DURATION 2 lab sessions


This exercise builds towards a solution to a problem of the sort that you may well encounter in real life - recognising and obeying a small, purpose designed, computer language (that, for no very good reason, I have called Faust) that greatly improves the flexibility and power of an otherwise simple program.

As a side-effect, it may also help you to understand what goes on inside the parsers used by programs like egrep and flex, as the final version of the program will use a Deterministic Finite State Automaton (DFSA), described by the Faust language, to recognise a second set of input data. (We actually have the beginnings of something like egrep, except that instead of giving it a regular expression, we are giving it the equivalent DFSA. If you are feeling brave, you can get some idea of the FSAs and DFSAs that flex produces by using the -T flag)


On successful completion of this exercise a student will:
Be able to take an informal english description of a computer language and convert it into a formal grammar.
Be able to split a formal grammar for a computer language into two parts, one for which flex is most appropriate, and one for which byacc is the best choice.
Be able to build parse trees and a dictionary for a simple computer language, using flex, byacc and C.
Be aware of the technique of using parse trees in an interpreter for a computer language with non-sequential flow of control.
Be aware of the concepts of concrete syntax and abstract syntax, and of the possibility of modifying either while keeping the other fixed.
Have experience of using Finite State Automata to recognise patterns in text.


This is a 2-session exercise, in 3 parts. You do not need to write 3 separate programs, as you will be marked just on the final version of your program. However, it would be sensible to retain a working solution for each part as you work on the next.

Although you could ignore the 3 separate parts and try to write the whole program in one go, it will not save you any time - this exercise is structured to help you deal with a difficult problem by breaking it up into smaller steps, all of which are used in the final result.

Part 1 - recognising the language - grammar

Given a loosely-defined grammar, write an equivalent formal definition in BNF and then implement it using flex and byacc. The result should be a program capable of recognising the Faust language and reporting some errors, but not yet able to do anything more.

Part 2 - obeying the language - parse trees

Enhance your program to obey the DFSA defined by the input Faust program, by building "parse trees" which can then be interpreted. You should do this by modifying the flex and byacc actions to call the various C routines provided.

Part 3 - improving the language - identifiers & dictionary

Enhance your program to handle identifiers using a dictionary. Use the dictionary to save the state-identifiers and convert them to state-numbers, so that you can build essentially the same parse trees as for the previous part (i.e. have a different concrete syntax but the same abstract syntax). You have to provide routines to convert identifiers to numbers, and back again.


Create a program: rundfsa [ -t ] [ -f dfsafile ] [ -d datafile ]
which inputs a Faust program describing a DFSA from "dfsafile" and runs it using data from "datafile". Both inputs default to "stdin".

In the Faust program for parts 1 and 2, states are referred to using unsigned numbers. The states are numbered consecutively, starting at 1. For part 3, states are referred to using identifiers. (I have not bothered to give different names to these two slightly different versions of Faust - you can think of them as Faust- and Faust+ if you like.)
A Faust program consists of one or more state declarations in any order, followed by the definition of the start state, given by a '*' followed by a state-number (or state-identifier).
States can be declared in any order, and referred to both before and after they are declared. Each state declaration starts with the state-number (or state-identifier), followed by a colon (for an ordinary state) or a "@" (for an accept/terminal state), followed by a list of one or more pairs of a symbol-string and a state-number (or state-identifier), separated by commas and terminated by a semicolon.
The symbol-string is one or more characters between a pair of apostrophes ('), any one of which when input by the DFSA causes a transition to the state given by the state-number (or state-identifier). Almost any characters can appear between the apostrophes containing a symbol-string - only two characters are not permitted, but you will have to work out which two! It is not sufficient just to recognise the characters used in the example below.
Spaces, tabs and newlines can appear between any components of the input, but not inside a state-number (or state-identifier). A space or tab (or both) can appear inside a symbol-string, in which case it is significant, but a newline cannot.
For example, signed reals and integers, with leading spaces, can be read by the following Faust program, using state-numbers for parts 1 and 2:

        7: ' ' 7, '+' 1, '-' 1,
                '0123456789' 2; 1: '0123456789' 2;
        2@ '0123456789' 2, '.'  3, 'E' 4;
        3@ '0123456789' 3, 'E' 4;
        4: '+-' 5, '0123456789' 6;
                5: '0123456789' 6;
                6@ '0123456789' 6;
or using state-identifiers for part 3:
        Start: ' ' Start, '+' Sign, -' Sign,
                '0123456789' Integer; Sign: '0123456789' Integer;
        Integer@ '0123456789' Integer, '.' Fraction, 'E' Exponent;
        Fraction@ '0123456789' Fraction, 'E' Exponent;
        Exponent: '+-' ExponentSign, '0123456789' ExponentValue;
                ExponentSign: '0123456789' ExponentValue;
                ExponentValue@ '0123456789' ExponentValue;

Each line of "datafile" should be in a format acceptable to the DFSA defined by the Faust program (i.e. accept states actually consume the end-of-line). rundfsa runs the Faust DFSA to check each line of "datafile" up to the end-of-file, reporting the line number of each incorrect line.

In part 1, the -t flag should cause rundfsa to echo the input Faust program. (You will have to write some code to make this happen.)
In part 2, when the Faust program is run, the -t flag will also echo each input data character, and output each new state-number in curly brackets. (This should happen automatically if you build the parse trees.) For example, using the Faust program above, input data of 1.23 would be echoed as:

In part 3, state-numbers are replaced by state-identifiers (you have to write some dictionary code to make this happen), so the example data will be echoed as:

Part 1 - recognising the language - grammar

To create a rundfsa program that simply inputs a Faust program, described as above using state-numbers, and then stops:

Draw the DFSA state diagram corresponding to the example Faust program above and identify the parts dealing with integers, fractions and exponents. What kinds of numbers are recognised at each of the accept states?
Write a formal description of the grammar of the Faust language described above, using grammar rules suitable for flex and byacc (hints available via the web pages). You are advised to ask a demonstrator to check your grammar before you implement it

A common mistake is to write a grammar that recognises numbers. This is not what you are supposed to do! You are supposed to write a grammar to recognise any Faust program. You should then use the given Faust program, that happens to recognise numbers, to test your grammar.

Just like the previous exercise, get started using the "makefile" but this time from "$CS2121/p*/ex5" i.e. make sure you are in your directory CS2121/ex5, and then type:

        cp $CS2121/p*/ex5/makefile .
	make start
For this exercise, this copies byacc, flex, and C (and include/.h) programs, example Faust program written using both state-numbers and state-identifiers, and test data.

The C code includes a "main" routine which looks for the various command-line flags (-t, -f, -d) and does whatever is necessary to deal with them. You only need to modify the given flex and byacc code (which initially simply input lines of text - you probably need to throw the existing grammar rules away and start again) and then "make 1" to compile and run the resulting program. You should not need any actions in your byacc code. However, you will need flex actions and, just as in the given code, the action for each pattern must test "trace" and "ECHO" if it is set:

        if (trace) ECHO;

You should get no output from this version of the program, other than the echo of the input Faust program with "-t" (assuming you are testing trace and ECHOing correctly), and messages telling you that there are no parse trees etc. to obey!

Part 2 - obeying the language - parse trees

To create a rundfsa program that inputs a Faust program, converts it to parse trees, and then runs it on the supplied data:

You should not modify the C file nor your grammar rules, but only the actions in the flex and byacc files you wrote for part 1, which can use the routines defined in the C file. You will need to call "savesymbols" from the flex actions (returning the result to byacc using yylval), and set "startstate" and call "definestate", "definepair" and "cons" from the byacc actions. You also need to write the %token and %type declarations for the byacc file to describe the values passed from flex to byacc, and the values passed between grammar rules in byacc.
"savesymbols" saves a symbol-string and returns a pointer to it. You are expected to remove the surrounding apostrophes from the string you pass to this routine (e.g. do some pointer arithmetic).
"startstate" is an integer variable, that you should set equal to the state-number of the start state.
"definepair" is used to create a (symbol-string, state-number) pair, and returns a pointer to it.
"cons" is used to attach a pair to the head of a list of pairs, and returns a pointer to the new list.
"definestate" is used to link a state to its list of pairs and note whether it is an accept state or not.
All these routines can be found in the C file, and also have "extern" declarations in the flex or byacc files, from which you should be able to work out what parameters they require.

These tree-building routines contain a lot of extraneous code, inserted simply to detect various mistakes commonly made while trying to complete this exercise. If you get any output from them, study it carefully to see what it is trying to tell you. In particular, examine the printed parse trees to make sure they correspond to what you expect - a common mistake is to fail to recognise symbol-strings properly.

The Faust program, when it is running properly, will check each line of the datafile. The example dfsafile should find two errors in the given datafile.

You are advised to ask a demonstrator to check your strategy for building the parse trees before you implement it

Part 3 - improving the language - identifiers & dictionary

You are strongly advised to make a backup copy of your program from part 2 before you start modifying it for part 3.

You should type "make 3" to run your program using the new version of the example Faust program, written with state-identifiers instead of state-numbers.

Firstly, the grammar must be changed so that a state is represented by an identifier rather than a number. Each state-identifier is an alphanumeric string (starting with an alphabetic character) up to 32 characters long. You should modify your flex code to recognise state-identifiers rather than state-numbers, and then convert them into numbers by calling "nextname", so that your byacc code is unchanged.

You will need to modify the C file to declare the dictionary and implement "nextname" (to map a state-identifier to a number) and "writestate" (to map a number back to the state-identifier). "nextname" should look up each state-identifier in the dictionary and return the corresponding number. If the state-identifier is not already present, "nextname" must insert it and map it to the next available number, starting with 1. You should not fix your code so that "Start" is recognised as name number 7, etc. as in the example Faust programs - the first name, no matter what it is, should always become name number 1, the second number 2, and so on. You can implement this using a linear search of an array with the same indices as "statelist" (or use "statelist" directly). To hold the characters that make up each state-identifier, use a pointer to a string, created by "strdup".

You should not need to modify anything else in the C file.


This exercise is worth 20 marks in total. It is necessary to produce working programs to demonstrate that you have achieved the Learning Outcomes of this exercise. However, it is also important that you are able to show that you understand the details of what you have written. You are therefore required to demonstrate your work to a member of the laboratory staff.


Demonstrate the final version of your working program (i.e. for part 3, or however far you get) to a demonstrator or lab supervisor.

You should labmail the file "labmail_me", which you should create by typing "make labmail".

Your final program will be marked as follows:
(from part 1)
1 mark - remembering to use "if (trace) ECHO;" in flex
1 mark - recognising symbols correctly
1 mark - recognising numbers and special characters correctly
1 mark - dealing with other characters correctly
1 mark - communication between flex and byacc handled correctly
2 marks - byacc grammar correct
(from part 2)
2 marks - actions in flex correct
2 marks - using %token and %type correctly
3 marks - actions in byacc correct
(from part 3)
6 marks - handling names and the dictionary correctly