UP PREVIOUS

CS5031 Assessment Exercise: Implementing a Simple Computer Language - Faust

PURPOSE

This exercise builds towards a solution to a problem of the sort that you may well encounter in real life - recognising and obeying a small, purpose designed, computer language (that, for no very good reason, I have called Faust) that greatly improves the flexibility and power of an otherwise simple program.

It may also help you to understand what goes on inside the parsers created by flex, as the final version of the program will use a Deterministic Finite State Automaton (DFSA), described by the Faust language, to recognise a second set of input data.

OBJECTIVES

This is a 3-part exercise. The marks are split evenly between the 3 parts. You do not need to write 3 separate programs, as you will be marked just on the final version of your program. However, it would be sensible to retain a working solution for each part as you work on the next. Although you could ignore the 3 separate parts and try to write the whole program in one go, it will not save you any time - this exercise is structured to help you deal with a difficult problem by breaking it up into smaller steps, all of which are used in the final result.

Part 1 - recognising the language - grammar

Given a loosely-defined grammar, write an equivalent formal definition in BNF and then implement it using flex and byacc. The result should be a program capable of recognising the Faust language and reporting some errors, but not yet able to do anything more.

Part 2 - obeying the language - parse trees

Enhance your program to obey the DFSA defined by the input Faust program, by building "parse trees" which can then be interpreted. You should do this by modifying the flex and byacc actions to call the various C routines provided.

Part 3 - improving the language - identifiers & dictionary

Enhance your program to handle identifiers using a dictionary. Use the dictionary to save the state-identifiers and convert them to state-numbers, so that you can build essentially the same parse trees as for the previous part (i.e. have a different concrete syntax but the same abstract syntax). You have to provide routines to convert identifiers to numbers, and back again.

DESCRIPTION

Create a program: rundfsa [ -t ] [ -f dfsafile ] [ -d datafile ]
which inputs a Faust program describing a DFSA from "dfsafile" and runs it using data from "datafile". Both inputs default to "stdin".

In the Faust program for parts 1 and 2, states are referred to using unsigned numbers. The states are numbered consecutively, starting at 1. For part 3, states are referred to using identifiers. (I have not bothered to give different names to these two slightly different versions of Faust - you can think of them as Faust and Faust+ if you like.)
A Faust program consists of one or more state declarations in any order, followed by the definition of the start state, given by a '*' followed by a state-number (or state-identifier).
States can be declared in any order, and referred to both before and after they are declared. Each state declaration starts with the state-number (or state-identifier), followed by a colon (for an ordinary state) or a "@" (for an accept/terminal state), followed by a list of one or more pairs of a symbol-string and a state-number (or state-identifier), separated by commas and terminated by a semicolon.
The symbol-string is one or more characters between a pair of apostrophes ('), any one of which when input by the DFSA causes a transition to the state given by the state-number (or state-identifier). Almost any characters can appear between the apostrophes containing a symbol-string - only two characters are not permitted, but you will have to work out which two! It is not sufficient just to recognise the characters used in the example below.
Spaces, tabs and newlines can appear between any components of the input, but not inside a state-number (or state-identifier). A space or tab (or both) can appear inside a symbol-string, in which case it is significant, but a newline cannot.
For example, signed reals and integers, with leading spaces, can be read by the following Faust program, using state-numbers for parts 1 and 2:

        7: ' ' 7, '+' 1, '-' 1,
                '0123456789' 2; 1: '0123456789' 2;
        2@ '0123456789' 2, '.'  3, 'E' 4;
        3@ '0123456789' 3, 'E' 4;
        4: '+-' 5, '0123456789' 6;
                5: '0123456789' 6;
                6@ '0123456789' 6;
        *7
or using state-identifiers for part 3:
        Start: ' ' Start, '+' Sign, -' Sign,
                '0123456789' Integer; Sign: '0123456789' Integer;
        Integer@ '0123456789' Integer, '.' Fraction, 'E' Exponent;
        Fraction@ '0123456789' Fraction, 'E' Exponent;
        Exponent: '+-' ExponentSign, '0123456789' ExponentValue;
                ExponentSign: '0123456789' ExponentValue;
                ExponentValue@ '0123456789' ExponentValue;
        *Start

Each line of "datafile" should be in a format acceptable to the DFSA defined by the Faust program (i.e. accept states actually consume the end-of-line). rundfsa runs the Faust DFSA to check each line of "datafile" up to the end-of-file, reporting the line number of each incorrect line.

In part 1, the -t flag should cause rundfsa to echo the input Faust program. (You will have to write some code to make this happen.)
In part 2, when the Faust program is run, the -t flag will also echo each input data character, and output each new state-number in curly brackets. (This should happen automatically if you build the parse trees.) For example, using the Faust program above, input data of 1.23 would be echoed as:

        {7}1{2}.{3}2{3}3{3}
In part 3, state-numbers are replaced by state-identifiers (you have to write some dictionary code to make this happen), so the example data will be echoed as:
        {Start}1{Integer}.{Fraction}2{Fraction}3{Fraction}

Part 1 - recognising the language - grammar

To create a rundfsa program that simply inputs a Faust program, described as above using state-numbers, and then stops:

Draw the DFSA state diagram corresponding to the example Faust program above and identify the parts dealing with integers, fractions and exponents. What kinds of numbers are recognised at each of the accept states?
Write a formal description of the grammar of the Faust language described above, using BNF and/or grammar rules suitable for flex and byacc (hints available via the web pages). You are advised to ask a demonstrator to check your grammar before you implement it.

A common mistake is to write a grammar that recognises numbers. This is not what you are supposed to do! You are supposed to write a grammar to recognise any Faust program. You should then use the given Faust program, that happens to recognise numbers, to test your grammar.

Starting byacc, flex, and C (and include/.h) programs, example Faust program and test data and a makefile can be found in $CS5031/p*/ex_ass/part1.
The C code includes a "main" routine which looks for the various command-line flags (-t, -f, -d) and does whatever is necessary to deal with them. You only need to modify the given flex and byacc code (which initially simply input lines of text - you probably need to throw the existing grammar rules away and start again) and then "make test" to make and run the resulting program. You should not need any actions in your byacc code. However, you will need flex actions and, just as in the given code, the action for each pattern must test "trace" and "ECHO" if it is set:

        if (trace) ECHO;

You should get no output from this version of the program, other than the echo of the input Faust program with "-t" (assuming you are testing trace and ECHOing correctly), and messages telling you that there are no parse trees etc. to obey!

Part 2 - obeying the language - parse trees

To create a rundfsa program that inputs a Faust program, converts it to parse trees, and then runs the DFSA on the supplied data:

You should not modify the C file nor your grammar rules, but only the actions in the flex and byacc files you wrote for part 1, which can use the routines defined in the C file. You will need to call "savesymbols" from the flex actions (returning the result to byacc using yylval), and set "startstate" and call "definestate", "definepair" and "cons" from the byacc actions. You also need to write the %token and %type declarations for the byacc file to describe the values passed from flex to byacc, and the values passed between grammar rules in byacc.
"savesymbols" saves a symbol-string and returns a pointer to it. You are expected to remove the surrounding apostrophes from the string you pass to this routine (e.g. do some pointer arithmetic).
"startstate" is an integer variable, that you should set equal to the state-number of the start state.
"definepair" is used to create a (symbol-string, state-number) pair, and returns a pointer to it.
"cons" is used to attach a pair to the head of a list of pairs, and returns a pointer to the new list.
"definestate" is used to link a state to its list of pairs and note whether it is an accept state or not.
All these routines can be found in the C file, and also have "extern" declarations in the flex or byacc files, from which you should be able to work out what parameters they require.

These tree-building routines contain a lot of extraneous code, inserted simply to detect various mistakes commonly made while trying to complete this exercise. If you get any output from them, study it carefully to see what it is trying to tell you. In particular, examine the printed parse trees to make sure they correspond to what you expect - a common mistake is to fail to recognise symbol-strings properly.

The Faust program, when it is running properly, will check each line of the datafile. The example dfsafile should find two errors in the given datafile.

You are advised to ask a demonstrator to check your strategy for building the parse trees before you implement it.

Part 3 - improving the language - identifiers & dictionary

Copy the new version of the example Faust program from $CS5031/p*/ex_ass/part3/dfsa

Firstly, the grammar must be changed so that a state is represented by an identifier rather than a number. Each state-identifier is an alphanumeric string (starting with an alphabetic character) up to 32 characters long. You should modify your flex code to recognise state-identifiers rather than state-numbers, and then convert them into numbers by calling "nextname", so that your byacc code is unchanged.

You will need to modify the C file to declare the dictionary and implement "nextname" (to map a state-identifier to a number) and "writestate" (to map a number back to the state-identifier). "nextname" should look up each state-identifier in the dictionary and return the corresponding number. If the state-identifier is not already present, "nextname" must insert it and map it to the next available number, starting with 1. You can implement this using a linear search of an array with the same indices as "statelist" (or use "statelist" directly). To hold the characters that make up a state-identifier, you can use either: a fixed-size array of characters, big enough for any state-identifier
or: a pointer to a "malloc"ed string, big enough for each individual state-identifier.

You should not need to modify anything else in the C file.

Bonus suggestions

* Use a better data-structure for your dictionary, such as a hash table or a balanced tree. (This is probably only worth 1 mark if you reuse code from another lab exercise.)
* Improve symbol-strings to be more like strings in ANSI C (hints available via the web pages), or more like [ ] in lex (e.g. "^a-zA-Z"). (This is probably only worth 1 mark.)
* The example Faust program you are given recognises numbers, but you may write and run other programs. (This is probably only worth 1 or 2 marks.)
* Improve the Faust language in other ways e.g. by allowing comments, or replacing some of the special characters like "@" and "*" by keywords. (This is probably only worth 1 mark.)
* Even if tracing is turned off, report when an accept state succeeds - you could modify the language to allow the user to specify the message in the state definition. (This is probably only worth 1 mark.)
* If an error is found, report where you were in the input, and what you expected at that point, and what you found instead. (This is probably only worth 1 mark.)
* Add actions to the Faust language e.g. postfix or even infix code, like in earlier exercises. You would have to extend the parse trees to include the code. If you attach code to each state transition, you could e.g. calculate the value of the number you input, especially if you have a special value such as "$" to mean (push) the character just recognised (hints available via the web pages).
* Change the automaton to be non-deterministic (e.g. backtracking, or breadth-first), or a Push-down automaton, or a Turing machine, or whatever. [You can find out more about automata from the CS2121 book-list, or in Aho Sethi & Ullman section 3.6 in the CS2111 book-list (available via the web pages).]
* Instead of reading a Faust description of a DFSA, read (one or more) regular expressions and convert them into the corresponding DFSA - you may find this easier if you use a non-deterministic automaton (NFSA). [e.g. Aho Sethi & Ullman sections 3.7 and 3.8.]

I strongly suggest you do the basic program first, and then enhance it (keeping a copy of the original working verson) rather than plunging straight into any extensions.

ASSESSMENT

This is the assessment exercise for the compilers part of the course only - you will be given details of the assessment work for the operating-systems part of the course separately.

A maximum of 70% of the marks will be awarded for a perfect solution to the basic problem as stated, evenly split between the three parts. The remaining 30% of the marks will only be awarded as bonuses for exceptional work, which must represent a substantial amount of extra effort and be beyond the original exercise specification (except that there are some suggestions above for bonus work). Marks obtained in this way will not match the extra effort needed on your part to obtain them.

The net effect is, that you can obtain good 2nd class marks by doing the exercise as set, but to obtain 1st class marks you have to do something extra. You should not attempt any bonus work until you have completed both the basic program and your assessed work for the operating-system part of the course.

To obtain the assessment marks when you complete part 3, or however far you get, you should must email your source files (lex, yacc and C) to pjj@cs.man.ac.uk and make an appointment to demonstrate your working program. If you have attempted any bonus work, you must also email your test data and the corresponding results your program generates.

It is important that the code submitted is substantially your own work. However, collaboration is an essential part of any real world task and learning from your peers is a good way of aquiring tips and techniques. You may discuss algorithms and grammars and how to debug your program, but you must not allow others to see your actual program - if you do so you may lose marks.


UP PREVIOUS