3rd year projects for 2009/10

(popularity 08/09) [popularity 09/10]

(?) [12] students editable What programming language is that?
(9) [4] students editable A regular expression development environment
(1) [3] students editable A GUI for the design and generation of grammars
(5) [2] students editable Privatising Java
(?) [2] students editable "make clean" - tidying up a filestore
(1) [2] students editable Animating parse-trees
(?) [1] students editable Grammar transformation
(2) [1] students editable Deriving regular expressions from example texts
(1) [0] students editable Deriving grammars from parse trees
(?) [0] students editable Grammar-style description and conversion
(?) [0] students editable Translating between Lex and Yacc
(0) [0] students editable Deriving BNF from example texts
(0) [0] students editable Illustrating different parsing strategies
(0) [0] students editable Incrementally writing and debugging grammars

What programming language is that?

Difficulty grading: SH=C, INF=C, BM=C, CM=C

There are many different computer languages, most of which were used for a short while and then abandoned. However, a few were used quite heavily before becoming obsolete. It would be helpful to have some way to automatically recognise the language used to write an old program. A straightforward, but time-consuming way would be to try to compile the text using as many different compilers as possible, and then pick the one that produces fewest error messages. However, old compilers tend to become unuseable shortly after their languages become obsolete.

Better would be to make an analyser, perhaps using pattern-matching tools like lex and yacc, which tried to classify the main features of the text. For example, we could look for various keywords, punctuation and operators, and try to spot those that are specific to particular languages. For example, C and Java use "if" and "else", but Pascal uses "then" as well. We could even try to look for different conventions for identifiers (e.g. are _ or $ allowed), comments (e.g. // or /* */ etc.) and literals (do strings use " " or ' ', are there escaped characters like \n, etc.).

It would probably be best to start with a few common programming languages, and add new languages if there was enough time. It might even be best to make the process data-driven, so we could provide files containing lists of language-specific features, which a more general scanner could use. (As an example of a similar facility in a different context, systems such as termcap or terminfo are able to cope with many different kinds of terminals by using files containing descriptions of their characteristics.) Perhaps we could even try to automate the process of building a classifier from examples of programs in various languages. This might make it easier to add new programming languages.

REFERENCES:
The Tower of Babel
Syntax across languages
WikiPedia: programming languages
"man termcap" or "man terminfo"

COURSE PREREQUISITES: -

EQUIPMENT: -

"make clean" - tidying up a filestore

Difficulty grading: SH=F, INF=F, BM=F, CM=F

There are various build tools like "make" and "ant" available, that can help to automate the creation of files from other files (e.g. compiling Fred.java to Fred.class). However, they are not always very good at automatically tidying up after themselves - for example, deleting temporary files, or deleting the compiled files when a project is complete.

It would be nice to have a tool that knew about common conventions, like .java and .class, and .c and .o etc., and would be able to tidy up by e.g. deleting the .class files. It would be even better if it knew to ask the user if they want to make e.g. a .jar file first. There are several possible extensions, such as

to read the "makefile"s and other similar scripts, to help decide what files to delete and what to retain.
to create a permanent script to do the tidying whenever the user requires, or even modify an existing "makefile" or similar to include these options.
use ideas from "make", such as being able to create files (like "makefile"s) that describe these conventions, so that the tool can be made extensible.

REFERENCES:
GNU make
Apache Ant

COURSE PREREQUISITES: -

EQUIPMENT: -

Grammar-style description and conversion

Difficulty grading: SH=C, INF=C, BM=C, CM=C

There are many different notations for describing context-free grammars, including many versions of BNF and EBNF. In particular, almost every software tool that deals with grammars, like yacc, has its own version. However, most of these different notations have exactly the same expressive power, so their main effect is just to make it difficult to take a grammar designed for one tool and use it with another tool.

It would be useful to create a system that can take a grammar in any notation and transform it to another notation, driven purely by some standard description of the two notations. (As an example of a similar facility in a different context, systems such as termcap or terminfo are able to cope with many different kinds of terminals by using files containing descriptions of their characteristics.)

Even a simple system only capable of converting between notations of similar complexity, such as between any two varieties of BNF, would be useful. However, it would be even more useful if it could also be used to convert between BNF and EBNF, or to deal with the extras and idiosyncrasies of actual grammar-driven tools like lex and yacc, such as actions and declarations.

REFERENCES:
BNF and EBNF and a comparison of various BNFs and EBNFs
"man termcap" or "man terminfo"
example request

COURSE PREREQUISITES: -

EQUIPMENT: -

Grammar transformation

Difficulty grading: SH=C, INF=C, BM=C, CM=C

The Java Compiler Compiler (JavaCC) is a parser generator originally created by Sun for use with their Java language. It is roughly equivalent to tools like Lex and Yacc, but has different underlying mechanisms. This makes it easier for humans to understand what it is doing, but also makes the basic system less powerful. Because of this, users have to put some effort into modifying an initial grammar to make it acceptable to JavaCC.

For example, we would normally recognise non-associative, left-associative and right-associative operators using non-recursive, left-recursive and right-recursive grammars respectively:

	  non_expression = sub_exp | sub_exp '=' sub_exp
	 left_expression = sub_exp | left_expression '+' sub_exp
	right_expression = sub_exp | sub_exp '^' right_expression

However, with JavaCC, we have to rewrite the grammars:

	  non_expression = sub_exp ( '=' sub_exp )?
	 left_expression = sub_exp ( '+' sub_exp )*
	right_expression = sub_exp ( '^' right_expression )?

This project is about producing a program that helps users to modify their grammars, to get the most out of JavaCC. The program would be interactive, so that it could point out problem areas, make suggestions, and allow the user to pick a suitable modification. Some specific areas that could be investigated are converting left recursion to iteration (as for left_expression above), left-factoring (as for non_expression and right_expression above), using look-ahead to remove ambiguities, and adding error handling.

There are several possible extensions to this project if time permits, implementing other similar transformations on grammars, such as removing empty, trivial, or unusable grammar rules.

REFERENCES:
JavaCC and a similar, but not identical, tool TCLLk
A bigger example

COURSE PREREQUISITES: -

EQUIPMENT: -

Translating between Lex and Yacc

Difficulty grading: SH=L, BM=L, CM=L

At first glance, it is a completely stupid idea to want to take a lex program and create a yacc program that does exactly the same thing (or vice versa). However, when writing a grammar-driven program such as a compiler, one of the decisions that has to be made is the exact dividing line between those parts of the overall grammar that will be handled by lex and those that will be handled by yacc, and inevitably the line isn't always correct the first time.

It would be useful if it were possible to feed all or part of a lex grammar into a tool that could build the corresponding yacc, or vice-versa, so that this boundary could be experimented with. It would also be interesting, from a purely teaching viewpoint, to illustrate the relative powers of the two notations (i.e. regular expressions and BNF).

An example given in COMP20121 was converting between the lex:

	[0-9]+ {return number;}

and the yacc:

	number : digit | number digit ;
	digit  : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

COURSE PREREQUISITES: -

EQUIPMENT: -

A regular expression development environment

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Langauge Processing
Max number of students who can do this project: 1

The basic idea is that a user would be able to input a regular expression (regexp) as in e.g. egrep, or a set of regexps as in e.g. lex, and be able to see how the regexp(s) matched parts of an actual sample text. The basic idea is that the tool could display the sample text in a window (or even several samples, each in a different window) and draw a box around each chunk of text that matched a regexp (with different colours for different regular expressions).

There are various different situations that might arise that this tool could help with:

The user might just want to check that the regexps worked as expected.
The user might want to see what was not matched by the regexps, to add extra regexps.
The user might want to see why a regexp was failing to match a piece of text:
Perhaps more than one of the regexps matches, in which case the tool could draw multiple boxes to show this.
Perhaps the regexp only partly matches, in which case the tool could highlight the parts of the regexp that matched so the user could see where it stopped matching.

COURSE PREREQUISITES: -

EQUIPMENT: -

Privatising Java

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

It is easy to write Java programs that make every method and variable "public". It is much harder to have the discipline to make as much as possible "private", to make it easier to modify in the future. This project is about taking a set of Java class sources and modifying them to make as much as possible "private".

A first attempt might just be to go through the sources inserting "private" in front of every declaration, recompiling them, and dealing with the resulting error messages. Better might be to first go through the sources making a list of everything that looks like "name1.name2" using name1 to help work out which class name2 comes from, and then go through all the sources again inserting "private" in front of any names not found in the first step.

A possible extension might be to find instances where variables could not be made private as above, and then insert "getVar" and "setVar" methods so that even those variables could be made private. Another possible extension would be to try to make use of "protected" as well.

It would be sensible to reuse an existing open-source Java parser, rather than trying to construct one from scratch.

REFERENCES: Making fields and methods as private as possible - a postmodern Thatcher.

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving regular expressions from example texts

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.

A complete grammar for a programming language would probably have to be created in two phases. The first (this project) would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc.

The tool might recognise typical strings, such as numbers consisting of digit characters "0" to "9", and perhaps also using "+" and "-" and ".". The tool might then interact with the user to decide whether these strings really were complete words, or whether we are trying to recognise something more complex e.g. course-unit codes like "COMP30900".

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it is told that "for" is the start of an for-statement, it shouldn't try to use the "for" inside "before".

Another problem is that of white space (spaces, tabs, newlines etc.). In some circumstances, such as in typical computer languages like Java and C, white space can usually be discarded, although it may separate words. In other circumstances, such as input to a database, white space may be very important as tabs may be used to separate fields and newlines used to separate records. It may be hard to recognise these fundamentally different situations without significant help from the user.

A simple example is a list of symbols such as: abcdabcdddabcabcd. The first step would be to identify boundaries between repeated groups i.e. abcd abcddd abc abcdd. The next step would be to distinguish between special cases and generalisations i.e. do we permit only 0 to 3 instances of "d", or do we permit any number of instances? Must the list of groups contain exactly 4 instances of "abcd...", or any number? Must the instances be in the order "...d ...ddd ... ...dd" or is any order possible?

REFERENCES: Programming By Example
an example of using lex

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving grammars from parse trees

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Creating a grammar for a computer language can be quite hard - often, we have a good idea of the sort of program or script we want to be able to write, but much less idea how to represent that in a grammar. The idea would be to create a system that would allow the user to start from such an example text in a window and then interact with the system to draw the "parse tree", and finally derive a grammar from it.

For the first step, for example, one might tell the system that this part of the text is an expression, the next part is a declaration, the next an if statement, and so on. One could then build up (this is a list of statements) and down (this is a sub-expression; this is a number, this is a variable, this is an operator) to create the complete parse tree.

The final step would be to take these parse trees, take each node in the tree as an example use of a grammar rule, and merge and generalise these to build up a complete grammar. For example, we might have examples of expressions containing several different operators, merge them to allow any of the given operators from the example, and then generalise to include all the operators we might actually want to use. We might even look at using concepts like Unification to help automate the merging.

Other facilities might be to provide suggestions or alternatives to choose from, to help the user when creating the parse trees or when converting the parse trees to a grammar (e.g. to help deal with operator precedence).

REFERENCES: -

COURSE PREREQUISITES: -

EQUIPMENT: -

Animating parse-trees

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

When one is working with a non-trivial grammar, e.g. for some programming language, it can be very difficult to understand exactly what each part of the grammar is used for. It would be nice to use a tool that could display the grammar and an example program recognised by the grammar, and then be able to select either a part of a program and have the corresponding part of the grammar highlighted, or select a grammar rule and have corresponding parts of the program text highlighted.

One way of accomplishing this would be to actually run the grammar through a parser-generator such as yacc, and use it to parse the example program, creating a parse-tree as it does so. The tool could then use the parse-tree to decide which parts of the grammar or program text to highlight for the user. (Berkely yacc can be made to automatically output information about a parse, which could be input by the tool and used to create the parse-tree - see reference below.)

The tool could be enhanced to display more detailed information from the parse tree - for example, if the user selected a statement, it might highlight the individual lexemes - names, numbers, etc. - using different colours to indicate the lexeme type, and then draw nested boxes around groups of lexemes to indicate how they are recognised by the grammar rules to form sub-expressions and larger expressions etc.

REFERENCES: -

COURSE PREREQUISITES: -

EQUIPMENT: -

A GUI for the design and generation of grammars

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

You are probably familiar with the idea of Graphical User Interfaces (GUIs) and the kinds of tools that can be used to create them. However, we often need to build a Textual User Interface (TUI) - for example, to recognise an expression to be drawn by a graph-plotting package, or to input a file of data into a statistical package - and although tools exist to help, they are not necessarily easy to use, particularly to begin with. What is needed is a graphical tool, that allows a user to build an informal description of the structure of the text, which can then be transformed into a formal description (grammar and/or parser program).

I would like the user to be able to "drag and drop" boxes corresponding to ideas like "a list of ..." or "a choice between ...", and to be able to nest them one inside the other as required. e.g.

a number is

a list of one or more things,
where each thing is
a digit character

an identifier is
a letter
character followed
by

a list of none or more things,
where each thing is

a choice between
a letter character
or
a digit character

It must be possible for the user to experiment with the structure, e.g. by changing one "thing" into "a list of things" or back again. Finally, it should be possible to output the grammar as Regular Expressions or BNF, or even in a form suitable for parsing tools like lex and yacc.

If time allows, the basic concept can be extended in various ways e.g.:

Allowing the user to add annotation to list boxes, to guide the translation into left or into right recursion.
Helping the user to distinguish between lexical and syntactic parts of the description i.e. those parts (words) that can be described by regular expressions, and those parts (sentences, statements) that need to be described by BNF.
Helping the user to look for ambiguities, by performing some of the analysis done by tools like yacc or javacc.

REFERENCES: examples

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving BNF from example texts

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

A complete grammar for a programming language would probably have to be created in two phases. The first would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF (this project) linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc. For this project, therefore, we will have to assume that the first phase has already happened, by breaking the given text up into "words" using some simple algorithm.

Perhaps the user would select an example statement, and then interact with the tool to construct a simple grammar rule for it. The tool could then attempt to match other chunks of text with this rule, and the user could agree or disagree, allowing the tool to refine the grammar rule.

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it knows how to recognise an expression consisting of names, numbers and operators, it shouldn't assume that the names in a declaration are also expressions.

Another problem is that of dealing with a hierarchical structure, for example trying to recognise if-statements and assignments. We can assume that we have a list of simple symbols, such as: i(n<n){n=n;i(n)n=n;} ("i" stands for "if", "n" for a name, and so on). We need to identify n=...; as an assignment statement, and then more complex statement forms, such as the if-statement: i(n) followed by a statement, and the block: { } containing one or more statements, and so on.

REFERENCES: Programming By Example
an example of using lex + yacc

COURSE PREREQUISITES: -

EQUIPMENT: -

Illustrating different parsing strategies

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Teaching Tools
Max number of students who can do this project: 1

Chart parsing is a very simple, general and flexible parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous grammars quite simply. It pays for this generality by being significantly less efficient. Because of its flexibility, it is straightforward to make versions with different parsing strategies, such as top-down or bottom-up. These characteristics would make it ideal for use in teaching, to demonstrate parsing strategies.

Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use in teaching, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.

If time permits, there are various possible extensions:
It would be nice if the tool had a simple GUI, to allow parse trees to be displayed, or to allow a user to "single step" through a parse.
Another extension might be to adapt the parser to be able to deal with EBNF as well as BNF.

REFERENCES: NLTK

COURSE PREREQUISITES: -

EQUIPMENT: -

Incrementally writing and debugging grammars

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Most programmers, at some point in their careers, need to implement a textual part of a user interface, or a textual configuration file, or similar. The implication of this is that most people who need to use grammars (i.e. regular expressions and/or BNF) will not be experts in their use. Unfortunately, learning how to make best use of grammars is nearly as hard as learning how to program for the first time, as it requires a different way of thinking about programming tasks.

One way to start could be to construct some example texts (e.g. programs) and then try to design grammars that matched them, using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, it is very hard to write and test just part of the grammar - you usually need an essentially complete grammar to get started! Instead, it would be nice if one could simply use fragments of grammar - for example experiment with using if-statements of the form if ( ... ) ... else ... where we don't yet want to have to worry about what "..." represents or about what precedes or follows the if-statement. (For example, the parser might simply take "..." to mean anything at all except another "if" or "else".)

Chart parsing is a very simple and very general parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous or incomplete grammars. It pays for this generality by being significantly less efficient. However, this does not really matter when one is designing and testing grammars, where efficiency is much less important than flexibility e.g. coping with partial matches.

Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use as outlined above, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.

REFERENCES: NLTK

COURSE PREREQUISITES: -

EQUIPMENT: -