3rd year projects for 2010/11

My list (make wget; make; make wgets)

popularity			Student view	Editing	Description
08/09	09/10	10/11	Student view	Editing	Description
-	-	?			Why won't my regular expression match? Regular Expression Matching in the Wild Does the regexp match the whole string? Does the regexp match a substring of the string? If so, where? Where are the submatches?
-	-	?			Why won't my BNF match? (chart parsing)
-	-	?			Animating algorithms
1	-	?			(?) Mind Maps 08/09
-	-	?			?
-	2	?	students	editable	"make clean" - tidying up a filestore
1	2	?	students	editable	Animating parse-trees
-	1	?	students	editable	Grammar transformation
2	1	?	students	editable	Deriving regular expressions from example texts
1	0	?	students	editable	Deriving grammars from parse trees

-	0	?	students	editable	Grammar-style description and conversion
-	0	?	students	editable	Translating between Lex and Yacc
0	0	?	students	editable	Deriving BNF from example texts
0	0	?	students	editable	Illustrating different parsing strategies
0	0	?	students	editable	Incrementally writing and debugging grammars

Why won't my Regexp work?

No matter whether you are just learning to use regular expressions to recognise patterns in text, or whether you have been using them all your life, there comes a moment when you think you have got it right, but the stupid thing just isn't working! What you need is a tool that shows you why the regular expression isn't matching the text. This tool should have similar abilities to e.g. egrep or flex, but also be able to show where and why the regular expression stopped matching the text.

There are several different situations that need addressing:
- should the regular expression match all the text, or just part of the text?
- if it should match all the text, is it failing partway, or is it failing to start matching at all?
- if it should match part of the text, is it failing to match the right part - is it matching too much or too little?
- are we using just one regular expression, like egrep, or several, like lex?
- if the latter, is it that the wrong regular expression has matched the text?
etc.

This project will involve creating a regular-expression matching program, similar to, but with more functionality than, e.g. egrep, plus a GUI to help the user make best use of the program.

Why won't my BNF work?

No matter whether you are just learning to use BNF grammars to recognise patterns in text, or whether you have been using them all your life, there comes a moment when you think you have got it right, but the stupid thing just isn't working! What you need is a tool that shows you why the BNF isn't matching the text. This tool should have similar abilities to e.g. bison, but also be able to show us where and why the BNF stopped matching the text.

There are several different situations that need addressing:
- usually, BNF is intended to match the whole text, but maybe you are still developing the BNF so only partial matches are possible (e.g. you haven't implemented all the different kinds of statement, when recognising a programming language)
- if it should match the whole text, is it failing partway, or is it failing to start matching at all?
- perhaps only some of the grammar rules are working correctly, but how can we tell which ones?
- if it should match part of the text, is it failing to match the right part - is it matching too much or too little?
- is it that the wrong grammar rule has matched (part of) the text?
etc.

This project will involve creating a simple but flexible parser (perhaps using chart parsing - a slow but simple algorithm) plus a GUI to help the user make best use of the program.

Make my web-site faster!

Why have I gone over quota?

http://windirstat.info/ http://windirstat.info/images/windirstat.jpg http://www.cs.umd.edu/hcil/treemap-history/ http://newsmap.jp http://kdirstat.sourceforge.net/

"make clean" - tidying up a filestore

Difficulty grading: SH=F, INF=F, BM=F, CM=F

There are various build tools like "make" and "ant" available, that can help to automate the creation of files from other files (e.g. compiling Fred.java to Fred.class). However, they are not always very good at automatically tidying up after themselves - for example, deleting temporary files, or deleting the compiled files when a project is complete.

It would be nice to have a tool that knew about common conventions, like .java and .class, and .c and .o etc., and would be able to tidy up by e.g. deleting the .class files. It would be even better if it knew to ask the user if they want to make e.g. a .jar file first. There are several possible extensions, such as

to read the "makefile"s and other similar scripts, to help decide what files to delete and what to retain.
to create a permanent script to do the tidying whenever the user requires, or even modify an existing "makefile" or similar to include these options.
use ideas from "make", such as being able to create files (like "makefile"s) that describe these conventions, so that the tool can be made extensible.

REFERENCES:
GNU make
Apache Ant

COURSE PREREQUISITES: -

EQUIPMENT: -

Grammar-style description and conversion

Difficulty grading: SH=C, INF=C, BM=C, CM=C

There are many different notations for describing context-free grammars, including many versions of BNF and EBNF. In particular, almost every software tool that deals with grammars, like yacc, has its own version. However, most of these different notations have exactly the same expressive power, so their main effect is just to make it difficult to take a grammar designed for one tool and use it with another tool.

It would be useful to create a system that can take a grammar in any notation and transform it to another notation, driven purely by some standard description of the two notations. (As an example of a similar facility in a different context, systems such as termcap or terminfo are able to cope with many different kinds of terminals by using files containing descriptions of their characteristics.)

Even a simple system only capable of converting between notations of similar complexity, such as between any two varieties of BNF, would be useful. However, it would be even more useful if it could also be used to convert between BNF and EBNF, or to deal with the extras and idiosyncrasies of actual grammar-driven tools like lex and yacc, such as actions and declarations.

REFERENCES:
BNF and EBNF and a comparison of various BNFs and EBNFs
"man termcap" or "man terminfo"
example request

COURSE PREREQUISITES: -

EQUIPMENT: -

Grammar transformation

Difficulty grading: SH=C, INF=C, BM=C, CM=C

The Java Compiler Compiler (JavaCC) is a parser generator originally created by Sun for use with their Java language. It is roughly equivalent to tools like Lex and Yacc, but has different underlying mechanisms. This makes it easier for humans to understand what it is doing, but also makes the basic system less powerful. Because of this, users have to put some effort into modifying an initial grammar to make it acceptable to JavaCC.

For example, we would normally recognise non-associative, left-associative and right-associative operators using non-recursive, left-recursive and right-recursive grammars respectively:

	  non_expression = sub_exp | sub_exp '=' sub_exp
	 left_expression = sub_exp | left_expression '+' sub_exp
	right_expression = sub_exp | sub_exp '^' right_expression

However, with JavaCC, we have to rewrite the grammars:

	  non_expression = sub_exp ( '=' sub_exp )?
	 left_expression = sub_exp ( '+' sub_exp )*
	right_expression = sub_exp ( '^' right_expression )?

This project is about producing a program that helps users to modify their grammars, to get the most out of JavaCC. The program would be interactive, so that it could point out problem areas, make suggestions, and allow the user to pick a suitable modification. Some specific areas that could be investigated are converting left recursion to iteration (as for left_expression above), left-factoring (as for non_expression and right_expression above), using look-ahead to remove ambiguities, and adding error handling.

There are several possible extensions to this project if time permits, implementing other similar transformations on grammars, such as removing empty, trivial, or unusable grammar rules.

REFERENCES:
JavaCC and a similar, but not identical, tool TCLLk
A bigger example

COURSE PREREQUISITES: -

EQUIPMENT: -

Translating between Lex and Yacc

Difficulty grading: SH=L, BM=L, CM=L

At first glance, it is a completely stupid idea to want to take a lex program and create a yacc program that does exactly the same thing (or vice versa). However, when writing a grammar-driven program such as a compiler, one of the decisions that has to be made is the exact dividing line between those parts of the overall grammar that will be handled by lex and those that will be handled by yacc, and inevitably the line isn't always correct the first time.

It would be useful if it were possible to feed all or part of a lex grammar into a tool that could build the corresponding yacc, or vice-versa, so that this boundary could be experimented with. It would also be interesting, from a purely teaching viewpoint, to illustrate the relative powers of the two notations (i.e. regular expressions and BNF).

An example given in COMP20121 was converting between the lex:

	[0-9]+ {return number;}

and the yacc:

	number : digit | number digit ;
	digit  : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving regular expressions from example texts

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.

A complete grammar for a programming language would probably have to be created in two phases. The first (this project) would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc.

The tool might recognise typical strings, such as numbers consisting of digit characters "0" to "9", and perhaps also using "+" and "-" and ".". The tool might then interact with the user to decide whether these strings really were complete words, or whether we are trying to recognise something more complex e.g. course-unit codes like "COMP30900".

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it is told that "for" is the start of an for-statement, it shouldn't try to use the "for" inside "before".

Another problem is that of white space (spaces, tabs, newlines etc.). In some circumstances, such as in typical computer languages like Java and C, white space can usually be discarded, although it may separate words. In other circumstances, such as input to a database, white space may be very important as tabs may be used to separate fields and newlines used to separate records. It may be hard to recognise these fundamentally different situations without significant help from the user.

A simple example is a list of symbols such as: abcdabcdddabcabcd. The first step would be to identify boundaries between repeated groups i.e. abcd abcddd abc abcdd. The next step would be to distinguish between special cases and generalisations i.e. do we permit only 0 to 3 instances of "d", or do we permit any number of instances? Must the list of groups contain exactly 4 instances of "abcd...", or any number? Must the instances be in the order "...d ...ddd ... ...dd" or is any order possible?

REFERENCES: Programming By Example
an example of using lex

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving grammars from parse trees

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Creating a grammar for a computer language can be quite hard - often, we have a good idea of the sort of program or script we want to be able to write, but much less idea how to represent that in a grammar. The idea would be to create a system that would allow the user to start from such an example text in a window and then interact with the system to draw the "parse tree", and finally derive a grammar from it.

For the first step, for example, one might tell the system that this part of the text is an expression, the next part is a declaration, the next an if statement, and so on. One could then build up (this is a list of statements) and down (this is a sub-expression; this is a number, this is a variable, this is an operator) to create the complete parse tree.

The final step would be to take these parse trees, take each node in the tree as an example use of a grammar rule, and merge and generalise these to build up a complete grammar. For example, we might have examples of expressions containing several different operators, merge them to allow any of the given operators from the example, and then generalise to include all the operators we might actually want to use. We might even look at using concepts like Unification to help automate the merging.

Other facilities might be to provide suggestions or alternatives to choose from, to help the user when creating the parse trees or when converting the parse trees to a grammar (e.g. to help deal with operator precedence).

REFERENCES: -

COURSE PREREQUISITES: -

EQUIPMENT: -

Animating parse-trees

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

When one is working with a non-trivial grammar, e.g. for some programming language, it can be very difficult to understand exactly what each part of the grammar is used for. It would be nice to use a tool that could display the grammar and an example program recognised by the grammar, and then be able to select either a part of a program and have the corresponding part of the grammar highlighted, or select a grammar rule and have corresponding parts of the program text highlighted.

One way of accomplishing this would be to actually run the grammar through a parser-generator such as yacc, and use it to parse the example program, creating a parse-tree as it does so. The tool could then use the parse-tree to decide which parts of the grammar or program text to highlight for the user. (Berkely yacc can be made to automatically output information about a parse, which could be input by the tool and used to create the parse-tree - see reference below.)

The tool could be enhanced to display more detailed information from the parse tree - for example, if the user selected a statement, it might highlight the individual lexemes - names, numbers, etc. - using different colours to indicate the lexeme type, and then draw nested boxes around groups of lexemes to indicate how they are recognised by the grammar rules to form sub-expressions and larger expressions etc.

REFERENCES: -

COURSE PREREQUISITES: -

EQUIPMENT: -

Deriving BNF from example texts

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

A complete grammar for a programming language would probably have to be created in two phases. The first would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF (this project) linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc. For this project, therefore, we will have to assume that the first phase has already happened, by breaking the given text up into "words" using some simple algorithm.

Perhaps the user would select an example statement, and then interact with the tool to construct a simple grammar rule for it. The tool could then attempt to match other chunks of text with this rule, and the user could agree or disagree, allowing the tool to refine the grammar rule.

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it knows how to recognise an expression consisting of names, numbers and operators, it shouldn't assume that the names in a declaration are also expressions.

Another problem is that of dealing with a hierarchical structure, for example trying to recognise if-statements and assignments. We can assume that we have a list of simple symbols, such as: i(n<n){n=n;i(n)n=n;} ("i" stands for "if", "n" for a name, and so on). We need to identify n=...; as an assignment statement, and then more complex statement forms, such as the if-statement: i(n) followed by a statement, and the block: { } containing one or more statements, and so on.

REFERENCES: Programming By Example
an example of using lex + yacc

COURSE PREREQUISITES: -

EQUIPMENT: -

Illustrating different parsing strategies

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Teaching Tools
Max number of students who can do this project: 1

Chart parsing is a very simple, general and flexible parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous grammars quite simply. It pays for this generality by being significantly less efficient. Because of its flexibility, it is straightforward to make versions with different parsing strategies, such as top-down or bottom-up. These characteristics would make it ideal for use in teaching, to demonstrate parsing strategies.

Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use in teaching, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.

If time permits, there are various possible extensions:
It would be nice if the tool had a simple GUI, to allow parse trees to be displayed, or to allow a user to "single step" through a parse.
Another extension might be to adapt the parser to be able to deal with EBNF as well as BNF.

REFERENCES: NLTK

COURSE PREREQUISITES: -

EQUIPMENT: -

Incrementally writing and debugging grammars

Difficulty grading: INF=F, SH=F, BM=F, CM=F
Area: Programming Language Processing
Max number of students who can do this project: 1

Most programmers, at some point in their careers, need to implement a textual part of a user interface, or a textual configuration file, or similar. The implication of this is that most people who need to use grammars (i.e. regular expressions and/or BNF) will not be experts in their use. Unfortunately, learning how to make best use of grammars is nearly as hard as learning how to program for the first time, as it requires a different way of thinking about programming tasks.

One way to start could be to construct some example texts (e.g. programs) and then try to design grammars that matched them, using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, it is very hard to write and test just part of the grammar - you usually need an essentially complete grammar to get started! Instead, it would be nice if one could simply use fragments of grammar - for example experiment with using if-statements of the form if ( ... ) ... else ... where we don't yet want to have to worry about what "..." represents or about what precedes or follows the if-statement. (For example, the parser might simply take "..." to mean anything at all except another "if" or "else".)

Chart parsing is a very simple and very general parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous or incomplete grammars. It pays for this generality by being significantly less efficient. However, this does not really matter when one is designing and testing grammars, where efficiency is much less important than flexibility e.g. coping with partial matches.

Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use as outlined above, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.

REFERENCES: NLTK

COURSE PREREQUISITES: -

EQUIPMENT: -