My list (make wget; make; make wgets)
popularity | Student view | Editing | Description | ||
---|---|---|---|---|---|
08/09 | 09/10 | 10/11 | |||
- | - | ? | Why won't my regular expression match?
Regular Expression Matching in the Wild Does the regexp match the whole string? Does the regexp match a substring of the string? If so, where? Where are the submatches? | ||
- | - | ? | Why won't my BNF match?
(chart parsing) | ||
- | - | ? | Animating algorithms | ||
1 | - | ? | (?) Mind Maps 08/09 | ||
- | - | ? | ? | ||
- | 2 | ? | students | editable | "make clean" - tidying up a filestore |
1 | 2 | ? | students | editable | Animating parse-trees |
- | 1 | ? | students | editable | Grammar transformation |
2 | 1 | ? | students | editable | Deriving regular expressions from example texts |
1 | 0 | ? | students | editable | Deriving grammars from parse trees |
- | 0 | ? | students | editable | Grammar-style description and conversion |
- | 0 | ? | students | editable | Translating between Lex and Yacc |
0 | 0 | ? | students | editable | Deriving BNF from example texts |
0 | 0 | ? | students | editable | Illustrating different parsing strategies |
0 | 0 | ? | students | editable | Incrementally writing and debugging grammars |
There are several different situations that need addressing:
- should the regular expression match all the text, or just part of the
text?
- if it should match all the text, is it failing partway, or is it
failing to start matching at all?
- if it should match part of the text, is it failing to match the right
part - is it matching too much or too little?
- are we using just one regular expression, like egrep, or several, like
lex?
- if the latter, is it that the wrong regular expression has matched the
text?
etc.
This project will involve creating a regular-expression matching program, similar to, but with more functionality than, e.g. egrep, plus a GUI to help the user make best use of the program.
There are several different situations that need addressing:
- usually, BNF is intended to match the whole text, but maybe you are
still developing the BNF so only partial matches are possible (e.g. you
haven't implemented all the different kinds of statement, when recognising
a programming language)
- if it should match the whole text, is it failing partway, or is it
failing to start matching at all?
- perhaps only some of the grammar rules are working correctly, but how
can we tell which ones?
- if it should match part of the text, is it failing to match the right
part - is it matching too much or too little?
- is it that the wrong grammar rule has matched (part of) the text?
etc.
This project will involve creating a simple but flexible parser (perhaps using chart parsing - a slow but simple algorithm) plus a GUI to help the user make best use of the program.
There are various build tools like "make" and "ant" available, that can help to automate the creation of files from other files (e.g. compiling Fred.java to Fred.class). However, they are not always very good at automatically tidying up after themselves - for example, deleting temporary files, or deleting the compiled files when a project is complete.
It would be nice to have a tool that knew about common conventions, like .java and .class, and .c and .o etc., and would be able to tidy up by e.g. deleting the .class files. It would be even better if it knew to ask the user if they want to make e.g. a .jar file first. There are several possible extensions, such as
REFERENCES:
GNU make
Apache Ant
COURSE PREREQUISITES: -
EQUIPMENT: -There are many different notations for describing context-free grammars, including many versions of BNF and EBNF. In particular, almost every software tool that deals with grammars, like yacc, has its own version. However, most of these different notations have exactly the same expressive power, so their main effect is just to make it difficult to take a grammar designed for one tool and use it with another tool.
It would be useful to create a system that can take a grammar in any notation and transform it to another notation, driven purely by some standard description of the two notations. (As an example of a similar facility in a different context, systems such as termcap or terminfo are able to cope with many different kinds of terminals by using files containing descriptions of their characteristics.)
Even a simple system only capable of converting between notations of similar complexity, such as between any two varieties of BNF, would be useful. However, it would be even more useful if it could also be used to convert between BNF and EBNF, or to deal with the extras and idiosyncrasies of actual grammar-driven tools like lex and yacc, such as actions and declarations.
REFERENCES:
BNF and EBNF
and
a comparison of various BNFs and EBNFs
"man termcap" or "man terminfo"
example request
COURSE PREREQUISITES: -
EQUIPMENT: -The Java Compiler Compiler (JavaCC) is a parser generator originally created by Sun for use with their Java language. It is roughly equivalent to tools like Lex and Yacc, but has different underlying mechanisms. This makes it easier for humans to understand what it is doing, but also makes the basic system less powerful. Because of this, users have to put some effort into modifying an initial grammar to make it acceptable to JavaCC.
For example, we would normally recognise non-associative, left-associative and right-associative operators using non-recursive, left-recursive and right-recursive grammars respectively:
non_expression = sub_exp | sub_exp '=' sub_exp left_expression = sub_exp | left_expression '+' sub_exp right_expression = sub_exp | sub_exp '^' right_expressionHowever, with JavaCC, we have to rewrite the grammars:
non_expression = sub_exp ( '=' sub_exp )? left_expression = sub_exp ( '+' sub_exp )* right_expression = sub_exp ( '^' right_expression )?
This project is about producing a program that helps users to modify their grammars, to get the most out of JavaCC. The program would be interactive, so that it could point out problem areas, make suggestions, and allow the user to pick a suitable modification. Some specific areas that could be investigated are converting left recursion to iteration (as for left_expression above), left-factoring (as for non_expression and right_expression above), using look-ahead to remove ambiguities, and adding error handling.
There are several possible extensions to this project if time permits, implementing other similar transformations on grammars, such as removing empty, trivial, or unusable grammar rules.
REFERENCES:
JavaCC and
a similar, but not identical, tool
TCLLk
A bigger example
COURSE PREREQUISITES: -
EQUIPMENT: -At first glance, it is a completely stupid idea to want to take a lex program and create a yacc program that does exactly the same thing (or vice versa). However, when writing a grammar-driven program such as a compiler, one of the decisions that has to be made is the exact dividing line between those parts of the overall grammar that will be handled by lex and those that will be handled by yacc, and inevitably the line isn't always correct the first time.
It would be useful if it were possible to feed all or part of a lex grammar into a tool that could build the corresponding yacc, or vice-versa, so that this boundary could be experimented with. It would also be interesting, from a purely teaching viewpoint, to illustrate the relative powers of the two notations (i.e. regular expressions and BNF).
An example given in COMP20121 was converting between the lex:
[0-9]+ {return number;}and the yacc:
number : digit | number digit ; digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
COURSE PREREQUISITES: -
EQUIPMENT: -Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.
A complete grammar for a programming language would probably have to be created in two phases. The first (this project) would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc.
The tool might recognise typical strings, such as numbers consisting of digit characters "0" to "9", and perhaps also using "+" and "-" and ".". The tool might then interact with the user to decide whether these strings really were complete words, or whether we are trying to recognise something more complex e.g. course-unit codes like "COMP30900".
A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it is told that "for" is the start of an for-statement, it shouldn't try to use the "for" inside "before".
Another problem is that of white space (spaces, tabs, newlines etc.). In some circumstances, such as in typical computer languages like Java and C, white space can usually be discarded, although it may separate words. In other circumstances, such as input to a database, white space may be very important as tabs may be used to separate fields and newlines used to separate records. It may be hard to recognise these fundamentally different situations without significant help from the user.
A simple example is a list of symbols such as:
abcdabcdddabcabcd
.
The first step would be to identify boundaries between repeated groups i.e.
abcd abcddd abc abcdd
.
The next step would be to distinguish between special cases and
generalisations i.e. do we permit only 0 to 3 instances of "d",
or do we permit any number of instances?
Must the list of groups contain exactly 4 instances of "abcd...", or any
number?
Must the instances be in the order "...d ...ddd ... ...dd" or is any order
possible?
REFERENCES:
Programming By Example
an example of using lex
COURSE PREREQUISITES: -
EQUIPMENT: -Creating a grammar for a computer language can be quite hard - often, we have a good idea of the sort of program or script we want to be able to write, but much less idea how to represent that in a grammar. The idea would be to create a system that would allow the user to start from such an example text in a window and then interact with the system to draw the "parse tree", and finally derive a grammar from it.
For the first step, for example, one might tell the system that this part of the text is an expression, the next part is a declaration, the next an if statement, and so on. One could then build up (this is a list of statements) and down (this is a sub-expression; this is a number, this is a variable, this is an operator) to create the complete parse tree.
The final step would be to take these parse trees, take each node in the tree as an example use of a grammar rule, and merge and generalise these to build up a complete grammar. For example, we might have examples of expressions containing several different operators, merge them to allow any of the given operators from the example, and then generalise to include all the operators we might actually want to use. We might even look at using concepts like Unification to help automate the merging.
Other facilities might be to provide suggestions or alternatives to choose from, to help the user when creating the parse trees or when converting the parse trees to a grammar (e.g. to help deal with operator precedence).
REFERENCES: -
COURSE PREREQUISITES: -
EQUIPMENT: -When one is working with a non-trivial grammar, e.g. for some programming language, it can be very difficult to understand exactly what each part of the grammar is used for. It would be nice to use a tool that could display the grammar and an example program recognised by the grammar, and then be able to select either a part of a program and have the corresponding part of the grammar highlighted, or select a grammar rule and have corresponding parts of the program text highlighted.
One way of accomplishing this would be to actually run the grammar through a parser-generator such as yacc, and use it to parse the example program, creating a parse-tree as it does so. The tool could then use the parse-tree to decide which parts of the grammar or program text to highlight for the user. (Berkely yacc can be made to automatically output information about a parse, which could be input by the tool and used to create the parse-tree - see reference below.)
The tool could be enhanced to display more detailed information from the parse tree - for example, if the user selected a statement, it might highlight the individual lexemes - names, numbers, etc. - using different colours to indicate the lexeme type, and then draw nested boxes around groups of lexemes to indicate how they are recognised by the grammar rules to form sub-expressions and larger expressions etc.
REFERENCES: -
COURSE PREREQUISITES: -
EQUIPMENT: -Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.
A complete grammar for a programming language would probably have to be created in two phases. The first would be simply to recognise individual "words" such as numbers, names, operators and punctuation for use with e.g. lex. This would then be used to create a BNF (this project) linking the words into "sentences" such as expressions, statements and declarations for use with e.g. yacc. For this project, therefore, we will have to assume that the first phase has already happened, by breaking the given text up into "words" using some simple algorithm.
Perhaps the user would select an example statement, and then interact with the tool to construct a simple grammar rule for it. The tool could then attempt to match other chunks of text with this rule, and the user could agree or disagree, allowing the tool to refine the grammar rule.
A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it knows how to recognise an expression consisting of names, numbers and operators, it shouldn't assume that the names in a declaration are also expressions.
Another problem is that of dealing with a hierarchical structure, for example
trying to recognise if-statements and assignments. We can assume that
we have a list of simple symbols, such as:
i(n<n){n=n;i(n)n=n;}
("i" stands for "if", "n" for a name, and so on).
We need to identify n=...;
as an assignment statement, and then
more complex statement forms, such as the if-statement:
i(n)
followed by a statement, and the block:
{ }
containing one or more statements, and so on.
REFERENCES:
Programming By Example
an example of using lex + yacc
COURSE PREREQUISITES: -
EQUIPMENT: -Chart parsing is a very simple, general and flexible parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous grammars quite simply. It pays for this generality by being significantly less efficient. Because of its flexibility, it is straightforward to make versions with different parsing strategies, such as top-down or bottom-up. These characteristics would make it ideal for use in teaching, to demonstrate parsing strategies.
Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use in teaching, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.
If time permits, there are various possible extensions:
It would be nice if the tool had a simple GUI, to allow parse trees to
be displayed, or to allow a user to "single step" through a parse.
Another extension might be to adapt the parser to be able to deal
with EBNF as well as BNF.
REFERENCES: NLTK
COURSE PREREQUISITES: -
EQUIPMENT: -Most programmers, at some point in their careers, need to implement a textual part of a user interface, or a textual configuration file, or similar. The implication of this is that most people who need to use grammars (i.e. regular expressions and/or BNF) will not be experts in their use. Unfortunately, learning how to make best use of grammars is nearly as hard as learning how to program for the first time, as it requires a different way of thinking about programming tasks.
One way to start could be to construct some example texts (e.g. programs)
and then try to design grammars that matched them, using a parser-generator
(like lex or yacc) to automatically create a recogniser from the grammar.
However, it is very hard to write and test just part of the grammar - you
usually need an essentially complete grammar to get started!
Instead, it would be nice if one could simply use fragments
of grammar - for example experiment with using if-statements of the form
if ( ... ) ... else ...
where we don't yet want to have to worry about what "..." represents or
about what precedes or follows the if-statement.
(For example, the parser might simply take "..." to mean anything at all
except another "if" or "else".)
Chart parsing is a very simple and very general parsing technique that, unlike those used in most parsers, is able to deal with any context-free grammar. For example, it can handle ambiguous or incomplete grammars. It pays for this generality by being significantly less efficient. However, this does not really matter when one is designing and testing grammars, where efficiency is much less important than flexibility e.g. coping with partial matches.
Chart parsing is mainly used for recognising natural languages, such as English, and various toolkits exist for this, such as NLTK. However, facilities that are important for natural languages, such as being able to deal with enormous vocabularies, words with more than one meaning, and imprecise grammars, are less useful for dealing with computer languages. This project consists of creating a chart-parser that is suitable for use as outlined above, perhaps by modifying NLTK or some similar package, or more likely by designing and implementing it from scratch.
REFERENCES: NLTK
COURSE PREREQUISITES: -
EQUIPMENT: -