MSc Projects 2003/4

All my projects are in the general area of creating, manipulating and using grammars.

PJJ.1 Designing Grammars (ACS,CS)
PJJ.2 Manipulating Grammars (ACS,CS)
PJJ.3 & PJJ.4 Deriving grammars from example texts (ACS,CS)

PJJ.1 Designing Grammars (ACS,CS)

You are probably familiar with the idea of Graphical User Interfaces (GUIs) and the kinds of tools that can be used to create them (e.g. development environments). However, we often need to build a Textual User Interface (TUI) - for example, to recognise an expression to be drawn by a graph-plotting package, or to input a file of data into a statistical package - and although notations (grammars) and tools (e.g. lex and yacc) exist to help, they are not easy for non-experts to use.

One way of making grammars easier to use might be a tool with a GUI, that helps a user to build an informal picture of the structure of the text, which can then be automatically transformed into a grammar. The user should be able to "drag and drop" boxes corresponding to ideas like "a list of ..." or "a choice between ...", and to be able to nest them one inside the other as required. e.g.

a number is

a list of one or more things,
where each thing is
a digit character

an identifier is
a letter character f
o
l
l
o
w
e
d

b
y

a list of none or more things,
where each thing is

a choice between
a letter character
or
a digit character

It would be important for the user to be able to experiment with the structure, e.g. by changing one "thing" into "a list of things" or back again, or by adding or removing options from a set of choices, or by moving boxes from one rule to another. The program should also be able to output the grammar e.g. as lex and yacc programs, so the design can be tested.

If time allows, the basic concept could be extended in various ways e.g.:

Helping the user to add annotation to guide the translation into a grammar.
Helping the user to distinguish between parts best dealt with by e.g. lex (words, names, numbers etc.) and those best dealt with by e.g. yacc (sentences, statements, expressions, declarations etc.).
Helping the user to understand any errors in the grammar reported by e.g. lex and yacc.
Helping the user to add actions to the generated lex and yacc programs.

More information:

Notations for grammars
Examples of english descriptions, pictures, and grammars.
Typical error messages from lex and yacc
More details about errors reported by yacc

PJJ.2 Manipulating Grammars (ACS,CS)

A common problem with grammars is that, for some reason, they are not in the right form for the tools you want to use to process them (e.g. lex and yacc).
At its simplest, this can be a trivial notational problem, that the grammar is not in exactly the right format (e.g. see BNF and EBNF notations for grammars).
At its worst, a grammar can be ambiguous, and the problem is to track down and resolve the ambiguity (e.g. see ambiguous grammars).
However, the most common problem is that the grammar is too complicated for the particular tool you are trying to use to process it (e.g. see how to confuse parsers).
Unfortunately, there is no straighforward way to distinguish between the previous two problems - all tools like yacc can do is report a problem and leave it to the user to sort out (e.g. see confusing yacc).

To solve the first problem, we just need a tool that can read and write grammars, using simple dialogs to describe the exact formats of the input and output.
We can then improve the usefulness of tool by making it manipulate the grammar to remove ambiguities or otherwise simplify it. It will be necessary for the user and the tool to cooperate - the tool providing information about possible modifications, the user selecting particular rules to be changed and the changes to make, and the tool performing the transformations and reporting the results. One important feature will be the ability to backtrack out of one or more unsuccessful transformations and try another line of attack.

If time allows, the basic concept could be extended in various ways e.g.

Discovering new grammar transformations.
(My exisiting list is simply those that I happen to have used in the examples, rather than any systematic exploration of possibilities.)
By recording successful and unsuccessful series of transformations, to identify useful combinations of basic transformations and add them to the options available to the user.
Adding the ability to output the grammar to e.g. yacc after each set of changes to see if they have succeeded in solving the problem, or to use any error messages to guide further changes.
Trying to predict the most appropriate transformations to use, or the rules most in need of transforming, or other similar hints and help.

More information:

PJJ.3 & PJJ.4 Deriving grammars from example texts (ACS,CS)

Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.

The complete grammar would probably have to be created in two phases, and so there are two independant projects on offer:

PJJ.3 Deriving regular expressions from example texts

discovering words - numbers, names, operators etc. (e.g. for lex).

The tool might recognise typical strings, such as numbers consisting of digit characters "0" to "9", and perhaps also using "+" and "-" and ".". The tool might then interact with the user to decide whether these strings really were complete words, or whether we are trying to recognise something more complex e.g. course-unit codes like "CS3900".

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it is told that "for" is the start of an for-statement, it shouldn't try to use the "for" inside "before".

Another problem is that of white space (spaces, tabs, newlines etc.). In some circumstances, such as in typical computer languages like Java anc C, white space can usually be discarded, although it may separate words. In other circumstances, such as in a database, white space may be very important as tabs may be used to separate fields and newlines to separate records. It may be hard to recognise these fundamentally different situations without significant help from the user.

A simple example is a list of symbols such as: abcdabcdddabcabcd. The first step would be to identify boundaries between repeated groups i.e. abcd abcddd abc abcdd. The next step would be to distinguish between special cases and generalisations i.e. do we permit only 0 to 3 instances of "d", or do we permit any number of instances? Must the list of groups contain exactly 4 instances of "abcd...", or any number? Must the instances be in the order "...d ...ddd ... ...dd" or is any order possible?

PJJ.4 Deriving BNF from example texts

discovering sentences - expressions, statements, declarations etc. (e.g. for yacc).

Perhaps the user would select an example statement, and then interact with the tool to construct a simple grammar rule for it. The tool could then attempt to match other chunks of text with this rule, and the user could agree or disagree, allowing the tool to refine the grammar rule.

A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it knows how to recognise an expression consisting of names, numbers and operators, it shouldn't assume that the names in a declaration are also expressions.

Another problem is that of dealing with a hierarchical structure, for example trying to recognise if-statements and assignments. We can assume that we have a list of simple symbols, such as: i(n<n){n=n;i(n)n=n;} ("i" stands for "if", "n" for a name, and so on). We need to indentify n=...; as an assignment statement, and then more complex statement forms, such as the if-statement: i(n) followed by a statement, and the block: { } containing one or more statements, and so on.

More information: