All my projects are in the general area of creating, manipulating and using grammars.
You are probably familiar with the idea of Graphical User Interfaces (GUIs) and the kinds of tools that can be used to create them (e.g. development environments). However, we often need to build a Textual User Interface (TUI) - for example, to recognise an expression to be drawn by a graph-plotting package, or to input a file of data into a statistical package - and although notations (grammars) and tools (e.g. lex and yacc) exist to help, they are not easy for non-experts to use.
One way of making grammars easier to use might be a tool with a GUI, that helps a user to build an informal picture of the structure of the text, which can then be automatically transformed into a grammar. The user should be able to "drag and drop" boxes corresponding to ideas like "a list of ..." or "a choice between ...", and to be able to nest them one inside the other as required. e.g.
|
|
If time allows, the basic concept could be extended in various ways e.g.:
More information:
A common problem with grammars is that, for some reason, they are not in the
right form for the tools you want to use to process them (e.g. lex and yacc).
At its simplest, this can be a trivial notational problem, that the grammar is
not in exactly the right format (e.g. see BNF
and EBNF notations for grammars).
At its worst, a grammar can be ambiguous, and the problem is to track down
and resolve the ambiguity (e.g. see
ambiguous grammars).
However, the most common problem is that the grammar is too complicated for
the particular tool you are trying to use to process it (e.g. see
how to confuse parsers).
Unfortunately, there is no straighforward way to distinguish between the
previous two problems - all tools like yacc can do is report a problem and
leave it to the user to sort out (e.g. see
confusing yacc).
To solve the first problem, we just need a tool that can read and write
grammars, using simple dialogs to describe the exact formats of the input and
output.
We can then improve the usefulness of tool by making it manipulate the grammar
to remove ambiguities or otherwise simplify it. It will be necessary for the
user and the tool to cooperate - the tool providing information about possible
modifications, the user selecting particular rules to be changed and the
changes to make, and the tool performing the transformations and reporting the
results. One important feature will be the ability to backtrack out of one or
more unsuccessful transformations and try another line of attack.
If time allows, the basic concept could be extended in various ways e.g.
More information:
Recognising text is a common problem, often best solved by writing a grammar and then using a parser-generator (like lex or yacc) to automatically create a recogniser from the grammar. However, writing the grammar to begin with can be very difficult. Perhaps this process could be simplified by a tool that interacted with the user to derive a grammar from an example piece of text.
The complete grammar would probably have to be created in two phases, and so there are two independant projects on offer:
The tool might recognise typical strings, such as numbers consisting of digit characters "0" to "9", and perhaps also using "+" and "-" and ".". The tool might then interact with the user to decide whether these strings really were complete words, or whether we are trying to recognise something more complex e.g. course-unit codes like "CS3900".
A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it is told that "for" is the start of an for-statement, it shouldn't try to use the "for" inside "before".
Another problem is that of white space (spaces, tabs, newlines etc.). In some circumstances, such as in typical computer languages like Java anc C, white space can usually be discarded, although it may separate words. In other circumstances, such as in a database, white space may be very important as tabs may be used to separate fields and newlines to separate records. It may be hard to recognise these fundamentally different situations without significant help from the user.
A simple example is a list of symbols such as:
abcdabcdddabcabcd
.
The first step would be to identify boundaries between repeated groups i.e.
abcd abcddd abc abcdd
.
The next step would be to distinguish between special cases and
generalisations i.e. do we permit only 0 to 3 instances of "d",
or do we permit any number of instances?
Must the list of groups contain exactly 4 instances of "abcd...", or any
number?
Must the instances be in the order "...d ...ddd ... ...dd" or is any order
possible?
Perhaps the user would select an example statement, and then interact with the tool to construct a simple grammar rule for it. The tool could then attempt to match other chunks of text with this rule, and the user could agree or disagree, allowing the tool to refine the grammar rule.
A significant problem will be preventing the tool from over-using an incomplete grammar. For example, if it knows how to recognise an expression consisting of names, numbers and operators, it shouldn't assume that the names in a declaration are also expressions.
Another problem is that of dealing with a hierarchical structure, for example
trying to recognise if-statements and assignments. We can assume that
we have a list of simple symbols, such as:
i(n<n){n=n;i(n)n=n;}
("i" stands for "if", "n" for a name, and so on).
We need to indentify n=...;
as an assignment statement, and then
more complex statement forms, such as the if-statement:
i(n)
followed by a statement, and the block:
{ }
containing one or more statements, and so on.
More information: