CS2112: Exercises + Answers from $3 Lex

* What do the following Lex grammar rules mean:
a) [^'] b) ^' c) [^'].* d) '[^']' e) '[^']+' f) "[^']+"

(I am using ` characters to quote any actual characters I refer to.)

a) `[` and `]` are metacharacters, as is `^` at the start of a `[ ]` set, so this means any one character except `'` (including `\n`).

b) The metacharacter `^` has a totally different meaning outside a `[ ]` set, so this means a `'` at the start of a line.

c) `.` and `*` are also metacharacters. This means any one character except `^` (including `\n`), followed by none, one or more of any characters except `\n` (i.e. followed by none or any characters up to but not including the end of line).

d) `'` is not a metacharacter. This means a `'`, then any one character except `'` (including `\n`), and then another `'`. (This could be used to recognise some character literals in C, although it would not cope with e.g. those that start with a `\`. Some languages use `''''` to mean a single `'`).

e) A `'`, then one or more characters (excluding `'`), and then another `'`. (This could be used to recognise a quoted string in Pascal.) e.g. `'hello'` but not `'don't'`. This avoids picking up `'hello' and 'goodbye'` as a single string, instead of 2 strings with a word between them.

f) `"` is a metacharacter, that stops most other metacharacters from having a special meaning, so this means the actual characters `[^']+`.

* Write Lex grammar rules to recognise literal numbers as defined in C: e.g. 1234, 1234L, 1234U, 1234UL, 123.4, 1234E-1, 12.34E+1, 123F, 123.4L, 12E+2F, 12E-2L, 0X1ABC, 0X1ABCU, 0X1ABCL, 0X1ABCUL (any of the alphabetic characters can be lower case as well).

There are many possible ways of doing this. Here is one:

 [0-9]+(L|U|UL|l|u|ul|Ul|uL)?
 [0-9]+(\.[0-9]+)?([Ee][-+][0-9]+)?[LFlf]?
 0[xX][0-9A-Fa-f]+(L|U|UL|l|u|ul|Ul|uL)?

It would be perfectly acceptable to give each of the possibilities a separate rule, rather than combine them using | and ? etc. as I have. Note that a simple integer (e.g. 1234) is recognised by both the first and second rule, but the precedence rule for Lex means that the first rule will win.

* Write Lex grammar rules to recognise characters as defined in C: e.g. 'a', '\n', '\027', '\0', '\'', '\x0f', '\\'

 '[^']'
 '\\[abfnrtv\'"?]'
 '\\0[0-7]*'
 '\\x0[0-9a-fA-F]*'

I forgot to mention another class of special characters in C, the trigraphs. These are used to supply some characters missing in international versions of the ASCII character set, are of the form:
"??"[()<>/!'=-]
and mean the characters `[]{}\|^#~` respectively.

+ How would you recognise strings as defined in C?

An approximation is:

 \"[^"]+\"
but this only corresponds to `'[^']'` above, and misses out all the other special cases. In practice, we would have to just recognise the leading `"` character in the grammar rule, and then have a chunk of C code in the action that calls Lex's input routine to obtain characters until it finds a matching `"` (in which case it stops and returns control to Lex) or a `\` or `??` (in which case it picks up the special character sequences listed above.)

* Do the programming languages you know allow you to use both upper and lower case letters - are "Fred", "fred" & "FRED" legal, and if so are they the same identifier or different ones? How about keywords - are "IF", "if" & "If" the same? Do you prefer "IF fred" or "if FRED" or some other combination (aim for maximum clarity, not ease of typing)?

For both C and SML, keywords must be lower case, identifiers can be any case, and the different cases mean different identifiers.
Personally, I used to write Pascal programs like:

 IF i < j THEN one(two) ELSE bigword(smallword);
I miss being able to make keywords stand out like this. On the other hand, I like being able to use `_` in identifiers, like `big_word`. I don't like the C convention of using all upper-case for `#define`d names (I don't like having to edit the whole program if I turn a constant into a variable or vice versa) but I understand why it is necessary (the semantics of the two kinds of identifier are so different with respect to e.g. scope rules, or how to use `;`)

* Write Lex grammar rules to recognise identifiers and (some) keywords for C and for SML.

Because of Lex's precedence rules, special cases must come first. e.g. for C:

 if
 else
 for
 . . .
 [a-zA-Z_][0-9a-zA-z_]*

Note that some identifiers are reserved to the system e.g. the names of the standard library functions etc., and more generally, those starting with an underscore (underline).

for SML:

 if
 then
 else
 . . .
 [!%&$#+/:<=>?@\~'^|*-]
 [a-zA-Z'][0-9a-zA-z_']*

* Using Lex or grep ($1.2.3), if we want to recognise one of a set of characters using square brackets, the order of the characters in the set does not normally matter. However, if we want to include a "]" it has to come first in the set, and a "-" character has to come first or last. Why?

As `]` and `-` have special meanings after a `[`, the only places that unambiguously give them their ordinary meaning (i.e. as the character itself) are as described. `^` also has a special meaning, but only immediately after a `[`, so it has its ordinary meaning in any other position.

* In the first example Lex program ($3.1), the final grammar rule is used to recognise unexpected characters. What would be the effect of using the dot metacharacter instead? i.e.:

 .	{ECHO; printf(" unexpected\n");}
or:
 .+	{ECHO; printf(" unexpected\n");}
Be careful to take into account the way that lex decides the relative priority of the grammar rules.

The first replacement would output the warning message after each unexpected character, whereas the original only outputs it after each continuous sequence of such characters.
The second replacement tries to remedy this, but unfortunately `.+` will recognise the whole of an input line, except for the end of line itself. Thus, the precedence rules of Lex means that, unless the whole input line matches one of the other grammar rules (e.g. the line only contains a number) which will win because it comes earlier, each input line will be reported as unrecognised because the last rule will match the most input.

* Write Lex grammar rules to recognise various comment conventions:
a) from "#" to the end of line


#.*$
or
#.*\n
The second version recognises the end of line as well, the first version every up to but not including the end of line. Actually, the first version could equally well be
#.*
as the metacharacter `.` means any character except end of line itself.

b) from "#" to "#" (including multiple lines)

The obvious rule is wrong:
#.*#
would mean any string of characters, except newline, between a pair of `#`s i.e. the string could itself contain a `#` and couldn't cope with multiple lines, so we have to use:
#[^#]*# c) from "/*" to "*/" (including multiple lines)
d) from "/*" to "*/" (including multiple lines and nested comments)
You should be careful to avoid taking two short comments plus the text between them as one long comment.
+ If you do this c) and d) are hard, so don't spend too much time on them if you get stuck - you will either need to use some C code involving "input" and "unput" in an action or else use start states, which you can read about in the references.
I don't expect you to know this for the exam, but you may need to know these sorts of tricks exist some day.

By analogy with b), for c) we would like to write something like
"/*"[^/*]*"*/"
(note we have to quote the metacharacter `*` to make it mean the character itself) but this does not mean what we want - it would reject a comment containing a `/` or a `*` by itself. ALso, if we recognise a long comment as a single entity we are likely to overflow whatever internal buffers Lex uses!

Using start states (c.f. Levine, Mason & Brown pp.171-3).

When we find the start of a comment, we go into a special state (declared to Lex in the %s line) that discards all characters up to the end of the comment, where we go back to the normal Lex state.

 %s C
 %%
 "/*"		{BEGIN C; /* enter comment-eating state */}
 <C>"*/"	{BEGIN 0; /* return to normal state */}
 <C>.		{/* discard next character in comment */}
 <C>\n 	{/* discard end of line in comment */}

Using input and unput (c.f. Levine, Mason & Brown pp.158).

When we find the start of a comment, we consume the characters that follow it directly in the action, up to the end of the comment. To get the input characters we use the same function that Lex does (input). If we coded the loop differently, we might need to give characters back to Lex, using unput.

 "/*"	{int c1, c2;
 	 for (c1= 0; (c2= input())!=EOF && (c1!=`*`||c2!=`/`); c1= c2)
 	  ; /* do nothing */
 	}

To do part d, . . . ?

+ Try to use Lex to write an infix calculator that can deal with precedence and associativity. You will need to replace the if statement in the code in $3.6 by a while loop, that acts while the operator on the stack is of higher (or, if left associative, of equal) precedence than the next operator in the input.
Now add brackets "(" and ")".

If you really want to attempt this, come and talk to me about it. This problem is interesting, but I no longer have the patience to do anything quite so silly myself.