Next: YACC Up: CS2121: The Implementation and Previous: Representation and Meaning

Subsections

LEX

Postfix calculator in LEX

($CS2121/e*/postfix/*)

 %{ /* C declarations used in actions */
 #define stack_size 100
 static int sp, stack [stack_size];
 static void push (int i) {
   if (++sp<stack_size) stack[sp]= i;
   else {printf ("error: stack overflow\n"); exit(1);}
 }
 static int pop (void) {
   if (sp>=0) return stack[sp--];
   else {printf ("error: stack underflow\n"); exit(1);}
 }
 %}

 /*lex definitions - not used in this example*/

 %%

 /* descriptions of       corresponding actions
    expected inputs       (C statements or blocks) */

 [0-9]+                   {push (atoi(yytext));}
 "+"                      {push (pop() + pop());}
 "*"                      {push (pop() * pop());}
 -                        {int rhs= pop(); push (pop() - rhs);}
 "/"                      {int rhs= pop(); push (pop() / rhs);}
 ;                        {printf ("%d\n", pop());}
 [ \t\n]                  ;
 [^-0-9+*/; \t\n]+        {ECHO; printf (" unexpected\n");}

 %%                        /* C code */

 int main (void) {sp= -1; yylex(); return 0;}

 static int yywrap (void) {return 1;}

Descriptions of expected inputs

characters	those characters (special characters must be in " ")
or	`while` or `"while"`	(the keyword) while
`"`characters`"`	`"+"`	(the operator) +
`\`character	`\n \t \\`	newline tab $\backslash$ etc. as for C
	otherwise, $\backslash$ character is an alternative to "character"
	`\+`	+
`^` and `$`	the start and the end of line respectively
	`^aardvark$`	a line just containing `aardvark`
`[`characters`]`	any one character in the set of characters
	`[AEIOUaeiou]`	any one vowel
	`[A-Za-z]`	any one letter
`[^`characters`]`	any one character not in the set of characters
	`[^\n]`	any character except a newline
`.`	any one character except a newline
`+`	1 or more times
`*`	0 or 1 or more times
`{`n`,`m`}`	n to m times
	`[0-9]+`	an unsigned integer
	`[A-Z][a-z]*`	a capitalised name
	`[A-Za-z][A-Za-z0-9_]*`	a C identifier
	`a{2,3}`	aa or aaa
`\|`	separates alternatives
	`WHILE\|while`	either WHILE or while
`(`any sub-pattern`)`	that sub-pattern
	`([A-Z][A-Z])+`	an even number of upper-case letters
pattern`?`	optional (but only the immediately preceding pattern!)
	`ab?c`	ac or abc
`/`pattern	only if followed by the pattern
	`ab/c`	ab, but only if followed by c
	`ab/\n`	the same as `ab$`

Rule Priority: the earliest rule matching the longest possible string.

grammar rules from $\S$ 2.1

-, ; the character itself

"+", "*", "/" special characters have to be quoted to just recognise them

[ \t\n] white space (ignored)

[0-9]+ one or more numeric characters (a number)

[^-0-9+*/; \t\n]+ any other characters

LEX definitions

Not normally needed, but e.g.

 ALPHANUMERIC   [A-Za-z0-9]
 ALPHABETIC     [A-Za-z]
 %%
 {ALPHABETIC}{ALPHANUMERIC}*

recognises identifiers

Actions, C declarations and code

yytext contains the actual characters recognised from input

yyleng the number of characters in yytext

ECHO = printf ("%s", yytext), the default action

yylex routine created by LEX from (expected input, action) lists.

yywrap called at EOF; (boolean) result = terminate or not

LEX deals with EOF (e.g. control-D) as if with the following rule:

EOF if (yywrap( )) return 0;

e.g. use in file inclusion, to return to the original file at the end of the included file.

How LEX is used

FLEX : infix.l $\rightarrow$ infix.c
GCC : infix.c $\rightarrow$ infix
infix : equations $\rightarrow$ answers

LEX has other facilities, but those described above are the most important. Refer to the LEX manual in the departmental library or online if you need to.

Epilogue

We met regular expressions before, with EGREP, and here they are again in LEX. This isn't laziness or lack of imagination on the part of the designers of these programs - regular expressions are well suited to the job we are asking them to perform, finding simple patterns (such as words or numbers) in text.

Furthermore, although there are more facilites in LEX's version of regular expressions than in EGREP's, including some we haven't looked at such as start states, it turns out that these do not help us to recognise more complex patterns, but just help us to describe the same patterns in simpler ways or write the C actions more easily. In fact, you will see in the other half of this course that regular expressions are already as complicated as we can make them without having to fundamentally change (i.e. slow down) the way in which programs like EGREP and LEX work.

However, we do need to describe and search for more complex patterns, even though we have reached a fundamental limit with regular expressions. We can see these limits if we try to use LEX to deal with some other forms of expressions:

Infix expressions
In the postfix calculator, we are relying on the fact that every operator is at the end of its expression, which takes the form:
sub-expression sub-expression operator
Because of this, we can stack values from the sub-expressions as we read them, and trigger the final calculation when we finally see the operator.
If we tried to write an infix calculator, recognising expressions of the form:
sub-expression operator sub-expression
we would have to stack both values and operators, and count sub-expressions so that we trigger the final calculation at the end of the second sub-expression. Worse, if we wanted to treat expressions like 1 + 2 * 3 correctly, and multiply the 2 and the 3 before adding the 1, things would be even more complicated.
All this is doable if we expend sufficient ingenuity, but quite clearly the pattern-matching abilities of LEX are no longer sufficient to assist in this, and we would be writing more and more complex C actions.
What we need to be able to do is describe the pattern for an expression as consisting (recursively) of an expression, an operator, and a second expression, with some corresponding action to perform the calculation once the complete pattern has been recognised. The recogniser will need a built-in stack to enable it to keep track of the various sub-expressions as it processes the nested patterns.
Overloading
It is impossible for LEX to distinguish the different uses of - in:
```
        - 1                       (signed literal)
        - B                       (negation)
        A - 1  or  A - B          (subtraction)
```
One consequence of this is that most languages do not allow signed literals in expressions but use negation and unsigned literals instead. However, even this is beyond LEX when used by itself, as - would have different numbers of operands, unless (as in SML) we use e.g. ~ for negation.
We would have similar problems with any other overloaded operator.

As regular expressions have reached their limit, we are going to have to use a new notation (BNF) to describe these sorts of patterns, and a new tool (YACC) to deal with the BNF.

Exercises

( $\dag$ indicates harder problems)

What do the following LEX grammar rules mean:
(a) [^'] (b) ^' (c) [^'].* (d) '[^']' (e) '[^']+' (f) "[^']+"
Write LEX grammar rules to recognise literal numbers as defined in C: e.g. 1234, 1234L, 1234U, 1234UL, 123.4, 1234E-1, 12.34E+1, 123F, 123.4L, 12E+2F, 12E-2L, 0X1ABC, 0X1ABCU, 0X1ABCL, 0X1ABCUL (any of the alphabetic characters can be lower case as well).
Write LEX grammar rules to recognise characters as defined in C: e.g. 'a', '\n', '\027', '\0', '\'', '\x0f', '\\'
$\dag$ How would you recognise strings as defined in C?
Do the programming languages you know allow you to use both upper and lower case letters - are Fred, fred and FRED legal, and if so are they the same identifier or different ones? How about keywords - are IF, if and If the same? Do you prefer IF fred or if FRED or some other combination (aim for maximum clarity, not ease of typing)?
Write LEX grammar rules to recognise identifiers and (some) keywords for C and for SML.
Using LEX or GREP ( $\S$ 1.3.1), if we want to recognise one of a set of characters using square brackets, the order of the characters in the set does not normally matter. However, if we want to include a ] it has to come first in the set, and a - character has to come first or last. Why?
In the postfix calculator ( $\S$ 2.1) the final grammar rule is used to recognise unexpected characters. What would be the effect of using the dot metacharacter instead? i.e.:
. {ECHO; printf(" unexpected\n");}
or:
.+ {ECHO; printf(" unexpected\n");}
Be careful to take into account the way that LEX decides the relative priority of the grammar rules.
Write LEX grammar rules to recognise various comment conventions:
1. from # to the end of line
2. from # to # (including multiple lines)
3. $\dag$ from /* to */ (including multiple lines)
4. $\dag$ from /* to */ (including multiple lines and nested comments)
You should be careful to avoid taking two short comments plus the text between them as one long comment. If you do this the last two are hard, so do not spend too much time on them if you get stuck - you will either need to use some C code involving input and unput in an action or else use start states, which you can find out about via the readings ( $\S$ 2.8).
$\dag$ Use LEX to write an infix calculator. You should handle brackets ( and ).

Readings

Louden: ch. 4.1
Lesk & Schmidt
Levine, Mason & Brown: chs. 1, 2, 6
Capon & Jinks: ch. 7.2
Aho, Sethi & Ullman: ch. 3.5

Next: YACC Up: CS2121: The Implementation and Previous: Representation and Meaning

Pete Jinks
2004-10-26

`-`, `;`	the character itself
`"+"`, `"*"`, `"/"`	special characters have to be quoted to just recognise them
`[ \t\n]`	white space (ignored)
`[0-9]+`	one or more numeric characters (a number)
`[^-0-9+*/; \t\n]+`	any other characters

`yytext`	contains the actual characters recognised from input
`yyleng`	the number of characters in yytext
`ECHO`	= `printf ("%s", yytext)`, the default action
`yylex`	routine created by LEX from (expected input, action) lists.
`yywrap`	called at EOF; (boolean) result = terminate or not
	LEX deals with EOF (e.g. control-D) as if with the following rule:
	`EOF if (yywrap( )) return 0;`
	e.g. use in file inclusion, to return to the original file at the end of the included file.

LEX