Next: YACC Up: CS2111: Design and Implementation Previous: Expressions: Operators Contents

Subsections

LEX

Postfix calculator in LEX

($CS2111/e*/postfix/*)


 %{ /* C declarations used in actions */
 #define stack_size 100
 static int sp, stack [stack_size];
 static void push (int i) {
   if (++sp<stack_size) stack[sp]= i;
   else {printf ("error: stack overflow\n"); exit(1);}
 }
 static int pop (void) {
   if (sp>=0) return stack[sp - -];
   else {printf ("error: stack underflow\n"); exit(1);}
 }
 %}

 /*lex definitions - not used in this example*/

 %%

 /* descriptions of       corresponding actions
    expected inputs       (C statements or blocks) */

 [0-9]+                   {push (atoi(yytext));}
 "+"                      {push (pop() + pop());}
 "*"                      {push (pop() * pop());}
 -                        {int rhs= pop(); push (pop() - rhs);}
 "/"                      {int rhs= pop(); push (pop() / rhs);}
 ;                        {printf ("%d\n", pop());}
 [ \t\n]                  ;
 [^-0-9+*/; \t\n]+        {ECHO; printf (" unexpected\n");}

 %%                        /* C code */

 int main (void) {sp= -1; yylex(); return 0;}

 static int yywrap (void) {return 1;}

Descriptions of expected inputs

characters	those characters (special characters must be in " ")
or	`while` or `"while"`	(the keyword) while
`"`characters`"`	`"+"`	(the operator) +
`\`character	`\n \t \\`	newline tab $\backslash$ etc. as for C
	otherwise, $\backslash$ character is an alternative to "character"
	`\+`	+
`^` and `$`	the start and the end of line respectively
	`^aardvark$`	a line just containing `aardvark`
`[`characters`]`	any one character in the set of characters
	`[AEIOUaeiou]`	any one vowel
	`[A-Za-z]`	any one letter
`[^`characters`]`	any one character not in the set of characters
	`[^\n]`	any character except a newline
`.`	any one character except a newline

+ 1 or more times

* 0 or 1 or more times

{n,m} n to m times

[0-9]+ an unsigned integer

[A-Z][a-z]* a capitalised name

[A-Za-z][A-Za-z0-9_]* a C identifier

a{2,3} aa or aaa

| separates alternatives

WHILE|while either WHILE or while

(any sub-pattern) that sub-pattern

([A-Z][A-Z])+ an even number of upper-case letters

pattern? optional (but only the immediately preceding pattern!)

ab?c ac or abc

/pattern only if followed by the pattern

ab/c ab, but only if followed by c

ab/\n the same as ab$

<>

Rule Priority: the earliest rule matching the longest possible string.

grammar rules from $\S$ 3.1

`-`, `;`	the character itself
`"+"`, `"*"`, `"/"`	special characters have to be quoted to just recognise them
`[ \t\n]`	white space (ignored)
`[0-9]+`	one or more numeric characters (a number)
`[^-0-9+*/; \t\n]+`	any other characters

LEX definitions

Not normally needed, but e.g.


 ALPHANUMERIC   [A-Za-z0-9]
 ALPHABETIC     [A-Za-z]
 %%
 {ALPHABETIC}{ALPHANUMERIC}*

recognises identifiers

Actions, C declarations and code

`yytext`	contains the actual characters recognised from input
`yyleng`	the number of characters in yytext
`ECHO`	= `printf ("%s", yytext)`, the default action
`yylex`	routine created by LEX from (expected input, action) lists.
`yywrap`	called at EOF; (boolean) result = terminate or not
	LEX deals with EOF (e.g. control-D) as if with the following rule:
	`EOF if (yywrap( )) return 0;`
	e.g. use in file inclusion, to return to the original file at the end of the included file.

How LEX is used

FLEX : infix.l $\rightarrow$ infix.c
GCC : infix.c $\rightarrow$ infix
infix : equations $\rightarrow$ answers

LEX has other facilities, but those described above are the most important. Refer to the LEX manual in the departmental library if necessary.

Using LEX to make an infix calculator

In the postfix calculator, we are relying on the fact that the (diadic) operator is the last item in the triplet describing an expression (sub-expression, sub-expression, operator) to enable us to calculate its value, using a stack to save the values of the sub-expressions. In an infix calculator, we have to detect the last item of (sub-expression, operator, sub-expression), and use the stack to save the first two items until the expression is complete.

Thus the stack must discriminate between numbers (from sub-expressions) and operators so we can stack operators for later. When we see a number we must check to see if it follows an operator (i.e. there is an operator on top of the stack), in which case it is the end of an expression which we must evaluate and stack the result of, or else it is the start of an expression and we must stack it:


[-+*/]         {push_operator (yytext[0]);}

[0-9]+         {rhs= atoi(yytext);
                 if (stack_top_is_operator())
                 {
                   op= pop();
                   lhs= pop();
                   rhs= do_operation (lhs, op, rhs);
                 }
                 push_value (rhs); /* lhs of next expression */
               }

However, the situation is usually more complex than this, as we need to deal with operator precedence and associativity, and brackets, all of which are unnecessary with postfix. For example, with an input of 1 + 2 * 3, the code above would calculate the value of 1 + 2 when the 2 was input, but we should really wait until we have input the next operator to see if it is of higher precedence (or of equal precedence and right associativity), as in this case the * is. (We will need some marker at the end of the expression, such as ;, that we can treat as if it has a lower precedence than any operator.)

Therefore, the characteristics of a more general grammar recogniser than LEX must be:

It has a built-in stack.
It builds up grammar components from lists of smaller fragments.
It can look ahead at one more piece of input to check that what it already has is complete and correct, and if so apply some action that can use the stack.

As we might expect, such tools exist, and the example we are going to look at in the next few sections is YACC.

Language design and implementation

$\S$ 2.10 said "a simple/cheap implementation may force a particular design" for expressions. One such design is to use postfix, where we can ignore precedence, associativity and evaluation order. However, postfix expressions are hard to check (e.g. test for stack underflow or for unused values left on the stack), and combined with its poor readability, this makes it harder to use.

Another simple design is to use fully bracketed prefix or postfix, which may be slightly easier for the user, as we can check that each operator has the right number of operands.

Overloading

It is impossible for LEX to distinguish the different uses of - in:


        - 1                       (signed literal)
        - B                       (negation)
        A - 1  or  A - B          (subtraction)

Therefore (unless someone invents a language which uses 3 different characters) languages do not allow signed literals in expressions but use negation and unsigned literals instead. However, even this is beyond LEX when used by itself, as - would have different numbers of operands, unless (as in SML) we use e.g. ^ for negation.

We would have similar problems with any other overloaded operator.

Exercises

( $\dag$ indicates harder problems)

What do the following LEX grammar rules mean:
(a) [^'] (b) ^' (c) [^'].* (d) '[^']' (e) '[^']+' (f) "[^']+"
Write LEX grammar rules to recognise literal numbers as defined in C: e.g. 1234, 1234L, 1234U, 1234UL, 123.4, 1234E-1, 12.34E+1, 123F, 123.4L, 12E+2F, 12E-2L, 0X1ABC, 0X1ABCU, 0X1ABCL, 0X1ABCUL (any of the alphabetic characters can be lower case as well).
Write LEX grammar rules to recognise characters as defined in C: e.g. 'a', '\n', '\027', '\0', '\'', '\x0f', '\\'
$\dag$ How would you recognise strings as defined in C?
Do the programming languages you know allow you to use both upper and lower case letters - are Fred, fred and FRED legal, and if so are they the same identifier or different ones? How about keywords - are IF, if and If the same? Do you prefer IF fred or if FRED or some other combination (aim for maximum clarity, not ease of typing)?
Write LEX grammar rules to recognise identifiers and (some) keywords for C and for SML.
Using LEX or GREP ( $\S$ 1.4.1), if we want to recognise one of a set of characters using square brackets, the order of the characters in the set does not normally matter. However, if we want to include a ] it has to come first in the set, and a - character has to come first or last. Why?
In the postfix calculator ( $\S$ 3.1) the final grammar rule is used to recognise unexpected characters. What would be the effect of using the dot metacharacter instead? i.e.:
. {ECHO; printf(" unexpected\n");}
or:
.+ {ECHO; printf(" unexpected\n");}
Be careful to take into account the way that LEX decides the relative priority of the grammar rules.
Write LEX grammar rules to recognise various comment conventions:
1. from # to the end of line
2. from # to # (including multiple lines)
3. $\dag$ from /* to */ (including multiple lines)
4. $\dag$ from /* to */ (including multiple lines and nested comments)
You should be careful to avoid taking two short comments plus the text between them as one long comment. If you do this the last two are hard, so do not spend too much time on them if you get stuck - you will either need to use some C code involving input and unput in an action or else use start states, which you can find out about via the readings (3.9).
$\dag$ Try to use LEX to write an infix calculator that can deal with precedence and associativity. You will need to replace the if statement in the code in $\S$ 3.6 by a while loop, that acts while the operator on the stack is of higher (or, if left associative, of equal) precedence than the next operator in the input.
Now add brackets ( and ).

Readings

Louden: ch. 4.1
Lesk & Schmidt
Levine, Mason & Brown: chs. 1, 2, 6
Capon & Jinks: ch. 7.2
Aho, Sethi & Ullman: ch. 3.5

Next: YACC Up: CS2111: Design and Implementation Previous: Expressions: Operators Contents

Pete Jinks
1999-09-30

`+`	1 or more times
`*`	0 or 1 or more times
`{`n`,`m`}`	n to m times
	`[0-9]+`	an unsigned integer
	`[A-Z][a-z]*`	a capitalised name
	`[A-Za-z][A-Za-z0-9_]*`	a C identifier
	`a{2,3}`	aa or aaa
`\|`	separates alternatives
	`WHILE\|while`	either WHILE or while
`(`any sub-pattern`)`	that sub-pattern
	`([A-Z][A-Z])+`	an even number of upper-case letters
pattern`?`	optional (but only the immediately preceding pattern!)
	`ab?c`	ac or abc
`/`pattern	only if followed by the pattern
	`ab/c`	ab, but only if followed by c
	`ab/\n`	the same as `ab$`