next up previous contents
Next: YACC Up: CS2111: Design and Implementation Previous: Expressions: Operators   Contents

Subsections


LEX


Postfix calculator in LEX

($CS2111/e*/postfix/*)

 %{ /* C declarations used in actions */
 #define stack_size 100
 static int sp, stack [stack_size];
 static void push (int i) {
   if (++sp<stack_size) stack[sp]= i;
   else {printf ("error: stack overflow\n"); exit(1);}
 }
 static int pop (void) {
   if (sp>=0) return stack[sp - -];
   else {printf ("error: stack underflow\n"); exit(1);}
 }
 %}

 /*lex definitions - not used in this example*/

 %%

 /* descriptions of       corresponding actions
    expected inputs       (C statements or blocks) */

 [0-9]+                   {push (atoi(yytext));}
 "+"                      {push (pop() + pop());}
 "*"                      {push (pop() * pop());}
 -                        {int rhs= pop(); push (pop() - rhs);}
 "/"                      {int rhs= pop(); push (pop() / rhs);}
 ;                        {printf ("%d\n", pop());}
 [ \t\n]                  ;
 [^-0-9+*/; \t\n]+        {ECHO; printf (" unexpected\n");}

 %%                        /* C code */

 int main (void) {sp= -1; yylex(); return 0;}

 static int yywrap (void) {return 1;}

Descriptions of expected inputs

characters those characters (special characters must be in " ")
or while or "while" (the keyword) while
"characters" "+" (the operator) +
\character \n \t \\ newline tab $\backslash$ etc. as for C
  otherwise, $\backslash$character is an alternative to "character"
  \+ +
^ and $ the start and the end of line respectively
  ^aardvark$ a line just containing aardvark
[characters] any one character in the set of characters
  [AEIOUaeiou] any one vowel
  [A-Za-z] any one letter
[^characters] any one character not in the set of characters
  [^\n] any character except a newline
. any one character except a newline
<>

+ 1 or more times
* 0 or 1 or more times
{n,m} n to m times
  [0-9]+ an unsigned integer
  [A-Z][a-z]* a capitalised name
  [A-Za-z][A-Za-z0-9_]* a C identifier
  a{2,3} aa or aaa
| separates alternatives
  WHILE|while either WHILE or while
(any sub-pattern) that sub-pattern
  ([A-Z][A-Z])+ an even number of upper-case letters
pattern? optional (but only the immediately preceding pattern!)
  ab?c ac or abc
/pattern only if followed by the pattern
  ab/c ab, but only if followed by c
  ab/\n the same as ab$
<>

Rule Priority: the earliest rule matching the longest possible string.

grammar rules from $\S $3.1

-, ; the character itself
"+", "*", "/" special characters have to be quoted to just recognise them
[ \t\n] white space (ignored)
[0-9]+ one or more numeric characters (a number)
[^-0-9+*/; \t\n]+ any other characters
<>

LEX definitions

Not normally needed, but e.g.

 ALPHANUMERIC   [A-Za-z0-9]
 ALPHABETIC     [A-Za-z]
 %%
 {ALPHABETIC}{ALPHANUMERIC}*
recognises identifiers

Actions, C declarations and code

yytext contains the actual characters recognised from input
yyleng the number of characters in yytext
ECHO = printf ("%s", yytext), the default action
yylex routine created by LEX from (expected input, action) lists.
yywrap called at EOF; (boolean) result = terminate or not
  LEX deals with EOF (e.g. control-D) as if with the following rule:
  EOF if (yywrap( )) return 0;
  e.g. use in file inclusion, to return to the original file at the end of the included file.
<>

How LEX is used

FLEX : infix.l $\rightarrow$ infix.c
GCC : infix.c $\rightarrow$ infix
infix : equations $\rightarrow$ answers

LEX has other facilities, but those described above are the most important. Refer to the LEX manual in the departmental library if necessary.


Using LEX to make an infix calculator

In the postfix calculator, we are relying on the fact that the (diadic) operator is the last item in the triplet describing an expression (sub-expression, sub-expression, operator) to enable us to calculate its value, using a stack to save the values of the sub-expressions. In an infix calculator, we have to detect the last item of (sub-expression, operator, sub-expression), and use the stack to save the first two items until the expression is complete.

Thus the stack must discriminate between numbers (from sub-expressions) and operators so we can stack operators for later. When we see a number we must check to see if it follows an operator (i.e. there is an operator on top of the stack), in which case it is the end of an expression which we must evaluate and stack the result of, or else it is the start of an expression and we must stack it:


[-+*/]         {push_operator (yytext[0]);}

[0-9]+         {rhs= atoi(yytext);
                 if (stack_top_is_operator())
                 {
                   op= pop();
                   lhs= pop();
                   rhs= do_operation (lhs, op, rhs);
                 }
                 push_value (rhs); /* lhs of next expression */
               }
However, the situation is usually more complex than this, as we need to deal with operator precedence and associativity, and brackets, all of which are unnecessary with postfix. For example, with an input of 1 + 2 * 3, the code above would calculate the value of 1 + 2 when the 2 was input, but we should really wait until we have input the next operator to see if it is of higher precedence (or of equal precedence and right associativity), as in this case the * is. (We will need some marker at the end of the expression, such as ;, that we can treat as if it has a lower precedence than any operator.)

Therefore, the characteristics of a more general grammar recogniser than LEX must be:

As we might expect, such tools exist, and the example we are going to look at in the next few sections is YACC.

Language design and implementation

$\S $2.10 said "a simple/cheap implementation may force a particular design" for expressions. One such design is to use postfix, where we can ignore precedence, associativity and evaluation order. However, postfix expressions are hard to check (e.g. test for stack underflow or for unused values left on the stack), and combined with its poor readability, this makes it harder to use.

Another simple design is to use fully bracketed prefix or postfix, which may be slightly easier for the user, as we can check that each operator has the right number of operands.

Overloading

It is impossible for LEX to distinguish the different uses of - in:

        - 1                       (signed literal)
        - B                       (negation)
        A - 1  or  A - B          (subtraction)
Therefore (unless someone invents a language which uses 3 different characters) languages do not allow signed literals in expressions but use negation and unsigned literals instead. However, even this is beyond LEX when used by itself, as - would have different numbers of operands, unless (as in SML) we use e.g. ^ for negation.

We would have similar problems with any other overloaded operator.

Exercises

($\dag $ indicates harder problems)


Readings

Louden: ch. 4.1
Lesk & Schmidt
Levine, Mason & Brown: chs. 1, 2, 6
Capon & Jinks: ch. 7.2
Aho, Sethi & Ullman: ch. 3.5
next up previous contents
Next: YACC Up: CS2111: Design and Implementation Previous: Expressions: Operators   Contents
Pete Jinks
1999-09-30