$CS2121/e*/postfix/*
)
%{ /* C declarations used in actions */ #define stack_size 100 static int sp, stack [stack_size]; static void push (int i) { if (++sp<stack_size) stack[sp]= i; else {printf ("error: stack overflow\n"); exit(1);} } static int pop (void) { if (sp>=0) return stack[sp--]; else {printf ("error: stack underflow\n"); exit(1);} } %} /*lex definitions - not used in this example*/ %% /* descriptions of corresponding actions expected inputs (C statements or blocks) */ [0-9]+ {push (atoi(yytext));} "+" {push (pop() + pop());} "*" {push (pop() * pop());} - {int rhs= pop(); push (pop() - rhs);} "/" {int rhs= pop(); push (pop() / rhs);} ; {printf ("%d\n", pop());} [ \t\n] ; [^-0-9+*/; \t\n]+ {ECHO; printf (" unexpected\n");} %% /* C code */ int main (void) {sp= -1; yylex(); return 0;} static int yywrap (void) {return 1;}
characters | those characters (special characters must be in " ") | |
or | while or "while" |
(the keyword) while |
" characters" |
"+" |
(the operator) + |
\ character |
\n \t \\ |
newline tab etc. as for C |
otherwise, character is an alternative to "character" | ||
\+ |
+ | |
^ and $ |
the start and the end of line respectively | |
^aardvark$ |
a line just containing aardvark |
|
[ characters] |
any one character in the set of characters | |
[AEIOUaeiou] |
any one vowel | |
[A-Za-z] |
any one letter | |
[^ characters] |
any one character not in the set of characters | |
[^\n] |
any character except a newline | |
. |
any one character except a newline | |
+ |
1 or more times | |
* |
0 or 1 or more times | |
{ n, m} |
n to m times | |
[0-9]+ |
an unsigned integer | |
[A-Z][a-z]* |
a capitalised name | |
[A-Za-z][A-Za-z0-9_]* |
a C identifier | |
a{2,3} |
aa or aaa | |
| |
separates alternatives | |
WHILE|while |
either WHILE or while | |
( any sub-pattern) |
that sub-pattern | |
([A-Z][A-Z])+ |
an even number of upper-case letters | |
pattern? |
optional (but only the immediately preceding pattern!) | |
ab?c |
ac or abc | |
/ pattern |
only if followed by the pattern | |
ab/c |
ab, but only if followed by c | |
ab/\n |
the same as ab$ |
Rule Priority: the earliest rule matching the longest possible string.
- , ; |
the character itself |
"+" , "*" , "/" |
special characters have to be quoted to just recognise them |
[ \t\n] |
white space (ignored) |
[0-9]+ |
one or more numeric characters (a number) |
[^-0-9+*/; \t\n]+ |
any other characters |
Not normally needed, but e.g.
ALPHANUMERIC [A-Za-z0-9] ALPHABETIC [A-Za-z] %% {ALPHABETIC}{ALPHANUMERIC}*
recognises identifiers
yytext |
contains the actual characters recognised from input |
yyleng |
the number of characters in yytext |
ECHO |
= printf ("%s", yytext) , the default action |
yylex |
routine created by LEX from (expected input, action) lists. |
yywrap |
called at EOF; (boolean) result = terminate or not |
LEX deals with EOF (e.g. control-D) as if with the following rule: | |
EOF if (yywrap( )) return 0; |
|
e.g. use in file inclusion, to return to the original file at the end of the included file. |
FLEX : infix.l infix.c
GCC : infix.c infix
infix : equations answers
LEX has other facilities, but those described above are the most important. Refer to the LEX manual in the departmental library or online if you need to.
We met regular expressions before, with EGREP, and here they are again in LEX. This isn't laziness or lack of imagination on the part of the designers of these programs - regular expressions are well suited to the job we are asking them to perform, finding simple patterns (such as words or numbers) in text.
Furthermore, although there are more facilites in LEX's version of regular expressions than in EGREP's, including some we haven't looked at such as start states, it turns out that these do not help us to recognise more complex patterns, but just help us to describe the same patterns in simpler ways or write the C actions more easily. In fact, you will see in the other half of this course that regular expressions are already as complicated as we can make them without having to fundamentally change (i.e. slow down) the way in which programs like EGREP and LEX work.
However, we do need to describe and search for more complex patterns, even though we have reached a fundamental limit with regular expressions. We can see these limits if we try to use LEX to deal with some other forms of expressions:
In the postfix calculator, we are relying on the fact that every operator is
at the end of its expression, which takes the form:
sub-expression sub-expression operator
Because of this, we can stack values from the sub-expressions as we read
them, and trigger the final calculation when we finally see the operator.
If we tried to write an infix calculator, recognising expressions of the form:
sub-expression operator sub-expression
we would have to stack both values and operators, and count sub-expressions so
that we trigger the final calculation at the end of the second
sub-expression. Worse, if we wanted to treat expressions like 1 + 2 * 3
correctly, and multiply the 2 and the 3 before adding the 1, things would
be even more complicated.
All this is doable if we expend sufficient ingenuity, but quite clearly the pattern-matching abilities of LEX are no longer sufficient to assist in this, and we would be writing more and more complex C actions.
What we need to be able to do is describe the pattern for an expression as consisting (recursively) of an expression, an operator, and a second expression, with some corresponding action to perform the calculation once the complete pattern has been recognised. The recogniser will need a built-in stack to enable it to keep track of the various sub-expressions as it processes the nested patterns.
It is impossible for LEX to distinguish the different uses of
-
in:
- 1 (signed literal) - B (negation) A - 1 or A - B (subtraction)
One consequence of this is that most languages do not allow signed literals in
expressions but use negation and unsigned literals instead. However, even this
is beyond LEX when used by itself, as -
would have different
numbers of operands, unless (as in SML) we use e.g. ~
for negation.
We would have similar problems with any other overloaded operator.
As regular expressions have reached their limit, we are going to have to use a new notation (BNF) to describe these sorts of patterns, and a new tool (YACC) to deal with the BNF.
[^']
(b) ^'
(c) [^'].*
(d) '[^']'
(e) '[^']+'
(f) "[^']+"
1234
, 1234L
, 1234U
, 1234UL
, 123.4
,
1234E-1
, 12.34E+1
, 123F
, 123.4L
, 12E+2F
,
12E-2L
, 0X1ABC
, 0X1ABCU
, 0X1ABCL
, 0X1ABCUL
(any of the alphabetic characters can be lower case as well).
'a'
, '\n'
, '\027'
, '\0'
, '\''
,
'\x0f'
, '\\'
Fred
, fred
and FRED
legal,
and if so are they the same identifier or different ones? How about keywords
- are IF
, if
and If
the same? Do you prefer
IF fred
or if FRED
or some other combination (aim for maximum
clarity, not ease of typing)?
]
it has to come first in the set, and a -
character
has to come first or last. Why?
. {ECHO; printf(" unexpected\n");}
.+ {ECHO; printf(" unexpected\n");}
#
to the end of line
#
to #
(including multiple lines)
/*
to */
(including multiple lines)
/*
to */
(including multiple lines and nested comments)
You should be careful to avoid taking two short comments plus the text
between them as one long comment.
If you do this the last two are hard, so do not spend too much time on them
if you get stuck - you will either need to use some C code involving
input
and unput
in an action or else use start states, which
you can find out about via the readings (2.8).
(
and )
.
Louden: ch. 4.1
Lesk & Schmidt
Levine, Mason & Brown: chs. 1, 2, 6
Capon & Jinks: ch. 7.2
Aho, Sethi & Ullman: ch. 3.5