UP PREVIOUS NEXT

Ambiguous Source Rules.

Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex chooses as follows:

1) The longest match is preferred.

2) Among rules which matched the same number of characters, the rule given first is preferred.

Thus, suppose the rules

integer	keyword action ...;
[a-z]+	identifier action ...;
to be given in that order. If the input is integers , it is taken as an identifier, because [a-z]+ matches 8 characters while integer matches only 7. If the input is integer , both rules match 7 characters, and the keyword rule is selected because it was given first. Anything shorter (e.g. int) will not match the expression integer and so the identifier interpretation is used.

The principle of preferring the longest match makes rules containing expressions like .* dangerous. For example,

'.*'
might seem a good way of recognizing a string in single quotes. But it is an invitation for the program to read far ahead, looking for a distant single quote. Presented with the input
'first' quoted string here, 'second' here
the above expression will match
'first' quoted string here, 'second'
which is probably not what was wanted. A better rule is of the form
'[^'\n]*'
which, on the above input, will stop after 'first' . The consequences of errors like this are mitigated by the fact that the . operator will not match newline. Thus expressions like .* stop on the current line. Don't try to defeat this with expressions like [.\n]+ or equivalents; the Lex generated program will try to read the entire input file, causing internal buffer overflows.

Note that Lex is normally partitioning the input stream, not searching for all possible matches of each expression. This means that each character is accounted for once and only once. For example, suppose it is desired to count occurrences of both she and he in an input text. Some Lex rules to do this might be

she	s++;
he	h++;
\n	|
.	;
where the last two rules ignore everything besides he and she. Remember that . does not include newline. Since she includes he, Lex will normally not recognize the instances of he included in she, since once it has passed a she those characters are gone.

Sometimes the user would like to override this choice. The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed. The position of the input pointer is adjusted accordingly. Suppose the user really wants to count the included instances of he:

she	{s++; REJECT;}
he	{h++; REJECT;}
\n	|
.	;
these rules are one way of changing the previous example to do just that. After counting each expression, it is rejected; whenever appropriate, the other expression will then be counted. In this example, of course, the user could note that she includes he but not vice versa, and omit the REJECT action on he; in other cases, however, it would not be possible a priori to tell which input characters were in both classes.

Consider the two rules

a[bc]+	{ ... ; REJECT;}
a[cd]+	{ ... ; REJECT;}
If the input is ab , only the first rule matches, and on ad only the second matches. The input string accb matches the first rule for four characters and then the second rule for three characters. In contrast, the input accd agrees with the second rule for four characters and then the first rule for three.

In general, REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other. Suppose a digram table of the input is desired; normally the digrams overlap, that is the word the is considered to contain both th and he . Assuming a two-dimensional array named digram to be incremented, the appropriate source is

%%
[a-z][a-z]	{digram[yytext[0]][yytext[1]]++; REJECT;}
.	;
\n	;
where the REJECT is necessary to pick up a letter pair beginning at every character, rather than at every other character.


UP PREVIOUS NEXT