"/*"(.|\n)*"*/" {/*a C comment*/}(i.e. /*, followed by any characters, followed by */ )
We can overcome this if a single character indicates the end, as long as we don't allow nested comments - e.g. Pascal uses "{" and "}" for comments and does not allow nested comments (although a good compiler might give a warning if it detects a "{" inside a comment):
"{"[^}]*"}"(i.e. "{", followed by any characters except a "}", terminated by "}" )
This should mean that recognising strings is easy, as they are usually terminated by the single character ". However, this is made more complicated again because most languages allow a special escape character ("\" for C and SML) so the following are valid strings:
"\n" a newline "\"" a quote character "\\" a backslash character
Note: this assumes that you only want to match correct input, and reject
anything even slightly wrong. In practice, it is normally more user-friendly
to use more generous rules (e.g. match anything between " " characters)
and then somehow check the matched text against the exact rules.
If so, you may be able to use simpler solutions, such as using yyless() as
in J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.) pp.
174-175 or M.E.Lesk & E.Schmidt: Lex - A
Lexical Analyzer Generator lex actions
It is usually best to have two sets of patterns: the first being an exact match, and the second being more generous but always generating an error message, as it will only be used if the first version fails.
{START} ( {SIMPLE} | {COMPLEX} ) * {END}
The SIMPLE pattern should recognise anything legal, but that can't be confused with END, and only consists of a single character. It is usually of the form [...] or [^...]. The COMPLEX pattern is used to deal with the more difficult cases, possibly involving several input characters.
For example, consider ANSI-C comments:
/* a simple comment */ /* also /* a comment */ /* a comment * with a silly character in it */ /* a multi-line comment */ /* two comments on one line */ . . . /* NOT one long comment */ /* a comment with a silly character at the end **/To recognise them, we might try something like this:
START "/*" END "*/" SIMPLE [^*] COMPLEX "*"/[^/]i.e. a comment STARTs with "/*" and ENDs with "*/", and can contain anything except another "*/", so the SIMPLE rule recognises anything except "*", and the COMPLEX rule recognises any "*" as long as it isn't followed by a "/".
Unfortunately, this example doesn't work, as you can't use trailing context anywhere except at the end of the complete pattern, so the (sub)pattern for COMPLEX is rejected by flex.
This kind of solution can handle anything that doesn't require trailing
context, so is OK for strings, but to recognise comments properly you need to
do something different:
Using one, even more complicated, pattern
More generally, we recognise the START, and then we recognise SIMPLE things,
until we meet some characters we are worried about (e.g. a "*" in a comment).
Then, if the "*" is followed by a "/", we are at the END - if it is followed
by another "*" then we are still worried (and need to see what comes next),
but if it is followed by anything else we are still inside the comment and can
look for more SIMPLE things.
Here is a really bad ASCII drawing of the automaton. "S" indicates the state where we are looping, recognising simple things [^*], "W" the worried state where the last character we have seen is a "*", and "E" the end state where we have recognised a complete comment.
[^*/] "/*" <------- "/" -------> S W -------> E ( ) -------> ( ) [^*] "*" "*"This corresponds to the following regular expression:
{START} ( {NOT_WORRYING}* {WORRYING}+ {NOT_WORRYING_NOR_FINAL} )* {NOT_WORRYING}* {WORRYING}+ {FINAL}(I have not substituted the sub-patterns from the automaton into the regular expression, as I don't want to give my students the complete answer to one of their lab exercises!)
For example, you could produce code like:
%x C %% {START} {...; yymore(); BEGIN C;} <C>{SIMPLE} {...; yymore();} <C>{COMPLEX} {...; yymore();} <C>{END} {...; ECHO; BEGIN 0;}i.e. set up SIMPLE, COMPLEX and END as separate patterns, so COMPLEX can now have trailing context.
Make sure you understand the difference between exclusive (%x) and inclusive (%s) start states - I believe lex only provides the latter, whereas flex provides both.
There are examples of start states here and in the lex version of Ze sveedish chef and