Recognising strings and/or comments

The Problem

Simple patterns like:
	"/*"(.|\n)*"*/"		{/*a C comment*/}
(i.e. /*, followed by any characters, followed by */ )
will match everything in the input file from the first "/*" to the last "*/", including uncommented text, because lex always tries to match as much as it can with each pattern.

We can overcome this if a single character indicates the end, as long as we don't allow nested comments - e.g. Pascal uses "{" and "}" for comments and does not allow nested comments (although a good compiler might give a warning if it detects a "{" inside a comment):

	"{"[^}]*"}"
(i.e. "{", followed by any characters except a "}", terminated by "}" )

This should mean that recognising strings is easy, as they are usually terminated by the single character ". However, this is made more complicated again because most languages allow a special escape character ("\" for C and SML) so the following are valid strings:

	"\n"		a newline
	"\""		a quote character
	"\\"		a backslash character

Note: this assumes that you only want to match correct input, and reject anything even slightly wrong. In practice, it is normally more user-friendly to use more generous rules (e.g. match anything between " " characters) and then somehow check the matched text against the exact rules.
If so, you may be able to use simpler solutions, such as using yyless() as in J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.) pp. 174-175 or M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator lex actions

It is usually best to have two sets of patterns: the first being an exact match, and the second being more generous but always generating an error message, as it will only be used if the first version fails.


The Solutions

Using a single, complicated, pattern

You need a pattern of the form
	{START} ( {SIMPLE} | {COMPLEX} ) * {END}

The SIMPLE pattern should recognise anything legal, but that can't be confused with END, and only consists of a single character. It is usually of the form [...] or [^...]. The COMPLEX pattern is used to deal with the more difficult cases, possibly involving several input characters.

For example, consider ANSI-C comments:

	/* a simple comment */
	/* also /* a comment */
	/* a comment * with a silly character in it */
	/* a multi-line
		comment
	*/
	/* two comments on one line */ . . . /* NOT one long comment */
	/* a comment with a silly character at the end **/
To recognise them, we might try something like this:
	START	"/*"
	END	"*/"
	SIMPLE	[^*]
	COMPLEX	"*"/[^/]
i.e. a comment STARTs with "/*" and ENDs with "*/", and can contain anything except another "*/", so the SIMPLE rule recognises anything except "*", and the COMPLEX rule recognises any "*" as long as it isn't followed by a "/".
(The / operator indicates trailing context, which in this example is anything except an actual "/" character - confusing eh? Also remember that we don't want " " quotes around actual characters inside [...])

Unfortunately, this example doesn't work, as you can't use trailing context anywhere except at the end of the complete pattern, so the (sub)pattern for COMPLEX is rejected by flex.

This kind of solution can handle anything that doesn't require trailing context, so is OK for strings, but to recognise comments properly you need to do something different:

Using one, even more complicated, pattern

More generally, we recognise the START, and then we recognise SIMPLE things, until we meet some characters we are worried about (e.g. a "*" in a comment). Then, if the "*" is followed by a "/", we are at the END - if it is followed by another "*" then we are still worried (and need to see what comes next), but if it is followed by anything else we are still inside the comment and can look for more SIMPLE things.

Here is a really bad ASCII drawing of the automaton. "S" indicates the state where we are looping, recognising simple things [^*], "W" the worried state where the last character we have seen is a "*", and "E" the end state where we have recognised a complete comment.

		 [^*/]
 "/*"		<-------	  "/"
------->   S		    W	-------> E
	 (   )	------->   ( )
	  [^*]	   "*"	   "*"
This corresponds to the following regular expression:
	{START}
	( {NOT_WORRYING}* {WORRYING}+ {NOT_WORRYING_NOR_FINAL} )*
	{NOT_WORRYING}* {WORRYING}+ {FINAL}
(I have not substituted the sub-patterns from the automaton into the regular expression, as I don't want to give my students the complete answer to one of their lab exercises!)

Using several simple patterns

Essentially, we just recognise the START of the object using a pattern, and then do something different to pick up the rest of the object.
Either:
Force lex to temporarily match the input against a new set of patterns, by using start states. (e.g. J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.) pp. 171-173
or M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator left context sensitivity)
You may need to use the yymore() function to glue together the fragments of text in yytext. (c.f. M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator lex actions)

For example, you could produce code like:

	%x C
	%%
	{START}	  	{...; yymore(); BEGIN C;}
	<C>{SIMPLE}	{...; yymore();}
	<C>{COMPLEX}	{...; yymore();}
	<C>{END}  	{...; ECHO; BEGIN 0;}
	
i.e. set up SIMPLE, COMPLEX and END as separate patterns, so COMPLEX can now have trailing context.

Make sure you understand the difference between exclusive (%x) and inclusive (%s) start states - I believe lex only provides the latter, whereas flex provides both.

There are examples of start states here and in the lex version of Ze sveedish chef and

Or:
Recognise the start of the text using a lex pattern, and then use normal input to recognise the rest of the text. This involves writing a significant chunk of C in a lex action that uses lex's input() function. e.g. J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.) pp. 158-159
c.f. exercises on CS2111 handout 3.