DURATION 1 lab session


This exercise is designed to familiarise you with the use of flex (manual available via the web pages) so that, when you come to use it later, you can concentrate on the important aspects of the problem.


Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C as you can.

Start by using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out some problem (bonus) areas. There is obviously scope for bonus work in implementing as many different kinds of ANSI C lexemes as possible, along with suitable test data.

Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the book list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)

As an example of how your program should work, an input of

        int main (void) {
          printf ("hello world\n");
          return 0;

should be converted into an output something like this:

	int			built_in_type
	main			identifier
	(			punctuation_or_operator
	void			keyword
	)			punctuation_or_operator
	{			punctuation
	printf			identifier
	(			punctuation_or_operator
	"hello world\n" 	string
	)			punctuation_or_operator
	;			punctuation
	return			keyword
	0			octal_int_number
	;			punctuation
	}			punctuation


Copy the starting files from "$CS2111/p*/ex2/*". The files are a makefile, some test data (ANSI C code), a pre-compiled library (checker.o), and c_lexemes.l, which classifies its input into letters, digits, white space, special characters, and anything else. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is the lexeme classification, which must be one of the values from:

	enum lexemes {ignore, float_number, octal_int_number,
		decimal_int_number, hex_int_number, preprocessor_command,
		comment, character, keyword, built_in_type, identifier,
		punctuation_or_operator, punctuation, operator,
		string, unknown, white_space};
(Don't alter this declaration or the one for "lexeme_names", else the checker may go wrong)

c_lexeme.l includes calls to "check" and "report" (provided by the pre-compiled file checker.o and added to your program by the makefile - if you are not using linux then comment out the calls). These routines check your lexemes, and list the total numbers recognised. (Use "ignore" if you don't want a particular lexeme to be checked - for example, most things recognised by the starting version of c_lexeme.l are "ignore"d, as they aren't ANSI C lexemes.)

The "check" only has effect at the end of a line (but not inside a multi-line comment or preprocessor command) so you may find it awkward to relate warnings to lexemes listed a while before. However, in general I hope you will find that these messages provide useful hints about mistakes you make and what you should try next. This facility is new, and any suggestions will be very welcome.

You can compile and run the flex program on the given data by typing "make test".

The starting version of c_lexemes.l already correctly recognises some lexemes, and you should keep these:
* white space (spaces, tabs and newlines), using:

        [ \t\n]	{lexeme(white_space);}
* unexpected characters, to help you identify wrong or missing patterns, using:
        .		{lexeme(unknown);}
You should ensure that this is always the last pattern in your program - if you aren't sure why, see what happens when you move it.

It also recognises other characters, but not as ANSI C lexemes, and warns the checker to "ignore" them:

        [A-Za-z]+				{lexeme(ignore); /*letter(s)*/}
        [0-9]+					{lexeme(ignore); /*decimal digit(s)*/}
        []!@#$%^&*()_+=|\\~`[{};:'",<.>/?-]+	{lexeme(ignore); /*special character(s)*/}
You should edit c_lexemes.l to gradually replace these patterns by ones that recognise ANSI C lexemes:
* identifier, keyword (e.g. "if", "while", "else"), built_in_type (e.g. "int", "char"): Patterns that recognise special cases, such as particular keywords, must come before patterns that recognise more general cases, such as any identifier - if you aren't sure why, see what happens when you move them.
* float_number, decimal_int_number, octal_int_number, hex_int_number, character: Start with simple versions, as in the first part of the test data. Only try recognising the formats in the last part of the test data when you have everything else working.
* punctuation, operator: use punctuation_or_operator for , : ( ) You should only recognise valid sequences of special characters as single lexemes.
* comment, string: (this will also help avoid spuriously recognising any of the other lexemes above that are actually part of a comment or string). Start with simple cases e.g. strings that don't contain " characters, and comments that don't contain * or / characters.
* preprocessor_command: Simple ones consist of any line that has a "#" as the first non-blank character.
Bonus: recognise all strings and/or comments and/or pre-processor commands correctly (hints available via the web pages). (This is probably only worth 1 mark.)


Demonstrate your working program to a demonstrator or lab supervisor. You must "labmail" the final version of your flex code i.e. "c_lexemes.l".