UP

CS5031 Practical 1: Introduction to Lex

PURPOSE

This practical is designed to familiarise you with the use of flex (manual available via the web pages).

OBJECTIVES

Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C as you can.

Start by using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out several problem areas. The previous exercise sheet should help.

Once you have written some new patterns, you do not need to go through the whole list (unless you want to, or have spare time - perhaps in the catch-up sessions at the end of the week). If you have time, there is obviously scope to implement many different kinds of ANSI C lexemes, along with suitable test data.

Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the CS2111 book-list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)

As an example of how your program should work, an input of

        #include <stdio.h>
        int main (void) {
          printf ("hello world\n");
          return 0;
        }

should be converted into an output something like this:

	int			built_in_type
				white_space
	main			identifier
				white_space
	(			punctuation_or_operator
	void			keyword
	)			punctuation_or_operator
				white_space
	{			punctuation
	 			white_space
	printf			identifier
	 			white_space
	(			punctuation_or_operator
	"hello world\n" 	string
	)			punctuation_or_operator
	;			punctuation
	 			white_space
	return			keyword
	 			white_space
	0			octal_int_number
	;			punctuation
				white_space
	}			punctuation
				white_space

DESCRIPTION

All exercises are done on Unix systems (i.e. Linux, SunOS or Solaris). You should start the week by creating a new directory for the course, and then one for each exercise inside it:

	mkdir 503
	cd 503
	mkdir ex1 ex2 ex3 ex_ass

It is important that you use a different directory for each exercise, as at least one file in each shares the same name (makefile). Start each exercise by going to the correct directory and copying across the various starting files e.g. for this lab:

	cd 503			(unless you have already done this above)
	cd ex1
	cp $CS5031/p*/ex1/* .	(note the . at the end - this is important!)

If you get an error message like:

	cp: cannot access /p*/ex1/*
it means that (assuming you have typed in the line correctly - check it!) you do not have the variable CS5031 set up. You can set it yourself by (in ksh or bash):
	CS5031=/opt/info/courses/CS5031
or (in csh):
	set CS5031 = /opt/info/courses/CS5031

Copy the starting files from "$CS5031/p*/ex1/*". The files are a makefile, some test data (ANSI C code), a pre-compiled library (checker.o), and c_lexemes.l, which classifies its input into letters, digits, white space, special characters, and anything else. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is the lexeme classification, which must be one of the values from:

	enum lexemes {ignore, float_number, octal_int_number,
		decimal_int_number, hex_int_number, preprocessor_command,
		comment, character, keyword, built_in_type, identifier,
		punctuation_or_operator, punctuation, operator,
		string, unknown, white_space};
(Don't alter this declaration or the one for "lexeme_names", else the checker may go wrong)

c_lexeme.l includes calls to "check" and "report" (provided by the pre-compiled file checker.o and added to your program by the makefile - if you are not using linux then comment out the calls). These routines check your lexemes, and list the total numbers recognised. (Use "ignore" if you don't want a particular lexeme to be checked - for example, most things recognised by the starting version of c_lexeme.l are "ignore"d, as they aren't ANSI C lexemes.)

The "check" only has effect at the end of a line (but not inside a multi-line comment or preprocessor command) so you may find it awkward to relate warnings to lexemes listed a while before. However, in general I hope you will find that these messages provide useful hints about mistakes you make and what you should try next. This facility is new, and any suggestions will be very welcome.

You can compile and run the flex program on the given data by typing "make test".

The starting version of c_lexemes.l already correctly recognises some lexemes, and you should keep these:
* white space (spaces, tabs and newlines), using:

        [ \t\n]	{lexeme(white_space);}
* unexpected characters, to help you identify wrong or missing patterns, using:
        .		{lexeme(unknown);}
You should ensure that this is always the last pattern in your program - if you aren't sure why, see what happens when you move it.

It also recognises other characters, but not as ANSI C lexemes, and warns the checker to "ignore" them:

        [A-Za-z]+				{lexeme(ignore); /*letter(s)*/}
        [0-9]+					{lexeme(ignore); /*decimal digit(s)*/}
        []!@#$%^&*()_+=|\\~`[{};:'",<.>/?-]+	{lexeme(ignore); /*special character(s)*/}
You should edit c_lexemes.l to gradually replace these patterns by ones that recognise ANSI C lexemes:
* identifier, keyword (e.g. "if", "while", "else"), built_in_type (e.g. "int", "char"): Patterns that recognise special cases, such as particular keywords, must come before patterns that recognise more general cases, such as any identifier - if you aren't sure why, see what happens when you move them.
* float_number, decimal_int_number, octal_int_number, hex_int_number, character: Start with simple versions, as in the first part of the test data. Only try recognising the formats in the last part of the test data when you have everything else working.
* punctuation, operator: use punctuation_or_operator for , : ( ) You should only recognise valid sequences of special characters as single lexemes.
* comment, string: (this will also help avoid spuriously recognising any of the other lexemes above that are actually part of a comment or string). Start with simple cases e.g. strings that don't contain " characters, and comments that don't contain * or / characters.
* preprocessor_command: Simple ones consist of any line that has a "#" as the first non-blank character.
If you have time to spare, try to recognise all strings and/or comments and/or pre-processor commands correctly (hints available via the web pages).
* The last pattern in your program should pick up unexpected characters using:
        .        {lexeme("????????");}


UP NEXT