UP PREVIOUS NEXT

Ex. 2121.2 : INTRODUCTION TO LEX

DURATION 1 lab session

AIMS

This exercise is designed to familiarise you with the use of flex (manual available via the web pages) so that, when you come to use it later, you can concentrate on the important aspects of the problem.

LEARNING OUTCOMES

On successful completion of this exercise a student will:
Be able to use flex to recognise the lexemes (words) in a piece of text.
Be aware of the different kinds of lexemes found in typical programming languages.
Be more familiar with some of the details of the C programming language, as standardised by ISO in 1989.

SUMMARY

Using flex, write a simple program that inputs C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by adding patterns to detect as many of the different kinds of lexemes in C as you can. The exercise is split into several parts, each building on previous parts to increase the functionality of your program, until you should be able to convert an input of:

        int main (void) {
          printf ("hello world\n");
          return 0;
        }

into an output something like this:

	`int`			built_in_type
	` `			white_space
	`main`			identifier
	` `			white_space
	`(`			punctuation_or_operator
	`void`			keyword
	`)`			punctuation_or_operator
	` `			white_space
	`{`			punctuation
				1 newline(s)
	`  `			white_space
	`printf`       		identifier
	` `			white_space
	`(`			punctuation_or_operator
	`"hello world\n"`	string
	`)`			punctuation_or_operator
	`;`			punctuation
				1 newline(s)
	`  `			white_space
	`return`		keyword
	` `			white_space
	`0`			octal_int_number
	`;`			punctuation
				1 newline(s)
	`}`			punctuation
				1 newline(s)
(Normally, newlines are included with white space, and in C predefined types are included with keywords, but it simplifies this exercise if they are treated separately.)

Discussions of characters and lexemes can be found in various books describing C. The web page for this lab includes links to an online copy of "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)

DESCRIPTION

Just like the previous exercise, get started using the "makefile" but this time from "$CS2121/p*/ex2" i.e. make sure you are in your directory CS2121/ex2, and then type:

        cp $CS2121/p*/ex2/makefile .
	make start
For this exercise, this copies test data for each part ( 1.c , 2.c , 3.c , 4.c , 5.c , 6.c , 7.c ), a pre-compiled library (checker.o), and c_lexemes.l, which classifies its input into letters, digits, white space, newlines, special characters, and anything else. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is the lexeme classification, which must be one of the values from:
	enum lexemes {ignore, float_number, octal_int_number,
		decimal_int_number, hex_int_number, preprocessor_command,
		comment, character, keyword, built_in_type, identifier,
		punctuation_or_operator, punctuation, operator,
		string, unknown, newlines, white_space};
(Don't alter this declaration or the one for "lexeme_names", else the checker may go wrong.)

[In a real C compiler, we probably would not bother to return comments, white-space and newlines as lexemes, but just discard them as we recognise them. Similarly, we probably would not bother to distinguish between the different kinds of integer numbers, or between characters and integers, or between keywords and built-in types. However, for this exercise, to prove that you are recognising everything correctly, you will be returning these items as different lexemes.

Another point of difference from a real compiler - really, we would want to know not just that we have seen some punctuation or a keyword, but which kind of punctuation or keyword, as they are used in different ways in different parts of a real C program. A similar problem arises with literals, like numbers and strings, and identifer names - although, initially, we just need to know that we have seen a value or name in a legal position in a program, eventually we will need to know what that particular value or name is. You will see, in the example given in the lectures, that we handle these two problems in different ways.]

c_lexemes.l includes calls to "check" and "report" (provided by the pre-compiled file checker.o and added to your program by the makefile - if you are not using linux then comment out the calls). These routines check your lexemes, and list the total numbers recognised. (Use "ignore" if you don't want a particular lexeme to be checked - for example, most things recognised by the starting version of c_lexemes.l are "ignore"d, as they aren't C lexemes.)

The "check" only has effect at the end of a line, so you may find it awkward to relate warnings to lexemes listed a while before. (Also, because of this, you may not be warned if you fail to recognise a single lexeme that extends over more than one line, such as a comment!) However, in general I hope you will find that these messages provide useful hints about mistakes you make and what you should try next. This facility is new, and any suggestions for improvements will be very welcome.

As with the previous exercise, I have split the problem into several parts, and provided a "makefile" to simplify running them. Similarly, I have provided some hints. Try to do each part yourself, without referring to the hints, but if you get stuck you might find them useful.

The idea for this exercise is that, as you do each part in turn, you will be gradually improving the patterns in your version of "c_lexemes.l" until it can cope with most C programs. Thus, for part 1, type:

        make 1
This will run your pattern on the correct test data, and then compare the result with the expected answer. It will tell you if you got it right ("no OOPS detected!"), or else it will try to tell you what went wrong ("OOPS..."). If you type "make 1" before you have made any changes to "c_lexemes.l", you should see something like this:

part 1: ./c_lexemes <1.c >1.lexemes
no OOPS detected!
2002-08-12 16:05 GMT: numbers of lexemes & characters processed:
  177 ignore(s)				in   532 characters
    0 float_number(s)			in     0 characters
    0 octal_int_number(s)		in     0 characters
    0 decimal_int_number(s)		in     0 characters
    0 hex_int_number(s)			in     0 characters
    0 preprocessor_command(s)		in     0 characters
    0 comment(s)			in     0 characters
    0 character(s)			in     0 characters
    0 keyword(s)			in     0 characters
    0 built_in_type(s)			in     0 characters
    0 identifier(s)			in     0 characters
    0 punctuation_or_operator(s)	in     0 characters
    0 punctuation(s)			in     0 characters
    0 operator(s)			in     0 characters
    0 string(s)				in     0 characters
    0 unknown(s)			in     0 characters
   20 newline(s)(s)			in    31 characters
  120 white_space(s)			in   299 characters
total:   317 lexemes			in   862 characters
there were 862 characters actually in the data file
If you want to see every lexeme that your patterns matched, type:
        more 1.lexemes

The starting version of c_lexemes.l already recognises some lexemes, and you should keep these patterns:
white space (spaces, tabs and newlines), using:

 [ \t]+			{lexeme(white_space); /*keep this*/}
 [\n]+			{lexeme(newlines);    /*keep this*/}
unexpected characters, to help you identify wrong or missing patterns, using:
 .			{lexeme(unknown);     /*keep this*/}
You should ensure that this is always the last pattern in your program - if you aren't sure why, see what happens when you make it the first rule.

It also recognises other characters, but not as C lexemes, and tells the checker to "ignore" them:

 [A-Za-z]+				{lexeme(ignore); /*get rid of this*/}
 [0-9]+ 				{lexeme(ignore); /*get rid of this*/}
 []!@#$%^&*()_+=|\~`[{};:'",<.>/?-]+    {lexeme(ignore); /*get rid of this*/}
You should edit c_lexemes.l to gradually replace these three patterns by ones that recognise C lexemes, as described below. (hints, including those about regular expressions from the previous exercise)

a) Simple lexemes

Part 1 Modify the patterns in "c_lexemes.l" to recognise identifiers, keywords (e.g. "if", "while", "else") and built_in_types (e.g. "int", "char"). (You should get rid of the pattern currently recognising [A-Za-z]+)

When your answer to part 1 is correct, you should have no "OOPS", no "unknown" characters, many fewer "ignore"d characters, and you should be processing every character in the file i.e. the number of characters in the last two lines output should be the same (862 in this case). You can then proceed to part 2, making further modifications to "c_lexemes.l", and then typing "make 2", and so on.

Part 2 Modify the patterns in "c_lexemes.l" to recognise simple forms of float_number, decimal_int_number, octal_int_number, hex_int_number, and character (i.e. literals such as: 'a' 'b' 'c', not single-character identifiers such as: a b c). To see what I mean by "simple forms", take a look at the test data 2.c
(You should get rid of the pattern currently recognising [0-9]+)
Try to avoid allowing impossible formats e.g. when you recognise floating-point numbers, don't allow more than one decimal point!

As the exercise is cumulative, it is a good idea to check that, as you finish each part, you can still do all the previous parts - for this part, you just need to check that "make 1" still gives the correct answers.

Part 3 Modify the patterns in "c_lexemes.l" to recognise punctuation and operators - you should use punctuation_or_operator for "," ":" "(" and ")". Your patterns should only recognise valid sequences of special characters as lexemes. (You should get rid of the pattern currently recognising []!@#$%^&*()_+=|\~`[{};:'",<.>/?-]+)
Remember to check that you can still do parts 1 and 2. By this stage, there should be no "ignore"d characters when you run your program.

Part 4 Modify the patterns in "c_lexemes.l" to recognise simple forms of comment and string i.e. strings that don't contain " characters, and comments that don't contain * or / characters. Remember to check that you can still do the previous parts.

b) Bonus work

This section contains problems that are more difficult than those in the previous sections, and where marks are harder to obtain. You should complete section (a) before attempting this section. If you are running out of time, you would do better to start exercise 3.

Part 5 Modify the patterns in "c_lexemes.l" to recognise all legal numbers and characters.

Part 6 Modify the patterns in "c_lexemes.l" to recognise all legal strings and comments. (hint)

Part 7 Modify the patterns in "c_lexemes.l" to recognise preprocessor commands. Simple commands consist of any line that has a "#" as the first non-blank character. However, if the character immediately before the end of line is a "\" character (there must not be any white space between them) then the command continues onto the next line, and so on.
Your pattern should recognise all the characters on the line(s) as a single lexeme, from the start of the line containing the "#", up to but not including the final end of line character.

ASSESSMENT CRITERIA

This exercise is worth 10 marks in total. It is necessary to produce a working program to demonstrate that you have achieved the Learning Outcomes of this exercise. However, it is also important that you are able to show that you understand the details of what you have written. You are therefore required to demonstrate your work to a member of the laboratory staff.

MARKING PROCESS

Demonstrate your working program to a demonstrator or lab supervisor by typing "make mark".

You must "labmail" the final version of your flex code i.e. "c_lexemes.l".

You program will be marked as follows:
2 marks - part 1
3 marks - part 2
1 mark - part 3
1 mark - part 4
1 mark - part 5
1 mark - part 6
1 mark - part 7


UP PREVIOUS NEXT