DURATION 1 lab session
This exercise is designed to familiarise you with the use of flex (manual available via the web pages) so that, when you come to use it later, you can concentrate on the important aspects of the problem.
Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C as you can.
Start by using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out some problem (bonus) areas. There is obviously scope for bonus work in implementing as many different kinds of ANSI C lexemes as possible, along with suitable test data.
Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the book list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)
As an example of how your program should work, an input of
int main (void) { printf ("hello world\n"); return 0; }
should be converted into an output something like this:
int built_in_type white_space main identifier white_space ( punctuation_or_operator void keyword ) punctuation_or_operator white_space { punctuation white_space printf identifier white_space ( punctuation_or_operator "hello world\n" string ) punctuation_or_operator ; punctuation white_space return keyword white_space 0 octal_int_number ; punctuation white_space } punctuation white_space
Copy the starting files from "$CS2111/p*/ex2/*". The files are a makefile, some test data (ANSI C code), a pre-compiled library (checker.o), and c_lexemes.l, which classifies its input into letters, digits, white space, special characters, and anything else. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is the lexeme classification, which must be one of the values from:
enum lexemes {ignore, float_number, octal_int_number, decimal_int_number, hex_int_number, preprocessor_command, comment, character, keyword, built_in_type, identifier, punctuation_or_operator, punctuation, operator, string, unknown, white_space};(Don't alter this declaration or the one for "lexeme_names", else the checker may go wrong)
c_lexeme.l includes calls to "check" and "report" (provided by the pre-compiled file checker.o and added to your program by the makefile - if you are not using linux then comment out the calls). These routines check your lexemes, and list the total numbers recognised. (Use "ignore" if you don't want a particular lexeme to be checked - for example, most things recognised by the starting version of c_lexeme.l are "ignore"d, as they aren't ANSI C lexemes.)
The "check" only has effect at the end of a line (but not inside a multi-line comment or preprocessor command) so you may find it awkward to relate warnings to lexemes listed a while before. However, in general I hope you will find that these messages provide useful hints about mistakes you make and what you should try next. This facility is new, and any suggestions will be very welcome.
You can compile and run the flex program on the given data by typing "make test".
The starting version of c_lexemes.l already correctly recognises some lexemes,
and you should keep these:
* white space (spaces, tabs and newlines), using:
[ \t\n] {lexeme(white_space);}* unexpected characters, to help you identify wrong or missing patterns, using:
. {lexeme(unknown);}You should ensure that this is always the last pattern in your program - if you aren't sure why, see what happens when you move it.
It also recognises other characters, but not as ANSI C lexemes, and warns the checker to "ignore" them:
[A-Za-z]+ {lexeme(ignore); /*letter(s)*/} [0-9]+ {lexeme(ignore); /*decimal digit(s)*/} []!@#$%^&*()_+=|\\~`[{};:'",<.>/?-]+ {lexeme(ignore); /*special character(s)*/}You should edit c_lexemes.l to gradually replace these patterns by ones that recognise ANSI C lexemes:
Demonstrate your working program to a demonstrator or lab supervisor. You
must "labmail" the final version of your flex code i.e. "c_lexemes.l".