This practical is designed to familiarise you with the use of flex (manual available via the web pages).
Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C as you can.
Start by using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out several problem areas. The previous exercise sheet should help.
Once you have written some new patterns, you do not need to go through the whole list (unless you want to, or have spare time - perhaps in the catch-up sessions at the end of the week). If you have time, there is obviously scope to implement many different kinds of ANSI C lexemes, along with suitable test data.
Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the CS2111 book-list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)
As an example of how your program should work, an input of
#include <stdio.h> int main (void) { printf ("hello world\n"); return 0; }
should be converted into an output something like this:
int built_in_type white_space main identifier white_space ( punctuation_or_operator void keyword ) punctuation_or_operator white_space { punctuation white_space printf identifier white_space ( punctuation_or_operator "hello world\n" string ) punctuation_or_operator ; punctuation white_space return keyword white_space 0 octal_int_number ; punctuation white_space } punctuation white_space
All exercises are done on Unix systems (i.e. Linux, SunOS or Solaris). You should start the week by creating a new directory for the course, and then one for each exercise inside it:
mkdir 503 cd 503 mkdir ex1 ex2 ex3 ex_ass
It is important that you use a different directory for each exercise, as at least one file in each shares the same name (makefile). Start each exercise by going to the correct directory and copying across the various starting files e.g. for this lab:
cd 503 (unless you have already done this above) cd ex1 cp $CS5031/p*/ex1/* . (note the . at the end - this is important!)
If you get an error message like:
cp: cannot access /p*/ex1/*it means that (assuming you have typed in the line correctly - check it!) you do not have the variable CS5031 set up. You can set it yourself by (in ksh or bash):
CS5031=/opt/info/courses/CS5031or (in csh):
set CS5031 = /opt/info/courses/CS5031
Copy the starting files from "$CS5031/p*/ex1/*". The files are a makefile, some test data (ANSI C code), a pre-compiled library (checker.o), and c_lexemes.l, which classifies its input into letters, digits, white space, special characters, and anything else. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is the lexeme classification, which must be one of the values from:
enum lexemes {ignore, float_number, octal_int_number, decimal_int_number, hex_int_number, preprocessor_command, comment, character, keyword, built_in_type, identifier, punctuation_or_operator, punctuation, operator, string, unknown, white_space};(Don't alter this declaration or the one for "lexeme_names", else the checker may go wrong)
c_lexeme.l includes calls to "check" and "report" (provided by the pre-compiled file checker.o and added to your program by the makefile - if you are not using linux then comment out the calls). These routines check your lexemes, and list the total numbers recognised. (Use "ignore" if you don't want a particular lexeme to be checked - for example, most things recognised by the starting version of c_lexeme.l are "ignore"d, as they aren't ANSI C lexemes.)
The "check" only has effect at the end of a line (but not inside a multi-line comment or preprocessor command) so you may find it awkward to relate warnings to lexemes listed a while before. However, in general I hope you will find that these messages provide useful hints about mistakes you make and what you should try next. This facility is new, and any suggestions will be very welcome.
You can compile and run the flex program on the given data by typing "make test".
The starting version of c_lexemes.l already correctly recognises some lexemes,
and you should keep these:
* white space (spaces, tabs and newlines), using:
[ \t\n] {lexeme(white_space);}* unexpected characters, to help you identify wrong or missing patterns, using:
. {lexeme(unknown);}You should ensure that this is always the last pattern in your program - if you aren't sure why, see what happens when you move it.
It also recognises other characters, but not as ANSI C lexemes, and warns the checker to "ignore" them:
[A-Za-z]+ {lexeme(ignore); /*letter(s)*/} [0-9]+ {lexeme(ignore); /*decimal digit(s)*/} []!@#$%^&*()_+=|\\~`[{};:'",<.>/?-]+ {lexeme(ignore); /*special character(s)*/}You should edit c_lexemes.l to gradually replace these patterns by ones that recognise ANSI C lexemes:
. {lexeme("????????");}