This practical is designed to familiarise you with the use of flex (manual available via the web pages).
Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C (numbers, characters, strings, identifiers, keywords, operators, punctuation, comments etc.) as you can.
You should start off using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out several problem areas. The previous exercise sheet should help. If you have time, there is obviously scope to implement as many different kinds of ANSI C lexemes as possible, along with suitable test data.
Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the CS2111 book-list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)
As an example of how your program should work, an input of
#include <stdio.h> int main (void) { printf ("hello world\n"); return 0; }
might give rise to the following output:
#include <stdio.h> preprocessor command int built-in type main identifier ( punctuation void built-in type ) punctuation { punctuation printf identifier ( punctuation "hello world\n" string ) punctuation ; punctuation return keyword 0 integer number ; punctuation } punctuation
Copy the starting files from "$CS5031/p*/ex1/*". The files are a makefile, some test data (ANSI C code) and c_lexemes.l, which at the moment classifies its input into letters, digits, white space, and special characters. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is simply a string giving the lexeme classification.
You can compile and run the flex program on the given data by typing "make test".
You should modify the list of existing patterns and associated actions in
c_lexemes.l to recognise and output:
* white space (spaces, tabs and newlines). As this normally only separates
real lexemes, don't call "lexeme" to output it, but just discard it.
* literals (integers, floats, and quoted characters). Start by recognising
simple versions of these, as in the first part of the test data. Only try
recognising the formats in the last part of the test data when you everything
else working.
* punctuation and operators. Don't worry too much about distinguishing
punctuation from operators, as , : ( ) are both, but try to make sure that you
only recognise valid sequences of special characters.
* identifiers, keywords (e.g. "if", "while", "else" and anything else in
the test data), and built-in types (e.g. "int", "char" etc.). Don't confuse
them, even though they look similar - patterns that recognise special cases,
such as particular keywords, must come before patterns that recognise more
general cases, such as any identifier.
* Pre-processor commands. These have a different style to the rest of C.
Simple commands consist of the whole of any line that has a "#" as the first
non-blank character. For the rest of C, newlines are like any other white
space.
* Comments and strings (doing so will also avoid spuriously recognising any
of the other lexemes above that are actually part of a comment or string).
This is difficult, so just do the best you can with simple cases (e.g. strings
that don't contain " characters, and comments that don't flow over more than
one line).
If you have time to spare, you could try to recognise all strings and/or
comments correctly
(hints available via the web pages).
* The last pattern in your program should pick up unexpected characters
using:
. {lexeme("????????");}