UP

CS5031 Practical 1: Introduction to Lex

PURPOSE

This practical is designed to familiarise you with the use of flex (manual available via the web pages).

OBJECTIVES

Using flex, write a simple program that inputs ANSI C code and outputs a list of the words (lexemes, tokens) it has recognised. You are given a skeleton program that naively classifies its input into letters, digits, white space, and special characters. Improve this skeleton by modifying the patterns or adding new ones to detect as many of the different kinds of lexemes in ANSI C (numbers, characters, strings, identifiers, keywords, operators, punctuation, comments etc.) as you can.

You should start off using simple patterns, to recognise the usual kinds of lexemes (i.e. those in the first part of the test data), and add complexity if you have the time. For example, numbers can simply consist of decimal digits, but you will find examples of other formats in the test data which you can try to write patterns for. Try to avoid allowing impossible formats e.g. if you recognise floating-point numbers, don't allow more than one decimal point! The description below also points out several problem areas. The previous exercise sheet should help. If you have time, there is obviously scope to implement as many different kinds of ANSI C lexemes as possible, along with suitable test data.

Discussions of characters and tokens can be found in various books describing ANSI C mentioned in the CS2111 book-list e.g. "Standard C - A Reference" by Plauger and Brodie. (characters, pre-processing, and syntax)

As an example of how your program should work, an input of

        #include <stdio.h>
        int main (void)
        {
          printf ("hello world\n");
          return 0;
        }

might give rise to the following output:

        #include <stdio.h>	preprocessor command
        int			built-in type
        main			identifier
        (			punctuation
        void			built-in type
        )			punctuation
        {			punctuation
        printf			identifier
        (			punctuation
        "hello world\n"  	string
        )			punctuation
        ;  			punctuation
        return			keyword
        0			integer number
        ;  			punctuation
        }			punctuation

DESCRIPTION

Copy the starting files from "$CS5031/p*/ex1/*". The files are a makefile, some test data (ANSI C code) and c_lexemes.l, which at the moment classifies its input into letters, digits, white space, and special characters. It also contains a function, "lexeme", which you should call when you recognise each lexeme, to produce a neat listing like that above. Its parameter is simply a string giving the lexeme classification.

You can compile and run the flex program on the given data by typing "make test".

You should modify the list of existing patterns and associated actions in c_lexemes.l to recognise and output:
* white space (spaces, tabs and newlines). As this normally only separates real lexemes, don't call "lexeme" to output it, but just discard it.
* literals (integers, floats, and quoted characters). Start by recognising simple versions of these, as in the first part of the test data. Only try recognising the formats in the last part of the test data when you everything else working.
* punctuation and operators. Don't worry too much about distinguishing punctuation from operators, as , : ( ) are both, but try to make sure that you only recognise valid sequences of special characters.
* identifiers, keywords (e.g. "if", "while", "else" and anything else in the test data), and built-in types (e.g. "int", "char" etc.). Don't confuse them, even though they look similar - patterns that recognise special cases, such as particular keywords, must come before patterns that recognise more general cases, such as any identifier.
* Pre-processor commands. These have a different style to the rest of C. Simple commands consist of the whole of any line that has a "#" as the first non-blank character. For the rest of C, newlines are like any other white space.
* Comments and strings (doing so will also avoid spuriously recognising any of the other lexemes above that are actually part of a comment or string). This is difficult, so just do the best you can with simple cases (e.g. strings that don't contain " characters, and comments that don't flow over more than one line).
If you have time to spare, you could try to recognise all strings and/or comments correctly (hints available via the web pages).
* The last pattern in your program should pick up unexpected characters using:

        .        {lexeme("????????");}


UP NEXT