Example of Processing Text using Lex alone

Here is a sample of a data-file that we want to try and recognise. It is a list of students and information about them.

CIS W7	Abramson,Paul B		 CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490
CE  X1	O'Rourke,Daniel M	 CS3001 CS3041 CS3052 CS3071 CS3082 CS3111 CS3322 CS3900 EM2490
AI  Y6	Naismith,Gregory S	 CS3001 CS3052 CS3071 CS3082 CS3311 CS3322 CS3361 CS3900 EM2490
CIS X4	Sanders,Alexander P	 CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490
SI  __	Varney,Samantha		 CS3041 CS3052 CS3071 CS3082 CS3251 CS3311 CS3432 CS3900 EM2490
CS  W1	Smith,Mark		 CS3001 CS3052 CS3071 CS3082 CS3212 CS3232 CS3241 CS3251 CS3900 EM2490
MI  I3	Cooper,Paul		 CS3311 CS3561 CS3572 CS3902 MK3322 HM3111 HM3132 HM3142 HM3152 HM3121
MI  I4	Smythe,Helen Ruth	 CS3561 CS3572 CS3902 MK3322 MK3311 DI3111 HM3132 HM3152 HM3121 HS3132
CE  Z5	Grant,Ellie		 CS3052 CS3071 CS3082 CS3111 CS3311 CS3322 CS3361 CS3900 EM2490

The whole file consists of one or more students, and for each student,
the first item is a school (e.g. CIS),
the next is a tutorial group (e.g. W7),
the next is the student's name (e.g. Abramson,Paul B)
and finally there is list of the modules taken by that student - this can be empty (e.g. if they haven't registered yet) or contain any number of module names (e.g. CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490)

So, how do we recognise (and then do something with) this data? There are many different possibilities - we could start from scratch using a general-purpose high-level language like C or Java, or a more specialised text-processing language like Perl or AWK. However, I want to concentrate on two specific tools - Lex and Yacc (or any similar tools, such as Flex and Bison).

I have described how to use Lex and Yacc together to solve this problem here, and for many data-formats this would be the end of the story. However, the text in this example has a particularly simple form - it consists of a series of records, each of which has essentially the same layout. This is in contrast to the situation with a typical programming language like C or Java, where:

there are many different kinds of statements (e.g. assignment, function-call, for-loop, if-then-else, switch) and declarations (e.g. variable, type, function)
the different statements and declarations can be nested (e.g. the body of a for-loop or if-statement is itself a statement; a { } block or a function-declaration can contain many statements and declarations; expressions can contain bracketed sub-expressions)

Because this text has such a simple form, we can take a different approach and use Lex by itself (and incidentally show more features of Lex). Depending on the details of what you are trying to achieve, this approach may simplify the resulting program or make it vastly more complicated, so look before you leap!
For example, I have used Lex by itself to input lists of numbers for use in calculations (e.g. course statistics), and to approximately translate text between various formats (e.g. from html to ASCII and from roff to html).
[I suspect that the line-based applications (e.g. student data, and course statistics) are situations where other tools based on regular-expressions, such as Perl or AWK, would be as good or maybe even better. However, applications that need to work word-by-word rather than line-by-line (e.g. text processing) might be less suitable for tools like AWK, and maybe Perl as well.]

We can start from the Lex code given here for use with Yacc (see here for an explanation of how to create this lex code in the first place) but we need to modify it to be stand-alone. (As mentioned here, we could also include extra code to pick up typical errors, but I have left this out to simplify things.)

	%{
	int mytrace;
	%}
	%%

	[A-Z]+				{if(mytrace)ECHO;/*school*/}

	[A-Z][0-9]			{if(mytrace)ECHO;/*group*/}
	"__"				{if(mytrace)ECHO;/*group*/}

	[A-Z][a-z'][-A-Za-z, ]*		{if(mytrace)ECHO;/*name*/}

	[A-Z][A-Z][0-9][0-9][0-9][0-9]  {if(mytrace)ECHO;/*module*/}
	[A-Z][A-Z][0-9][0-9][0-9]       {if(mytrace)ECHO;/*module*/}

	[ \t\n]				{if(mytrace)ECHO;}
	.				{ECHO; yyerror("unexpected character");}

	%%
	yywrap etc. as usual

[The major difference between this version of the Lex program and the original version is that I have removed the returns from the various actions. This is not actually necessary - we have the option of writing actions without returns and calling yylex once, or writing actions containing returns and calling yylex repeatedly using code like:

	while (yylex()) {/*no loop body necessary*/}

I find it simpler to use the single call to yylex and omit the returns.
Also, for no reason that I can track down, if I use "trace" (or "echo") as a variable name, the program created by flex crashes at run-time when I try to initialise the variable, but if I call it "mytrace" there is no problem - maybe I am getting a name-clash with a curses library.]

If the data is already known to be in the right format (e.g. it is created by a different program and all we are doing is some extra processing) this is sufficient, but otherwise, there is nothing in the code above that checks that the various pieces of data appear in the right order, or even that they appear at all!

We can check that each field is present by setting flags, and checking them at the end of the record (i.e. end-of-line).

	%{
	int mytrace;
	int school, group, module, name;
	void init() {school=group=module=name=0;}
	void check() {
	  if (school==0) yyerror("school missing");
	  else if (school>1) yyerror("multiple schools");
	  if (group==0) yyerror("group missing");
	  else if (group>1) yyerror("multiple groups");
	  if (name==0) yyerror("names missing");
	  else if (name>1) yyerror("multiple names");
	  if (module==0) yyerror("warning - no modules\n");
	}
	%}
	%%

	[A-Z]+				{if(mytrace)ECHO; school++;}

	[A-Z][0-9]			{if(mytrace)ECHO; group++;}
	"__"				{if(mytrace)ECHO; group++;}

	[A-Z][a-z'][-A-Za-z, ]*		{if(mytrace)ECHO; name++;}

	[A-Z][A-Z][0-9][0-9][0-9][0-9]  {if(mytrace)ECHO; module++;}
	[A-Z][A-Z][0-9][0-9][0-9]       {if(mytrace)ECHO; module++;}

	"\n"				{if(mytrace)ECHO; check(); init();}

	[ \t]				{if(mytrace)ECHO;}
	.				{ECHO; yyerror("unexpected character");}

	%%
	yywrap etc. as usual

Alternatively, we can use another of Lex's built-in features - start-states. Essentially, what these do is combine several different patterns used in the Lex program into one long multi-part pattern, so that each part must be present and in the right order.

In the following code, the start-state SCHOOL indicates that we are expecting a school-name, and is used to ensure that each record has a school-name at the start of the record. We then switch to GROUP (using BEGIN) to ensure that the next field will be the group-name, and then to NAME and to MODULE, and back to SCHOOL to recognise the next record.

	%s SCHOOL GROUP NAME MODULE

	%%

	<SCHOOL>[A-Z]+				{if(mytrace)ECHO; BEGIN GROUP;}

	<GROUP>[A-Z][0-9]			{if(mytrace)ECHO; BEGIN NAME;}
	<GROUP>"__"				{if(mytrace)ECHO; BEGIN NAME;}

	<NAME>[A-Z][a-z'][-A-Za-z, ]*   	{if(mytrace)ECHO; BEGIN MODULE;}

	<MODULE>[A-Z][A-Z][0-9][0-9][0-9][0-9]	{if(mytrace)ECHO;}
	<MODULE>[A-Z][A-Z][0-9][0-9][0-9] 	{if(mytrace)ECHO;}

	<MODULE>"\n"				{if(mytrace)ECHO; BEGIN SCHOOL;}

	[ \t]					{if(mytrace)ECHO;}
	"\n"					{if(mytrace)ECHO; yyerror("unexpected end-of-line"); BEGIN SCHOOL;}
	.					{ECHO; yyerror("unexpected character");}

	%%

	yywrap etc. as usual except that we need:
		BEGIN SCHOOL;
	before the call to yylex

To ensure that we start by looking for a school-name on the first line, we need to include BEGIN SCHOOL; in the program before the call to yylex.

There are now two rules recognising a newline - one when we are in the MODULE start-state, as expected, and one for any other situation. This is needed because the pattern that recognises anything and reports it as an error . actually excludes newlines. Rather than just add newline to the existing error-reporting pattern:

	"\n"|.		{ECHO; yyerror("unexpected character");}

I have chosen to include some rudimentary error-recovery: after a newline, no matter what we were trying to recognise on that line, we will start afresh in the SCHOOL start-state trying to recognise the start of a new record.

Of course, if the error is just that a newline has been inserted by mistake in the middle of a record, this will be exactly the wrong thing to do! We could instead try to deal with this with extra rules like:

	<GROUP>^[A-Z][0-9]	{ECHO; yyerror("unexpected end-of-line in the middle of a record"); BEGIN NAME;}

but, as the program should recover anyway after another line has been skipped, and the user will still have to edit the data to correct the problem, in the end this will make very little difference (and users can always find new ways to make mistakes!)

To allow for a list of module-names rather than just one, we don't BEGIN a different start-state after recognising a module-name. We therefore need to deal with whatever indicates the end of the list while still in this start-state. For some data-formats this might mean recognising the next kind of field (see the next paragraph about optional fields) but in this example all we have to do is recognise the end of the record, marked by the end-of-line. (Note that the code would be very similar even if we insisted on the list containing at least one module-name, except that we would probably want to count the module-names and check the count at the end-of-line, as in the previous version of the program.)

If we had decided, say, to make group-name optional, rather than using "__" to indicate a missing value, we could have used code like:

	<GROUP>[A-Z][0-9]			{if(mytrace)ECHO; BEGIN NAME;}

	<GROUP>[A-Z][a-z'][-A-Za-z, ]*   	{/*no group_name*/if(mytrace)ECHO; BEGIN MODULE;}
	<NAME>[A-Z][a-z'][-A-Za-z, ]*   	{if(mytrace)ECHO; BEGIN MODULE;}

or:

	<GROUP>[A-Z][0-9]			{if(mytrace)ECHO; BEGIN NAME;}

	<GROUP,NAME>[A-Z][a-z'][-A-Za-z, ]*   	{if (YYSTATE==GROUP)/*no group_name*/;
						if(mytrace)ECHO; BEGIN MODULE;}

At the moment, when an error is reported, each incorrect character is output in a separate message, like this:

	C unexpected character
	D unexpected character

In general, it is very dangerous to try to recognise more than one character in the error pattern - it is too easy to make a simple mistake and allow it to recognise some valid inputs - but in this case we can improve error reporting slightly by replacing the pattern . by [^ \t\n]+

	[^ \t\n]+	{ECHO; yyerror("unexpected character(s)");}

to get messages like this:

	CD unexpected character(s)

The code will detect and report all format errors, but the error message will just be the rejected characters followed by "unexpected character". If you need better error messages, you might want to use code like:

	<SCHOOL>[A-Z]+	{if(mytrace)ECHO; BEGIN GROUP;}
	<GROUP>[A-Z]+	{ECHO; yyerror("group-name expected but school-name seen instead");}
	<NAME>[A-Z]+	{ECHO; yyerror("student-name expected but school-name seen instead");}
	<MODULE>[A-Z]+	{ECHO; yyerror("module-name expected but school-name seen instead");}

[More information about complex lex patterns and start-states]

What do we do next?

For this example, the next step would probably be to do something with the actual names of students and modules etc., which at the moment we are just discarding. We would probably want to add code e.g. to count the number of students in each tutorial group, or the number taking each module, etc.

We could include code like groupname= strdup(yytext); or group= lookupgroup(yytext); in the actions as we recognise each different kind of name (remembering to cope with the list of module-names for each student) and then use the variables when we recognise the end of the record (i.e. the end-of-line). (Each function like lookupgroup should maintain a name-list, adding new names as they are recognised, and maybe even mainting a count.) Code to output the counts etc. can be placed after the call to yylex.