CIS W7 Abramson,Paul B CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490 CE X1 O'Rourke,Daniel M CS3001 CS3041 CS3052 CS3071 CS3082 CS3111 CS3322 CS3900 EM2490 AI Y6 Naismith,Gregory S CS3001 CS3052 CS3071 CS3082 CS3311 CS3322 CS3361 CS3900 EM2490 CIS X4 Sanders,Alexander P CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490 SI __ Varney,Samantha CS3041 CS3052 CS3071 CS3082 CS3251 CS3311 CS3432 CS3900 EM2490 CS W1 Smith,Mark CS3001 CS3052 CS3071 CS3082 CS3212 CS3232 CS3241 CS3251 CS3900 EM2490 MI I3 Cooper,Paul CS3311 CS3561 CS3572 CS3902 MK3322 HM3111 HM3132 HM3142 HM3152 HM3121 MI I4 Smythe,Helen Ruth CS3561 CS3572 CS3902 MK3322 MK3311 DI3111 HM3132 HM3152 HM3121 HS3132 CE Z5 Grant,Ellie CS3052 CS3071 CS3082 CS3111 CS3311 CS3322 CS3361 CS3900 EM2490
The whole file consists of one or more students, and for each student,
the first item is a school (e.g. CIS
),
the next is a tutorial group (e.g. W7
),
the next is the student's name (e.g. Abramson,Paul B
)
and finally there is list of the modules taken by that student - this
can be empty (e.g. if they haven't registered yet) or contain any number of
module names (e.g. CS3001 CS3071 CS3102 CS3132 CS3311 CS3322 CS3361 CS3900 EM2490
)
So, how do we recognise (and then do something with) this data? There are many different possibilities - we could start from scratch using a general-purpose high-level language like C or Java, or a more specialised text-processing language like Perl or AWK. However, I want to concentrate on two specific tools - Lex and Yacc (or any similar tools, such as Flex and Bison).
I have described how to use Lex and Yacc together to solve this problem here, and for many data-formats this would be the end of the story. However, the text in this example has a particularly simple form - it consists of a series of records, each of which has essentially the same layout. This is in contrast to the situation with a typical programming language like C or Java, where:
{ }
block or a function-declaration can contain many statements and declarations;
expressions can contain bracketed sub-expressions)
%{ int mytrace; %} %% [A-Z]+ {if(mytrace)ECHO;/*school*/} [A-Z][0-9] {if(mytrace)ECHO;/*group*/} "__" {if(mytrace)ECHO;/*group*/} [A-Z][a-z'][-A-Za-z, ]* {if(mytrace)ECHO;/*name*/} [A-Z][A-Z][0-9][0-9][0-9][0-9] {if(mytrace)ECHO;/*module*/} [A-Z][A-Z][0-9][0-9][0-9] {if(mytrace)ECHO;/*module*/} [ \t\n] {if(mytrace)ECHO;} . {ECHO; yyerror("unexpected character");} %% yywrap etc. as usual
[The major difference between this version of the Lex program and the
original version
is that I have removed the return
s from the various actions.
This is not actually necessary - we have the option of writing actions without
return
s and calling yylex
once,
or writing actions containing return
s and calling
yylex
repeatedly using code like:
while (yylex()) {/*no loop body necessary*/}I find it simpler to use the single call to
yylex
and omit
the return
s.
If the data is already known to be in the right format (e.g. it is created by a different program and all we are doing is some extra processing) this is sufficient, but otherwise, there is nothing in the code above that checks that the various pieces of data appear in the right order, or even that they appear at all!
%{ int mytrace; int school, group, module, name; void init() {school=group=module=name=0;} void check() { if (school==0) yyerror("school missing"); else if (school>1) yyerror("multiple schools"); if (group==0) yyerror("group missing"); else if (group>1) yyerror("multiple groups"); if (name==0) yyerror("names missing"); else if (name>1) yyerror("multiple names"); if (module==0) yyerror("warning - no modules\n"); } %} %% [A-Z]+ {if(mytrace)ECHO; school++;} [A-Z][0-9] {if(mytrace)ECHO; group++;} "__" {if(mytrace)ECHO; group++;} [A-Z][a-z'][-A-Za-z, ]* {if(mytrace)ECHO; name++;} [A-Z][A-Z][0-9][0-9][0-9][0-9] {if(mytrace)ECHO; module++;} [A-Z][A-Z][0-9][0-9][0-9] {if(mytrace)ECHO; module++;} "\n" {if(mytrace)ECHO; check(); init();} [ \t] {if(mytrace)ECHO;} . {ECHO; yyerror("unexpected character");} %% yywrap etc. as usual
In the following code, the start-state SCHOOL indicates that we are expecting a school-name, and is used to ensure that each record has a school-name at the start of the record. We then switch to GROUP (using BEGIN) to ensure that the next field will be the group-name, and then to NAME and to MODULE, and back to SCHOOL to recognise the next record.
%s SCHOOL GROUP NAME MODULE %% <SCHOOL>[A-Z]+ {if(mytrace)ECHO; BEGIN GROUP;} <GROUP>[A-Z][0-9] {if(mytrace)ECHO; BEGIN NAME;} <GROUP>"__" {if(mytrace)ECHO; BEGIN NAME;} <NAME>[A-Z][a-z'][-A-Za-z, ]* {if(mytrace)ECHO; BEGIN MODULE;} <MODULE>[A-Z][A-Z][0-9][0-9][0-9][0-9] {if(mytrace)ECHO;} <MODULE>[A-Z][A-Z][0-9][0-9][0-9] {if(mytrace)ECHO;} <MODULE>"\n" {if(mytrace)ECHO; BEGIN SCHOOL;} [ \t] {if(mytrace)ECHO;} "\n" {if(mytrace)ECHO; yyerror("unexpected end-of-line"); BEGIN SCHOOL;} . {ECHO; yyerror("unexpected character");} %% yywrap etc. as usual except that we need: BEGIN SCHOOL; before the call to yylexTo ensure that we start by looking for a school-name on the first line, we need to include
BEGIN SCHOOL;
in the program
before the call to yylex
.
There are now two rules recognising a newline - one when we are in the MODULE
start-state, as expected, and one for any other situation. This is needed
because the pattern that recognises anything and reports it as an error
.
actually excludes newlines. Rather than just add newline to the
existing error-reporting pattern:
"\n"|. {ECHO; yyerror("unexpected character");}I have chosen to include some rudimentary error-recovery: after a newline, no matter what we were trying to recognise on that line, we will start afresh in the SCHOOL start-state trying to recognise the start of a new record.
Of course, if the error is just that a newline has been inserted by mistake in the middle of a record, this will be exactly the wrong thing to do! We could instead try to deal with this with extra rules like:
<GROUP>^[A-Z][0-9] {ECHO; yyerror("unexpected end-of-line in the middle of a record"); BEGIN NAME;}but, as the program should recover anyway after another line has been skipped, and the user will still have to edit the data to correct the problem, in the end this will make very little difference (and users can always find new ways to make mistakes!)
To allow for a list of module-names rather than just one, we don't BEGIN a different start-state after recognising a module-name. We therefore need to deal with whatever indicates the end of the list while still in this start-state. For some data-formats this might mean recognising the next kind of field (see the next paragraph about optional fields) but in this example all we have to do is recognise the end of the record, marked by the end-of-line. (Note that the code would be very similar even if we insisted on the list containing at least one module-name, except that we would probably want to count the module-names and check the count at the end-of-line, as in the previous version of the program.)
<GROUP>[A-Z][0-9] {if(mytrace)ECHO; BEGIN NAME;} <GROUP>[A-Z][a-z'][-A-Za-z, ]* {/*no group_name*/if(mytrace)ECHO; BEGIN MODULE;} <NAME>[A-Z][a-z'][-A-Za-z, ]* {if(mytrace)ECHO; BEGIN MODULE;}or:
<GROUP>[A-Z][0-9] {if(mytrace)ECHO; BEGIN NAME;} <GROUP,NAME>[A-Z][a-z'][-A-Za-z, ]* {if (YYSTATE==GROUP)/*no group_name*/; if(mytrace)ECHO; BEGIN MODULE;}At the moment, when an error is reported, each incorrect character is output in a separate message, like this:
C unexpected character D unexpected characterIn general, it is very dangerous to try to recognise more than one character in the error pattern - it is too easy to make a simple mistake and allow it to recognise some valid inputs - but in this case we can improve error reporting slightly by replacing the pattern
.
by [^ \t\n]+
[^ \t\n]+ {ECHO; yyerror("unexpected character(s)");}to get messages like this:
CD unexpected character(s)
The code will detect and report all format errors, but the error message will just be the rejected characters followed by "unexpected character". If you need better error messages, you might want to use code like:
<SCHOOL>[A-Z]+ {if(mytrace)ECHO; BEGIN GROUP;} <GROUP>[A-Z]+ {ECHO; yyerror("group-name expected but school-name seen instead");} <NAME>[A-Z]+ {ECHO; yyerror("student-name expected but school-name seen instead");} <MODULE>[A-Z]+ {ECHO; yyerror("module-name expected but school-name seen instead");}
[More information about complex lex patterns and start-states]
We could include code like groupname= strdup(yytext);
or
group= lookupgroup(yytext);
in the actions as we recognise
each different kind of name (remembering to cope with the list of module-names
for each student) and then use the variables when we recognise the end of the
record (i.e. the end-of-line). (Each function like lookupgroup
should maintain a
name-list,
adding new names as they are recognised, and maybe even mainting a count.)
Code to output the counts etc. can be placed after the call to
yylex
.