DURATION: 1 lab session


To introduce you to describing pieces of text using regular expressions.


Write a series of patterns (regular expressions) suitable for use with egrep, trying out most of the available facilities. (A condensed version of the man page for egrep is available via the web pages.)
Investigate the differences between egrep and flex by rewriting some of these regular expressions for use with flex. (A description of the regular expressions used with flex is also available via the web pages.)


You are advised to attempt each part in sequence, but only the parts indicated by a '*' are actually marked.

a) egrep

The simplest way of using egrep is to type the whole command on a line. To simplify marking, type your commands into a edit window, and copy-and-paste them into a shell window. (Alternatively, if you know how to use shell scripts or something similar, feel free.)

        egrep 'pattern' file
lists every line in the file that contains the pattern. It is important that you quote the pattern using ' characters so that the shell does not try to recognise special characters that are meant for egrep. Try
        egrep 'tom' /usr/dict/words
Do not use the '-i' facility of egrep to ignore the case of letters, and (except for a6) do not use the '-v' facility to invert the test - I want you to learn how to use regular expressions in general, rather than with a specific program.

For the first 6 parts you should search in "$CS2111/p*/ex1/words", which contains one word per line.

a1 Find all the words containing an apostrophe ('). To fool the shell into passing an apostrophe on to egrep, use

as the pattern to recognise one apostrophe, surrounded by more apostrophes as described above i.e.
All these apostrophes are overkill, but they will work here and elsewhere (e.g. in a6). If you understand bash you can omit some.

a2 Find all the words containing any decimal digits.

a3 Find all the proper names (i.e. starting with an upper-case letter).

a4 Find all the acronyms (i.e. words containing no lower-case letters).

a5 Find all the single-character words.

a6* Create a file "words" containing everything in $CS2111/p*/ex1/words that is not a single character nor an acronym nor contains a decimal digit nor an apostrophe (but including proper names that aren't acronyms).
An obvious way to do this is to use a different egrep for each set of words to discard, and run them in sequence. However, I want you to use a single egrep, using -v to invert the meaning of the search, so that only lines that don't match the pattern are copied to the output:

        egrep -v 'pattern' $CS2111/p*/ex1/words >words

For the next 4 parts you should search in your file "words".

a7 Find all the words that contain no vowels (i.e. aeiouAEIOU).

a8* Find all the words that consist only of (some or all of) the letters aeiouyAEIOUY.

a9 Find all the words that contain the five vowels in alphabetic order (and contain any other vowels and letters etc.).
For this part and the next part, what is required is a pattern of the form

where A, E, I, O, U indicate sub-patterns that recognise the vowels, and the "..."s indicate various sub-patterns that recognise whatever is between the vowels and maybe also what precedes and follows the 5 vowels.

a10* Find all the words that contain each of the five vowels exactly once, and in alphabetic order.

You are now going to use another file, "$CS2111/p*/ex1/text". which is similar to "words", except that it has many words on each line. You will need to use the fact that spaces or newlines separate the words (i.e. no word contains a space or a newline). Remember that the words you are looking for can occur at the start or end of a line, as well as somewhere in the middle.

a11 This is similar to a9: find all the lines on which there is a word that contains the five vowels in alphabetic order.

a12* This is similar to a10: find all the lines on which there is a word that contains each of the five vowels exactly once, and in alphabetic order.

b) flex

Copy the files "my_grep.l" and "makefile" from "$CS2111/p*/ex1" The flex file "my_grep.l" has two important lines, between the lines consisting of "%%":

        your_pattern_goes_here  	{printf("%s\n", yytext);}
        \n|.       	       	       	{/*discard everything else*/}
You should replace "your_pattern_goes_here" by a regular expression (e.g. from part a - but you must not put ' characters around the regular expression, nor can you put any spaces or tabs before your pattern) and then compile and run it on $CS2111/p*/ex1/words by using the command:
        make test
The part of the input that matches the regular expression will be output by the printf command, and everything else will be thrown away.

b1 Find all the single-characters words (a5).

b2 Try using 'tom' as the regular expression. You should find that the correct number of 'tom's are located, but that you do not see the rest of the words.

b3* Modify the regular expression to match the whole of each word containing 'tom'.

Bonus: Try repeating a11 and/or a12 using lex. Try to output just the particular words you want, then try to output the whole line containing each word you want. You should be able to do this with a single pattern. Try to simplify either version by using several patterns with different actions - you may get some hints by looking at the next exercise. (Use the command "make bonus" to compile and run your program on "$CS2111/p*/ex1/text".)


Demonstrate your regular expressions for the *ed parts to a demonstrator or lab supervisor. You will also be expected to answer two questions about regular expressions picked at random. You must "labmail" your "my_grep.l" for part b3.