UP NEXT

Ex. 2121.1 : REGULAR EXPRESSIONS

DURATION: 1 lab session

AIMS

To introduce you to describing pieces of text using regular expressions.

LEARNING OUTCOMES

On successful completion of this exercise a student will:
Be able to write regular expressions that describe simple patterns in text.
Be able to make simple use of two tools that use regular expressions, namely egrep and flex.

SUMMARY

Write a series of patterns (regular expressions) suitable for use with egrep, trying out most of the available facilities. (A condensed version of the man page for egrep is available via the web pages.)

Investigate the differences between egrep and flex by rewriting some of these regular expressions for use with flex. (A description of the regular expressions used with flex is also available via the web pages.)

You are advised to attempt each part in sequence, but only the parts indicated by a '*' are actually marked.

DESCRIPTION

Remember to create a directory CS2121, and directories "ex1" to "ex5" inside it, and to work in the "ex1" directory, as usual.

a) egrep

The simplest way of using egrep is to type the whole command on a line.

        egrep pattern datafile
lists every line in the datafile that contains the pattern (i.e. regular expression). Try typing:
        egrep tom /usr/share/dict/words

However, if we use more complicated patterns, the shell (e.g. bash) will grab characters like "|" and "*" before egrep sees them, so we will put each pattern in a file to protect it from the shell. Create a file called "regexp" that contains a single line, consisting of:

        tom
Do not put any extra quote characters (' or ` or ") or white-space characters (space or tab, or extra newlines other than the one immediately after "tom") into the file! Then try typing:
        egrep -f regexp /usr/share/dict/words
You should get 97 matches - pipe the output through wc to check:
        egrep -f regexp /usr/share/dict/words | wc -l
(If you get a different number, it may be that you have made a mistake, or it may be that "/usr/share/dict/words" has changed!)

For the rest of this exercise, you are going to be using a "makefile" that will reduce the amount of typing you need to do. Copy the makefile from "$CS2121/p*/ex1" i.e. make sure you are in your directory CS2121/ex1, and then type:

        cp $CS2121/p*/ex1/makefile .
and then copy over the other file you need by typing:
	make start
For this exercise, this just copies the file "b.l"

For part 1, create a pattern file called "1.regexp", and then just type:

        make 1
This will take commands from the makefile to run your pattern on the correct test data, and then compare the result with the expected answer. It will tell you if you got it right, or else it will tell you what your pattern matched that it shouldn't have and what it failed to match that it should have. If you want to see everything that your pattern matched, type:
        more 1.out
When your answer to part 1 is correct, proceed to part 2, creating "2.regexp", and then typing "make 2", and so on.

egrep has various other command line options that you might be tempted to use, such as '-i' to ignore the case of letters, or '-v' to invert the test. Do not modify the makefile to use them - I want you to learn how to use regular expressions in general, rather than the facilities of egrep, so that what you learn can be applied to flex in later exercises. Similarly, do not use the extra features available in Gnu egrep such as "[:digit:]" etc., just those described in the lectures.

Some of the questions have hints available. Try to do each part yourself, without referring to the hints, but if you get stuck you might find them useful.

For the first 9 parts the makefile will search in "$CS2121/p*/ex1/words", which contains one word per line.

1) Find all the words containing an apostrophe ('). (hint)

2) Find all the words containing a decimal digit. (hint)

3) Find all the proper names (i.e. starting with an upper-case letter - it doesn't matter if the rest of the line is letters or not). (hint)

4) Find all the single-character words. (hint)

5) Find any blank lines. (for hints, see part 4)

6) Find all the words containing a full-stop (.). (hint)

7) Find all the acronyms (i.e. words containing no lower-case letters). (hint)

8)* Using a single pattern, find all the words that are matched by part 1 or 2 or 3 or 4 or 5 or 6 or 7 above. (hint)

9) Find all the words containing exactly one decimal digit. (hint)

For the remaining parts, the makefile will search a different file, consisting of the words not matched by part 8 above.

10)* Find all the words that contain no vowels (i.e. aeiouAEIOU). (for hints, see part 7)

11)* Find all the words that consist only of (some or all of) the letters aeiouyAEIOUY. (for hints, see part 7)

12)* Find all the words that contain the five vowels in alphabetic order (and contain any other vowels and letters etc.). You may exploit the fact that the test data only includes lower-case letters. (hint)

b) flex

The flex file "b.l" you copied earlier when you typed "make start" has two important lines, between the lines consisting of "%%":

        %%
        your_pattern_goes_here  	{printf("%s\n", yytext);}
        \n|.       	       	       	{/*discard everything else*/}
        %%
For part 13, copy "b.l" to "b13.l" and edit it to replace "your_pattern_goes_here" by a regular expression. Then compile and run it on $CS2121/p*/ex1/words (as used for parts 1 to 9 above) by typing:
        make 13
The part of the input that matches the regular expression will be output by the printf command, and everything else will be thrown away. You must not put any spaces or tabs in front of your pattern!

13) Find all the single-characters words (like part 4 above).

14) Try using 'tom' as the regular expression. You should find that the correct number of 'tom's are located, but that you do not see the rest of the words. (remember to copy b.l to b14.l to start)

15)* Modify the regular expression to match the whole of each word containing 'tom'. (remember to copy b.l to b15.l to start) (for hints, see part 4)

c) Bonus work

This section contains problems that are more difficult than those in the previous sections, and where marks are harder to obtain. You should complete sections (a) and (b) before attempting this section. If you are running out of time, you would do better to start exercise 2.

16)* Use egrep to find all the words that contain each of the five vowels exactly once, and in alphabetic order. (This uses the same data file as parts 10 to 12 above.)

For parts 17 and 18, the makefile will search "$CS2121/p*/ex1/text". which is similar to the previous file, except that it has many words on each line. You will need to use the fact that spaces or newlines separate the words (i.e. no word contains a space or a newline). The words you are looking for can occur at the start or end of a line, as well as somewhere in the middle.

17)* This is similar to part 12 - use egrep to find all the lines on which there is a word that contains the five vowels in alphabetic order.

18)* This is similar to part 16 - use egrep to find all the lines on which there is a word that contains each of the five vowels exactly once, and in alphabetic order.

For the next 3 parts, you are going to repeat an earlier part, but using flex instead of egrep. You are expected to get exactly the same output as you did with egrep, so for the last two parts, your regular expression must match not just one word but the whole line it appears on. Start each part by pasting your regular expression from the egrep version into your flex code (created by copying "b.l", as in section (b)).

19)* Repeat part 12 using flex instead of egrep.

20)* Repeat part 17 using flex instead of egrep.

21)* Repeat part 18 using flex instead of egrep.

ASSESSMENT CRITERIA

This exercise is worth 10 marks in total. It is necessary to produce working expressions or programs to demonstrate that you have achieved the Learning Outcomes of this exercise. However, it is also important that you are able to show that you understand the details of what you have written. You are therefore required to demonstrate your work to a member of the laboratory staff and answer two questions about regular expressions.

MARKING PROCESS

Demonstrate your regular expressions for the *ed parts to a demonstrator or lab supervisor by typing "make marka" for section (a), "make markb" for section (b), and "make markc" for section (c). You will also be expected to answer two questions about regular expressions picked at random.

You should labmail the file "labmail_me", containing all your regular expressions for egrep and flex, which you should create by typing "make labmail".

Section (a) will be marked as follows:
1 mark each (4 marks total) - correct answer to each of parts 8, 10, 11, 12

Section (b) will be marked as follows:
1 mark - correct answer to part 15

Section (c) will be marked as follows:
0.5 mark (3 marks total) - correct answer to each of parts 16 to 21

The questions about regular expressions will be marked as follows:
1 mark each (2 marks total) - correct answer to each question


UP NEXT