Next: Debugging YACC grammars Up: CS2111: Design and Implementation Previous: YACC: Further usage Contents

Subsections

Inside LEX and YACC: State Machines

State Machine programs essentially consist of a switch statement inside a loop. Each time around the loop, the switch uses the current state to decide which piece of code to activate, each of which includes a calculation of the next state, ready for the next iteration of the loop.

The state machines used by LEX and YACC are usually implemented by combining data tables that depend on the user's grammar rules, with a general algorithm that depends only on which tool is in use. (Methods for calculating these tables are beyond the scope of this course unit, but will be covered in CS304 next year.)

Inside LEX: Finite State Automata (FSA)

The C program that is created by LEX from your regular expressions works something like this:


        state= 0; get next input character
        while (not end of input) {
           depending on current state and input character
              match: /* input expected */
                 calculate new state; get next input character
              accept: /* current pattern completely matched */
                 state= 0; perform action corresponding to pattern
              error: /* input unexpected */
                 state= 0; echo 1st character input after last accept or error;
                 reset input to 2nd character input after last accept or error;
        }

The decision as to which of the 3 options to choose, and the calculation of the new state after a match, are encoded in a simple table indexed by the current state and input character.

The particular regular expressions are converted into a Deterministic FSA (DFSA) e.g. for a(b|c)d*e+

$\begin{picture}(8,1.5) \put(0,1){\vector(1,0){0.65}}\put(1,1){\circle{1.0}}\put(... ... % 2 -> 3 \qbezier(6.8,0.7)(7,0)(7.2,0.7)\put(6.9,0.5){e} % 3 -> 3 \end{picture}$

which has states 0 to 3, where state 0 is the initial state and state 3 is an accept state that indicates a possible end of the pattern. This is the equivalent decision table:

	a	b	c	d	e	other
0	m(1)
1		m(2)	m(2)
2				m(2)	m(3)
3	a	a	a	a	m(3)	a

<>
a = accept; m(n) = match with new state n; blank = error

Inside YACC: Push Down Automata (PDA)

The parsers YACC generates are essentially finite state machines with a stack. They work by repeatedly inputing and shifting (pushing) symbols until a complete Right-Hand-Side (RHS) of a grammar rule is on the stack and then reducing (popping) the RHS to the corresponding (pushed) Left-Hand-Side (LHS). The parser algorithm used by YACC is known as LR(1), and it creates the decision table using the LALR(1) algorithm.

Although a simple stack of states is sufficient for the parsing algorithm, YACC's stack really consists of (state, value) pairs, where the value is set and used in the actions ($$, $1 etc.).


        initialise stack
        state= 0; get next input symbol
        while (not accept and not error) {
           depending on current state and input symbol
              shift: /* stack does not yet hold full RHS */
                 push current state
                 calculate new state; get next input symbol
              reduce: /* input symbol follows a RHS on the stack */
                 pop RHS /*current state represents right-most item*/
                 calculate new state using LHS symbol & state from stack
              accept: /* i.e. stack empty and end of input */
                 finish parsing
              error: /* input unexpected */
                 . . .
        }

LEX does not make its decision table visible, but YACC allows us to see it in the file y.output e.g. here is the (reformatted) y.output for list : id | list ',' id ;


        state 0 $accept : _list $end
                id shift 1        . error          list goto 2
        state 1 list : id_  (1)
                . reduce 1
        state 2 $accept : list_$end
                list : list_, id
                $end accept        , shift 3        . error
        state 3 list : list ,_id
                id shift 4        . error
        state 4 list : list , id_  (2)
                . reduce 2

(for each state, the _ in the first line(s) tell us where we are in recognising the grammar, and the last line(s) are the transitions to other states. $end means end of input)
and the corresponding PDA:

$\begin{picture}(8.5,2) % put(0,0)\{ framebox(8.5,2)\{\}\} \put(1,1.5){\circle{0.... ...id} % 0-1 \put(3,1.25){\vector(0,-1){0.2}}\put(2.8,1.1){','} % 2-3 \end{picture}$

and decision table:

state	terminal symbols			non-terminals
	id	','	$end	list
0	1			2
1	list:id	list:id	list:id
2		3	accept
3	4
4	list:list ',' id	list:list ',' id	list:list ',' id
	number = shift to new state			new state
	LHS:RHS = reduction rule			after reduction

and here are some example parses:

stack+current state input symbol+rest of input ($=end of input)

0 id$

$\rightarrow$

01 $

$\Rightarrow$

02 $

$\rightarrow$

0 id,id$

$\rightarrow$

01 ,id$

$\Rightarrow$

02 ,id$

$\rightarrow$

023 id$

$\rightarrow$

0234 $

$\Rightarrow$

02 $

$\rightarrow$

0 id,$

$\rightarrow$

01 ,$

$\Rightarrow$

02 ,$

$\rightarrow$

023 $

$\rightarrow$

error

0 id id$

$\rightarrow$

01 id$

$\Rightarrow$

02 id$

$\rightarrow$

error

<>
$\rightarrow$ = shift, $\Rightarrow$ = reduction

Each reduction pops the RHS to reveal a state which, with the LHS symbol, is used to decide the next state
(For this simple grammar always 0, list, and 2 respectively).

Left and Right recursion

Our version of YACC has a tracing facility that we can use to see how it treats left-recursive and right-recursive grammars - e.g. when recognising a , b , c as a list of identifiers. This would be input by a program generated by LEX (or similar), so the parser would see the symbols $\fbox {id}$ $\fbox {,}$ $\fbox {id}$ $\fbox {,}$ $\fbox {id}$ followed by a symbol marking the END of the list (e.g. $\fbox {;}$ ). Example parse trace for list : id | list ',' id ;


                  .... look ahead at id 'a'
        id <-- 'a'
        |
        list
        |                 .... look ahead at ','
        |     ',' <-- ','
        |      |                 .... look ahead at id 'b'
        |      |      id <-- 'b'
        +------+------+
        list
        |                 .... look ahead at ','
        |     ',' <-- ','
        |      |                 .... look ahead at id 'c'
        |      |      id <-- 'c'
        +------+------+
        list
        |                 .... look ahead at END

Example parse trace for list : id | id ',' list ;


                  .... look ahead at id 'a'
        id <-- 'a'
        |                 .... look ahead at ','
        |     ',' <-- ','
        |      |                 .... look ahead at id 'b'
        |      |     id <-- 'b'
        |      |      |                 .... look ahead at ','
        |      |      |     ',' <-- ','
        |      |      |      |                 .... look ahead at id 'c'
        |      |      |      |     id <-- 'c'
        |      |      |      |      |                 .... look ahead at END
        |      |      |      |     list
        |      |      +------+------+
        |      |     list
        +------+------+
        list
        |

The differences arise solely because of the differences in the grammars: as YACC looks for and reduces the RHSs of the actual grammar rules, with left-recursion the sub-lists must be on the left of the parse tree, whereas with right-recursion they must be on the right. Input is left-to-right, so in the first example the tree can be built as the input is being read, but in the second example all the input must be read before the tree can start to be built.

The right recursive grammar uses more stack space (the number of $\vert$ s on each line) than the left recursive grammar. This is always the case with LR grammar tools like YACC, so to minimise the risk of stack overflow we should use left recursive grammars wherever possible. There are other kinds of grammar recognising tools for which this is not the case, or which cannot use one or the other kind of recursion, so check this if you use a tool other than YACC.

Language design and implementation

Most computer languages can mostly be processed by LR(1) parsers like YACC. This is partly because the ability of humans to understand a piece of text is similar to that of YACC, and partly because significantly more complex languages tend to be disproportionately harder and slower to translate, and so are rarely used. However, humans are smarter than computers, and translators have to be able to cope with worst cases, so it can sometimes be necessary to cheat slightly to be able to completely recognise a language. This typically involves putting extra code into the actions of LEX and YACC.

For example, C is not quite LR(1), particularly in the area of declaration processing. One 'fix' is to check any identifier at the start of a statement to see if it is currently defined by a typedef, in which case the statement is a declaration!

Exercises

Put some grammars through YACC -v and construct the corresponding PDAs from the y.output files.
The simple translation of regular expressions to FSAs outlined above in 6.1 usually creates a non-deterministic automaton (NFSA), which then has to be converted to a DFSA. e.g. consider:
```
  while|if   /* keyword */
  [a-z]+     /* identifier */
```
Create an NFSA for these regular expressions, and then convert it to an equivalent DFSA that distinguishes between keywords and identifiers.

Readings

Capon & Jinks: chs. 6.4, 8.1, 8.4, 8.5
Aho, Sethi & Ullman: chs. 3.6-3.8, 4.5, 4.7
Johnson: $\S$ 4
Levine, Mason & Brown: p. 197

Next: Debugging YACC grammars Up: CS2111: Design and Implementation Previous: YACC: Further usage Contents

Pete Jinks
1999-09-30