Return to first page of the Compiler Theory article
Pascal contains 10 types of code structures:
We will not look at the last one here.
Each of these requires its own structure and procedure to parse it. These procedures are highly recursive. One nice clean way to handle parsing is to create a parent procedure that determines the type of code structure and calls the appropriate procedure. Many of the child procedures will then make recursive calls back to the parent procedure.
Procedure Parse_Code_Structure (var Code_Struct : Code_Struct_Ptr, var token token_type); begin if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_begin) then parse_code_block(Code_Struct, token); if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_end) then advance_token(token) else raise_syntax_error(token); if (token.type_of_token = reserved_sym) and (token.res_sym = rs_semicolon) then advance_token(token) else raise_syntax_error(token); return; if (token.type_of_token = identifier) then if (found_in_table(var_tab, token.cap_string)) begin parse_assignment_statement(Code_struct, token) else if (found_in_table (proc_tab, token.cap_string)) then parse_procedure_call (code_struct, token); if (token.type_of_token = reserved_sym) and (token.res_sym = rs_semicolon) then advance_token(token) else raise_syntax_error(token); return; if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_if) then parse_if_statement(Code_struct, token) if (token.type_of_token = reserved_sym) and (token.res_sym = rs_semicolon) then advance_token(token) else raise_syntax_error(token); return; . . .
This procedure is growing too large to show here completely, but that should be enough to get the idea. We also do not have enough room here to look at all the different structure parsing procedures. The if statement is probably the most interesting.
Before we look at parse_if_statement, we should look at the data structure needed to hold the if statement, and in fact we should look at all data structures needed to hold all statement types. There is some debate over the best structure to use. I will present the structure I prefer, which is straight forward and easy to handle. (see notes in the first page of this article)
Once again, a variant record would be ideal for holding code structures, but Pascal's variant records have not improved any in the past twenty pages, so once again we will use a set of pointers. A parent type is needed to hold the pointers. Here is a partial definition of code_struct:
Type code_struct_ptr = ^code_struct; code_block_ptr = ^code_block_stat_rec; assignment_ptr = ^assignment_stat_rec; proc_call_ptr = ^proc_call_stat_rec; if_rec_ptr = ^if_stat_rec; . . . code_struct_enum = (code_block, assignment, proc_call, if_stat, case_statement, for_loop, do_loop, repeat_loop, goto_statement); code_struct = record code_struct_type : code_struct_enum; code_blk_ptr : code_block_ptr; assign_ptr : assignment_ptr; call_ptr : proc_call_ptr; if_stat_rec_ptr : if_ptr; . . . end; {record}
The structure for an if statement is defined as follows:
if_stat_rec = record Boolean_Exp : Expression_Type_Ptr; True_Path, False_Path : code_Struct_Ptr; end; {record}
Note that we now need to expand our definition of an expression. Expressions need to contain boolean values (true and false), boolean operators (and, or), compare operators (>, <, =, etc), function calls, and other things. Another detail of expressions is that in order to enforce strong type checking each node of the expression tree must contain a type. If the type returned by any child node is not compatible with the parent node, there is a type error. By compatible we mean that an integer value can be assigned to real variable, but not the other way around. The two types do not have to be strictly the same.
To see how if_stat_rec works we need now to look at parse_if_statement.
procedure parse_if_statement (var Code_Struct : Code_Struct_Ptr, var token token_type); begin if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_if) then advance_token(token) else (* this procedure should not have been called *) raise_internal_error(token); code_struct := new (code_struct); code_struct.code_struct_type := if_stat; code_struct.if_stat_rec_ptr := new (if_rec_ptr); code_struct.^if_stat_rec_ptr.True_Path := NIL; code_struct.^if_stat_rec_ptr.FALSE_Path := NIL; parse_expression(Code_Struct.^if_stat_rec_ptr.^Boolean_Exp); if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_then) then advance_token(token) else raise_syntax_error(token); parse_code_structure (code_struct.^if_stat_rec_ptr.true_path, token); (* this part is important - study it closely *) if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_else) then begin parse_code_structure (code_struct.^if_stat_rec_ptr.false_path, token) if (token.type_of_token = reserved_symbol) and (token.glob_res_word = rs_semicolon) then begin advance_token(token); return; end end else if (token.type_of_token = reserved_symbol) and (token.glob_res_word = rs_semicolon) then begin advance_token(token); return; end else raise_syntax_error(token); end; {procedure parse_if_statement}
Now, if we want to parse the following code:
if (Bool_Var) then call_proc;
This is simple and straight forward. Parse_if_statement is entered, the keyword IF is consumed by advance_token, and the if statement structure is allocated. Next the expression Bool_Var is parsed by the expression parser. Then the semicolon is consumed is consumed and control is returned to calling procedure. This results in a structure that looks like figure 4.
Figure 4. Structure of an If Statement
Now lets look at a more complex example:
if (Bool_Var) then call_proc else call_different_proc;
Everyone always has a problem with this Pascal syntax. Where do semicolons belong and where do they NOT belong? This is handled by the piece of code with the comment (* this part is important - study it closely *). If we look at what goes on in the parser, this syntax makes a little more sense. Parsing takes place just as it did in the last example, up to the point where the call to call_proc is parsed. After that, the parser looks for a semicolon, and not seeing one, it knows that the keyword "else" must be next. So that keyword is found and consumed, then the call to call_different_proc parsed, the semicolon is consumed and control is returned to the calling function. Figure 5 shows how this structure should look.
Figure 5. Structure of an If-Else Statement
Now to get even more bizarre, think of the following examples:
if (Bool_Var) then if (Bool_Var2) then begin call_proc; end else call_different_proc;
Which means if Bool_Var is true and Bool_Var2 is true, then call call_proc, but if Bool_Var is false, then call call_different_proc. Nothing is called if Bool_Var is true and Bool_Var2 is false. According to Pascal syntax, can this be done without the Begin and End? Now look at:
if (Bool_Var) then if (Bool_Var2) then call_proc else call_different_proc;
Which means to that when Bool_Var is true and Bool_Var2 is true, we call call_proc, and when Bool_Var2 is false, we call call_different_proc. When Bool_Var is false, nothing is called. Figure 6a shows how the structure of the first example should look, and 6b shows the structure of the second example. It is a good exercise to trace through the parser code and see how these structures are created.
Figure 6. More Complex If Statements
The next thing to look at is how the parser handles an else if. The keyword "else" is consumed, then the parse_code_structure procedure is called. This, in essence, makes the if part of the else if an entirely separate statement, as if the programmer has written
if (Bool_Var) then call_proc; else if (Bool_Var2) then call_different_proc;
This results in the structure shown in figure 7.
Figure 7. If - Else If structure.
Note that some languages, particularly Ada, Modula-2 and LOOP force the handling of else-if as part of the if statement itself. This is accomplished by creating a single keyword, "elsif" in the case of Ada and Modula-2, "else_if" in the case of LOOP. [LOOP (Language for Object Oriented Programming) is currently under development. I hope to post an article on it this winter - perhaps in February - JAF]
It might also be a benefit to the student of compiler theory to see examples of how other statement types are stored and parsed. The code block statement will need a list of statements:
code_list_ptr = ^code_list_record; statement_list_record = record statement : code_struct_ptr; next_statment : code_list_ptr; end; {record} code_block_statement = record statement_list : code_list_ptr; end; {record}
This structure allows a code block to hold any number of statements of any type. This statement type is used any time the parser encounters the key word BEGIN. Here is the parser for it:
procedure parse_code_block (var Code_Struct : Code_Struct_Ptr, var token token_type); (* ordered linked lists are a pain *) last : statement_list_ptr; begin if (token.type_of_token = glob_res_word) and (token.glob_res_word = gw_begin) then advance_token(token) else (* this procedure should not have been called *) raise_internal_error(token); code_struct := new (code_struct); code_struct.code_struct_type := code_block; code_struct.code_blk_ptr := new (code_block_stat_rec); (* This creates an empty node at the start of the list *) (* it will need to be removed after the loop *) code_struct.^code_blk_ptr.statement_list := new (statement_list_record); last := code_struct.^code_blk_ptr.statement_list; while (token.type_of_token <> glob_res_word) or (token.glob_res_word <> gw_end) do begin parse (last.next, token); last := last.next; last.next := NIL; if (token.type_of_token = reserved_sym) and (token.res_sym = rs_semicolon) then advance_token(token) else raise_syntax_error(token); end; (* remove blank node at beginning of list *) last := code_struct.^code_blk_ptr.statement_list; code_struct.^code_blk_ptr.statement_list := code_struct.^code_blk_ptr.^statement_list.next; free (last); end; {procedure parse_code_block}
Note that the final semicolon after the keyword "end" is not consumed. This semicolon will not exist if the code block is in an if-else structure.
Assignment statements have the structure
assignment_stat_rec = record lvalue, rvalue : expression_type_ptr; lvalue_type, rvalue_type : decl_type_enum; end; {record}
Lvalue is the left hand side of the expression, and must be a variable. The EBNF can be created in such a way that this is insured, in which case the type of this field would actually be more narrowly defined. The rvalue is any legal expression. Lvalue_type and_Rvalue type are the types of each side of the assignment. This information is kept to ensure that the two types are compatible. The parsing code for this is quite straight forward. The magic here is in parsing expressions, and this has already been covered. So parse_assignment_statement will not be shown here. The icon for this paper is actually a diagramatic representation of the assignment statement structure (figure 8).
Figure 8. The Structure of an Assignment Statement
The structures for all three loop types are similar, so they will be shown here together:
for_loop_rec = record variable : Var_Decl; start_val, end_val : integer; code : code_struct_ptr; end; {record} while_loop_rec = record end_condition : expression_type_ptr; code : code_struct_ptr; end; {record} repeat_loop_rec = record continue_condition : expression_type_ptr; code : code_struct_ptr; end; {record}
By now the reader should be able to figure out him/herself what the parser code will look like for these constructs.
The last statement type we will look at is the procedure call. The structure to hold a procedure call is rather complex because a list of expressions is needed to hold parameters. The type proc_ptr is defined in the next section, Parsing Procedures and Scoping.
exp_list_ptr = ^exp_list; exp_list = record exp : expression_type_ptr; next : exp_list_ptr; end; {record} proc_call_stat = record proc : proc_ptr; Parm_List : exp_list_ptr; end; {record}
When the call is parsed the procedure is looked up in yet another table. If the procedure is found a pointer is set to the structure holding the procedure's code. If the procedure is not found it is a syntax error.
To parse the list of parameters, a loop is entered when the opening parenthesis is seen. The loop exits when the closing parenthesis is found, and the parameters themselves are parsed with calls to parse_expression.
The concepts behind parsing computer code can be quite deep, but the code itself is surprisingly small and compact. The entire parser for the eight code structures in Pascal is only a couple hundred lines long. If you were to print it out, it would probably take less paper than it would to print the section of this paper describing it!
Go to next page of Compiler Theory article
Return to start of this page
Return to Table of Contents of the Compiler Theory article
Return to High Tech page