From: greg@cee.hw.ac.uk
Newsgroups: comp.compilers
Subject: Re: Concrete to Abstract syntax
Date: 24 Jun 1996 11:00:19 -0400
Organization: Dept of Computing and Electrical Engineering, Heriot-Watt University

The following is from I course I used to give on language definition.
Greg Michaelson
--------------------------------------------------------------------

Abstract syntax

When we come to define semantics we  want  to  associate  meanings
with  syntactic  constructs.  However,  concrete  syntax  contains
information which is not directly relevant to semantics.

Consider the grammar for expressions once more:

     <expression> ::= <term> | <term> + <expression> | <term> - <expression>
     <term> ::= <factor> | <factor> * <term> | <factor> / <term>
     <factor> ::= <base> | - <base>
     <base> ::= <integer> | ( <expression> )

This grammar is chosen to reflect the operator precedence:

     ()   before
     -    before
     * /  before
     + -

Thus, the structure of the rules imposes  a  structure  on  symbol
sequences:  "+"  and  "-"  in  <expression>s qualify the preceding
<term>s and  following  <expression>s;  "*"  and  "/"  in  <term>s
qualify  the  preceding  <factor>s  and  following <term>s; "-" in
<factor>s qualify the following <base>s; "(" and  ")"  in  <base>s
qualify the enclosed <expression>s.

Consider parsing:

     (3+4)*5-6

using the above grammar for <expression>s:

                                  <expression>
                                       |
                       +---------------+----------------+
                       |               |                |
                    <term>             -           <expression>
                       |                                |
                +------+---------+                   <term>
                |      |         |                      |
            <factor>   *      <term>                <factor>
                |                |                      |
             <base>          <factor>                <base>
                |                |                      |
        +-------+-------+     <base>                <integer>
        |       |       |        |                      |
        (  <expression> )    <integer>                  6
                |                |
         +------+------+         5
         |      |      |
      <term>    + <expression>
         |             |
     <factor>       <term>
         |             |
      <base>       <factor>
         |             |
     <integer>       <base>
         |             |
         3         <integer>
                       |
                       4

This structure reflects  the  order  in  which  processing  is  to
proceed:  "3+4" before "(3+4)*5" before "(3+4)*5-6". However, from
the point of view of that processing, much  of  the  structure  is
irrelevant.  Thus,  to  calculate "3+4" we don't need to know that
"3" is a <integer> is a <base> is a <factor>: knowing that "3"  is
a <integer> rather than a sub-expression is sufficient. Similarly,
to calculate "(3+4)" the brackets are  irrelevant.  Similarly,  to
calculate "(3+4)*5" we don't need to know that "(3+4)" is a <base>
is a <factor> is a <term>: knowing that  it  is  a  sub-expression
rather than a <integer> is sufficient.

In general, provided that we know the order in which processing is
to  take  place,  all  we need to distinguish are <integer>s, sub-
expressions  and  operators.  Thus,  we  can  simplify  the  above
structure to:

                                     <expression>
                                          |
                          +---------------+----------------+
                          |               |                |
                    <expression>          -          <expression>
                          |                                |
                   +------+---------+                  <integer>
                   |      |         |                      |
            <expression>  *   <expression>                 6
                   |                |
           +-------+-------+    <integer>
           |       |       |        |
     <expression>  +  <expression>  5
           |               |
       <integer>       <integer>
           |               |
           3               4

This corresponds to the abstract syntax:

     <expression> ::= <integer> | <expression> <op> <expression> | - <expression>
     <op> ::= + | - | * | /

The most significant change is that  we  have  dropped  chains  of
single rules.  Thus:

     <expression> -> <term> -> <factor> -> <base> -> <number> -> 6

has become:

     <expression> -> <number> -> 6

This abstract syntax has lost precedence structure  and  brackets.
However,  its  purpose is not to recognise symbol sequences but to
identify the structure  of  symbol  sequences  relevant  to  their
meaning.

From a compiling point of view, concrete syntax tells  us  how  to
recognise  the  overall  structure  of  symbol sequences; abstract
syntax tells  us  what  aspects  of  that  structure  need  to  be
retained, say in a parse tree, for semantic processing.

We can find concrete syntax from abstract syntax by folding  rules
together.  Consider expressions yet again:

     <expression> ::= <term> | <term> + <expression> | <term> - <expression>
     <term> ::= <factor> | <factor> * <term> | <factor> / <term>
     <factor> ::= <base> | - <base>
     <base> ::= <integer> | ( <expression> )

First of all, replace the  single  <base>  in  <factor>  with  the
options  from  <base>  and  then replace the qualified <base> with
<factor>:

     <factor> ::= <integer> | ( <expression> ) | - <factor>

Next, replace the single <factor> in <term> with the options  from
<factor> and replace all qualified <factor>s with <term>:

     <term> ::= <integer> | ( <expression> ) | - <term> | <term> * <term> |
                <term> / <term>

Next, replace the single <term> in <expression> with  the  options
from <term> and replace all qualified <term>s with <expression>:

     <expression> ::= <integer> | ( <expression> ) | - <expression> |
                      <expression> * <expression> | <expression> / <expression> |
                      <expression> + <expression> | <expression> - <expression>

Next,  generalise   the   options   with   an   operator   between
<expression>s:

     <expression> ::= <integer> | ( <expression> ) | - <expression> |
                      <expression> <op> <expression>
     <op> ::= * | / | + | -

Next, drop the brackets from  the  bracketed  <expression>  as  to
process it we just process the <expression>:

     <expression> ::= <integer> | <expression> | - <expression> |
                      <expression> <op> <expression>
     <op> ::= * | / | + | -

Finally, drop the redundant single <expression>:

     <expression> ::= <integer> | - <expression> | <expression> <op> <expression>
     <op> ::= * | / | + | -

In general, for each rule Ri:

i)   for an option which is a single rule Rj, replace it with  the
     options for that rule

ii)  replace any other references to Rj with Ri

These two rules remove the  chains  of  single  rules  from  parse
trees.

iii) drop any symbols which are not significant for distinguishing
     meaning

This rule is partly a matter of style. In principle  we  can  drop
all  symbols except one to distinguish the construct. In practise,
we  may  leave  some  "redundant"  symbols  in   place   to   ease
understanding of the final abstract syntax.

iv)  drop any options which are just Ri

This rule removes redundant circularities like:

     <expression> ::= <expression>

For example, for statements:

     <statements> ::= <statement> | <statement> ; <statements>
     <statement> ::= <assign> | <input> | <output>
     <assign> ::= <name> := <expression>
     <input> ::= READ <name>
     <output> ::= WRITE <name>

First of all:

     <statement> ::= <name> := <expression> | READ <name> | WRITE <expression>

Next:

     <statements> ::= <name> ::= <expression> | READ <name> |
                      WRITE <expression> | <statements> ; <statements>

Note that we do not  just  fold  the  grammar  into  one  enormous
abstract   syntax  rule.  We  need  to  keep  separate  rules  for
structures which are semantically distinct. Thus,  here  we  might
keep   separate   rules  for  expressions,  which  return  values,
statements,  which  modify  variables,  and  declarations,   which
introduce variables.

It is usual to use a different notation for abstract syntax:

i)   use simple identifiers instead of full names in <>

ii)  use "->" instead of "::="

Thus, for expressions:

     e -> i | - e | e op e
     op -> + | - | * | /

Thus, for statements:

     s -> n := e | READ n | WRITE e | s ; s

We will use this notation when we come to describe semantics.

We also need to indicate how abstract syntax constructs correspond
to concrete syntax constructs. For example, for expressions:

     e in  <expression>
     op in  <operator>
     i in  <integer>

     e -> i | - e | e op e
     op -> + | - | * | /

For example, for statements:

     s in  <statement>
     e in  <expression>
     n in  <name>

     s -> n := e | READ n | WRITE e | s ; s

