From: greg@cee.hw.ac.uk Newsgroups: comp.compilers Subject: Re: Concrete to Abstract syntax Date: 24 Jun 1996 11:00:19 -0400 Organization: Dept of Computing and Electrical Engineering, Heriot-Watt University The following is from I course I used to give on language definition. Greg Michaelson -------------------------------------------------------------------- Abstract syntax When we come to define semantics we want to associate meanings with syntactic constructs. However, concrete syntax contains information which is not directly relevant to semantics. Consider the grammar for expressions once more: ::= | + | - ::= | * | / ::= | - ::= | ( ) This grammar is chosen to reflect the operator precedence: () before - before * / before + - Thus, the structure of the rules imposes a structure on symbol sequences: "+" and "-" in s qualify the preceding s and following s; "*" and "/" in s qualify the preceding s and following s; "-" in s qualify the following s; "(" and ")" in s qualify the enclosed s. Consider parsing: (3+4)*5-6 using the above grammar for s: | +---------------+----------------+ | | | - | | +------+---------+ | | | | * | | | | | | +-------+-------+ | | | | | ( ) 6 | | +------+------+ 5 | | | + | | | | | | | | 3 | 4 This structure reflects the order in which processing is to proceed: "3+4" before "(3+4)*5" before "(3+4)*5-6". However, from the point of view of that processing, much of the structure is irrelevant. Thus, to calculate "3+4" we don't need to know that "3" is a is a is a : knowing that "3" is a rather than a sub-expression is sufficient. Similarly, to calculate "(3+4)" the brackets are irrelevant. Similarly, to calculate "(3+4)*5" we don't need to know that "(3+4)" is a is a is a : knowing that it is a sub-expression rather than a is sufficient. In general, provided that we know the order in which processing is to take place, all we need to distinguish are s, sub- expressions and operators. Thus, we can simplify the above structure to: | +---------------+----------------+ | | | - | | +------+---------+ | | | | * 6 | | +-------+-------+ | | | | + 5 | | | | 3 4 This corresponds to the abstract syntax: ::= | | - ::= + | - | * | / The most significant change is that we have dropped chains of single rules. Thus: -> -> -> -> -> 6 has become: -> -> 6 This abstract syntax has lost precedence structure and brackets. However, its purpose is not to recognise symbol sequences but to identify the structure of symbol sequences relevant to their meaning. From a compiling point of view, concrete syntax tells us how to recognise the overall structure of symbol sequences; abstract syntax tells us what aspects of that structure need to be retained, say in a parse tree, for semantic processing. We can find concrete syntax from abstract syntax by folding rules together. Consider expressions yet again: ::= | + | - ::= | * | / ::= | - ::= | ( ) First of all, replace the single in with the options from and then replace the qualified with : ::= | ( ) | - Next, replace the single in with the options from and replace all qualified s with : ::= | ( ) | - | * | / Next, replace the single in with the options from and replace all qualified s with : ::= | ( ) | - | * | / | + | - Next, generalise the options with an operator between s: ::= | ( ) | - | ::= * | / | + | - Next, drop the brackets from the bracketed as to process it we just process the : ::= | | - | ::= * | / | + | - Finally, drop the redundant single : ::= | - | ::= * | / | + | - In general, for each rule Ri: i) for an option which is a single rule Rj, replace it with the options for that rule ii) replace any other references to Rj with Ri These two rules remove the chains of single rules from parse trees. iii) drop any symbols which are not significant for distinguishing meaning This rule is partly a matter of style. In principle we can drop all symbols except one to distinguish the construct. In practise, we may leave some "redundant" symbols in place to ease understanding of the final abstract syntax. iv) drop any options which are just Ri This rule removes redundant circularities like: ::= For example, for statements: ::= | ; ::= | | ::= := ::= READ ::= WRITE First of all: ::= := | READ | WRITE Next: ::= ::= | READ | WRITE | ; Note that we do not just fold the grammar into one enormous abstract syntax rule. We need to keep separate rules for structures which are semantically distinct. Thus, here we might keep separate rules for expressions, which return values, statements, which modify variables, and declarations, which introduce variables. It is usual to use a different notation for abstract syntax: i) use simple identifiers instead of full names in <> ii) use "->" instead of "::=" Thus, for expressions: e -> i | - e | e op e op -> + | - | * | / Thus, for statements: s -> n := e | READ n | WRITE e | s ; s We will use this notation when we come to describe semantics. We also need to indicate how abstract syntax constructs correspond to concrete syntax constructs. For example, for expressions: e in op in i in e -> i | - e | e op e op -> + | - | * | / For example, for statements: s in e in n in s -> n := e | READ n | WRITE e | s ; s