Semicolons or newlines as statement separator/terminator?

rpereda@wotangate.sc.ti.com (Ramon Pereda) wrote, in response to Larry Wall

Now, if what you meant is that newlines have syntactic meaning in addition to their ordinary whitespace meaning...
Icon has makes an interesting distinction between whitespace and newlines. Newline automatically inserts a semicolon at the end of a line, if an "expression" ends on that line and the next line begins with another. For the sake of this discussion, just think of "expressions" as statements. As a result, the following three sets of statements are all equivalent:
i := 2;
j := 5;
k := 7;

i := 2
j := 5
k := 7

i := 2; j := 5; k := 7
Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C. I see pitfalls like this as flaws in the language, not the novice programmer. Perhaps his semi-conscious act of leaving out the semicolon has a natural elegance. Programming language designers should not beat this innate habit out of the novice programmer. I don't have any hard rules for justifying leaving out the semicolons. I just appeal to your intuitions (when you FIRST STARTED programming if you can think back that far). Isn't simpler? Less clutter?

The semicolon pitfall is pervasive and very easy to remedy. Now, if only English were as easy to fix...

prechelt@i41s25.ira.uka.de (Lutz Prechelt) responded

Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C. I see pitfalls like this as flaws in the language, not the novice programmer.
Indeed the semicolons can be a nuisance. And it is surprisingly simple (surprising only for us semicolon-grown programmers) to create a syntax that needs no semicolons.

In a language I designed, whose syntax is similar to Modula-2, I once tried to simply remove all semicolons from the syntax. It turned out that the only place where they were needed was at an optional keyword where one could write either

END;
or
END IF;
and both meant exactly the same.

Perhaps semicolons are just a habit?

to which ludemann@netcom.com (Peter Ludemann) replied

It is surprisingly simple to create a syntax that needs no semicolons.

Indeed: I can think of 3 examples.

  1. BCPL requires a semicolon only if you put multiple statements on a line. The end-of-line rule is especially elegant: a semicolon is implicitly inserted if it would make sense (ie, the semicolon is not inserted if the last thing on the line is an operator or if there is an open left parenthesis). Given that the designers of C (allegedly a BCPL-descendant) didn't like verbosity, I'm surprised that they didn't continue this fine tradition [BCPL's style of commenting wasn't continued either; but it's now in C++].
  2. REXX allows continuing a line by having the last character on the line as a comma. This works because comma has no intrinsic meaning in REXX (it can be used as an argument separator in procedure calls, if you want). I urge all scripting language designers to read Cowlishaw's book on REXX (my flame to Larry Wall: REXX provides about as much "power" as Perl, while having readability sufficiently good that IBM managers have been know to program in it; and it works well as an underlying language for editors (e.g., as a replacement for Emacs-lisp)).
  3. The new Arden syntax, where on a challenge I modified the grammar to remove all semicolons (now they aren't even needed if multiple statements appear on a line). It turned out to be surprisingly simple to do ... in most cases, I just changed the ";" token to <statement_end>, defined by:
    	<statement_end> ::= /* empty */
    	<statement_end> ::= ";"
    
    If anybody wants to see the grammar, it's in YACC-form

"Dr A. N. Walker" (anw@maths.nottingham.ac.uk) responded to Peter Ledemann

Here is a fourth [example]: throw away the semicolons in Pascal, and the only ambiguity is that you can't always see where the empty statements are. This can easily be cured by adding a visible representation [such as ";"] of such statements.

Whether this is a Good Thing from the point of view of the poor programmer trying to make sense of the error messages coming from the compiler is quite another matter.

everettm@walters.East.Sun.COM (Mark Everett) responded to Dr Walker

I was always under the impression that garbage characters like ';' were there to enable error recovery to provide meaningful error messages. Statement delimiters are a place where the compiler can "sync up" with what it is expecting. They are not needed in correct programs, but rather incorrect ones.

again, lwall@netlabs.com (Larry Wall) rose to the Perl/REXX bait :-), this time dangled by Peter Ludemann

REXX allows continuing a line by having the last character on the line as a comma. This works because comma has no intrinsic meaning in REXX (it can be used as an argument separator in procedure calls, if you want).
I dislike out-of-band bandaids like continuation characters. But I already said my piece about that.
I urge all scripting language designers to read Cowlishaw's book on REXX...
I have read it. I've even been to a REXX symposium. Mike and I get along great. It's our loyal followers that tend to rip each other's throats out.

For my part, I urge all scripting language designers not to emulate REXX's example in allowing umpty jillion divergent implementations.

REXX provides about as much "power" as Perl...
You have a funny definition of "about as much". Most of the folks at the REXX symposium thought that Perl was quite a bit more powerful. In fact, that was their main complaint. :-)

To be sure, both languages are equivalent to Turing machines, so they're theoretically equivalent in what is *possible* to do. However, just as with human languages, computer languages differ not so much in what it is possible to say, but in what it is easy to say.

while having readability sufficiently good that IBM managers have been know to program in it;
Stupider people than that have programmed in Perl. So there! :-)
and it works well as an underlying language for editors (e.g., as a replacement for Emacs-lisp)).
That's not too surprising, since that's the sort of thing REXX was designed for. To the first approximation, REXX is a macro assembler for a text processing engine. It's fine for folks who want little program structure beyond the statement, or data structure beyond the associative array. You can do things like that in Perl, but Perl's mandate is broader. Perl tries to make many other things easy too. Some would say that it tries to make too many other things easy... :-)

But every language optimizes for a different set of capabilities, so you can't really compare languages piecemeal. You can't compare a bat wing with a bee sting. The organism lives or dies as a whole. Every living language will fill its ecological niche (or try to) and will either prosper, or evolve into a new niche, or go extinct.

It's no accident that a semicolon looks like a claw, and terminates things.

lwall@netlabs.com (Larry Wall) also responded to Ramon's mention of Perl

Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C.
If it's a common goof, it behoves the compiler writer to give a good diagnostic. Here, let's try it:
$ perl
foo(1,2,3)
bar(4,5,6);
Semicolon seems to be missing at - line 1.
syntax error at - line 2, near "bar"
Similarly for allowing newlines within a string constant:
$ perl
print "A string missing its trailing quote;
print "Another string";
Bare word found where operator expected at - line 2, near "print "Another"
 (Might be a runaway multi-line "" string starting on line 1)
Admittedly, many C compilers give you confusing diagnostics when you leave out a semicolon. This results from a conspiracy of ambiguity between the C grammar and yacc. It's one of the reasons I don't like dangling statements (and why Perl doesn't have them).

I guess this refers to the C grammar not being LALR(1), which is what yacc recognises. I think it is unfair to blame yacc - languages that are hard to parse automatically are usually hard for a human to read as well, and C (in all its queasy glory, as in the obfuscated C competition) certainly is. However, yacc's built-in parse-time diagnostics are pitiful. If blame should go anywhere, it is to the language designers (but the decisions they made were rational at the time) and implementors - yacc is just an implementation tool, and can (and should) be replaced.

I see pitfalls like this as flaws in the language, not the novice programmer. Perhaps his semi-conscious act of leaving out the semicolon has a natural elegance. Programming language designers should not beat this innate habit out of the novice programmer.
While I usually come down on the side of intuition, I must point out that every novice programmer has a multitude of bad habits that must be beat out of him or her. Our abhorrence of violence (which was beat into us by our culture) should not deter us from beating civility into those who need it.

The fact is, people think in terms of statement terminators. Every time you finish speaking a sentence, you change the intonation at the end of it, and insert a pause before the next sentence. This is pretty much universal in human languages.

The only way I can see you winning this argument is if you liken a program to a poem. Line structure is more important in poetry than in prose, and there are some forms of poetry that allow you to omit the punctuation. This often comes at the expense of other syntactic restrictions, however, such as the number of syllables allowed per line. On the other hand, there are poets that write unpunctuated free verse to get more of a stream-of-consciousness feel, or to keep the clausal bindings purposefully ambiguous. That's fine in poetry. But I don't think you can argue that this is intuitive to the novice. Experts are purposefully ambiguous; novices only accidentally so.

I don't have any hard rules for justifying leaving out the semicolons. I just appeal to your intuitions (when you FIRST STARTED programming if you can think back that far). Isn't simpler? Less clutter?
Simpler? Less clutter? In some sense, maybe. Would you consider your life simpler and less cluttered if we removed all the stop signs from our streets?

Odd that you should appeal to my intuition. When I FIRST STARTED programming, I used a language like that, and I always felt that line-continuation markers were exceedingly ugly and unnatural. Similarly for any other cute line-continuation tricks depending on unfulfilled syntactic expectations, where you're allowed to use a newline in certain places in the grammar but not others. Ugh. I breathed a great sigh of relief when I discovered languages with statement terminators. Your mileage has obviously varied.

Note that C has the worst of both worlds here--ordinary statements are semicolon terminated, but macros definitions are terminated by the end of the line, and therefore need backslash as a continuation marker. Despite the fact that I write many more statements than macros, I forget the backslash *far* more often than the semicolon. This is because line continuations are (for me) conceptually outside the syntax of the language, while the statement terminators are very much an explicit part of it. I suspect most C programmers would agree here. Maybe even some shell programmers. :-)

The semicolon pitfall is pervasive and very easy to remedy.
No, it's not. You haven't considered all the ramifications. Zeroing in on one feature without considering it in context doesn't achieve any good in the long haul--it's just moving piles of complexity and risk around to the other end of the prison yard. The warden might enjoy it, but the prisoners certainly won't.
Now, if only English were as easy to fix...
It's easy enough to trash English, and I even laughed like I was supposed to, but I also note that you enjoy your English punctuation sufficiently to use at least three different statement terminators: "?", "." and "..."
And I don't see any place where you accidentally left one out. Perhaps statement terminators are intuitively obvious to some people...

leichter@zodiac.rutgers.edu (Jerry) made some interesting and novel points that I am going to have to mull over

Larry Wall argues that explicit line terminators are "natural" in programming languages because, after all, we use them in English (and other natural languages).

Let's continue his argument. In every natural language I know of, case distinctions have no non-redundant semantic content. In fact, in English the rules for capitalization are pretty much fixed, and if I write a sentence in all lower-case, you generally would have no problem determining what to capitalize. (There may be a few exceptional cases like acronyms, but they are quite rare.) The particular rules in other Western languages vary, but the same remains true: Capitalization may help you visually parse a sentence, but you'll have to trouble reading it in all caps, all lower-case, or in punky random mixed case.
I think this assumes that other punctuation, such as spaces to separate words and full stops to end sentences, is present.

Nevertheless, most programming languages today *do* maintain case distinctions. This argues strongly that identifying programming languages with natural languages misses the point, at least to some degree. Just because we use grammatical techniques original developed by linguists studying natural languages doesn't mean programming languages are very much like natural languages! For another example, programming languages make a good deal of recursion, while natural language structures tend, in general, to be quite simple and of highly bounded depth. It's easy to find natural examples of code whose parse tree is 10 or 15 levels deep, but rather hard to construct a *natural* natural-language example of such a thing!
e.g. "I know an old lady who swallowed a ... dog, to catch the cat, to catch the bird, to catch the spider, to catch the fly..." is funny in part because of the unusual sentence structure.

Much of what's in programming languages, I contend, stems from a different source entirely: Mathematical formulas. Mathematical formulas *do* assign semantic meaning to letter case. They *do* have rather deep recursive structures, in exactly the same way (and for the same reasons) as expressions in programming languages. In fact, the expressions in almost all programming languages are pretty much direct cribs from traditional mathematical syntax. In many cases, the real semantic content of a program is in the expressions. The non-expression statements form an overall structure within which those expressions must fit. In this way, many programs are like math texts: English sentences and sentence fragments form a scaffolding within which the "real stuff" - the formulas - are embedded.

Finally - coming back to the original point: Mathematical formulas are almost always written with no explicit "end of formula" markers. Whitespace and line breaks are generally enough to provide an unambiguous interpretation. In "heavy" math, even the interpolated English text consists mainly of sentence fragments with minimal punctuation.

I'll bet almost all missing semicolons are missing at the end of expression statements, not at the end of other kinds of statements. Partly this is an artifact of language design - non-expression statements are often self-terminating and don't need a semicolon. But partly I suspect it's because even experienced programmers think of expressions as "like" mathematical formulas, and they apply the "natural laws" of that domain.


knishimo@cat.cce.usp.br (Kazuo Nishimoto) mentioned Eiffel

Most of the discussion about semicolon is centered in the languages C or C++. I think the Eiffel approach to this is very interesting. Below is a transcript of a section about syntax and the semicolon in the book "Eiffel: The Language" by Bertand Meyer:

For syntax, some pragmatism does not hurt. A modern version of the struggle between big-endians and little-endians provides a good example. The programming language world is unevenly divided between partisans of the semicolon (or equivalent) as terminator and the Algol camp of semicolon as delimiter. Although the accepted wisdom nowadays is heavily in favor of the first approach, I belong to the second school. But in practice what matters is not anyone's taste but convenience for software developers: adding or forgetting a semicolon should not result in any unpleasant consequences.
semicolon as terminator = every statement (or whatever) ends in a ";"
semicolon as delimiter (I prefer the word "separator") = two adjacent statements are separated by a ":"
C uses the former, so we have to have a ";" before a "}", whereas Pascal uses the latter, so we must omit the ";" before an "else". However, Pascal has enough flexibility that the ";" before an "end" is optional.

In the syntax of Eiffel, the semicolon is theoretically a delimiter (between instructions, declarations, Index_term clauses, Parent parts); but the syntax was so designed as to make the semicolon syntactically redundant, useful only to improve readability; so in most contexts it is optional.

This tolerance is made possible by two syntactical properties: an empty construct is always legal; and the use of proper construct terminators (often end) ensures that no new component of a text may be mistaken for the continuation of the previous construct. For example in

x := y          (The recommended style suggests
a := b           including a semicolon after y.)
no construct may involve two adjacent identifiers, so that 'a' after 'y' must begin a new instruction, even without a semicolon.
This is not the case with functional languages like SML, but it also manages without semicolons.

On a separate point, daveb@perth.DIALix.oz.au (David Brooks) wrote

While I can see the merit of getting rid of the "one statement per line" habit, in terms of being closer to natural languages, I do see one possible drawback.
I have just been working with some fairly heavy-duty C++ code, and it can be quite baffling to get a (sometimes misleading) compiler diagnostic, which points at a line containing a really heavy compound statement. It's by no means always obvious what the compiler is objecting to.

A further point (this is really one of author discipline, rather than language design) is trying to find deeply embedded data declarations. Exactly what is "thingy" declared as? In Assembler (wash your mouth out with soap :) ) at least those declarations were perforce neatly lined up at the left margin. Now they may be buried levels deep in the code.

bevan@cs.man.ac.uk (Stephen J Bevan)

It can be quite baffling to get a (sometimes misleading) compiler diagnostic, which points at a line containing a really heavy compound statement.
Indeed, that's why I dislike compilers that only give line numbers. I much prefer character position or failing that line -and- column.

plong@perf.com (Paul Long)

One problem with specifying the character position is what to do in the presence of macro expansion. Do you indicate the position in the original source or in the expanded source. Either way can lead to confusion.

bevan@cs.man.ac.uk (Stephen J Bevan)

I'd suggest generally the former with the option of the latter in order to solve the really awkward problems.

stidev@gate.net (Solution Technology) brought a different viewpoint to the discussion

To diverge slightly; Algol68 took a slightly different approach and didn't use generic brackets. It used if...fi, case...esac, do...od. It initially looks ugly but you get used to it in a few days and it is easier than {...} and there are fewer mistakes. No what the @#$% does } supposed to match. No missing brackets. Overloading of syntactic marks causes mistakes, like the comma, semicolon, less-than-greater-than, of C++.

and the moderator added

Comment That's a good point, but what about Algol68's extremely overloaded punctuation that let you write if a then b else c fi as (a|b|c) ?
tnemmoC

to which Comment "Dr A. N. Walker" (anw@maths.nottingham.ac.uk) replied

What about it?

Firstly, allowing you to write "if... fi" in an alternate form is surely *under*loading. It allows the user to use "elegant variation" in styles to make meanings clearer -- you can use, for example, vertically aligned "if... fi" for major blocks of code, horizontally aligned "if... fi" for key single-line code, and "(...)" for trivia that you want to slip into place without drawing the reader's attention.

Secondly, the meat of your point [which ISTR you have made before] is presumably that the *compiler* [and hence the reader] finds it quite hard to tell whether a left parenthesis is meant to be "(", "[", "if", "case" or "begin". But it doesn't matter! They are all brackets; and that in itself is quite an insight for the reader -- one that many more recent languages have failed to notice [or to exploit].

and the moderator added

Agreed, it's not important whether it's hard to compile, but it seems to me that you could get some hard to find program errors by misplacing a vertical bar and turning a case into an if or v.v.

And finally, on a related but separate question, "Noel S. Gorelick" (ngorelic@speclab.cr.usgs.gov) asked

I am writing what is turning out to be a C-like interpreter. As an interpreter, the trailing semicolons are a nuisance, and seem kind of silly most of the time. I would like to be able to optionally use newlines as statement separators. My problem is shown below:
for (i=0 ; i<10 ; i=i+1) bar(i)		// works

for (i=0 ; i<10 ; i=i+1) {		// works
	bar(i)
}

for (i=0 ; i<10 ; i=i+1)		// doesn't work
{
	bar(i)
}
These three statements should all be equal, but the the newline trailing the for() in the last one matches to an empty expression.

What I think I need to do is to conditionally ignore newlines. Only when trying to match a 'separator' rule, should a newline not be ignored. However, I'm not savvy enough with yacc/lex to know what I'm doing. Here are the rules I've got:

expression_statement
    : separator                        { $$ = NULL; }
    | expression separator             { $$ = $1; }
    ;
separator
    : '\n'
    | ';'
    ;
I tried some code after matching 'expression', to disable/enable eating of newlines, but this fails, causing the lexer to eat all newlines.
expression_statement
    : separator                                 { $$ = NULL; }
    | expression { eatNL = 0; } separator       { $$ = $1; eatNL = 1}
    ;
Any help would be greatly appreciated.

and norvell@cs.toronto.edu (Theo Norvell) replied

I would like to be able to optionally use newlines as statement separators.
Good idea.
What I think I need to do is to conditionally ignore newlines.
This may not be necessary. Often a better idea is to follow the C/Algol/etc idea of treating blanks and newlines the same. You can use an unambiguous grammar that requires no statement terminators or separators. (As a few people have recently pointed out.)

For Pascal-like languages, the grammar can even be LL(1). Consider this language as an example:

Program --> Block eof
Block --> Statement OptSemi Block | Empty
OptSemi --> ; | Empty
Statement -->
	id := Exp
   |	id(ArgList)
   |	if Exp then Block else Block fi
   |	while Exp do Block od
   |	var id : Type
   |	proc id(ParamDecls) Block end
   |	fun id(ParamDecls) Block return Exp
	etc
Exp --> monop Exp | Exp binop Exp | (Exp) | id(ArgList) | id
Empty -->
This grammar can be made unambiguous, by giving precedence and associativity to the various unary and binary operators. In fact it is very similar to the grammar of the Turing language, which is LL(1).

Note that in assignment statements, I only allow very simple left-hand-sides, this could be extended to allow subscripting, but as soon as you allow a statement to begin with a parenthesis, you will lose all hope of an LL grammar. Consider the Block

a := foo
(*p) := bar
It begins too much like
a := foo(*p)
But perhaps LALR is salvageable.

When you allow whole expressions to be statements by themselves, even this hope fades, as any useful grammar will be ambiguous. Consider

x := f(x)
which could be parsed as x := f followed by (x). Also what about
x - y
which could be parsed as x followed by -y. Obviously allowing C's "empty statement" is also disastrous as the empty string can be parsed as any number of empty statements. But it is a useless statement, as you can always use {}, which need not cause a problem.

By changing the function call syntax (I use square bracket and let the type checker figure out the difference between subscripts and arguments) and segregating unary from binary operators you can get to LALR(1) as evidenced by the yacc grammar for a C-like language below.

In an interactive language there is still the problem that the end of a statement may not be recognized as such until the first token of the next statement is read. I suggest using some special token (I'll call it "done") that the user uses to request that the value of the preceding statment be printed. The nonterminal Program should be modified to read

Program : Block done {printVal();} Program | eof ;
Here is the yacc grammar:
%start Program
%token id if then else while do typename return eof
%right colonEqual
%left '+' '-'
%left '*' '/'
%left '~'	/* Unary minus */
%%
Program	:	Block eof
	;
Block	:	Stmt OptSemi Block
	|	Empty
	;
OptSemi	:	';'
	|	Empty
	;
Empty	:
	;
Stmt	:	'{' Block '}'
	|	Exp
	|	if Exp OptThen Stmt else Stmt
	|	while Exp OptDo Stmt
	|	typename id /* var declaration */
	|	typename id '[' PList ']' Block return Exp /* func decl */
	;
OptThen	:	then | Empty
	;
OptDo	:	do | Empty
	;
PList	:	typename id OptSemi PList
	|	Empty
	;
Exp	:	id OptArgList
	|	Exp colonEqual Exp
	|	Exp '+' Exp
	|	Exp '-' Exp
	|	Exp '*' Exp
	|	Exp '/' Exp
	|	'~' Exp
	|	'(' Exp ')'
	;
OptArgList :	ArgList | Empty
	;
ArgList	:	'[' ExpList ']'
	;
ExpList	:	Exp OptSemi ExpList
	|	Empty
	;