Now, if what you meant is that newlines have syntactic meaning in addition to their ordinary whitespace meaning...Icon has makes an interesting distinction between whitespace and newlines. Newline automatically inserts a semicolon at the end of a line, if an "expression" ends on that line and the next line begins with another. For the sake of this discussion, just think of "expressions" as statements. As a result, the following three sets of statements are all equivalent:
i := 2; j := 5; k := 7; i := 2 j := 5 k := 7 i := 2; j := 5; k := 7Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C. I see pitfalls like this as flaws in the language, not the novice programmer. Perhaps his semi-conscious act of leaving out the semicolon has a natural elegance. Programming language designers should not beat this innate habit out of the novice programmer. I don't have any hard rules for justifying leaving out the semicolons. I just appeal to your intuitions (when you FIRST STARTED programming if you can think back that far). Isn't simpler? Less clutter?
The semicolon pitfall is pervasive and very easy to remedy. Now, if only English were as easy to fix...
Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C. I see pitfalls like this as flaws in the language, not the novice programmer.Indeed the semicolons can be a nuisance. And it is surprisingly simple (surprising only for us semicolon-grown programmers) to create a syntax that needs no semicolons.
In a language I designed, whose syntax is similar to Modula-2, I once tried to simply remove all semicolons from the syntax. It turned out that the only place where they were needed was at an optional keyword where one could write either
END;or
END IF;and both meant exactly the same.
Perhaps semicolons are just a habit?
It is surprisingly simple to create a syntax that needs no semicolons.
Indeed: I can think of 3 examples.
<statement_end> ::= /* empty */ <statement_end> ::= ";"If anybody wants to see the grammar, it's in YACC-form
Whether this is a Good Thing from the point of view of the poor programmer trying to make sense of the error messages coming from the compiler is quite another matter.
REXX allows continuing a line by having the last character on the line as a comma. This works because comma has no intrinsic meaning in REXX (it can be used as an argument separator in procedure calls, if you want).I dislike out-of-band bandaids like continuation characters. But I already said my piece about that.
I urge all scripting language designers to read Cowlishaw's book on REXX...I have read it. I've even been to a REXX symposium. Mike and I get along great. It's our loyal followers that tend to rip each other's throats out.
For my part, I urge all scripting language designers not to emulate REXX's example in allowing umpty jillion divergent implementations.
REXX provides about as much "power" as Perl...You have a funny definition of "about as much". Most of the folks at the REXX symposium thought that Perl was quite a bit more powerful. In fact, that was their main complaint. :-)
To be sure, both languages are equivalent to Turing machines, so they're theoretically equivalent in what is *possible* to do. However, just as with human languages, computer languages differ not so much in what it is possible to say, but in what it is easy to say.
while having readability sufficiently good that IBM managers have been know to program in it;Stupider people than that have programmed in Perl. So there! :-)
and it works well as an underlying language for editors (e.g., as a replacement for Emacs-lisp)).That's not too surprising, since that's the sort of thing REXX was designed for. To the first approximation, REXX is a macro assembler for a text processing engine. It's fine for folks who want little program structure beyond the statement, or data structure beyond the associative array. You can do things like that in Perl, but Perl's mandate is broader. Perl tries to make many other things easy too. Some would say that it tries to make too many other things easy... :-)
But every language optimizes for a different set of capabilities, so you can't really compare languages piecemeal. You can't compare a bat wing with a bee sting. The organism lives or dies as a whole. Every living language will fill its ecological niche (or try to) and will either prosper, or evolve into a new niche, or go extinct.
It's no accident that a semicolon looks like a claw, and terminates things.
Forgetting a trailing semicolon is a common goof in a variety of languages such as Perl, Ada, Pascal, and C.If it's a common goof, it behoves the compiler writer to give a good diagnostic. Here, let's try it:
$ perl foo(1,2,3) bar(4,5,6); Semicolon seems to be missing at - line 1. syntax error at - line 2, near "bar"Similarly for allowing newlines within a string constant:
$ perl print "A string missing its trailing quote; print "Another string"; Bare word found where operator expected at - line 2, near "print "Another" (Might be a runaway multi-line "" string starting on line 1)Admittedly, many C compilers give you confusing diagnostics when you leave out a semicolon. This results from a conspiracy of ambiguity between the C grammar and yacc. It's one of the reasons I don't like dangling statements (and why Perl doesn't have them).
I guess this refers to the C grammar not being LALR(1), which is what yacc recognises. I think it is unfair to blame yacc - languages that are hard to parse automatically are usually hard for a human to read as well, and C (in all its queasy glory, as in the obfuscated C competition) certainly is. However, yacc's built-in parse-time diagnostics are pitiful. If blame should go anywhere, it is to the language designers (but the decisions they made were rational at the time) and implementors - yacc is just an implementation tool, and can (and should) be replaced.
I see pitfalls like this as flaws in the language, not the novice programmer. Perhaps his semi-conscious act of leaving out the semicolon has a natural elegance. Programming language designers should not beat this innate habit out of the novice programmer.While I usually come down on the side of intuition, I must point out that every novice programmer has a multitude of bad habits that must be beat out of him or her. Our abhorrence of violence (which was beat into us by our culture) should not deter us from beating civility into those who need it.
The fact is, people think in terms of statement terminators. Every time you finish speaking a sentence, you change the intonation at the end of it, and insert a pause before the next sentence. This is pretty much universal in human languages.
The only way I can see you winning this argument is if you liken a program to a poem. Line structure is more important in poetry than in prose, and there are some forms of poetry that allow you to omit the punctuation. This often comes at the expense of other syntactic restrictions, however, such as the number of syllables allowed per line. On the other hand, there are poets that write unpunctuated free verse to get more of a stream-of-consciousness feel, or to keep the clausal bindings purposefully ambiguous. That's fine in poetry. But I don't think you can argue that this is intuitive to the novice. Experts are purposefully ambiguous; novices only accidentally so.
I don't have any hard rules for justifying leaving out the semicolons. I just appeal to your intuitions (when you FIRST STARTED programming if you can think back that far). Isn't simpler? Less clutter?Simpler? Less clutter? In some sense, maybe. Would you consider your life simpler and less cluttered if we removed all the stop signs from our streets?
Odd that you should appeal to my intuition. When I FIRST STARTED programming, I used a language like that, and I always felt that line-continuation markers were exceedingly ugly and unnatural. Similarly for any other cute line-continuation tricks depending on unfulfilled syntactic expectations, where you're allowed to use a newline in certain places in the grammar but not others. Ugh. I breathed a great sigh of relief when I discovered languages with statement terminators. Your mileage has obviously varied.
Note that C has the worst of both worlds here--ordinary statements are semicolon terminated, but macros definitions are terminated by the end of the line, and therefore need backslash as a continuation marker. Despite the fact that I write many more statements than macros, I forget the backslash *far* more often than the semicolon. This is because line continuations are (for me) conceptually outside the syntax of the language, while the statement terminators are very much an explicit part of it. I suspect most C programmers would agree here. Maybe even some shell programmers. :-)
The semicolon pitfall is pervasive and very easy to remedy.No, it's not. You haven't considered all the ramifications. Zeroing in on one feature without considering it in context doesn't achieve any good in the long haul--it's just moving piles of complexity and risk around to the other end of the prison yard. The warden might enjoy it, but the prisoners certainly won't.
Now, if only English were as easy to fix...It's easy enough to trash English, and I even laughed like I was supposed to, but I also note that you enjoy your English punctuation sufficiently to use at least three different statement terminators: "?", "." and "..."
Let's continue his argument. In every natural language I know of, case
distinctions have no non-redundant semantic content. In fact, in English the
rules for capitalization are pretty much fixed, and if I write a sentence in
all lower-case, you generally would have no problem determining what to
capitalize. (There may be a few exceptional cases like acronyms, but they
are quite rare.) The particular rules in other Western languages vary, but
the same remains true: Capitalization may help you visually parse a sentence,
but you'll have to trouble reading it in all caps, all lower-case, or in
punky random mixed case.
I think this assumes that other punctuation, such as spaces to
separate words and full stops to end sentences, is present.
Nevertheless, most programming languages today *do* maintain case
distinctions. This argues strongly that identifying programming languages
with natural languages misses the point, at least to some degree. Just
because we use grammatical techniques original developed by linguists
studying natural languages doesn't mean programming languages are very much
like natural languages! For another example, programming languages make a
good deal of recursion, while natural language structures tend, in general,
to be quite simple and of highly bounded depth. It's easy to find natural
examples of code whose parse tree is 10 or 15 levels deep, but rather hard
to construct a *natural* natural-language example of such a thing!
e.g. "I know an old lady who swallowed a ... dog, to catch the cat, to
catch the bird, to catch the spider, to catch the fly..." is funny in part
because of the unusual sentence structure.
Much of what's in programming languages, I contend, stems from a different source entirely: Mathematical formulas. Mathematical formulas *do* assign semantic meaning to letter case. They *do* have rather deep recursive structures, in exactly the same way (and for the same reasons) as expressions in programming languages. In fact, the expressions in almost all programming languages are pretty much direct cribs from traditional mathematical syntax. In many cases, the real semantic content of a program is in the expressions. The non-expression statements form an overall structure within which those expressions must fit. In this way, many programs are like math texts: English sentences and sentence fragments form a scaffolding within which the "real stuff" - the formulas - are embedded.
Finally - coming back to the original point: Mathematical formulas are almost always written with no explicit "end of formula" markers. Whitespace and line breaks are generally enough to provide an unambiguous interpretation. In "heavy" math, even the interpolated English text consists mainly of sentence fragments with minimal punctuation.
I'll bet almost all missing semicolons are missing at the end of expression statements, not at the end of other kinds of statements. Partly this is an artifact of language design - non-expression statements are often self-terminating and don't need a semicolon. But partly I suspect it's because even experienced programmers think of expressions as "like" mathematical formulas, and they apply the "natural laws" of that domain.
For syntax, some pragmatism does not hurt. A modern version of the
struggle between big-endians and little-endians provides a good example. The
programming language world is unevenly divided between partisans of the
semicolon (or equivalent) as terminator and the Algol camp of semicolon as
delimiter. Although the accepted wisdom nowadays is heavily in favor of the
first approach, I belong to the second school. But in practice what matters
is not anyone's taste but convenience for software developers: adding or
forgetting a semicolon should not result in any unpleasant consequences.
semicolon as terminator = every statement (or whatever) ends in a ";"
semicolon as delimiter (I prefer the word "separator") = two adjacent
statements are separated by a ":"
C uses the former, so we have to have a ";" before a "}",
whereas Pascal uses the latter, so we must omit the ";" before an "else".
However, Pascal has enough flexibility that the ";" before an "end" is optional.
In the syntax of Eiffel, the semicolon is theoretically a delimiter (between instructions, declarations, Index_term clauses, Parent parts); but the syntax was so designed as to make the semicolon syntactically redundant, useful only to improve readability; so in most contexts it is optional.
This tolerance is made possible by two syntactical properties: an empty construct is always legal; and the use of proper construct terminators (often end) ensures that no new component of a text may be mistaken for the continuation of the previous construct. For example in
x := y (The recommended style suggests a := b including a semicolon after y.)no construct may involve two adjacent identifiers, so that 'a' after 'y' must begin a new instruction, even without a semicolon.
A further point (this is really one of author discipline, rather than language design) is trying to find deeply embedded data declarations. Exactly what is "thingy" declared as? In Assembler (wash your mouth out with soap :) ) at least those declarations were perforce neatly lined up at the left margin. Now they may be buried levels deep in the code.
It can be quite baffling to get a (sometimes misleading) compiler diagnostic, which points at a line containing a really heavy compound statement.Indeed, that's why I dislike compilers that only give line numbers. I much prefer character position or failing that line -and- column.
Firstly, allowing you to write "if... fi" in an alternate form is surely *under*loading. It allows the user to use "elegant variation" in styles to make meanings clearer -- you can use, for example, vertically aligned "if... fi" for major blocks of code, horizontally aligned "if... fi" for key single-line code, and "(...)" for trivia that you want to slip into place without drawing the reader's attention.
Secondly, the meat of your point [which ISTR you have made before] is presumably that the *compiler* [and hence the reader] finds it quite hard to tell whether a left parenthesis is meant to be "(", "[", "if", "case" or "begin". But it doesn't matter! They are all brackets; and that in itself is quite an insight for the reader -- one that many more recent languages have failed to notice [or to exploit].
for (i=0 ; i<10 ; i=i+1) bar(i) // works for (i=0 ; i<10 ; i=i+1) { // works bar(i) } for (i=0 ; i<10 ; i=i+1) // doesn't work { bar(i) }These three statements should all be equal, but the the newline trailing the for() in the last one matches to an empty expression.
What I think I need to do is to conditionally ignore newlines. Only when trying to match a 'separator' rule, should a newline not be ignored. However, I'm not savvy enough with yacc/lex to know what I'm doing. Here are the rules I've got:
expression_statement : separator { $$ = NULL; } | expression separator { $$ = $1; } ; separator : '\n' | ';' ;I tried some code after matching 'expression', to disable/enable eating of newlines, but this fails, causing the lexer to eat all newlines.
expression_statement : separator { $$ = NULL; } | expression { eatNL = 0; } separator { $$ = $1; eatNL = 1} ;Any help would be greatly appreciated.
I would like to be able to optionally use newlines as statement separators.Good idea.
What I think I need to do is to conditionally ignore newlines.This may not be necessary. Often a better idea is to follow the C/Algol/etc idea of treating blanks and newlines the same. You can use an unambiguous grammar that requires no statement terminators or separators. (As a few people have recently pointed out.)
For Pascal-like languages, the grammar can even be LL(1). Consider this language as an example:
Program --> Block eof Block --> Statement OptSemi Block | Empty OptSemi --> ; | Empty Statement --> id := Exp | id(ArgList) | if Exp then Block else Block fi | while Exp do Block od | var id : Type | proc id(ParamDecls) Block end | fun id(ParamDecls) Block return Exp etc Exp --> monop Exp | Exp binop Exp | (Exp) | id(ArgList) | id Empty -->This grammar can be made unambiguous, by giving precedence and associativity to the various unary and binary operators. In fact it is very similar to the grammar of the Turing language, which is LL(1).
Note that in assignment statements, I only allow very simple left-hand-sides, this could be extended to allow subscripting, but as soon as you allow a statement to begin with a parenthesis, you will lose all hope of an LL grammar. Consider the Block
a := foo (*p) := barIt begins too much like
a := foo(*p)But perhaps LALR is salvageable.
When you allow whole expressions to be statements by themselves, even this hope fades, as any useful grammar will be ambiguous. Consider
x := f(x)which could be parsed as x := f followed by (x). Also what about
x - ywhich could be parsed as x followed by -y. Obviously allowing C's "empty statement" is also disastrous as the empty string can be parsed as any number of empty statements. But it is a useless statement, as you can always use {}, which need not cause a problem.
By changing the function call syntax (I use square bracket and let the type checker figure out the difference between subscripts and arguments) and segregating unary from binary operators you can get to LALR(1) as evidenced by the yacc grammar for a C-like language below.
In an interactive language there is still the problem that the end of a statement may not be recognized as such until the first token of the next statement is read. I suggest using some special token (I'll call it "done") that the user uses to request that the value of the preceding statment be printed. The nonterminal Program should be modified to read
Program : Block done {printVal();} Program | eof ;Here is the yacc grammar:
%start Program %token id if then else while do typename return eof %right colonEqual %left '+' '-' %left '*' '/' %left '~' /* Unary minus */ %% Program : Block eof ; Block : Stmt OptSemi Block | Empty ; OptSemi : ';' | Empty ; Empty : ; Stmt : '{' Block '}' | Exp | if Exp OptThen Stmt else Stmt | while Exp OptDo Stmt | typename id /* var declaration */ | typename id '[' PList ']' Block return Exp /* func decl */ ; OptThen : then | Empty ; OptDo : do | Empty ; PList : typename id OptSemi PList | Empty ; Exp : id OptArgList | Exp colonEqual Exp | Exp '+' Exp | Exp '-' Exp | Exp '*' Exp | Exp '/' Exp | '~' Exp | '(' Exp ')' ; OptArgList : ArgList | Empty ; ArgList : '[' ExpList ']' ; ExpList : Exp OptSemi ExpList | Empty ;