What if C++ looked more like Python or CoffeeScript?
I’m quite fond of languages with minimal syntax. Not only it is easier to read and write code in these languages - it also provides an opportunity to reduce errors (both at compile time and at run time), when you consider that every character in a program has the potential to cause an error due to being misread or misplaced. In addition, long, dense lines of code littered with punctuation increase the cognitive burden on the programmer.
In this post, I would like to sketch out an idea for a modified C++ syntax with reduced noise. I don’t make claims about the practicality of implementing such syntax, and it’s by no means a complete solution. However, I thought that it’s an interesting thought experiment, and I’d enjoy a simpler syntax if it was available.
Another thing I should mention is that it isn’t meant to be a backward-compatible syntax change which could be incorporated into, say, C++17. What I’m exploring is a dialect of C++ with Python-like syntax but with most of C++ constructs and semantics intact.
The basic idea is very straightforward, and of course already applied in other languages. Remove semicolons and curly braces used to denote code blocks, make parentheses in control structures optional, make indentation significant (as in Python and CoffeeScript) - voila, less noise.
The code could look like this:
Generally speaking, the newline becomes the end of a statement, and indentation indicates a code block. Curly braces can still be used in initialization constructs.
Some statements (like template declarations) will still require multiple lines. This kind of syntax requires some constraints to be placed on how statements can be split across multiple lines, but I think these can align pretty well with existing good code style.
The rest of this post is about dealing with various details, e.g. where this syntax can be ambiguous.
Newline and semicolon as statement separators
Generally, end of the line means the end of a statement:
The semicolon can be repurposed to combine multiple statements on one line:
The semicolon continues to be used in the for loop, so it’s mostly unchanged:
Multiline statements
Some statements can’t practically fit on one line and need to be recognized as multiline. One example is templates:
The template keyword and type list must be followed by a class or function declaration on the subsequent line.
What about a function declaration which requires more than one line? For example the return type might need to go on a separate line:
This can be parsed by introducing a set of non-terminating symbols, i.e. symbols which can’t end a statement. The arrow can be one of those.
Sometimes it will also be necessary for function parameters or arguments to span multiple lines:
This could be parsed in two ways: one is that the parser keeps concatenating lines until it finds the closing parenthesis; the other is that commas become part of the non-terminating symbol set, and the parser concatenates the next line as long as the current line ends in a comma. Arguments passed when calling a function can be treated in the same way.
Finally, what about multiline string literals? For example:
I think it should be OK to concatenate them just like it’s currently done in C++. Even though this could be interpreted as separate literals on subsequent lines, such an interpretation wouldn’t be of much use.
Optional parentheses in control structures
The parentheses in control structures are optional:
The handling of the while in do-while is similar to how this is handled in normal C++:
Sometimes the expressions have to be quite long:
There are two ways of handling this situation. The set of non-terminating symbols includes the operators, so lines will be concatenated as long as they end in operators.
Alternatively, the condition can be enclosed in parentheses, which provides the option of breaking it up freely, just like in normal C++:
Declarations vs definitions
Suppose we have a function declaration and a definition:
After removing semicolons and braces they become ambiguous:
A backslash can be used to distinguish a function with an empty body:
The only requirement is to have a blank line following the backslash as that will be considered the function body. One problem with the backslash is that using it this way conflicts with the currently existing translation rules, so perhaps another symbol is necessary. But in any case, the point is that I need some way of specifying an empty block instead of {}.
The same principle applies to classes:
Method access specifiers
There are two options of formatting access specifiers. The obvious one is to indent them to the same level as the rest of the class body:
However, the prevalent existing style appears to be to align access specifiers with the class specifiers, so this could be allowed a special case instead:
Lambdas
A lot of the syntax can be translated in a straightforward manner. However, lambda expressions are one of the trickier cases. Let’s take an example:
I’ve removed the braces around the body of the lambda given to for_each. There was actually a proposal to allow lambda body to be an expression which suggested introducing this kind of syntax into C++14, so this is possible to parse - subject to some restrictions.
However, lambda expressions can be a lot more complex than this, with multiple statements in the body and so on.
In a more complex scenario, lambda bodies could be demarcated via indentation:
While we’re at it, here is another potentially problematic situation: a function taking two callbacks as arguments. I could pass two lambdas to it:
Note that an empty lambda is now reduced to the introducer. What if I want something more elaborate? For example:
If I try to use indentation for lambda bodies, it looks pretty ugly and confusing. Instead, it’s better to allow an unindented leading comma (this is the CoffeeScript solution):
What about a lambda which is defined and called immediately?
After removing semicolons and braces, I get this:
This is potentially ambiguous, so the lambda needs to be wrapped in parentheses:
Standalone code blocks
Standalone code blocks are occasionally useful for scoping. But without curly braces, code blocks like this would merge into one:
Relying on indentation isn’t going to be enough here, I need to tell the parser that a new block started. A backslash on its own line can be used for that (with the same caveat as before - another character may be better):
Things I don’t know how to handle
One thing I don’t know how to handle is unnamed types:
I don’t see an easy way of delimiting the type definition from the variable without adding a new keyword or a symbol to mark an anonymous type. Personally, I’d be happy to live without anonymous types in exchange for the other benefits, but this may not be a good tradeoff for other people.
Syntactic constructs that wouldn’t work in C++
There are a few other constructs which are useful in other languages, so I considered them briefly, but decided they wouldn’t work or wouldn’t add much value in the C++ context.
Treating everything as an expression and optional returns
Another thing CoffeeScript and Ruby do is treating everything as an expression:
Consequently, methods return the result of the last expression, and the return keyword is optional:
However, I think this kind of change would be too drastic, and it has performance implications which make it undesirable in C++.
Array comprehensions
CoffeeScript’s for is a mechanism for array comprehensions rather than a simple loop:
While this is nice, I don’t think this would add that much to the range-based for syntax and functional style iteration already available in C++.
No distinction between definition and assignment
CoffeeScript doesn’t make a distinction between variable definition and assignment, so that a = 10. It makes the syntax simpler, but it also makes shadowing impossible:
This is a deliberate choice on the part of CoffeeScript’s creator. I think it’s dubious even in CoffeeScript, and would be worse in C++.
A longer example
Finally, here is a longer example of the new syntax and the normal C++ syntax side by side. This is based on some code I borrowed from the MongoDB project. In addition to using the new syntax on the left side, I made a few C++11/14 style changes.
I think the code on the left is much easier to understand thanks to reduced clutter. It also means that bugs would be easier to spot, and it would be faster to scan through, e.g. when looking for a particular function.
Another thing to note is that the new syntax results in substantial savings in terms of line count. If I don’t include the large comment at the top, the normal C++ code is 25% longer than my syntax. That’s despite the fact that it uses Java-style formatting for curly braces. If opening curly braces were on their own lines, the difference would be even more significant.
I believe this vertical compression is valuable, because it can make a difference between seeing a whole function or only part of it, a whole algorithm or only a portion. Seeing the whole thing at once makes it easier to understand it and reason about it.
The end
Are there other problems with parsing this syntax which I haven’t thought of? This is C++, so I’m sure there are! Even without trying to make significant changes to the syntax, C++ parsing is a complicated affair. Plus of course, C++ was never designed with significant whitespace in mind.
My intent was to see if this kind of change would be at all plausible, and what it would make the code look like. It seems that it could work, but it would likely require constraints to be placed on some of the things which can be written in regular C++.
If you spotted a particular problem, feel free to point it out in the comments.
Keen on mastering C++11/14? I've written books focused on C++11/14 features. You can get a VS2013, Clang or a GCC edition.