Programming Language Reinvention
The best programming language and environment I’ve used is C# with Visual Studio (version 2005 as I write this). However, this pair still has its shortcomings, and so I created this page to enumerate those shortcomings along with my ideas on how to improve them with a reinvented language.
I’m going to refer to the new language as “Jargon” because I need to give it a name. I kinda like that name so I may keep it.
The initial target of the new language is to build .NET assemblies, but the language itself will be designed in such a way that it should also be quite usable to generate JVM-compatible binaries or even native code.
I have some code examples herein. The Jargon syntax is not defined, so I use C#-like syntax that is sometimes not valid C# because C# doesn’t support the features I’m suggesting. This should not be taken to be an indication of the Jargon syntax, necessarily, but just to illustrate how a particular feature might look on its own.
Goal
The goal of this project is a development environment that allows more efficiency in both inception and maintenance. It is not simply a slicker editor, or a better compiler, or a new language. It is these things, but not merely these things. The goal is to come up with the components of a full environment that beats current alternatives.
Principles
I’m going to attempt to list these principles in order of priority where possible. So basically if two principles contradict each other in some context, the principle that is higher on the list trumps the other.
- Performance: Resulting compiled code should be at least as fast as if written in C#.
- Readable code: Code should be not just as naturally readable as possible, but the syntax rules should make obfuscation as difficult as possible.
- Concise syntax: Expressing a thought should require as little code as possible.
- Defaults where reasonable and intuitive: Where possible the developer should never have to code something specifically if a reasonable and intuitive default can be inferred instead.
- Enforced consistent convention: Two pieces of code that are syntax-identical must be character-identical also, and this must be enforced, not simply convention.
- Human readability of code is the responsibility of the editor and not the storage format.
- The storage format should reflect (as close as possible in plain text) the format portrayed by editors, and be friendly to differencing tools such as those used in source control.
Shortcomings of C#
First, I’ll point out that although C# will never beat C/C++’s raw speed for certain processor intensive tasks, creating something that does compete with C/C++ in that field is not a goal of this project.
- No “readonly ref” (or “const ref”)
- There are two reasons to pass a variable by reference. The first of which is to make the parameter effectively in/out. The second is to improve performance (when passing large value-types). In C#, you have only one option, “ref”, so you can either do both of these or neither. If you want to pass by reference because you’re working with a large value type, it becomes modifiable by the callee. This is often not desired. C++ provides a “const” keyword for this. The C#-y way to do this would be a “readonly” keyword.
- Note that a “readonly” keyword might not be necessary, as it might make sense to have passing by reference simply be an automatic optimization when passing large value types. This approach is much more complicated because the callee’s implementation would have to have code emitted to make a copy if the value were to be modified therein.
- No “with” construct
- A “with” construct lets you reference methods of an object while only referencing the object itself once.
- Besides just being a syntactic shortcut, a with construct is also an optimization as it doesn’t require loading the same reference on the stack for every member call. This is especially true if modifying several members of a value type that’s in an array. For example, the emitted code for the following is horrible:
array[0].val1 = 1; array[0].val2 = 2; array[0].val3 = 3;
- I suggest a syntax much closer to
with array[0] { val1 = 1; val2 = 2; val3 = 3; }
- Constructors are differentiated only by a parameter type
- A constructor is basically a static function with no name where “this” is implied and references a newly allocated but unconstructed object. Having constructors be unnamed, though, seems to be a relic inherited from C++ or possibly its ancestor. The reason this is a problem is because the only way to differentiate constructor calls is by the string of parameter types they take, but often it is desired to construct an object in different ways even though the types of the input parameters are the same.
- In my proposed solution, you would create static methods on the class which return an instance. Language syntax aids would allow you to specify that the static method was to be treated as a constructor. This would give you the implied “this” context, and could also cause the generation of constructor metadata for compatibility with C#. In order to use classes created in C#, the nameless constructors would be visible in Jargon as static methods named “new”.
- Verbose delegate instantiation code
- For example to start a thread, I have to pass in a delegate of type ThreadStart. To do this I have to use a “new” construct to instantiate a delegate of type ThreadStart and pass to that a method name. Alternatively, I could just create an anonymous delegate which allows me to do so without any superfluous “new” construct.
- I propose that the anonymous delegate ability be kept, and that you also be able to simply pass a function name as a parameter where a delegate is required and the compiler will take care of emitting the “new” construct. (i.e. if the parameter type is ThreadStart and you pass in a method of type void(), then it creates a delegate of type ThreadStart from the void() method.)
- Weak reference support
- Weak references already exist in C# via a generic class.
- I propose support for weak references be added as a feature of the syntax
- Cannot return a reference to a field declaration
- Take a look at the following example:
struct SomeStruct { public int PublicField; } class ExampleClass { private SomeStruct _privateField; public SomeStruct PublicProperty { get { return _privateField; } set { _privateField = value; } } } class Program { public static void Example(ExampleClass obj) { obj.PublicProperty.PublicField = 1; // this can't be done } }The reason the above can’t be done is because you cannot return the structure as a reference. I see no reason to be unable to do this however, except that returning references to variables would break if you tried to pass back a reference to a local variable, or some other variable in unmanaged storage. But this could be done if the compiler simply forbade the illegal usage.
- The other problem might be that “reference” to a type is not recognized as a type, itself. So there would be no way to define the actual signature of the function in metadata. If the function were private, however, this wouldn’t be an issue. Plus, it might be possible if the compiler automatically rewrote the construct and calls to it to function as an out or ref parameter instead of a return value.
- Take a look at the following example:
- Unable to specify different implementations in generics based on whether the type is a reference type vs. a value type.
- This is unfortunate because there are several cases where the fundamental implementation would change based on that criteria, and whereas you can enforce using either value type or reference types, you cannot currently give them the same name.
- Interfaces and enumerations can’t act as namespaces for other items
- So it is impossible for interfaces (or enumerators) to have nested classes, or to have static functions, but I see no good reason why they shouldn’t. In fact, I see no reason why enumerators shouldn’t be able to have instance methods where “this” is simply a reference to the value. It might even be possible to add instance methods on interfaces where “this” would be of the interface type as long as these instances where only referenced explicitly (i.e. would not be implemented implicitly in the implementing classes). Interfaces could therefore have no member variables, but implementing one might bring along some added functionality without causing multiple inheritance problems.
- No automation of field-wrapping by a property
- It is so common to take a private field and make a public property out of it, that I see no reason why it shouldn’t be possible to simply specify that any particular field should have an automatic property emitted to wrap it.
- No novel operators
- C# provides a specific set of operators, and lets you overload many of them (note: this could be its own discussion, but quickly, I’ll say that operator overloading is a necessary fact in order to allow reasonable string concatenation or implement math libraries – this isn’t without its cons, but I know of no way around them)
- What C# doesn’t let you do is define your own operators. So, for instance, if I wanted to do a case insensitive compare on a string, I have to use a clumsy Equals() method. However, this is a common need, so an operator would make this much easier. But “==” is already used for case sensitive comparison – it would be nice to define something like “%==” to mean case insensitive compare (even %>= and %<= for sorting)
- My proposition is to cordon off a set of characters that can be used in operators (e.g. & | + – * . ~ $ % ^ / ? = > < !) and set up the lexer to recognize any group of these characters as an operator.
- This is more complicated than this, however, because in order to automatically assign variable types (without superfluous declarations, see “Automatic Static Typing” in the notes section) the compiler (and really, also the IDE for good visual feedback) need to be able to differentiate assignments from expressions. (Note, assignments are expressions in some languages, but not in Jargon). If you imagine an operator definition as a function definition (as it is in most languages) then this can be acheived by declaring the left parameter as “output” to mark it as an assignment.
- No ability to customize compiler operation
- There is no way to customize the compilation process with your own code. I would like it if a compiler (and probably IDE also) would recognize certain attributes as being custom implementations of compiler functions or precompilers. In fact things like field-wrapping could actually be accomplished this way.
- Need a sort of “plugin” class
- I want to be able to implement functionality in one class that is to be imported into other classes. This is normally done by deriving from the class, and I want it to work that way, but that isn’t always possible. The most common example I run into of this is with things like ListViewItem. You can store your own custom class in a ListViewItemCollection if you derive from ListViewItem. But you can’t derive from ListViewItem if your custom class already is derived from another class which doesn’t derive from ListViewItem. It might make sense then to implement ListViewItem as an interface rather than a class, but ListViewItem needs member variables for its operation which interfaces can’t have. I propose a sort of multiple inheritence much like the explicit implementation of interfaces. It might work by automatically adding a member field to the implementing class then emitting code to work with it.
- No “true” templates
- Generics are not C++ templates, and in some ways this is good. On the other hand there are several shortcomings which make it quite annoying and force the developer into duplicating code.
- I would like to see the creation of a sort of meta-class with templates that work more like C++’s in that the classes are expanded at compile time into concrete classes. The meta classes would be necessarily “internal” and only their concrete instances would be exportable (as they will be true classes).
- Poor internal support for flags
- An attribute can be supplied to specify that an Enum is meant to be treated as flags, but a lot of what could be done automatically because of this attribute is not (such as automatic picking of the values, and automatic testing for flags)
- Note that for really good support of flags, simple incrementing of values by a power of two is not enough because often there are “masked” parts of the enumeration which should actually operate as sub-enumerations. A good language should be able to nest enumerations to achieve an effect like this.
Other Notes and Ideas
- Syntax extensibility
- I alluded above to the desire for the ability to extend the compilation process with custom code, much like you can already create custom attributes that other programs which use reflection can make use of. I want to do something similar but extend it beyond the compiler to also make parsing extensible so that you could actually embed other syntax directly in your code (provided you had code elsewhere capable of parsing that syntax). This would make it possible to provide inline data-access-friendly syntax later on, or even in the short term to make inlining SQL very clean and at the same time, understandable by the IDE.
- Some of these extensions I expect to be included in a default installation of the Jargon language. A couple examples would be inline C# and inline IL assembly. So I would expect it to be possible to implement a function in inline assembly, or simply copy a function implemented in C# directly into your Jargon file and have the IDE still recognize the syntax, and the compiler still be able to compile it.
- Superfluous definition
- Java enforces the convention that a public class be found in a file of the same name. There is no such rule for file-scope classes (which can only be seen from within the file anyway). I’m in favor of enforcing this convention also, but extending it for every class, enumeration, interface, struct, everything under the namespace level, and additionally enforce the convention that folders/directories match namespaces. Namespace declarations, and even class declarations become superfluous. Definitions for classes, structs, and interfaces, then need specify no name, but only that they are either class, struct, interface, etc.
- Note: This may seem strange at first for delegates which have such a concise definition, but there’s no great loss in making every delegate its own tiny file. Any clumsiness this may seem to cause should be overcome in the IDE.
- Note: For case-mangling operating systems, such as Windows 95/98/Me, this simply won’t work, but I’m favoring sacrificing compilability on these platforms.
- Java enforces the convention that a public class be found in a file of the same name. There is no such rule for file-scope classes (which can only be seen from within the file anyway). I’m in favor of enforcing this convention also, but extending it for every class, enumeration, interface, struct, everything under the namespace level, and additionally enforce the convention that folders/directories match namespaces. Namespace declarations, and even class declarations become superfluous. Definitions for classes, structs, and interfaces, then need specify no name, but only that they are either class, struct, interface, etc.
- Superfluous/formatting syntax
- Curly braces and semicolons
- With curly braces and semicolons, it is possible to construct a C# program on one single line. This would appear to be the most worthless feature of the language. Its pros are nil and its cons are: 1) it allows every programmer to invent his own convention for formatting which, as it turns out, is more unique than a fingerprint, and 2) it simply requires extra typing (and with curly braces, requires the combination of fairly obscure keys and a shift modifier).
- Whitespace
- This is distinct from indentation which I cover separately below
- Sometimes used to align things into a sort of ad-hoc columns
- This is fairly rare, and not particularly useful, so the ability to do this doesn’t appear to be much of a pro. The cons are 1) that it makes later editing more annoying (need to adjust spaces to realign, or worse, sometimes adjust hundreds of lines if one “column” needs to become wider), and 2) that it also makes use of a variable width fonts impossible.
- Note: my preference would be to have the IDE try to align things based on rules created by the users, but if I were to implement anything like manual alignment, I would make the use of a tab character be required (as opposed to space) and would make every tab act as a column break rather than act as a skip-to-next-stop.
- Also sometimes it is just used inconsistently.
- Consider the following example:
if ( someValue==someFunction( param1, param2 ) ) if(someValue == someFunction(param1, param2)) if (someValue == someFunction(param1,param2)) if(someValue==someFunction(param1,parm2)) if ( someValue== someFunction( param1, param2) )These all mean the same thing, and just reflect different programming styles that all developers want to do differently, and most don’t even follow consistently themselves. The last one reflects an exaggerated example of how obscure this can become.
- Also consider the following example:
amount*value;
Does this mean multiply “amount” by “value”, or does it mean declare a “value” variable of type “amount*”? One needs to scan around for context to see which is intended. For instance it could have been the second line of this:
int total = amount*value;
Which requires you to find the line above to make sense of the line below.
- I propose making whitespace significant and thereby enforcing a format.
- This is better in that it simply makes all developers get used to reading in the same format. Strict formatting like this also makes text based searches within code much more useful.
- Besides indentation (see indentation below), In the normal statement part of the syntax, there are two places, so far, where where whitespace will be required, and no other place that it will be allowed. These are:
- A single space on either side of a binary operator
- A single space following a comma separating items in a list
- Consider the following example:
- Indentation
- Since braces and indentation serve identical purposes in nearly all conventions in use, there’s no need for both, so I propose discarding of block delimiters like braces, and using significant indentation instead (like python). I also suggest that tabs be required for indentation, and that blocks are differentiated by their parents by exactly one tab. I propose this at least by default. I can see the possibility that this could be relaxed with some kind of option that made indentation interpretation work like python’s, but if allowed at all, want to make that the exception rather than the rule.
- Also, because there are cases where tabs just don’t work right, I propose that in lieu of tabs, spaces can be used, but that if spaces are used at all, the indentation MUST be 4 spaces per tab, and must be the only method of indentation used throughout the whole file.
- For example, tabs will not be rendered correctly in HTML, but HTML is often used to display code. In such a case, using 4-space replacements is permissible. If a developer then cuts and pastes from HTML into the document, the IDE is expected to replace all spaces with tabs automatically.
- Also if spaces are used in a file for some reason, the IDE should treat each group of four spaces as a single tab character during cursor movements, etc.
- Curly braces and semicolons
- Automatic static typing
- C# uses static typing which means that a variable’s type is determined when you declare it and cannot change. C#’s static typing is also manual which means that the developer must manually specify what the type of the variable is. Recently C# also introduced the “var” keyword which alleviates some of the redundancy of manually specifying type by allowing type to be implied by assignment.
- Python uses dynamic typing which means that a variable can be any type at any time. To be fair, C# is capable of manual dynamic typing to a large degree by declaring variables as “object”. With python, it is possible for any occurrence of a variable to potentially be any type at runtime. However this makes IDE features like context-sensitve code suggestions (“intellisense”) much more difficult to implement and much less effective when implemented.
- I propose a hybrid of these two in which variables have static type as in C#, but that their types are determined automatically (like with C#’s var keyword, only without the need for the keyword). Like python they would automatically be declared in the scope where they’re first used. Their type, like C# is deterministic, but unlike C#’s var keyword, the type is calculated by examining all uses of the variable and not simply its initial usage.
- Note: that the type assigned to the variable will be the first common base of all the candidate types derived from its usage, and without the IDE it may not be obvious at first glance what type a variable will be. Additionally the scope or initialization of a variabe might also be non-obvious (as is the case with python). The IDE should find a way to present both the type and the scope of the variable.
- Duck typing
- I don’t yet see a way to build good support for a sort of duck typing. At this time I have no plans to include something like it in my draft, but I’m mentioning it here because I don’t want to lose sight of the possibility of adding it elegantly. It may be possible to add it using some sort of extensibility later.
- Assignments As Expressions
- In C++ descendents such as C# and Java, assignments are expressions that return the assigned value. This allows some code shortcuts such as assignment chaining (e.g. x = y = z= 1) and while loop shortcutting (e.g. while (null != (reader.ReadLine())) )
- Since other syntactic constructs could be provided which synthesize the same benefit, and since assignments as expressions makes for harder to read code, Jargon assignments will not be expressions.
- Ternary Operator
- The conditional ternary operator has a usefulness in that it works like an if, but operates as an expression, so it can make for some shorter code and even remove duplication (e.g. if (b) { x = y; } else { x = z; }) vs (e.g. x = b ? y : z). The latter doesn’t repeat the “x = “. This can be both useful and abused – when abused it is much harder to read.
- I propose an end to the ternary operator. But only if a way to avoid the code duplication of not using it can be developed.
Notes
- I am going to use the C# syntax for statements/expressions as the basis for Jargon’s syntax minus things I’ve already noted I’d change (e.g. no braces, no semicolons, strict whitespace, etc). I have not yet decided on the syntax for metadata (classes/structs/function definitions etc).