Some thoughts on designing a computer language.

Designing a computer language is an interesting exercise.

Remember first, the target of a computer language is a microprocessor or a microcontroller. And microprocessors or microcontrollers are stupid: they only understand integers, memory addresses (which are just like integers; think of memory as organized as an array of bytes), and if you’re lucky, floating point numbers. (And even there, they’re handled like integers but with a decimal point position. Stored, of course, as an integer.)

Because of that, most modern computer languages rely on a run-time library. Even C, which is as close to writing binary code for microprocessors as most of us will ever get, relies on a run-time library to handle certain abstract constructs the microprocessor can’t. For example, a ‘long’ integer in C is generally assumed to be at least 32-bits wide–but if you’re on a processor that only understands 16-bit integers, any 32-bit operation on a long integer must be handled with a subroutine call into a run-time library. And heck, some microcontrollers don’t even know how to multiply numbers, which means a * b has to translate internally into __multiply(a,b).

For most general-purpose programming languages (like C, C#, C++, Java, Objective-C, Swift, ADA and the like), the question becomes “procedural programming” or “object-oriented programming.” That is, which paradigm will you support: procedures (like C)? or objects? (like Java)

Further, how will you handle strings? How will you handle text like “Hello world?” Remember: your microprocessor only handles integers–not strings. And under the hood, every string is basically just an array of integers: “Hello world?” is stored in ASCII as the array of numbers [ 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 63 ], either marked somewhere with a length, or terminated with an end of string marker, 0.

In C, a string is simply an array of bytes. The same in C++, though C++ provides the std::string class which helps manage the array. In Java, all strings are translated internally into an array of bytes which is then immediately wrapped into a java.lang.String object. (It’s why in Java you can write:

"Hello world?".length()

since the construct “Hello world?” is turned into an object.) Objective-C turns the string declaration @”Hello world?” into an NSString, and Swift into a String type, which is related to NSString.

Declarations also become interesting. In C, C++ and Objective-C, you have headers which forces your language to provide a mechanism for representing external linkage. Those three languages also provide a mechanism for representing abstract types, meaning for every variable declaration:

int *a;

which represents the variable a that points to an integer, you must be able to write:

int *

which represents the abstraction of a variable which points to an integer.

And for every function:

int foo(int a, int *b, char c[5]) {...}

You need:

extern int foo(int, int, char[5]);

But Java does not provide headers, so it has less need for header declarations–but then adds the need to mark methods as “public”, “protected” or “private” so we know the scope of methods and variables which can be hidden in C by simply omitting the declaration from the header.

This means Java’s type declaration system can be far simpler than C’s.

And while we’re at it, what types are you going to allow? Most languages have integer types, floating point types, structure or object types (which basically represent a record containing multiple different internal values), array types, and pointer or reference types. But even here there are differences:

C allows the use of unsigned values, like unsigned integers. Java, however, does not–but really, the only effective difference in performing math operations between signed and unsigned integers are right-shift operations and compare operations. And Java works around the former with the unsigned right shift (‘>>>’) operator.

C also represents arrays as simply a chunk of memory; C is very low level this way. But Java represents arrays as a distinct fundamental type, alongside basic types (like integers or floating point values) and objects.

And pointers or references can be explicit or implicit: C++ makes this explicit by requiring you to indicate in a function if an object or structure is passed by value (that is, the entire object is copied onto the stack), or by reference (that is, a pointer is passed on the stack). This makes a difference because updating an object passed by value has no effect on the caller. But when passed by reference, changes to the object can affect the caller’s copy–since there really is only one copy in memory.

Java, on the other hand, passes objects and arrays by reference, always.

This passing by reference makes the ‘const’ keyword (or its equivalent) very important: it can forbid the function being called from modifying the object passed to it by reference.

On the flip side, Java does not have the concept of a ‘pointer’.

And let’s consider for(...) loops. The C language introduces the three-part for construct:

for (initializer, comparator, incrementer) statement

which translates into:

        initializer
loop:   if (!comparator) goto exit;
        statement
        incrementer
        goto loop;
exit:

But Java and Objective C also introduce different for loop constructs, such as Java’s

for (type variable: container) statement

which iterates the variable across the contents of the container. Internally it is implemented by using the Java’s Iterator interface, and translates the for loop above as:

        Iterator<type> iterator = container.iterator;
loop:   if (!iterator.hasNext()) goto exit;
        type variable = iterator.next();
        statement
        goto loop;
exit:

Of course this assumes container implements the Iterable interface. (Pro-tip: If you want to create a custom class which can be used as the container in a for loop, implement the Iterable interface.)

While we’re at it, if your language is object oriented, do you allow multiple inheritance, like C++ where an object can be the child of two or more parent objects? Or do you implement an “interface” or “protocol” (which specifies methods that are required to be implemented but provides no code), and have single inheritance, where objects can have no more than one parent object but can have one or more interfaces, such as in Java or Objective C?

Do you make exceptions a first-class citizen of your language, such as in Java or C++? Or is it a library, such as C’s setjmp/longjmp calls? Or is it even available? Early versions of Pascal did not provide for exception handling: instead, you must either explicitly handle problems yourself, or you must check to make sure that things don’t go haywire: that you don’t divide by zero, for example.

And we haven’t even dived into more advanced features. We’ve just stuck with the stuff that most general purpose languages implement. Ada has built-in support for parallel processing by making threads and synchronization part of the language. (Languages like C or Swift require a library–generally based on POSIX Threads–for parallel processing, though the availability of multi-threaded programming in those languages are optional.)

Other languages have built-in handling of mathematical vectors and matrices, or of string comparison and regular expressions. Some languages (like Java or LISP) provide support for lambda functions. And other languages combine domain-specific features with general purpose computing–such as PHP, which allow general-purpose programs to be written, but is designed for web pages.

Pushing farther afield, we have languages such as Prolog, a declarative language which defines the formal logic rules of a program without declaring the control flow to execute the rules.

(Prolog defines the relationships between a collection of rules, and performs a search through the rules in response to a query. Such a language is useful if we wish to, for example, provide a list of conditions that may be symptoms of a disease; a Prolog query would then list the symptoms, and after execution provide a list of diseases which correspond to those symptoms.)

But let’s ignore stuff like this for now, since my interest here is either procedural or object-oriented programming. (One could consider object-oriented programming as basically procedural programming performed on objects.)

The design of a programming language is quite interesting.

And how you answer questions like this (and other questions that may come up) really determine the simplicity of learning verses the expressive power of the language. Sadly, expressive power can become confusing and harm learning: just look at the initial promise of Swift as an easy and painless language to learn. A promise that has since been retracted, since Swift is neither a stable language (Swift 1 does not look like Swift 4), nor simple. Things like the type safety constructs ? (optional) or ! (forced) are hard to understand, since they rely on the concept of “pointers” and the safety (or lack thereof) of dealing with null pointers (that is, pointers to memory address 0, which typically means “not initialized” or “undefined”).

Or just look at how confusing the C type system can become to a beginner. I mean, it’s easy for a beginner to understand:

int foo[5];

That’s an array of 5 integers.

But what about:

char *(*(**foo[][8])())[];

What the hell???

Often you find C programmers avoiding the “expressive power” of C by using typedefs instead; declaring each component of the above as an individual type.

It is in large part because of C’s “expressive power” (combined with terse syntax) which allows contests like the International Obfuscated C Code Contest to exist: notice we don’t see an “obfuscated Java contest”.

Behold, a runner up in that contest.

But at least it isn’t APL, a language once described to me as a “write-only programming language” because of how hard it is to read, making use of special symbols rarely found on older computers:

(~R∊R∘.×R)/R←1↓ιR

This is the Wikipedia example of an APL program which finds all prime numbers from 1 to R.

No, I have no clue how it works, or what the squiggly marks mean.

Simplicity, it seems to me, forgoes expressive power. Java, for example, cannot express the idea of an array of pointers to functions returning pointers to arrays–since Java does not have the concept of a pointer to a function (that’s handled by the reflection API), or does Java have the concept of pointers. Further, Java does not permit the declaration of complex anonymous structures; first, everything is a class. And second, classes are either explicitly named or implicitly named as part of an anonymous declaration. It’s hard to declare something like the following C++ declaration; you’re forced to break down each component into its own declaration.

struct Thing {
    struct {
        int x;
        int y;
    } loc;
    struct {
        int w;
        int h;
    } size;
};

And it’s just as well; this makes more sense if you were to write:

struct Point {
    int x;
    int y;
};

struct Size {
    int w;
    int h;
};

struct Thing {
    Point loc;
    Size size;
};

It becomes clear that “Thing” is a rectangle with a location and a size.

But then, people often complain that Java requires a lot more typing to express the same concept.

It’s a balance. It’s what makes all this so fascinating.

Development Chaos Theory

An on-going conversation about stuff I find interesting

Some thoughts on designing a computer language.

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply