miaoYu

miaoYu正在翻译

How to implement a programming language 6(Wrapping up)

原文链接: lisperator.net

Wrapping up

In the interest of actually finishing this book howto, I moved forward and added a few (non-essential, but good-to-have) features and fixes to the λanguage without describing their implementation here. Download the full code below.

Download lambda.js

The program is runnable with NodeJS. It reads the program in λanguage from STDIN and compiles and executes it. It produces the compiled result at STDERR and the program output at STDOUT. Example usage:

cat primitives.lambda program.lambda | node lambda.js

The final touches are:

  • Support for the negation operator (!). It's non-essential because it can be easily implemented as a function:
    not = λ(x) if x then false else true;
    not(1 < 0)        # ⇒ true

However, having a dedicated AST node for it allows us to generate more efficient code (a function call implies the GUARD-s and whatnot).

  • I added a js:raw keyword. Using it you can insert arbitrary JavaScript code in the output (only complete expressions, not statements). Example use case is to easily define primitives from the λanguage:
    length = js:raw "function(k, thing){
      k(thing.length);
    }";

    arrayRef = js:raw "function(k, array, index){
      k(array[index]);
    }";

    arraySet = js:raw "function(k, array, index, newValue){
      k(array[index] = newValue);
    }";

    array = js:raw "function(k){
      k([].slice.call(arguments, 1));
    }";

I configured my syntax highlighter to use a reddish color for this keyword, because it's dangerous. Maybe not quite, but you have to know the implications of using it: the code you pass to js:raw will make it untouched and unchecked into the output JS. For instance the optimizer won't be able to see that you're accessing the local variable below, and it'll drop x = 10:

    dumb = λ() {
      let (x = 10) js:raw "x + x";
    };

    dumb = function(β_K1) {
      β_K1(x + x);
    };

That's not a bug. It's supposed to happen.

  • Generate better code for boolean expressions — make_js, as we wrote it, could easily produce output like (a < 0 ? true : false) !== false, but that can obviously be simplified as just a < 0. This probably doesn't add speed, but at least the output code looks less stupid.

  • Fixed the semantics of && and || operators such that false is the only falsy value in our language.

  • It generates var declarations for globals and evaluates the final code under "use strict" (there appears to be a 5-10% speed improvement in strict mode).

Are we even close to a real language?

Believe it or not, λanguage is pretty close to Scheme. I promised you we won't be implementing a Lisp, and I kept my word. But the bigger part of the job is done: we have decent CPS transformer and optimizer. If we wanted to implement Scheme, all that's left is writing a Scheme parser that produces a compatible AST, and a pre-compiler pass to macro-expand. And a bunch of primitives for the core library. That shouldn't be too much work.

TODO

If we'd like to continue working on λanguage, the following should be on the radar.

Variable names

The JavaScript generator leaves variable names as they are, but that's generally not a good idea (not to mention it's a plain bug, since we allow in identifier names characters that JavaScript does not). We at least should prefix globals and replace illegal characters. The prefix I'm thinking about is “λ_”.

Variable arguments lists

Any practical language will need something like JavaScript's arguments. We could easily add some syntax for it, for example:

foo = λ(first, second, rest...) {
  ## rest here is an array with arguments after second
};

but wait, what even _is_ an array in our λanguage? (that's next on the list).

It might seem tempting to think that we can already use the “arguments” name (since we keep the same variable names in JS), but that won't work properly: both to_cps and the optimizer will assume it's a global, possible mess resulting.

To implement the syntax above without sacrificing much code size, we could use the GUARD function. Example output:

foo = function CC(first, second) {
  var rest = GUARD(arguments, CC, 2); // returns arguments starting at index 2
};

Related to this feature, we also need an equivalent to Function.prototype.apply.

Arrays and objects

These are easy to define as primitive functions, as we did with js:raw above. However, implementing them at the syntax level would allow generating more efficient code, as well as giving us a familiar syntax; a[0] = 1 is no doubt nicer than arraySet(a, 0, 1).

One thing I'd like to avoid is ambiguity. For instance, in JavaScript the curly brackets are used both for representing bodies of code, and literal objects. The rule is “when open curly bracket occurs in statement position, then it's a code block; when in expression position, it's an object literal”. But we don't have statements in λanguage (which is actually a feature), and curly brackets denote a sequence of expressions. I'd like to keep it that way, so I wouldn't use the { ... } notation for object literals.

The “dot” notation

Related to the previous one, we should support the dot notation for accessing object properties. That's too ubiquitous to be ignored.

I'd expect it to be somewhat challenging to support “methods” like JavaScript does (i.e. the this keyword). The reason for this is that all our functions are transformed to CPS (and all function calls will insert the continuation as first argument). If we were to support JS-like methods, how could we know that we're calling a function in CPS (i.e. written in λanguage) and not a straight function (i.e. from some JS library)? This requires some thought.

Syntax (separators)

The requirement to separate expressions with semicolons in "prog" nodes is a bit too tight. For example the following is syntactically invalid, according to the current parser rules:

if foo() {
  bar()
}                  # ← error
print("DONE")

The problem is on the marked line. Even though it ends with a closing curly bracket, there should be a semicolon following it (because the if is really an expression, and it ends at that closing bracket). That's not quite intuitive when coming from JavaScript; it might seem preferable to relax the rules and make the semicolon optional after an expression that ends with a curly bracket. But it's tricky because the following is also a valid program:

a = {
  foo();
  bar();
  fib
}
(6);

Result of which would be to call foo(), bar() and then put the result of fib(6) into the variable a. Silly syntax, but you know what, most infix languages suffer from such weirdness, for example the following is syntactically a valid JS program; you get no parse error if you try it, although there will be obviously a run-time error when you call foo():

function foo() {
  a = {
    foo: 1,
    bar: 2,
    baz: 3
  }
  (10);
}

Exceptions

We could provide an exception system on top of reset and shift operators, or other primitives.

Moving on…

Even without the features listed in TODO, our λanguage is pretty powerful and I'll conclude this document with some samples comparing how you'd implement trivial programs in NodeJS versus λanguage. Read on.

© Mihai Bazon 2012 - 2017

Proudly NOT powered by WordPress.

Humbly powered by Common Lisp.