Tuesday, March 26, 2013

General Approach to Writing Assembly Code

Based on my previous plan for three layers of laziness, I've come up with the following general approach to writing the instructions for a function body.

First, by the time we are ready to write the function body code, we should have parsed the expression returned by the function into an expression tree, the nodes of which are function calls of some kind, and the branches of which are parameters to those function calls.

We will process the branches first in such a way that, once processing is finished, the result of each parameter's evaluation sits on the stack in order.  When all of the branches are finished, we'll write the function call itself.

Prior to beginning to write the assembly code, we'll check the tree for duplicate function calls--calls where all of the same parameters are passed to the same function.  Usually these will be unevaluated expressions passed in as parameters, or known to the function as environment variables.  For each duplicate we will allocate two local variables, one to store the result of the function evaluation, and the second to store a flag indicating whether the function has already been evaluated.

If we think that the result of the function is to be used later, we'll set the corresponding flag, pop it off of the stack after evaluation, store it as a local variable, then push it back onto the stack.  Later, rather than reevaluating the function call, we'll simply push the local variable onto the stack.

At the end, of course, well write a ret (return) operation.

Down the road, it might be worthwhile to consider including function calls inline, since there is essentially no overhead except for the larger size of the executable.

Saturday, March 23, 2013

Three Layers of Laziness

In the last post I walked through the CIL disassembly for a simple function.  I discovered that one of my mechanisms intended to implement laziness was really causing more work than was necessary.

As a brief recap, the problem is this:  In a standard programming language, expressions that are parameters to functions are evaluated before being passed to the function, regardless of whether they are ultimately needed within the function.

I was hoping to get around this by passing in a reference to the expression, and only evaluate it if necessary.  However, in my sample function I used one of my parameters three times, and the C# compiler ended up evaluating the expression three times.

This makes perfect sense in the standard programming world, because the function might not return the same value each time.  However, if the function is pure, the expression only needs to be evaluated once.

In the end, I hope I can implement three layers of laziness: 1) Within a function, let each expression be evaluated only if it is needed;  2) If an expression is evaluated, let is only be evaluated once, and let its value be stored in a local variable if it is needed more than once; 3) Let a thunk remember its last evaluated value, and return it directly rather than reevaluating.

Having said that, here is a rough sketch of what the CIL for the AddAndMultiply function should have looked like.  (I'm not 100% certain how to calculate the maxstack value yet...working on that).

.maxstack 7
.locals init (
  int32 V_0,
  int32 V_1)
// Get the value of x
IL_0000:  ldarg.1
IL_0001:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
// Store x in a local variable
IL_0006:  stloc.0
// If x >= 0 then goto IL_0019
IL_0007:  ldloc.0
IL_0008:  ldc.i4.0
IL_0009:  bge IL_0019
// Throw exception
IL_000e:  ldstr "x must not be less than zero"
IL_0013:  newobj instance void class [mscorlib]System.Exception::'.ctor'(string)
IL_0018:  throw
// Get the value of y
IL_0019:  ldarg.2
IL_001a:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
// y is still on the stack, push x back on and add the two together
IL_0021:  ldloc.0
IL_0022:  add
// Push x on the stack again and multiply
IL_0023:  ldloc.0
IL_0024:  mul
// Return the result
IL_0025:  ret

Fig. 12.1: Ideal CIL code
One geeky observation that not many people would make:  The syntax of assembly language is very much like the syntax of Altaic languages--the verb always comes last.

First Leisurely Walk through CIL Code

For this project I have been alternating between Microsoft Visual Studio 2010 on a Windows 7 PC and Xamarin Studio on a Mac.  The CIL code I'm looking at in this project was compiled on the Mac using mono and disassembled using monodis, so it may be a little different from what the Microsoft csc and ildasm would produce.

Right now, I'm just going to look at the AddAndMultiply function, and see how lazy evaluation was handled.  Here is the CIL code:
.maxstack 11
IL_0000:  ldarg.1
IL_0001:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
IL_0006:  ldc.i4.0
IL_0007:  bge IL_0017
IL_000c:  ldstr "x must not be less than zero"
IL_0011:  newobj instance void class [mscorlib]System.Exception::'.ctor'(string)
IL_0016:  throw  
IL_0017:  ldarg.2
IL_0018:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
IL_001d:  ldarg.1
IL_001e:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
IL_0023:  add
IL_0024:  ldarg.1
IL_0025:  callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
IL_002a:  mul
IL_002b:  ret

Fig. 11.1: CIL code for AddAndMultiply
Let's just walk through this.

ldarg.1
This loads the function's first argument (ISignature<int> x) onto the stack.

Stack now has: [x]

callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
This calls the Value() method of the ISignature<int32> interface.  This will pop the x off of the stack and push the result of the evaluation back on the stack.  This is how we implemented lazy evaluation.

Stack now has: [x.Value]

ldc.i4.0
This pushes the constant value 0 onto the stack.

Stack now has: [x.Value, 0]

bge IL_0017
This takes the last two values from the stack.  If the next-to-top value is greater than or equal to the top value (i.e. if 13 >= 0) then it will branch to the operation at IL_0017.  This pops both values off the stack.

Stack now has: []

The next few steps would be followed if we had passed in an x that evaluated to a negative value.

ldstr "x must not be less than zero"
This loads the exception message string onto the stack.

Stack now has: ["x must not be less than zero"]

newobj instance void class [mscorlib]System.Exception::'.ctor'(string)
This creates a new instance of the Exception class.  The constructor takes one parameter, so it pops the string off the stack.  The new exception is pushed back onto the stack.

Stack now has: [Exception]

throw
This pops the exception off the stack and throws it.  Execution ends here.

The following steps would be executed for non-negative values of x.

ldarg.2
callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
This loads the function's second argument (ISignature<int> y) onto the stack and calls the Value() method.

Stack now has: [y.Value]

ldarg.1
callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
This again loads the function's first argument (ISignature<int> x) onto the stack and calls the Value() method.

Stack now has: [y.Value, x.Value]

add
This adds the top two values on the stack and pushes the result back on.

Stack now has: [(y.Value + x.Value)]

ldarg.1
callvirt instance !0 class ToylModel20130322.ISignature`1<int32>::Value()
For the third time, this loads the function's first argument (ISignature<int> x) onto the stack and calls the Value() method.

Stack now has: [(y.Value + x.Value), x.Value]

mul
This multiplies the top two values on the stack and pushes the result back on.

Stack now has: [(y.Value + x.Value) * x.Value]

ret
This returns the value on the top of the stack.

The main lesson I learned from looking at this is that I ended up calling the Value() method four times, when I only really needed it twice.  It probably would have been better if the C# code looked like this:
int _x = x.Value ();
if (_x < 0) {
  throw new Exception("x must not be less than zero");
} else {
  int _y = y.Value();
  return (_y + _x) * _x;
}

Fig. 11.2: More efficient C# code
This code only calls Value() once and stores the result locally.

Friday, March 22, 2013

Lessons Learned

A preliminary C# expression for the Toyl model in Fig. 10.1 is now posted on github.

I rewrote this sucker about five times before I landed on the revision currently posted.  What did I learn?

First, I found ways to use delegates to represent lambda expressions instead of creating separate classes for each one.  That made the code a lot smaller and a lot tidier.

Second, in the case of currying and creating higher order functions, I found a way to define "delegate factories" that would generate the delegates I needed.

I'm actually pretty happy with the resulting code.  In the next post, I will disassemble it and see what it looks like in CIL.

Walking through Compilation

I have to admit, now that I am at this point, I am a little uncomfortable with the idea that every lambda expression will be turned into a class.  It seems like I will end up with a lot of tiny classes, and that seems wasteful somehow.

However, until I see real evidence that it is a problem, I'm not going to worry about it. I can think of a couple of different ways that I might reduce the number of classes.

I'm going to try my current model out by creating a model Toyl program, then working out how it would be expressed in C#.  Then, I'll compile the C# program and look at how it is expressed in CIL.

So here's the model program.  It includes a curried function, a higher order function, and lazy evaluation.  This would compile into a DLL, because there is no impure code here to interact with the sullied world.
namespace ToylModel20130322 {
    AddAndMultiply = int function(int x, int y) {
      x < 0: Exception("x must not be less than zero");
      otherwise: (y + x) * x;
    };

    C13 = 13;

    AddCAndMultiplyByC = AddAndMultiply(C13);

    ApplyTwice = <T> function(<T>function(<T>) f, <T> v) {
      f(f(v))
    };

    TestFunction = int function() {
      ApplyTwice(AddCAndMultiplyByC, 72)
    };
}

Fig. 10.1: A model Toyl program
There is something going on here that I haven't explicitly mentioned.  I am using the colon (:) and semicolon (;) as a ternary operator meaning "if...then...else". It makes the "case" blocks within function bodies look nice, but it might come back to haunt me since I am also using the semicolon as the declaration delimiter. Right now I think the two operators should exist in different domains within the code, but we'll see if I run into trouble as the process goes on.

The full C# code will probably be pretty long, so I'll work on it throughout the day and post it on github, and the next post will have my lessons learned.

Thursday, March 21, 2013

Syntax Outline

The basic unit of a Toyl program will be an expression, which may be a constant value, lambda expression, or else involve function calls and operators.  The full description of expression syntax covers almost everything else I currently envision in Toyl:
expression ::= { reference | constant | evaluable | lambda }

  reference ::= identifier
  evaluable ::= { function-call | operator-expression } [where-phrase]
    function-call ::= function-name "(" parameters ")"
      function-name ::= identifier
      parameters ::= expression ["," parameters]
    operator-expression ::= { unary-operation | binary-operation | trinary-operation }
      unary-operation ::= operator expression
      binary-operation ::= expression operator expression
      ternary-operation ::= expression operator expression operator expression
    where-phrase ::= "where" "{" declarations "}"
      declarations ::= declaration [";" declarations]
        declaration ::= identifier "=" expression
  lambda ::= return-type function-type "(" parameter-declarations ")" "{" expression "}"
    function-type ::= "function" | "process"
    parameter-declarations ::= parameter-declaration ["," parameter-declarations]
      parameter-declaration ::= type identifier

Fig. 9.1: Expression syntax
Identifiers may be organized in namespaces, which (like the where-phrase above) are just collections of identifiers:
namespace ::= "namespace" "{" declarations "}"
Fig. 9.2: Namespace syntax

This looks like a good start.  In the next post, I'll try to walk through the process of translating Toyl expressions into the underlying object structures and see where my ideas break.

Wednesday, March 20, 2013

Language Design

In this post, I'll start to lay out how Toyl syntax will work.

The main purpose served by programming language syntax is to unambiguously describe the structure of the program.  In addition to this main purpose, programming languages have "syntactic salt" that helps to prevent unintentional errors, and "syntactic sugar" to make code easier to read and write.

In my experience, functional programming languages tend to follow different syntactic traditions than imperative languages (which usually fall into the Pascal or C traditions).  Part of this is a cultural matter, but part of it is practical as well.  Let us take, for example, a function that returns the nth Fibonacci number:  It could easily be written in C-like syntax as follows:
int fibonacci(int n) {
  return
    (
    n <= 1
    ? n
    : fibonacci(n - 2) + fibonacci(n - 1)
    );
}

Fig. 8.1: C-like syntax
In Lisp, it could be written like this:

(defun fibonacci (n)
  (if (< n 2)
    n
    (+ (fibonacci (- n 1)) (fibonacci (- n 2)))))

Fig. 8.2: Lisp syntax
In Haskell, one way to write it would be:
fibonacci :: Integer -> Integer
fibonacci n
    | n == 0 = 0
    | n == 1 = 1
    | n > 1 = fibonacci (n-1) + fibonacci (n-2)

Fig. 8.3: Haskell syntax
None of these is functionally superior to the others.  The C-like syntax is more "salty" than the others because it requires parentheses to enclose the function parameters, commas to delimit them, and a semicolon to end the declaration.  This is extra work for the programmer, but the payoff is that it gives the compiler more opportunities to identify programmer mistakes in the code.

The Lisp syntax is a little less salty than the C-like syntax, but it is still clear and unambiguous to the compiler.  However, both the Lisp and C-like syntax allow the indentation to "lie" about the structure of the code, because the indentation is optional.  In a large program written in either language the indentation could easily become out of sync with the actual structure of the program and conceal a flaw in the logic.

In Haskell, the indentation is syntactically important, and conveys the same information as the block-enclosing tactics of Lisp and C.  When I first encountered a language that treated indentation as meaningful, it made me a little uncomfortable because I had been accustomed to whitespace being ignored by the compiler, and so free for the programmer to use as a form of documentation.  In particular, I had worked a lot in Unix where some programmers might use tabs and others might use spaces for indentation, and different editors assign different relative sizes to the tab as compared to the space.

However, I love indentation as documentation of program structure. I think I will design a syntax for Toyl that will work with or without indentation, but let the compiler warn, by default, if the indentation is lying about the program structure. Here is a preliminary prototype of what the Fibonacci function might look like in Toyl:
fibonacci = int function (int n) {
    n < 0: Exception("n may not be negative");
    n < 2: n;
    otherwise: recurse(n - 1) + recurse(n - 2);
}

Fig. 8.4: Preliminary Toyl prototype
In the next post I'll outline the basics of Toyl syntax using Backus-Naur form.

Tuesday, March 19, 2013

Where Pure Meets Impure

Lazy evaluation within Toyl functions is great, because Toyl will be built to expect it.  However, any time we interact with a library written in a standard imperative language, or with the standard .NET architecture, we will need to know how to correctly cross the gap between the Toyl world and the real world.

There are three sides to this:  First, from a practical perspective, we won't be able to pass ISignature<T> objects into standard .NET methods that expect T objects, so we'll have to evaluate any value before we pass it into a standard method.

Trickier, though, is the question of side-effects.  I will eventually implement some functionality that allows a thunk (a parameterless closure) to remember its evaluated value, which should not change for pure functions.  However, if a method has side-effects in the world, we might need to call it again, rather than remembering the result we obtained from calling it previously.  Likewise, if a method draws information from the world, we might need to call it again to get updated information.

One example of the former type of method is Console.WriteLine, which writes a line to the console.  If we lazily refuse to evaluate it a second time with a particular string, then we will be unable to ever print the same string to the console twice in the same session!

An example of the latter type of method is Console.ReadLine, which reads a line from the console.  This takes no parameters, so our aggressively lazy system would by default assume that anything the user typed at the console was the same thing that the user previously typed.

There are several ways to skin this cat.  The simplest approach would be to treat all external methods as being impure until they have been declared pure within Toyl somewhere.  I would then implement a rule that says: "Any Toyl function that relies on an impure function is itself impure."

Another approach, which would be interesting from a theoretical perspective, would be to treat impure functions the same as pure functions, but pass in something that represents the changing external universe.  No two values of the Universe parameter would be equal, so no two function calls would be considered to operate on the same inputs.  So, we would have Console.ReadLine(Universe) and Console.WriteLine(Universe).  The Universe value would not be globally available--it would have to be passed in to any function that needs it--so any Toyl function that needs it needs to take it as a parameter itself.

That's kind of a tidy solution, because it makes it explicit in the code which functions are pure and which are impure.  However, neither this solution nor the assignment of purity to functions mandates a particular order of evaluation, which can be important when there are side-effects involved.  This is a serious question, since I previously said that I was going to implement aggressive laziness by allowing the system to decide which order to evaluate expressions in.

As an example of the type of issue I am talking about, consider a program that needs to ask a user a question using Console.WriteLine(), then get the user's response using Console.ReadLine().  All things being equal, there is nothing that says the question needs to be evaluated before the response is retrieved, but really it needs to happen in that order.

Continuing the Universe model, what we might want could be something like this:
(Universe, string) PromptUser(Universe U0, string question) {
  Universe U1 = Console.WriteLine(U0, question);
  (Universe U2, string answer) = Console.ReadLine(U1);
  return (U2, answer);
}

Fig. 7.1: Universes in, and universes out
In this case, since we change the universe by asking the question, we want to pass that changed universe into the Console.ReadLine().  There is only one order that this expression can be evaluated in, and that is to ask the question first, then get the answer.

The difficulty here is that our Console.ReadLine() itself should really return two values--it should return the line read from the console, as well as the new universe.  We can handle that in a number of ways, but for now, in pseudo-code, I'll just write two outputs to the left of the assignment operator.

But hopefully we will not need C#-like pseudocode much longer.  In my next post, I'll start talking about what Toyl will look like, and we can start using Toyl code instead.

Lazy Evaluation

The basic idea of lazy evaluation is that the machine doesn't do any work unless we know it needs to.  Modern imperative programming languages almost always have a certain amount of lazy evaluation in expressions, but if you give the machine a list of instructions in an imperative program, the machine will execute them all, regardless of whether they are needed for the final result.

Part of the reason for this is that the compiler can't hope to keep track of all of the side-effects of imperative commands and determine whether a command is necessary.  In functional programming, however, there is the idea of "pure" functions, which have no side effects, so we can say categorically that they need not be evaluated unless we know we need their results.

When we say that a pure function has no side-effects, we not only mean that it does not impact the outside world, but also that it does not draw input from the outside world.  In other words, the result of a pure function depends on nothing other than the content of the function and its inputs.

We will implement aggressive laziness in Toyl by letting the system choose which expressions to evaluate, and allowing it to evaluate them in whatever order is necessary.  Not only that, but within the body of the function, we will evaluate bound variables and parameters only when (or if) we need them.

Going back to our sample closure (Fig. 4.1), and continuing to develop this idea using C# syntax, we can make the environment variables z and k into attributes whose underlying expressions are not evaluated until they are needed, as follows:
class MyFunction {
    // Environment Part
    public ISignature<int32> z;
    public int32 z {
        get {
            return this._z.Value();
        }
        set {}
    }

    public ISignature<int32> k;
    public int32 k {
        get {
            return this._k.Value();
        }
        set {}
    }


    ...
}

Fig. 6.1: A closure with lazy environment variables
Doing this with the parameters x and y is not as slick in C# syntax, so here is one area where we will begin to depart from what we can easily express in C#.  To implement lazy evaluation of the parameters, we would need syntax like this:
// Control Part
public DateTime Value(ISignature<DateTime> x, ISignature<int32> y) {
    return x.Value().AddYears(int.Parse(y.Value().ToString()) + z * k);
}

Fig. 6.2: Lazy evaluation of parameters
In the next post, I'll address how to handle a mixed environment where not all functions can be pure.

Monday, March 18, 2013

Function Signatures

If this is going to be a statically typed language, then that should apply to functions when they are passed as arguments to other functions.  If we are going to pass a function to something resembling Haskell's map, then the compiler should verify that the function we are passing takes two parameters, and they match the types of other arguments appropriately.

My first inclination is to do that with interfaces: Let each lambda expression that shares a particular signature implement an interface unique to that signature.  With some additional class definitions, we ought to be able to define a single set of intermediate curried states for all functions that share the same signature, since the control part exists only at the end of the curry chain.

And lastly, if we can use something like the generic type parameters in C#, we can boil down our separate definitions to one for every number of parameters, as follows:
interface ISignature<T1, T2, TReturn> {
    TReturn Value(T1 p1, T2 p2);
    ISignature<T2, TReturn> Curry(T1 p1);
}

class Curried<T1, T2, TReturn> : ISignature<T1, T2, TReturn> {
    ...
}

class MyFunction : ISignature<DateTime, int32, DateTime> {
    ...
}

Fig. 5.1: An interface representing a function signature and some matching lambda expressions as classes

I think this approach, in general, will work.  In the next post I'll attack lazy evaluation.

Currying

In the last post, I worked out how the C# compiler represents lambda expressions as classes.  Naturally, then, curried functions should be represented as classes, too.  There are several different ways to handle this.  The method sketched out in pseudo-code here is only one of them, where I have tried to limit redundancy and memory at the possible expense of performance.  The reason I am doing it this way is because I have a motto: "Make it first, then make it fast".  Right now I can see the memory and redundancy tradeoff, but I don't even know how much performance I am buying by trying to optimize. Later, if I encounter performance problems, then I can come back and evaluate how much memory is worth how much speed.

The approach here is to declare the uncurried function as a class like the one we saw in Fig. 3.1, then to declare a chain of classes to represent the stages of currying.

// This is the uncurried function
class MyFunction {
    // Environment Part
    public int32 z;
    public int32 k;

    // Control Part
    public DateTime Value(DateTime x, int32 y) {
        return x.AddYears(int.Parse(y.ToString()) + z * k);
    }

    // Currier
    public MyFunctionCurried1 Curry(DateTime x) {
        return new MyFunctionCurried1(this, x);
    }
}

// This is the function with one parameter curried
class MyFunctionCurried1 {
    private MyFunction _curriedFrom;
    private DateTime x;

    public MyFunctionCurried1(MyFunction curriedFrom, DateTime x) {
        this._curriedFrom = curriedFrom;
        this.x = x;
    }

    // Control Part
    public DateTime Value(int32 y) {
        return this._curriedFrom.Value(x, y);
    }

    // Currier
    public MyFunctionCurried2 Curry(int32 y) {
        return new MyFunctionCurried2(this, y);
    }
}

// This is the function with both parameters curried
class MyFunctionCurried2 {
    private MyFunction _curriedFrom;
    private int32 y;

    public MyFunctionCurried2(MyFunctionCurried1 curriedFrom, int32 y) {
        this._curriedFrom = curriedFrom;
        this.y = y;
    }

    // Control Part
    public DateTime Value() {
        return this._curriedFrom.Value(y);
    }
}

Fig. 4.1: A curry chain (C#-like pseudocode)

So far, so good.  In the next post I'll introduce function signatures as interfaces, so I can pass functions (and their curried variants) around to higher order functions.

Disassembled Expression

For this post, I took the sample function body from Fig. 1.3 and compiled it using a C# 2010 compiler (version 4.0.30319.1), then disassembled the executable into IL.

The result was pretty interesting, and not too different from what I imagine happens in a modern compilable functional language.

The most interesting thing, to me, is that the lambda expression was turned into a class (with the automatically generated name <>c__DisplayClass1).  Translating from CIL into C#-like pseudo-code, the class looks something like this:
class <>c__DisplayClass1 {
    public int32 z;
    public int32 k;

    public DateTime <Main>b__0(DateTime x, int32 y) {
        return x.AddYears(int.Parse(y.ToString()) + z * k);
    }
}

Fig. 3.1: Pseudo-code representation of an internally generated class for a C# lambda expression
This class basically represents the lambda expression.  The environment part is represented by the public members z and k, and the method <Main>b__0 represents the control part.  To invoke this lambda expression, first the environment part needs to be filled out by creating an instance of the class, then setting the values in the environment part (creating a closure).  The closure could then be passed around and eventually evaluated by calling the method <Main>b__0 with the parameters x and y.

So, for me, the next question is going to be this:  If we generate classes like this to represent lambda expressions, how will we curry functions, and how will we create higher order functions?  Those are likely to be the topics of the next post.

Expression Structure

Expressions are trees.  Well, they can be networks if you've removed redundancies, but let's start out as seeing them as trees.  Each node in the expression tree is either a constant node (representing a fixed value) or a function node (representing a function call).  Function nodes have parameters, which are the branches of the tree, and parameters point at other expression nodes.

In the previous post, we were working on creating a closure.  The expression at the heart of the closure was this:
x.AddYears(int.Parse(y.ToString()) + z * k)
Fig. 2.1: The control part of a lambda expression
The expression tree that we can build out of this is like the following:
Expression {
    Function: DateTime.AddYears(int);
    Parameters {
        [0]: x;
        [1]:Expression {
            Function: int.+(int);
            Parameters {
                [0]: Expression {
                    Function: int.Parse(string);
                    Parameters {
                        [0]: null; /* Static Method */
                        [1]: Expression {
                            Function: string.ToString();
                            Parameters {
                                [0]: y;
                            }
                        }
                    }
                }
                [1]: Expression {
                    Function: int.*(int);
                    Parameters {
                        [0]: z;
                        [1]: k;
                    }
                }
            }
        }
    }
}
Fig. 2.2: The tree structure of the expression
Note that we have two types of open variables here.  The variables x and y are parameters, and we will require them to have values before the expression can be evaluated.  The variables z and k are drawn from the lexical environment, and by pointing to their own expression trees, we can effectively insert them directly into the tree.

In the next post, I will write the C# code to encapsulate an expression node, and see how that disassembles into CIL.

Closure

I'm going to start with creating a closure.  Landin [1] defined a closure as a lambda expression whose open variables have been closed by lexical environment, creating a closed expression.  So, let's start with a lambda expression as represented in C#:

(x, y) => x.AddYears(int.Parse(y.ToString()) + z * k)
Fig. 1.1: A lambda expression in C#

There is a lot we don't know for certain about this lambda expression.  We can infer that x is a DateTime, but that isn't necessarily the case, and y could be almost anything.  This lambda expression at least needs type declarations for these variables, so let's provide them:

Func<DateTime, int, DateTime> f =
    (x, y) => x.AddYears(int.Parse(y.ToString()) + z * k);

Fig. 1.2: A lambda expression in C#, partially disambiguated by its environment

Here, we have implicitly defined the types of x and y by assigning the lambda expression to a delegate whose type has been made explicit.  Now, we have a lambda expression with two open variables, z and k.  In order to create a closure, we need to close the variables z and k using the lexical environment of the expression.

Suppose the expression appears in the following environment:

DateTime g() {
    int z = DateTime.Now.Year;
    int k = 13;
    Func<DateTime, int, DateTime> f =
        (x,y) => x.AddYears(int.Parse(y.ToString()) + z * k);
}

Fig. 1.3: A lamba expression in its lexical environment

In this environment, we now have all of the information we need to populate the closure.  In Landin's terms, a closure has an "environment part" and a "control part".  Using those terms, our closure record (for the lambda expression only) could look like this:

Environment Part {
    Symbol {
        Name: z;
        Type: int;
        Value: DateTime.Now.Year;
    }
    Symbol {
        Name: k;
        Type: int;
        Value: 13;
    }
}
Control Part {
    Parameter {
        Name: x;
        Type: DateTime;
    }
    Parameter {
        Name: y;
        Type: string;
    }
    Value Expression: x.AddYears(int.Parse(y.ToString()) + z * k);
}
Fig. 1.4: A preliminary mockup of a closure record
In the next post, I'll unpack the structure of the value expression.

[1] P. J. Landin (1964), The mechanical evaluation of expressions

Methodology (and Madness)

I don't have a CS degree, and I've never taken a class in compiler design, so this experiment ought to be pretty interesting.  I'm going to create a functional programming language (called Toyl, for "Toy Language"), as well as the compiler needed to compile a Toyl program into a .NET executable.

I will do this by first developing C# code that is equivalent to units of Toyl code, then I will disassemble the C# code, and figure out how to translate the Toyl code into equivalent IL assembly code.

So there you have it.  Laugh if you want to, weep if you pity the fool that attempts to do something like this.  It ought to be fun.