Introduction

Hello, and welcome to part two of this language implementation series. In part one, a language most simple was defined in an attempt to introduce a method to parse source code into an Abstract Syntax Tree.

This article assumes that you have read part one so that we can push on with some more difficult concepts, specifically the introduction of variables, if statements and loops, and binary operators (+, -, *, /). Once again, we will be converting the source code into JavaScript, so that you can try executing it in your browser. In part one, the created language controlled a rather useless JavaScript image effect. In this article, such frivolities are dumped in favour of a straight business problem.

The Sample Program

In one of my previous jobs, it seems we were implementing e-commerce websites every other month. In my opinion, the biggest challenge was always calculating the freight cost of an order, especially when multiple items of different sizes and weights were being sent, mixed in with order-regions to affect price. Just when you thought the calculations were pretty good, one order would come along causing a ridiculous freight cost to be calculated. This meant that the code to calculate costs could be complicated, and changed often. What if instead, the rules for calculating the freight cost could be edited directly by the owner of the site?

In order to write this business-rule engine, we need to understand the domain objects, and decide what data will be available for the language. An Order object has a delivery region, customer type ("normal" or "VIP"), and a list of OrderItems. An OrderItem has a unit price, quantity, and weight:

ObjectDiagram.png

An example of a freight policy could be to charge a fixed rate per KG, where the rate depends on the delivery region. VIP customers get a 30% freight discount. The language FL (Freight-Language) will allow this by having conditional statements, loops, variables, and access to the order data. The following section describes the language in English.

The Language Definition

An FL program is made up of 0 or more statements, and must end by declaring the total cost with the line "Freight cost is x", where x is a number or variable. In the statements, variables can be declared and set (e.g., "set price to 0"), conditional statements can be used (e.g., "if region is "asia" [", followed by 0 or more statements, and ending with "]"), and it is possible to loop through each item in the order (e.g., "loop through order [", then statements where keywords "weight" and "quantity" can be used, followed by "]"). Every statement ends in a semi-colon.

Here is an example:

Set pricePerKG to 2.5;
If region is "asia" [
  set pricePerKG to 2.2;
]
Set totalWeight to 0;
loop through order [
  Set totalWeight to totalWeight + (quantity * weight);
]
Set freightCost to totalWeight * pricePerKG;
if customer is "VIP" [
  set freightCost to freightCost * 0.7;
]
Freight cost is freightCost;

A few of the things which can be achieved in this language is determining the minimum and maximum weights in the order (the < and > operators are also allowed), and nested loops and conditional statements are allowed to any depth. As ever, we need to define this language formally, and a slightly non-standard version of Backus-Naur Form (BNF) will be used as an interim step before defining this language using Irony:

<Program> ::= <StatementList> <FreightDeclaration>
<StatementList> ::= <Statement>*
<Statement> ::= <SetVariable> ";" | <IfStatement> | <OrderLoop> | <Expression> ";"
<SetVariable> ::= "set" <variable> "to" <Expression>
<IfStatement> ::= "if" <Expression> "[" <StatementList> "]"
<OrderLoop> ::= "loop" "through" "order" "[" <StatementList> "]"
<FreightDeclaration> ::= "freight" "cost" "is" <Expression> ";"
<Expression> ::= <number> | <variable> | <string> | 
                 <Expression> <BinaryOperator> <Expression> | "(" <Expression> ")"
<BinaryOperator> ::= "+" | "-" | "*" | "/" | "<" | ">" | "<=" | ">=" | "is"

There are many ways to define a grammar, and often a bit of intuition is used, which means it gets a little easier each time you do it. The following points can be said about this definition:

  • A Statement can be an IfStatement, which can itself contain Statements. It is this recursive declaration which allows the nesting of conditional statements and loops to any level with little difficultly.
  • Boolean expressions (e.g., "i < 4") and number expressions (e.g., "i + 4") are treated in the same way. This has been done to keep this language simple for the purposes of this article; however, it does allow some nonsensical statements (e.g., "if i + 4 then []"). An alternative would be to declare an Expression as a NumberExpression or a BooleanExpression.
  • An Expression contains the recursive definition "(" <Expression> ")", which allows all kinds of complicated expressions, such as: (1 + ((2 * variable) - 3) ) / anotherVariable.
  • A Statement can be just an expression. While perhaps not required in this language, this is a very common feature in languages (e.g., a method call with a return value can be used as an expression or, when not capturing the returned value, a statement).

Defining the Language in Irony

As was illustrated in part one, the transition from BNF to an Irony compiler is not very difficult. It involves creating a class that extends Irony.Compiler.Grammar, and creating a constructor to declare each of the terminals, and then defining the grammar:

// define all the non-terminals
var program = new NonTerminal("program");
var statementList = new NonTerminal("statementList");
var statement = new NonTerminal("statement");
var ifStatement = new NonTerminal("ifStatement");
// etc…
 
// define the grammar
 
//<Program> ::= <StatementList> <FreightDeclaration>
program.Rule = statementList + freightDeclaration;
 
//<StatementList> ::= <Statement>*
statementList.Rule = MakeStarRule(statementList, null, statement);
 
//<Statement> ::= <SetVariable> ";" | <IfStatement>
//                      | <OrderLoop> | <Expression> ";"
statement.Rule = setVariable + ";" | ifStatement | orderLoop | expression + ";";
 
//<SetVariable> ::= "set" <variable> "to" <Expression>
setVariable.Rule = Symbol("set") + variable + "to" + expression;
 
//<IfStatement> ::= "if" <Expression> "[" <StatementList> "]"
ifStatement.Rule = Symbol("if") + expression + "[" + statementList + "]";

For the sake of brevity, quite a few pieces were left out in that code snippet. While you should see the attached source code for the full version, I would like to point cast your attention to the following points:

  • All variables (including the globals "region", "customer", "weight", and "quantity") are declared as instances of the Irony-included IdentifierTerminal class. This covers most alphanumeric variable names, and provides a few customisation options. You can also define completely new variables if, for example, your language dictated that each variable name was enclosed in '$' signs. The keywords of the language which could be misinterpreted as variable names (e.g., "loop", "if", etc.) are specified on the instance of the terminal.
  • The actual values to the region and customer terminals can ("asia", "vip", etc.) use the Irony-provided StringLiteral class, which provides many options for defining how strings are defined in your language (for example, instead of enclosing in quotes, you may wish to enclose all strings in backslashes).

Parsing the Source Code

The source code can be parsed into an Abstract Syntax Tree with the following code:

FLGrammar grammar = new FLGrammar();
LanguageCompiler compiler = new LanguageCompiler(grammar);
AstNode program = compiler.Parse("the source code as a string");

Assuming the source code is valid, program is now the root node of a tree which can be traversed using the ChildNodes property. In order to visualise this, cast your eyes over the following simple program, which sets the freight cost to $5, unless the delivery is for Africa, in which case a $1 discount is given:

Set total to 5;
If region is "Africa" [
Set total to 4;
]
Freight cost is total;

Because we are ignoring irrelevant terminals such as "[" and "]" by defining those terminals as punctuation, the above program will be parsed into the following Abstract Syntax Tree:

FreightLanguageTree.png

While that may appear complicated, each part of the diagram is rather simple. You can see that a Program is a StatementList and a FreightDeclaration; the expression of the IfStatement includes a BinaryOperator (the equality keyword "is") to compare two values; etc.

Getting to the point where your language can be parsed into an abstract syntax tree is a major milestone. Now all we need to do is generate code from this thing!

Generating Code

In part one of this series, the language was so simple that we could simply loop through the nodes and extract useful information from the tree. Now that we've got recursive loops and the language is a little more complex, it is going to be much more elegant and maintainable if we create a class for each node in the tree, and have each node in the tree take responsibility for its own little piece of code.

As an example of the type of work that each node needs to do, take a look at the IfStatement node in the diagram above. How can this node generate JavaScript for itself and all of its child nodes? It's quite simple really: it needs to create a string with the text "if (" then let the first child node add its own code to this string, then concatenate ") {", then let the second node, which is a StatementList, to add its code, then finish by adding "}". And how does a StatementList generate code? Well, each child node is a Statement node, so it just needs to loop through each child, and get each Statement to generate its own code. Hopefully, you can see that an IfStatement nested within another IfStatement will automatically generate the correct, nested JavaScript code. Deeply nested FL code can be fed into the compiler, and correct, complex code will be generated from very simple code within each tree node. This, for me, is a true example of elegant code.

So, for each non-terminal, we will create a class which extends Irony.Compiler.AstNode, with properties exposing each of its child nodes:

class IfStatementNode : AstNode {

    // This constructor pattern is required. The base class
    // sets the ChildNodes property, used in the code below.
    public IfStatementNode(AstNodeArgs args)
        : base(args) {}
 
    private ExpressionNode Condition {
        get { return (ExpressionNode)ChildNodes[1]; }
    }
 
    private StatementListNode StatementList {
        get { return (StatementListNode)ChildNodes[2]; }
    }

}

We need each node to have a method to generate code from. One way would be to have a method defined in a base class or an interface which takes in a StringBuilder and writes code to it. An interface will do:

interface IJavascriptGenerator
{
    void GenerateScript(StringBuilder builder);
}

Here is the method to generate code for the 'if' statement:

public void GenerateScript(StringBuilder builder) {
    builder.Append("if (");
    Condition.GenerateScript(builder);
    builder.AppendLine(") {");
    StatementList.GenerateScript(builder);
    builder.AppendLine("}");
}

You must take a little time to think about in which context the generated code will be run in. In this example, JavaScript is being generated, and so it makes sense for a program to be converted to a JavaScript function. The function will have parameters such as region and items that the code can use. The ProgramNode will generate the function declaration, and then get its ChildNodes to add their own code to the function. With that in mind, here is the code generation function for OrderLoopNode:

public void GenerateScript(StringBuilder builder) {
    builder.AppendLine("for (var __i = 0; __i < items.length; __i++) {");
    builder.AppendLine("var weight = items[__i].weight;");
    builder.AppendLine("var quantity = items[__i].quantity;");
    StatementList.GenerateScript(builder);
    builder.AppendLine("}");
}

Just as the C# compiler generates some strange class names behind the scenes, __i is used for the counter in order to not clash with the user's own variable names. The variables global and weight are not garbled as the point of their existence is to allow the user's code to access these values. Note that although the grammar allows nested loops, the generated JavaScript would not work only because you would have re-declarations of the __i, weight, and quantity variables. It would be possible to generate JavaScript in such a way to avoid this problem, however that is outside the scope of this article (note though that the IfStatementNode will successfully create nested JavaScript).

An Expression may contain non-terminals and terminals, and so this must be handled in its own way. It basically checks whether or not the child node(s) implement IJavaScriptGenerator or not. If they don't, then the text is added to the program verbatim. This works because all terminals in FL (the variables, strings, and numbers) are exactly the same in FL and JavaScript. If a string in the source language was different from a string in the target language, then some transformation would need to take place.

So, we've got these classes which should generate code when they are part of the Abstract Syntax Tree, but you may well be wondering how these classes come to be part of the tree in the first place. Well, the answer lies in the non-terminal declarations in the FLGrammar constructor. Here is the IfStatement declaration:

var ifStatement = new NonTerminal("ifStatement");

It turns out we can specify the type of class to use in the tree for this non-terminal by specifying the type in an overloaded NonTerminal constructor:

var ifStatement = new NonTerminal("ifStatement", typeof(IfStatementNode));

For this to work, the specified class must inherit AstNode, and provide a suitable constructor which calls its base constructor:

public SomeNonTerminalNode(AstNodeArgs args)
    : base(args) { }

We can now generate JavaScript code as follows:

var grammar = new FLGrammar();
var compiler = new LanguageCompiler(grammar);
IJavascriptGenerator program =
    (IJavascriptGenerator)compiler.Parse(sourceCode);

// (snipped out the error handling)
StringBuilder js = new StringBuilder();
program.GenerateScript(js);

Assuming valid input, the js variable now contains a JavaScript function that returns the freight cost for a given input. This function is then included in the page and run when necessary. See FLCompiler.js for the full code, including the error handling, and see Default.aspx to see the code that calls the generated function.

You can try writing your own FL programs, seeing the generated JavaScript, and executing that JavaScript on the demo page.

An Exercise for the Reader

As pointed out earlier in the article, even a well thought-out algorithm can have strange results with certain inputs. Assuming that the source code was input in a textbox in some administration area, an excellent idea would be to look at the generated Abstract Syntax Tree and determine which execution paths are possible, and then generate test data on the fly and show examples of freight costs for different orders, in order to give confidence that the algorithm makes sense. For example, if there is a statement such as if region is "asia", then have two sets of input: one for Asia, and one for the rest of the world. While this is outside the scope of this article, it obviously involves traversing the generated tree and looking at each node (this is what the C# compiler is doing when it warns that there is unreachable code, etc.).

Adding this extra level of intelligence could be the difference between a really annoying FL language that is more trouble than it's worth and an amazingly useful tool which can prevent ridiculous business rules seeing the light of day.

Summary

I hope these two articles have shown you that creating a simple language is really quite simple (a complicated language can still be complicated, though). In particular, I hope this article has made the following points:

  • Allowing nested code is achieved by recursively defining non-terminals (e.g., when a non-terminal which is a type of statement that contains child nodes which may contain statements).
  • Giving each type of node the responsibility to generate code for only the area that is concerned with makes the generation of code, even deeply nested, spaghetti code, relatively simple.
  • Irony provides a lot of help, for example, by providing pre-defined terminals such as string literals. There are still many features of Irony that I haven't touched on, and you should always have a look at the Irony source code and forums on CodePlex to see if Irony has already provided the functionality you need.

But wait a minute, I know what you're thinking...

"Actually, this entire article was useless"

Well, you may have a point there. Under what circumstances would you want to execute freight cost calculations in JavaScript? The fact that this was an example for illustration purposes notwithstanding, there are many cases where rather than generating code, you just want to execute it. In other words, you want to write your first interpreter. Oh dear, I sense another article coming on...

Useful Links