A Simple Compiler - Part 6: Control Statements

Introduction

In this installment, we will add control statements to Teeny, thus allowing us to write simple programs that do something interesting.

To whet your appetite just a little, here is a Teeny v0.06 program to compute all the primes between 3 and 100:

  program primes;
  const limit = 100;
  var i, j, prime;
  begin
    i := 3;
    while i <= limit do begin
      j := i div 2;
      prime := 1;
      while (prime <> 0) and (j > 1) do begin
        if i mod j = 0 then
          prime := 0;
        j := j - 1;
      end;
      if prime <> 0 then
        writeln(i, ' is prime');
      i := i + 1;
    end;
  end.

Control Statements

We will add the 'if' and 'while' control statements to Teeny. The 'if' and 'while' statements are ubiquitous, appearing in virtually all mainstream imperative programing languages. As with the rest of Teeny, we will implement them in the same fashion as they are implemented in the language Pascal.

Before we actually add the 'if' and 'while' statements, we need to make another slight change to the grammar. In .05 we had:

  CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .

  Stmt      = Writeln
            | ident ':=' Expr
            .

In .06 we need to add CmpndStmt to Stmt, as follows:

  CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .

  Stmt      = Writeln
            | ident ':=' Expr
            | CmpndStmt
            .

The stmt routine will now check for 'begin', and if found, call CmpndStmt.

With that out of the way, we can proceed with the business at hand.

The while statement

We will tackle the 'while' statement first, since it is slightly simpler than the 'if' statement.

Examples of the 'while' statement:

 while expr do
     stmt;

 while expr do begin
     stmt;
     stmt;
 end;

As you can see, the 'begin' and 'end' pairs are only needed when there is more than one statement that follows the 'do'.

This is what we will add to our grammar to allow the 'while' statement:

 'while' expr 'do' stmt

This says, if 'while' is found, it must be followed by an expr, followed by the keyword 'do', followed by stmt. And as can be seen from the Stmt grammar above, stmt can be 1 or more statements.

The Stmt production now looks like this:

  Stmt      = Writeln
            | ident ':=' Expr
            | 'while' expr 'do' stmt
            | CmpndStmt
            .

Parsing the 'while' statement turns out to be extremely simple - just follow the grammar rules:

 if accept(whilesym) then
     expr()
     expect(do)
     stmt()
 end

Of course we have left out code generation and any needed semantic checks. More on these later.

The if statement

The 'if' statement is only slightly more complex than the 'while' statement, due to the presence of the optional 'else' statement.

Examples of the 'if' statement:

 if expr then
     stmt;

 if expr then begin
     stmt;
     stmt;
 end;

Examples of the 'if' - 'else' statement:

 if expr then stmt else stmt;

 if expr then begin
     stmt;
     stmt;
 end else begin
     stmt;
     stmt;
 end;

And this is what we will add to our grammar to allow the 'if' statement:

 'if' expr 'then' stmt ['else' stmt]

This says if 'if' is found, if must be followed by an expr, followed by the keyword 'then', followed by stmt. Optionally, an 'else' can occur, which is followed by stmt.

You should note that this rule forces the 'else' to bind to the closest occurring 'if'. This is the accepted standard for all mainstream programming languages.

Here is how Stmt looks with the 'if' statement added:

  Stmt      = Writeln
            | ident ':=' Expr
            | 'while' expr 'do' stmt
            | 'if' expr 'then' stmt ['else' stmt]
            | CmpndStmt
            .

The 'if' statement is a little harder to parse than the 'while' statement, because of the optional part. How would you parse thte 'if' statement? Following the rules of our grammar, we obtain the following for the 'if' statement (Remember that '[]' enclose optional items in our grammar rules):

 if accept(ifsym) then
     expr()
     expect(thensym)
     stmt()
     if accept(elsesym) then
         stmt
     end
 end

As you can see from this, if we have a well formed grammar, parsing really is easy.

Boolean Expressions

Something you may have noticed is the lack of boolean operators in the current version of Teeny. We are going to have to add these in order to be able to use our new 'if' and 'while' statements. The boolean operators we will add are:

We will also need to update our productions for expressions. The new production rules are:

  Expr      = SimpExpr
              [ ('=' | '<>' | '<' | '<=' | '>' | '>=' )
              SimpExpr ] .
  SimpExpr  = Term { ('+' | '-' | 'or') Term } .
  Term      = Factor { ('*' | 'div' | 'mod' | 'and') Factor } .
  Factor    = number
            | ident
            | ('+' | '-') Factor
            | '(' Expr ')'
            .

Precedence

What does precedence mean? Simply put, it has to do with which expression operator will be performed first. Operators with a higher precedence are performed before operators with lower precedence. Ideally, we want our precedence to follow standard algebraic conventions, so that the programmer will encounter no surprises. For instance:

 2 + 3 * 4

The standard rules of algebra cause us to evaluate the multiplication first, and then the addition, since multiplication has higher precedence than addition. A programmer would not be happy if the answer to the above expression was 20 instead of 14, when compiled by the Teeny compiler.

The precedence that our expression grammar forces is the following:

This is fairly standard, except for our placement of 'or' and 'and'. In C-like languages (C++, Java, C# and many others) 'and' and 'or' have a lower precedence than the relational operators. This allows us to use 'and' and 'or' without parenthesis, as in the following:

 if ch >= 'a' and ch <= 'z' then

Because 'and' has a lower precedence than the relational operators, we can be confident that the relational operations will be performed first.

However, in Pascal, 'and' is given the same precedence as as multiply, and 'or' is given the same precedence as addition. The result of this is that you must always place expressions containing 'and' or 'or' in parenthesis, as in the following:

 if (ch >= 'a') and (ch <= 'z') then

Because Teeny is intended to be a Pascal-like language, we will use the Pascal precedence, rather than the C precedence.

More about Boolean Expressions

Now that we have added relational and logical operators to Teeny, we have a language dilemma. What is the result of the following statement?

Assume that 'a' and 'b' have been declared to be of type integer.

  a := 10 > 5;
  b := 15 > 20;

In you are a C or C++ programmer, you would say that 'a' is assigned the value 1, and 'b' is assigned the value 0.

However, if you are a Java, C# or Pascal programmer, you would say that "this will not compile". Why? Because in the Java/C#/Pascal world, only boolean types can be assigned a boolean value. "10 > 5" is a boolean expression, and therefor can only be assigned to a boolean type. C and C++ loosen things up, allowing virtually any type of expression to be assigned to an integer. While C++ does add a boolean type, it is really just for documentation purposes.

Which approach is best? Personally, I prefer the Java/C#/Pascal way. It actually does not make a very big impact on actual code, but I have found that it is yet another device to help protect the programmer from certain mistakes.

Why this discussion? In Teeny, I had to decide if I was going to support the C/C++ semantics or the Java/C#/Pascal semantics. Since Teeny is an almost true subset of Pascal, I decided to follow the Java/C#/Pascal semantics in regards to boolean expressions. However, this does add complications to the compiler. As well as accepting valid language statements, compilers also much reject invalid language statements. Even though the current version of Teeny has only integer types, we must code the compiler in such a way as to disallow assigning the results of boolean expressions to integer variables. Note that this semantic check is not expressed in our grammar. While there are specification systems that can express these types of things, they are not nearly so easy to use as EBNF.

Grammar for teeny v0.06

We are now ready to present the complete grammar for Teeny v0.06:

  Teeny     = 'program' ident ';' {Decl} CmpndStmt '.' .
  Decl      = 'const' Const {',' Const } ';'
            | 'var' ident {',' ident} ';'
            .
  Const     = ident '=' number .
  CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
  Stmt      = Writeln
            | ident ':=' Expr
            | 'if' expr 'then' Stmt ['else' Stmt]
            | 'while' expr 'do' Stmt
            | CmpndStmt
            .
  Writeln   = 'writeln' '('
              [(string|Expr) {',' (string|Expr) }] ')' .
  Expr      = SimpExpr
              [ ('=' | '<>' | '<' | '<=' | '>' | '>=' )
              SimpExpr ] .
  SimpExpr  = Term { ('+' | '-' | 'or') Term } .
  Term      = Factor { ('*' | 'div' | 'mod' | 'and') Factor } .
  Factor    = number
            | ident
            | ('+' | '-') Factor
            | '(' Expr ')'
            .

Changes

Changes to the Scanner

In trying to make Teeny a true Pascal subset, I changed the divide operator from '/' to 'div'. This was simply a matter of adding 'div' to the list of keywords.

Additionally, the scanner needs to recognize the new operators. The single character operators are simple to scan, so I will not say any more about them.

The multiple character operators take a little bit of thought, so I will illustrate the most complicated, the less than or equal operator, '<='.

The complication arises in that when a '<' is found, it must be determined if less than, less than or equal, or not equal has been found. A solution in pseudo code is:

  when '<'
    read_ch()           -- skip '<'
    if ch = '=' then
      read_ch()
      return less than  -- '<=' found
    end
    if ch = '>' then
      read_ch()         -- skip '>'
      return not equal  -- '<>' found
    end
    return less than    -- just '<' found

Similar code is used for the other multiple character operators.

Changes to the Virtual Machine

We need to add the following binary operators to the Teeny Virtual Machine:

These are all implemented in a similar way to the arithmetic operators.

Each binary operator expects two integers on the stack. The operator in question is performed on the two stack items, the two items are popped off the stack, and the result is then pushed onto the stack.

For instance, the add operator is:

  pop a
  pop b
  temp = a + b
  push temp

The less than (lss) operator is:

  pop a
  pop b
  temp = a < b
  push temp

add is implemented as:

  add: top--; stack[top] += stack[top + 1]; break;

lss is implemented as:

  lss: top--; stack[top] = stack[top] <  stack[top + 1]; break;

Two other instructions added to the Virtual Machine are jump (jmp) and jump-if-zero (jz).

Both of them have the target address as their operand. jmp is unconditional, the jump is always taken. jz is conditional. If the top word on the stack is 0, the jump is taken, otherwise it is not. In either case, the stack is popped.

Changes to the code generator

We added the backpatch method. This is used to patch the operand of jump instructions, that were not know at the time the code had to be generated.

Changes to the parser

This is where the major changes to the compiler occurred for this installment. And many of those changes had to do with semantic checks.

All expression routines had to be changed to return the type of binary expression that was parsed. The routines return whether they parsed a boolean or an integer expression.

The Expr production rule returns a boolean type, while all the other production rules (SimpExpr, Term, Factor) return an integer type. Additionally, each expression parsing routine checks to make sure that the binary operand types matches the allowable operator type. For instance, 'and' and 'or' require boolean operands, while the arithmetic operators require integer operands. Invalid expressions such as:

 (10 > 1) + 5

must be caught and flagged with a compilation error.

Additionally, the 'if' and 'while' statements insure that the 'expr' part of the statement is of boolean type. And the assignment statement insures that the result of 'expr' is the same type as the left hand side of the ':='.

As an example of the changes to expression parsing, we will look at the term method. Term previously looked like this:

  term()
    factor()
    while is_mulop() do
      op = sym
      getsym()
      factor()
      gen(op)
    end

Now, in order to do type checking, term() is changed to:

  term()
    int x, y
    x = factor()          -- save the type
    while is_mulop() do
      op = sym
      if sym = 'and' then
        mustbe(bool, x)   -- type check
      else
        mustbe(int, x)    -- type check
      getsym()
      y = factor()
      mustbe(x, y)        -- type check
      gen(op)
    end
    return x

factor() returns the type it found. The operator is processed, then depending on the operator, the type is checked to make sure it is allowable.

Code generation for the 'while' statement

The 'while' statement looks like:

  while expr do
    stmt

This means while the result of expr is true, do the following statements. The first time expr is false, skip to the statement immediately following the end of the loop.

To see how we would implement code for this, let us think about wow would we code this in a lower level language. Something like the following (this is how we coded 'while' loops in very old versions of Basic):

 l1:
   if not expr goto l2
   stmt
   goto l1
 l2:

And that is just the code we will generate for the 'while' statement.

From the above, we can see that there are two jumps that will need to be generated, one conditional, and one unconditional. Additionally, one of the jumps will be generated at a time when we do not know the target address. So, two addresses must be saved for later use: the top of the loop address, and the address of the conditional jump instruction, so that we can later patch it with the correct address.

  if accept('while') then
  -- get and save the address of the start of the expression
    loop_top = gen.getcx()
  -- generate the code for the expression
    x = expr()
  -- type check
    mustbe(bool, x)
  -- now save the address where the next instruction goes
    patch = gen.getcx()
  -- generate a jump out of the 'while' loop
    gen.jump_false()
    expect('do')
  -- generate code for the statement part
    stmt()
  -- jump back to the expression
    gen.jump(loop_top)
  -- and patch the 'jz' we generated above
    gen.backpatch(patch)

Summary

We have come a long way. We now have a compiler that can compile a simple language. But this language includes a full compliment of operators, enforces the proper precedence and type checking. It includes assignment, console output, and the 'if' and 'while' statements. By the way, 'if' and 'while' statements can be nested to virtually any reasonable level. Our target language, even though dead simple and still just a toy, can process simple algorithms, such as a brute force prime number generator. And all of this in only 750 lines of source code.

This installment covered control statements, and what is involved in compiling them.

We had a sojourn about type checking and boolean expressions, and how different language families allow or do not allow certain constructs.

I hope that we have learned just how easy lexical analysis and parsing can be. And I hope this lesson has taught you that semantic analysis and code generation is where most of the complication in compiling lurks.

Finally, I have tried to emphasize that it is a must to follow a grammar when writing a compiler.

References