A Simple Compiler - Part 2: Syntax analysis.

Like Lexical analysis, Syntax analysis (or so-called parsing) is a highly analyzed and well understood part of compiling. Basically, parsing takes the symbols returned by the scanner and makes sure that they form sentences (or productions) that are legal, according to the languages grammar.

There are many different parsing algorithms, some more flexible than others. Some are more suited to automated parser generators than others. Well known parser generators include YACC, BISON (the GNU version of YACC), Antlr, and Coco/R. We will cover one algorithm - called recursive descent - which is well suited to a hand-crafted compiler. But before we get there, we'll need to talk about grammars.

A grammar is a formal way of specifying the syntax of a language. Developing a well defined non-ambiguous grammar is one of the most important steps you can take to make writing your compiler successful. The particular grammar format we'll use is called EBNF, for Extended Backus Naur Form.

An EBNF grammar consists of:

A production in a grammar takes the following form:

non-terminal = terminal non-terminal ....

For example:

 sentence = subject predicate "." .
 subject = "Peter" | "Paul" .
 predicate = "writes compilers" | "crashes compilers" .

This grammar has 3 productions. Sentence is the start symbol, subject and predicate are non-terminals, and "Peter", "Paul", "writes compilers", "crashes compilers" are terminals. Additionally, '|' is used for alteration. Note how all non-terminals have to eventually be defined by terminals.

Valid sentences in this language are:

 Peter writes compilers.
 Peter crashes compilers.
 Paul writes compilers.
 Paul crashes compilers.

To specify a function call in a C-like language, we could write:

 function_call = ident '(' [parm_list] ')' .
 parm_list = ident {',' ident} .

The grammar above has two productions: function_call and parm_list. ident, '(', ',', and ')' are terminals, and parm_list is a non-terminal. [] and {} are meta-symbols, meaning zero-or-one and zero-or-more of what they enclose, respectively. Other meta-symbols are '|' for 'or', and '(' and ')' for grouping. As you can see from the above simple grammar, non-terminals in the production to the right of the '=' have to also be listed to the left of the '='.

Note that by convention, ident and number are treated as terminals, and are usually specified by something similar to:

 ident = letter { letter | digit }
 number = digit { digit }

Another common terminal is String, which can be defined as:

 string = "'" {no_single_quote} "'" .
 no_single_quote = AnyChar - "'" .
 AnyChar = Chr(32) .. Chr(255) .

function_call can be read as requiring an identifier, an opening '(', an optional parameter list, and a closing )'. parm_list can be read as an identifier, optionally followed by zero or more ',' followed by another identifier.

According to the above grammar, the following are valid function calls:

 factor()
 term(a)
 expr(tx, prec, tptr)

The following are invalid function calls:

 1fun()  --an identifier must start with an alphabetic character
 fun     --a function call must have '(' and ')'
 myfun(a b c)    --parameters must be separated by ','

Another simple example. We'd like to write a teeny compiler for a subset of Pascal. Our subset will only support one statement - writeln. Here is a grammar for teeny Pascal:

  Program = 'program' ident ';' 'begin' Stmtseq 'end' '.' .
  Stmtseq = [Stmt {';' Stmt }] .
  Stmt    = 'writeln' '(' String ')' .

This grammar has 3 productions. Program is the start symbol. Pascal, Stmtseq and Stmt are non-terminals. Note how each of the non-terminal symbols is substituted for until it leads to a terminal.

  A teeny Pascal example program:

  program one;
  begin
    writeln('hello, world!');
    writeln('goodbye.');
  end.

  A erroneous teeny Pascal program:

  program;    -- ident missing after program
  begin
    write     -- unknown identifier
    writeln() -- writeln requires a string
  end;        -- terminating 'end' requires a '.'

Can you see how the grammar above can be used to see if a program is syntactically correct?

So how would you write a parser for the teeny Pascal grammar?

It turns out that the parsing closely follows the grammar. We assume the existence of a routine getsym() that fetches the next symbol from the source, and stores it in the global variable sym. We also assume the existence of a routine expect(), which verifies that the passed symbol is equal to the current symbol, and if so, retrieves the next symbol by calling getsym(). Further, we assume that getsym() has already been called at least once.

To parse the Pascal production:

 Program = 'program' ident ';' 'begin' Stmtseq 'end' '.' .

 program()
   expect(programsym)
   expect(idsym)
   expect(semi)
   expect(beginsym)
   stmtseq()
   expect(endsym)
   expect(period)
 end

Apparently, for terminals, we simply make sure they are there, and in the proper order.

For non-terminals we call a function that handles the work.

To parse Stmtseq:

  Stmtseq = Stmt {';' Stmt } .

 stmtseq()
   stmt();
   while sym == semi
     getsym()    -- consume ';'
     stmt()
   end
 end

Remember, {} specifies symbols that appear zero or more times.

The above could also be written as:

 stmtseq()
   loop
     stmt()
     if sym <> semi break
     getsym()
   end
 end

Finally, to parse Stmt:

  Stmt = ['writeln' '(' String ')'] .

 stmt()
   if sym == writelnsym
     getsym()
     expect(lparen)
     expect(stringsym)
     expect(rparen)
   endif
 end

Of course I am grossly oversimplifying here, but I hope you get the general idea. I also hope you see how important it is to have a grammar, and how easy it can be to develop a parser from that grammar.

A simple example implementation of the above grammar follows. The example can run the above teeny Pascal program.

Next time: More on grammars.

References

File: teeny.h

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <ctype.h>

 enum {Ident_max = 16, Max_str = 200};
 typedef char Ident[Ident_max+1];

 typedef enum {wrc=1, wrl, hlt} Opcode;

 typedef struct {
   Opcode opcode;
   int operand;
 } Instruction;

 void interpret();

 class Msg {
 public:
   Msg() {err_line = 0; err_col = 0;}
   void fatal(const char msg[]);
   void warn(const char msg[]);
   void set_line(int line) { err_line = line; }
   void set_col(int col) { err_col = col; }
 private:
   int err_line, err_col;
 };

 class VirtualMachine {
 public:
   VirtualMachine() { code_index = 0; }
   void gen(Opcode opcode, int operand);
   const Instruction *getcode() { return code; }

 private:
   enum {MAX_CODE=500};
   Instruction code[MAX_CODE];
   int code_index;
 };

 class Scanner {
 public:
   typedef enum {
     nul, lparen='(', rparen=')', times='*', plus='+', period='.',
     semi=';', numbersym, idsym, litstrsym, program, begin, end,
     writeln, eofsym
   } Symbol;

   bool init_scanner(const char fn[]);
   Symbol getsym();
   int getval() { return val; }
   const char *getid() { return id; }
   const char *getlitstr() { return lit_str; }

 private:
   void read_ch();
   Symbol get_ident();
   Symbol get_number();
   Symbol get_string(int delimiter);

   int val;
   Ident id;
   char lit_str[Max_str];
   int ch;
   int cur_line, cur_col;
   FILE *f;
   static char *key_words[];
   static Symbol key_syms[];
 };

 class Parser : Scanner {
 public:
   void parse(char fn[]);
 private:
   Symbol sym;

   bool accept(Symbol s);
   void expect(Symbol s);
   bool stmt();
   void stmtseq();
 };

File:teeny.cpp

 /*************************************************************
   Grammar for teeny Pascal:

   Pascal  = 'program' id ';' 'begin' Stmtseq 'end' '.' .
   Stmtseq = Stmt {';' Stmt } .
   Stmt    = ['writeln' '(' String ')'] .

   A teeny Pascal example program:

   program one;
   begin
     writeln('hello, world!')
   end.
  *************************************************************/

 #include "teeny.h"

 VirtualMachine vm;
 Msg error;

 void Msg::fatal(const char msg[]) {
   if (err_line == 0)
     printf(msg);
   else
     printf("line %d col %d %s\n", err_line, err_col, msg);
   exit(2);
 }

 void Msg::warn(const char msg[]) {
   printf("line %d col %d %s\n", err_line, err_col, msg);
   exit(2);
 }

 /---------------- Virtual Machine interpreter ----------------/
 void interpret() {
   const Instruction *pc = vm.getcode();
   for (;;) {
     switch (pc->opcode) {
       case wrc: printf("%c", pc->operand); break;
       case wrl: printf("\n"); break;
       case hlt: return;
       default: error.fatal("whoops! unknown opcode");
     }
     pc++;
   }
 }

 /---------------- Code generator -----------------------------/
 void VirtualMachine::gen(Opcode opcode, int operand=0) {
   if (code_index < MAX_CODE) {
     code[code_index].opcode = opcode;
     code[code_index].operand = operand;
     code_index++;
   } else error.fatal("program too long");
 }

 /---------------- Lexical analyzer ---------------------------/
 char *Scanner::key_words[] = {"begin", "end", "program",
     "writeln", ""};
 Scanner::Symbol Scanner::key_syms[] = {begin, end, program,
     writeln, idsym};

 void Scanner::read_ch() {
   if (ch == '\n') {
     cur_line++;
     cur_col = 0;
   }
   ch = fgetc(f);
   if (ch != EOF)
     cur_col++;
 }

 Scanner::Symbol Scanner::get_string(int delimiter) {
   int i = 0;
   for (;;) {
     read_ch();
     if (ch == delimiter) {
       read_ch();
       break;
     }
     if (ch == EOF) {
       error.warn("eof in string");
       break;
     }
     if (ch < ' ') error.warn("illegal character in string");
     if (i < Max_str) {
       lit_str[i] = (char)ch;
       i++;
     }
   }
   lit_str[i] = '\0';
   return litstrsym;
 }

 Scanner::Symbol Scanner::get_ident() {
   int i = 0;
   do {
     if (i < Ident_max) {
       id[i] = (char)ch;
       i++;
     }
     read_ch();
   } while (ch != EOF && isalnum(ch));
   id[i] = '\0';
   for (i = 0;
       key_words[i][0] != '\0' && strcmp(id, key_words[i]) != 0;
       i++)
     ;
   return key_syms[i];
 }

 Scanner::Symbol Scanner::get_number() {
   val = 0;
   do {
     val = 10 * val + (ch - '0');
     read_ch();
   } while (ch != EOF && isdigit(ch));
   return numbersym;
 }

 Scanner::Symbol Scanner::getsym() {
   int save_ch;
   while (ch != EOF && ch <= ' ')
     read_ch();
   error.set_line(cur_line); error.set_col(cur_col);
   save_ch = ch;
   switch (ch) {
     case EOF: return eofsym;
     case '+':case '*':case '(':case ')':case '.':case ';':
       read_ch(); return (Symbol)save_ch;
     case '\'': return get_string(ch);
     default:
       if (isalpha(ch)) return get_ident();
       if (isdigit(ch)) return get_number();
       error.warn("illegal character");
       read_ch();
       return getsym();
   }
 }

 bool Scanner::init_scanner(const char fn[]) {
   if ((f = fopen(fn, "r")) == NULL) return false;
   error.set_line(1);
   error.set_col(0);
   cur_line = 1;
   cur_col = 0;
   ch = ' ';
   read_ch();
   return true;
 }

 /---------------- Parser -------------------------------------/
 bool Parser::accept(Symbol s) {
   if (s == sym) {
     sym = getsym();
     return true;
   }
   return false;
 }

 void Parser::expect(Symbol s) {
   if (!accept(s)) error.warn("syntax error");
 }

 // Stmt = ['writeln' '(' String ')'] .
 bool Parser::stmt() {
   if (accept(writeln)) {
     expect(lparen);
     if (sym != litstrsym) error.warn("string expected");
     const char *lit_str = getlitstr();
     for (int i = 0; lit_str[i] != '\0'; i++)
       vm.gen(wrc, lit_str[i]);
     vm.gen(wrl);
     sym = getsym();
     expect(rparen);
     return true;
   }
   return false;
 }

 // Stmtseq = Stmt {';' Stmt } .
 void Parser::stmtseq() {
   do {
     stmt();
   } while (accept(semi));
 }

 // Pascal = 'program' id ';' 'begin' Stmtseq 'end' '.' .
 void Parser::parse(char fn[]) {
   if (!init_scanner(fn)) error.fatal("no source");
   sym = getsym();
   expect(program);
   expect(idsym);
   expect(semi);
   expect(begin);
   stmtseq();
   expect(end);
   vm.gen(hlt);
   expect(period);
 }

 /---------------- Main driver --------------------------------/
 int main(int argc, char *argv[]) {
   Parser parser;
   parser.parse(argv[1]);
   interpret();
   return 0;
 }