May 1994/Designing an OOP Compiler

Object-Oriented Programming

Designing an OOP Compiler

Louis Tsai

Louis (Kam-wai) Tsai was born in Hong Kong where he earned his B.S. in Computer Engineering. He came to the United States in 1986 and earned his MBA in Computer Information Systems. He is a senior programmer/analyst at WORLDSPAN in Atlanta, GA. He can be reached at 969 Merrimac Way, Lawrenceville, GA 30244. 404-381-7735.

Overview
Most marketed compilers are written in C or another conventional programming language. Adding a new command to the language may require changing a lot of code in your compiler. This increases its maintenance cost, as well as the time and effort needed for each new release. Also, a programmer must fully understand the design of a compiler before making changes to it.
In this article, I will discuss an object-oriented compiler. The compiler was originally written in C using a DOS-based Windows API. My job was to rewrite the compiler using Borland C++ and Microsoft Windows while maintaining the same functionality. To run the product on Microsoft Windows and to fully utilize the advantages of porting to Windows (such as adding buttons, message boxes, and window operations), some Windows commands had to be added in our language. With object-oriented programming, we satisfied our requirements by creating an object class to implement a command. To add or delete commands in the language, we simply add or delete objects.

Architecture of a Compiler
In begin with a brief review of a simple compiler. I have not changed the basic architecture, just implemented it with an objectoriented design. A compiler consists of three components (see the block diagram in Figure 1) :

a tokenizer

a parser

a code generator
Tokenizing a program is the first step. The compiler reads a character string from an input file and matches the string against entries in a keyword table. If it finds a match, it returns a token with an appropriate token type (an integer value). In our project, a keyword corresponds to a command. A character string not recognized as a keyword is tokenized as the name of a variable (the token is treated as a variable). Our tokenizer needs to support several functions: getToken(), lookToken(), and peekToken(). Function getToken() moves the file pointer to the next token and returns the new token from the file. Function lookToken() returns the current token in a program without moving the file pointer. Function peekToken() returns a look-ahead token without moving the file pointer.
Our parser calls the above mentioned token-retrieval functions and checks command syntax. It also outputs syntax-error messages to users. For example, the syntax of an IF statement is:
IF expression
   THEN statement ....
   {ELSE statment ....} ENDIF
If the token IF is encountered, the parser expects an expression, followed by THEN, and so on, Figure 2 shows a parse tree for an IF statement. As you can see, the diagram of a parse has branches like a tree, so we call it a parse tree. A conventional compiler builds a parse tree using linked lists or similar data structures.
In addition to checking syntax of commands, our parser creates and updates a symbol tables, to store the variables declared inside a program. A symbol table is created and updated in the variable declaration section of a program. It is output to the object file at code generation time and referenced during execution. My project's symbol tables also contain the information needed to reference branch statements (e.g. GOTO, IF, and WHILE statments) and literals (all constants in the program, such as constant numbers or literals).
The purpose of the code generator is to produce the appropriate instructions (instruction objects or machine language). My project's parser produces a pseudo file that contains all the instructions that are executed at run time. The code generator reads from the pseudo file and generates the appropriate instruction objects to an object file (output from the compiler). Figure 3 shows the data flow of the compiler.

Language Syntax
The scripting language that the compiler compiles is similar to BASIC. It is designed to allow personal computer users to create and modify fill-in screens which, when combined, form scripts. Theses scripts guide the PC user through a task in a Computer Reservation System (CRS). The product is mainly used for airline, rental car, and hotel reservations. Instead of typing a long cryptic format command to the host computer, a travel agent can run a script program and enter data, retrieve airline schedules, book tickets, etc. Listing 1 shows the syntax of a sample script program.
To pop up a dialog box, you have to include a command, such as
DIALOG ("Test.dlg", RET_CODE, parm1, parm2)
in your script program. Test.dlg is the name of the dialog file, which contains control statements. Listing 2 shows a dialog file.
Parsing a dialog statement involves another set of dialog commands (as shown in Listing 2) . To parse a dialog command, the compiler has to open a dialog file and compile it. The operation is just like compiling any other programming language. For details of the script and dialog language, please refer to Reference 1.
As I mentioned earlier, the compiler handles two sets of commands: one for the script program and another for the dialog file. The compiler behaves as though you are running two compilers simultaneously. So to reduce duplication of code, I created a parent class called Tokenizer. This class is inherited by the two classes Asl and Dlg, which handle the separate translations.

Inside the Tokenizer
Listing 3 shows the definition of the Tokenizer class. Listing 4 shows the definitions of classes Asl and Dlg. Asl is the tokenizer class for script programs and Dlg is for dialog files. Compiling either a script or a dialog file will use the same lookToken(), getToken(), and peekToken() functions in the parent class.
The only difference between classes Asl and Dlg is the function initialize(). In Listing 5, the virtual function initialize() is overridden in classes Asl and Dlg. The function initializes a keyword set gKeywordSet with different command sets (delimiter and operator sets are common to both languages). This way, I can compile two different languages with a minimum of code. I add the array kcommands to my keyword table in Asl. And I add the arrays kdatatypes and kdialogcmds in Dlg.
All the strings stored in kcommands, kdatatypes, and kdialogcmds are defined in a keyword header file. This technique eases adding or deleting keywords in the languages. Also, if I want to change spellings in the language, I can just change keyword tables. This will be useful when I want to port products to foreign languages.

Implementing the IF Command
Instead of using linked lists for the parse tree, I use an object-oriented approach to handle the parsing. I created a parse node for each command. Listing 6 defines a parse node class for the command IF, which I call a Pif class. Inside the Pif class, the function parse() (shown in Listing 7) parses the syntax of an IF command. Figure 2 shows the linkage of an IF parse node.
Inside the parse() function, I instantiate an expression object (class Pexpression) and check for out of memory. Then I call the function Pexpression::parse() to check for the syntax of an expression. The expression
pTemp = gpTokenizer->getToken()
assigns a new token from the input file to pTemp. I check the token type of pTemp. The next token I expect is a THEN. If it is a THEN, I expect statements. Otherwise, it must be an error. I output an error message to an error file.
Before I enter into the while loop, I call gpTokenizer->lookToken() to check for the current Token. If it is not ELSE or ENDIF, it should be a statment. I instantiate a statement object (class Pstatement). Then, I call Pstatment::parse(), which is a function to differentiate commands in the languages. This function contains all the commands supported in the languages. The loop continues until the look ahead token is an ELSE or ENDIF. If the while loop ends with an ENDIF, the Pif::parse() ends and returns to the calling function. However, if it is an ELSE, another while loop (similar to the one in Listing 7) is executed to handle statements after the token ELSE. The gpErrorFile is a class which handles error messages.
If I need to add or delete commands, I simply change Pstatement::parse(). Listing 8 shows the function Pstatement::parse(). PARSE is a macro that calls the function parse() and then cleans up the object. The function Pstatement::parse() starts by getting a token from the input file. A switch statement differentiates the type of parse node to be created. For example, when a CALL statement is followed by the token THEN, a Pcall parse node will be created. The function Pcall::parse() is called to handle the syntax of a CALL statement. Pstatement::parse() terminates and returns to Pif::parse().

Nested IF Statements
The compiler supports a nested IF, which occcurs when another IF token follows the token THEN. Another Pif parse node is created inside Pstatement::parse() and the function Pif::parse() is called to handle the new IF input tokens. It is like building a parse tree (see Figure 2) . Each node inside a parse tree is a parse node. Each parse node's parse function contains the grammar for that command. Without using any explicit linked list, I can link each parse node with the function call parse() (see Listing 7 and Listing 8) . The following displays the sequence of function calls for a nested IF:
Pif::parse()
   --> new Pstatement
   --> Pstatment::parse()
   --> new Pif

   --> another Pif::parse()
Using the same technique, I created a PdialogStatement class for my dialog language. Listing 9 shows the syntax of the parse function. Pedit and Plistbox parse functions check the syntax of the dialog commands EDIT and LISTBOX as shown in Listing 2.

Implementing the WHILE Command
Another good example is the WHILE command, which has the following syntax:

WHILE expression DO statement .... ENDWHILE
First, I instantiate an expression parse node. Then, I call the expression's parse function. A DO token is expected after the expression. Following the DO token, I have a loop to create Pstatement parse nodes and call the parse function. The loop ends when it find an ENDWHILE token. The algorithm is similar to that for IF. Figure 4 shows the parse tree for a WHILE node. Listing 10 shows the code of a WHILE parse node.

Flexibility
Parse nodes for different commands are quite similar. To implement commands using parse-node classes is easy and the code is reusable. I created more than thirty parse-node classes in two days using copy and paste functions. You can also move objects from one language to another. Some parse nodes — such as for IF, WHILE, FOR, and REPEAT — are universal commands. You can reuse them in any language. Also, you can add or delete objects easily. As a matter of fact, you can collect your parse-node classes in a library and share them for different languages. If you don't need these commands, simply exclude them from your Pstatement parse function.
To design a new command, draw a parse tree for the command. Then, you can look at your parse-node library and reuse the relevant parse nodes in your code. In these examples, we reuse the expression and statement parse nodes. Also, if we put all the parse nodes in a DLL library, the size of the compiler's executable is significantly reduced.
In my project, I had to interface with other pieces in the reservations system. Since we used object-oriented design, we broke the whole project into different classes before starting. We defined the classes, member functions and data members in header files. Everyone involved with the project reviewed the class definitions with which they needed to interface. We didn't need to know the details of how to implement those functions, just how to use them and to be sure we got the interface functions we needed.
This is the beauty of object-oriented design. In my compiler, I used a symbol-table class, a code-generation class, and a dialog-manager class. The symbol-table class handles all the variables used in a script. My output file contains symbol tables used at run time. The symbol-table class has read and write functions that can read and write symbol tables to and from files. To improve the performance at run time, we output table objects to an object file (output file of a compiler) during compilation and read in the objects at run time.
After the parsing, the compiler instantiates a code-generation class. The code-generation class looks at a file (we called it a pseudo file) generated by the parser that contains instructions. It instantiates instruction objects and calls the write function of the instruction object. The instruction objects were written to the object file. Similar to the symbol-table class, each instruction object has the capabilities to read and write itself to a file. The dialog-manager class handles control commands for the dialog file. The code-generation class calls functions inside the dialog-manager class to output dialog control objects.
Without these classes, I would not have been able to finish my project. Also, these classes are reused in other projects. Once again, I have taken advantage of an important aspect of object-oriented design — reusability.

Advantages of the Design
When I first looked at the compiler and started my design work, I found that the old tokenizer used sets to distinguish different commands. I looked at some container classes supported by Borland, Microsoft, and Rogue Wave. I found that Rogue Wave's and Borland's set classes were very useful. My keyword set was implemented using the container class. I also used the Borland file-stream classes to handle my input and output. I began my design process by looking first at some existing classes in the market. Then, I defined classes and member functions that needed to be implemented in a header file. As the header file became too large to manage, I separated it out based on functionality.
In the old parser, parsing functions were put in different files. Linking them together required a lot of extern statements, and it was time consuming to locate the functions. So I decided to build a parse-node class for each command. This had several benefits:

I eliminated a lot of extern statements.

I put functions in groups and shared some member functions by putting them a parent class.

The parse-node classes are similar and are easy to duplicate. This can reduce the maintenance cost of the compiler.

I built a parsing engine which includes an expression parse-node class, symbol-table classes and the tokenizer. I can delegate some parse-node classes to someone else in order to reduce the development cycle.
In the old compiler, the code generation was embedded in the parser, making the parser more complicated. I separated the code generator into a class. There are several advantages of this approach:

It makes programs more modular. I can plug in a different code generator in case I need to generate something different and I can reuse my parser in other projects.

It can simplify the coding and make it easy to maintain.

We can delegate the code generator to someone else to reduce the development cycle.

We can add optimization in our object file.

It is more flexible when incorporating enhancements.
Using object-oriented design can reduce the development cycle of a project. A lot of code can be reused in other projects. It is cost effective and easy to maintain. It may take some time to familiarize yourself with the object-oriented concept and the tools, but once you go through the learning curve, you will love it and won't want to go back to conventional C.

Acknowledgements
I would like to thank the management of WORLDSPAN's Distributed Systems Development department for the support I have received. Special thanks to Alan Greenberg for his valuable input to this project and Cynthia Petrie for her editing. Thanks also to my co-workers Bruce Bettis for creating the symbol-table class, Scott Kelley for the code-generation class, and Ron Zepp for the dialog-manager class.

References
1. WORLDSPAN Scripting Language Manual, July 1992.
2. Object-oriented Software Construction, Bertrand Meyer, Prentice Hall, 1988.