Parsing Expressions in Java

Dr. Dobb's Journal January 1999

Specifying rules for common operations

Cliff Berg

Cliff, vice president and chief technology officer of Digital Focus, can be contacted at cliffbdf@digitalfocus.com.

Many work-process applications let users specify business rules for operations that are to be performed again and again. A spreadsheet, for example, defines a set of business rules for performing calculations. Business-rule specifications commonly involve expressions -- formulas a business-rule system is to apply. Work-process applications thus need to let users enter and compose expressions; interpret, compile, and execute them; and save them for reuse.

Parsing expressions involves:

Lexical recognition of tokens that comprise words, multicharacter symbols, and special characters.
Parsing defined by the language grammar.
Name resolution of variable, function, and type references.
Semantic actions based on the constructs that are recognized, which might include generating immediate results (an interpreter), generating code, or building an abstract model of the input to be passed to other software.

A lexical recognizer is often referred to as a "tokenizer" because it divides input into tokens -- groups of characters that constitute words and special symbols. A tokenizer is usually a distinct software module, constructed to scan an incoming text stream and produce tokens as output. The parser then makes requests for the next available token.

A tokenizer works hand-in-hand with a parser. The parser recognizes constructs defined by the language. A parser typically sits in a loop, asking the tokenizer for tokens. Depending on what tokens precede it, the parser makes decisions about what language constructs to apply to recognize the input. The parser's architecture is, therefore, highly dependent upon the nature of the input language.

When the parser receives a token that is some kind of identifier, it must determine what role that identifier plays in the language. For example, the identifier might be the name of a function or program variable. The role played by an identifier affects how it is handled when parsing is done and what action must be taken based on the input. When an expression is parsed, the action to be taken might be to compute the actual value represented by the expression -- in this case, variable identifiers must be substituted with the actual value they represent.

For its part, Java provides the built-in tokenizer classes java.io.StreamTokenizer and java.text.StringTokenizer -- but not a parser. Consequently, to complement these classes, this article presents a Java expression parser. The complete source code for the parser is available electronically; see "Resource Center," page 5 or the Digital Focus web site (http://www .digitalfocus.com/).

Grammars and Symbol Recognition

To construct a parser, you first define a grammar to be parsed. Although there are many ways of representing a grammar, I'll use the grammar-specification format popularized by the UNIX yacc program.

The yacc format includes lists of the tokens produced by the tokenizer, and lists of productions (syntactic groups that make up the language). Each production consists of a left-side name, followed by a right-side list of symbols. The left and right sides are separated by a colon, and the right side is terminated by a semicolon. For example, this production defines a symbol called NumericLiteral:

NumericLiteral : IntLiteral ;

It is possible (and common) to have a left side defined more than once, indicating that it can be composed of alternative groupings of symbols. You could, for example, have:

NumericLiteral : FloatLiteral ;

Yacc format provides a shorthand for this, such that if two productions have the same left side, they can share the same definition, and the right sides are merely separated by a vertical bar, indicating an OR relationship:

NumericLiteral : IntLiteral | FloatLiteral ;

A grammar usually has a top-most symbol, indicating the symbol, which aggregates the total set of things that can occur within the language. For this expression grammar, I have defined an Expression symbol for this purpose.

Ways of Parsing Expressions

One common technique for parsing an expression is to use a table-driven parser, and encode the recognition rules in a table. Starting from an initial state, each time the parser gets a new token, it looks that token up in the table, and based on the current state, advances to a new state. The table can also tell the parser to obtain a new token, or continue using the current one. It can also tell the parser to group a collection of recently recognized symbols into a larger group, representing a higher-level grammar construct. When the parser is done, it will have recognized the top-most construct -- an entire expression. Constructing such a table usually requires a tool such as yacc.

Another approach is to write a routine for each grammar symbol, and call those routines recursively. Since an expression is made up of a known set of symbol kinds, and those in turn are composed of others, the process can be decomposed into a set of recursive function calls. In cases where it is not possible to tell which method to call next (for example, if the current token can occur as the start symbol of more than one of the alternative productions), a token look-ahead method sees what token is waiting in the input, and branches based on that. With expressions, however, it is usually possible to write the grammar so that all such ambiguities are removed. This involves ensuring that no right side of a production starts with the same symbol that's defined on the left side. This is called "left recursion," and a grammar that has left recursion is more complex. The expression grammar I present here (available electronically) has left recursion removed.

The Tokenizer

For recognizing expressions, I define the following token kinds that must be recognized by the lexer and returned to the parser. The tokenizer must be able to group input characters into tokens of these kinds:

IntLiteral, a value such as 123.
FloatLiteral, a value such as 123.456.
Ident, an identifier such as MyVariable or SomeFunction.
Asterisk, the "*" symbol, indicating multiplication.
Slash, the "/" symbol, indicating division.
Plus, the "+" symbol, indicating either addition or a plus sign.
Hyphen, the "-" symbol, indicating either subtraction or a negative sign.
OpenParen, the "(" symbol, for grouping of subexpressions or function parameters.
CloseParen, the ")" symbol, for terminating a group.
Comma, the "," symbol, for separating a list of function parameters.

Listing One defines the interface for the tokenizer, while Listing Two defines an abstract type Token.

You can use Java's built-in tokenizers to implement the tokenizer -- just wrap it in a class that implements the Lexer interface. Alternatively, you can write your own if you need functionality that isn't available from the built-in tokenizers (such as character position of tokens).

The Expression Grammar

Since this expression grammar does not support left recursion, it does not require the use of a look-ahead to the next token. A grammar of this type is known as a "predictive" grammar -- the parser can always determine which production to use based on the current token and parse state. In the case of a recursive descent parser, the state is represented by the method call stack.

Figure 1 illustrates the grammar I use. The top-most nonterminal kind is Expression. An expression is composed of a Term, followed by a TermOp. A TermOp can be a plus sign ("+"), a minus sign ("-"), or it can reduce to null, which covers the case when there is no "+" or "-" operation following a Term.

A Term is composed of a Factor and FactorOp, and a Factor is composed of a Primary and PrimaryOp. This nested chain of definitions implements the operator precedence rules of the expression language: "*" has higher precedence than "+" and so on. Primary is the top of the chain. A Primary can be an identifier, number (NumericLiteral), parenthesized expression, or signed expression (a Primary with a "+" or "-" in front of it, to alter the Primary's sign).

The grammar also allows use of function notation, consisting of an identifier followed by a comma-separated list of expressions in parentheses. During parsing, the parser can choose to resolve the names of functions that are encountered, and associate them with defined numeric functions, such as SQRT() and so on. A function's parenthesized parameter list is treated in the grammar as an operator: It is recognized by the PrimaryOp production. Thus, syntactically, any expression may be followed by a parenthesized parameter list. Whether this makes sense depends on whether the expression language supports the notion of function-valued expressions, as is the case in C (a dereferenced function pointer). If not, the parser must add a check to ensure that the Primary preceding a nonnull PrimaryOp is an identifier corresponding to a known function.

Graph Types

Depending on your requirements, an expression parser can evaluate expressions as it parses them. Alternatively, you may want it to construct a graph representing the expression. In either case, the design of the parser is essentially the same. The expression parser presented here constructs a graph.

I define the base types for the nodes in the graph in Listing Three. The parser parses an expression string and returns a graph, representing the abstract structure of the expression. This graph can be handed over to other code to either evaluate or store in an object database for future evaluation. All nodes in the graph will be of type SymbolInstance, and will extend from the abstract class Expression. I will also define subtypes for each kind of expression, including IdentExpression for identifiers, IntLiteralExpression for integer-type literals, FunctionCallExpression for function calls, TermExpression for "+" and "-" operations, and so on. These expression subtypes correspond to a logical model of expressions, and do not necessarily correspond to the symbols defined in the grammar. The logical model defines the output; the grammar symbols define input. Listing Four is the expression tree subtypes I have defined.

Each kind of expression either stands on its own (as in the case of an IdentExpression and any of the literal expressions) or represents some kind of operation involving one or more operands. For example, a FunctionCallExpression represents a function operation, involving all the function arguments as operands. Similarly, a TermExpression is either an addition or subtraction operation, involving a left- and right-side argument.

Using a Stack to Manage Output

One difficulty in parsing is that the symbols that make up the grammar do not necessarily correspond one-to-one to the graph node kinds that are being constructed. In fact, when a grammar is rewritten to remove left recursion, some of the natural correspondence is usually removed in favor of being able to parse the input more easily. As a result, a separate mechanism is often needed to maintain the state of the output, while the parser also maintains the state of the input. The most common mechanism for maintaining the output state is a stack. In this approach, items that are constructed as intermediate results are pushed onto a stack, so that they are available to subsequent stages during parsing. The stack acts as a holding area. At the end of parsing, the stack may contain a single item on it: The root node of a graph representing an expression; alternatively, the stack may contain an entire program that represents the operations specified in the original input, but translated into the instruction set of a stack- or register-based machine.

For this expression parser, I use a simple stack that pushes and pops the expression graph nodes that I construct. It is a Singleton object, and has the methods in Listing Five.

Name and Type Resolution

My parser implementation also defines the base types in Listing Six. These types support the ability to perform type and name resolution. For example, when the parser recognizes a function name, it can presumably obtain from somewhere the return type of the function, and use that to determine the type of the final result of the expression. At each step in the recognition process, the parser determines the type of each subexpression, until it is able to determine the type of the final result. To determine the result type of an operation, the parser has a method called resolveType(), which takes as input the type of the operation operands, and returns the result type. It has the signature: protected Type resolveType(Type lhsType, int op, Type rhsType) throws ParseException;, which implements the rules of the language governing what results from an operation involving two operands of the same or different types. For example, in this expression language, an operation involving two integer operands results in an integer, whereas an operation involving a float and an integer results in a float.

The interface Region defines an abstract interface for a symbol table -- an object that provides a named lookup capability for things in the language that have names, such as defined functions. For languages such as C, this would also include identifiers, types, and so on. It might also support the ability to nest regions, concatenate them, or define scope based on position (for languages in which symbols cannot be used prior to the point of declaration). Correspondingly, a Symbol has a name and can be looked up in a Region, and Symbol defines the interface for obtaining information about the named entity. In my expression language, defined functions are the only things that have names and can appear in symbol tables.

For performing symbol-name resolution, the parser has a method with the signature: protected Symbol resolve(Region immediateRegion, String s, boolean create) throws UnresolvedSymbolException, SymbolConflictException;, which implements the name scoping rules of the language. In this case, it merely looks in a list of defined function names.

Computing the FIRST Set

The FIRST set of a grammar symbol is the set of all terminals that can begin that symbol. For example, in the expression grammar, the FIRST set of NumericLiteral is { IntLiteral, FloatLiteral }. This is because a NumericLiteral always begins with either an IntLiteral or FloatLiteral token and nothing else.

Now consider a Primary, whose production is as in Figure 2. A Primary can begin with any of { Ident, Literal, OpenParen, SignedExpression }. Of these, Literal and SignedExpression represent productions of other symbols. A Literal can start with any of { TextLiteral, NumericLiteral }, and a SignedExpression may start with any of { Hyphen, Plus }. In turn, NumericLiteral can start with any of { IntLiteral, FloatLiteral }, which, at last, are tokens and cannot be reduced further. Thus, you now know all the terminals that can start a Primary, and they are { Ident, IntLiteral, FloatLiteral, TextLiteral, OpenParen, Plus, Hyphen }. This is the FIRST set of Primary.

I used a program to generate the FIRST set of my expression grammar. However, for a moderate-sized grammar, it is relatively simple to construct the FIRST set by hand for each symbol. For the ExpressionParser, I have defined the Java array constant in Listing Seven to contain the FIRST sets of all the grammar symbols. Within the code, I make use of the method in Listing Eight to inquire if a specified terminal belongs to a symbol's FIRST set.

Recursive Descent of Expressions

To perform recursive descent recognition of expressions, there must be a routine for each symbol defined in the grammar, or at least a routine for each left-side symbol. Thus, for the top-level symbol Expression, there must be a routine to recognize Expression; see Listing Nine. Expression is defined by the production:

Expression: Term TermOp ;

Therefore, an Expression always consists of a Term followed by a TermOp -- it cannot be anything else. The routine parse_Expression() therefore calls parse_ Term(), followed by parse_TermOp(), and returns. You might wonder how the expression graph gets returned. The parser builds the graph on a stack it maintains, and so the calling routine simply calls pop() to get the graph.

Parsing a Term is similar. The routine parse_Term() simply calls the methods that parse the components of a Term: parse_Factor() and parse_FactorOp(). Parsing a TermOp is a little different, however. A TermOp consists of the production in Figure 3. To parse a TermOp, you have to allow for the fact that a TermOp can reduce to null -- the last production has nothing in it. You also have to use the recognition of an operator as an opportunity to generate output. Since operation results correspond to Expression types, I generate the appropriate kind of Expression node whenever an operation is recognized. For a TermOp, this is a TermExpression.

In Listing Ten, the method calls pop() twice to get the left- and right-side operands of the current operation. It is expected that these were pushed in previous calls; in fact, one was pushed in the parse_Term() call occurring in this method. After constructing the expression tree for a TermOp, I push it onto my stack, so that it is available to other operations. I then call parse_TermOp() recursively, to handle the TermOp (which may reduce to null) that occurs at the end of the current production. I also call the method resolveType(), which determines the result type, based on the type of the input operands.

Parsing Terminals

At some point, identifiers and numbers have to be recognized. The tokenizer produces tokens that correspond to Ident, IntLiteral, and FloatLiteral. The parser still has to perform semantic actions based on these, however, including identification of defined function names and type resolution of identifiers and numeric literals (float versus fixed, for instance). The parser therefore has methods that are called for recognizing identifiers and literals; see the parse_Ident() method in Listing Eleven.

Conclusion

Expression parsing is an important part of many applications. A Java-based expression parser is potentially a reusable component of any developer's toolkit. The techniques presented here can also be applied to parsing other kinds of languages instead of expressions, or extended to handle more complex expressions, such as those including arrays and strings, as well as other kinds of operators.

DDJ

Listing One

public interface Lexer{
    public Token getNextToken() throws LexicalException;
    public String getTokenValue();
}

Back to Article

Listing Two

public interface Token{
    public String getValue();
    public int getTokenKind();
}

Back to Article

Listing Three

public interface SymbolInstance{
}
public abstract class Expression
implements SymbolInstance
{
    public abstract String toString();
    public void setType(Type t) { type = t; }
    public Type getType() { return type; }
    public Object evaluate(EvaluationContext evaluation_context) 
    { return null; }
    public void setSyntacticallyGrouped(boolean b) { sg = b; }
    public boolean syntacticallyGrouped() { return sg; }
    private boolean sg = false;
    private Type type;
}

Back to Article

Listing Four

Expression     IdentExpression
    LiteralExpression 
        NumericLiteralExpression 
            IntegerLiteralExpression
            FloatLiteralExpression
    FunctionCallExpression
    ArithmeticExpression 
        TermExpression
        FactorExpression
    SignedExpression

Back to Article

Listing Five

protected void push(Expression ex){
    stack.insertElementAt(ex, 0);
}
protected Expression pop()
{
    Expression e = (Expression)(stack.elementAt(0));
    stack.removeElementAt(0);
    return e;
}

Back to Article

Listing Six

public interface Type{
    public boolean isInstanceOf(Type t);
}
public interface Symbol
{
    public String getName();
    public void setType(Type t);
    public Type getType();
}
public interface Region
{
    public Symbol lookup(String identifier);
    public Region getContainingRegion();
    public abstract Symbol createNewSymbol(String s);
}

Back to Article

Listing Seven

private static final int[][] first = {
    { k_NONE },
    { k_IntLiteral },
    { k_FloatLiteral },
    { k_Ident },
    { k_Asterisk },
    { k_Slash },
    { k_Plus },
    { k_Hyphen },
    { k_OpenParen },
    { k_CloseParen },
    { k_Comma },

    // k_Expression:
    { k_IntLiteral, k_FloatLiteral, k_Plus, k_Hyphen, 
        k_Ident, k_OpenParen },
    // k_Term:
    { k_FloatLiteral, k_IntLiteral, k_Hyphen, k_Plus, 
        k_OpenParen, k_Ident },
    // k_TermOp:
    { k_NONE, k_Hyphen, k_Plus },
    // k_Factor:
    { k_IntLiteral, k_FloatLiteral, k_Plus, k_Hyphen, 
        k_Ident, k_OpenParen },
    // k_FactorOp:
    { k_NONE, k_Slash, k_Asterisk },
    // k_Primary:
    { k_FloatLiteral, k_IntLiteral, k_Hyphen, k_Plus, 
        k_OpenParen, k_Ident },
    // k_PrimaryOp:
    { k_NONE, k_OpenParen },
    // k_EList:
    { k_FloatLiteral, k_IntLiteral, k_Hyphen, k_Plus, 
        k_OpenParen, k_Ident, k_NONE },
    // k_NumericLiteral:
    { k_FloatLiteral, k_IntLiteral },
    // k_SignedExpression:
    { k_Plus, k_Hyphen },
    // k_IntExpression:
    { k_FloatLiteral, k_IntLiteral, k_Hyphen, k_Plus, 
        k_OpenParen, k_Ident },
    // k_MoreEList:
    { k_NONE, k_Comma }
};

Back to Article

Listing Eight

protected static boolean tokenIsInFirstOf(int termkind, int symkind){
    int[] firstSet = first[symkind];
    for (int i = 0; i < firstSet.length; i++)
    {
        if (termkind == firstSet[i]) return true;
    }
    return false;
}

Back to Article

Listing Nine

protected void parse_Expression()throws ParseException, LexicalException
{
    if (tokenIsInFirstOf(getTokenKind(), k_Term))
    {
        parse_Term();
        parse_TermOp();
        return;
    }
    throw createSyntacticException(k_Expression);
}

Back to Article

Listing Ten

protected void parse_TermOp()throws ParseException, LexicalException
{
    if (tokenIsInFirstOf(getTokenKind(), k_Plus))
    {
        parse_Plus();
        parse_Term();
            
        TermExpression e = new TermExpression();
        Expression rhs = pop();
        Expression lhs = pop();
        e.setLHS(lhs);
        e.setRHS(rhs);
        e.setAdditive(true);
        e.setType(resolveType(lhs.getType(), PLUS, rhs.getType()));
        push(e);

        parse_TermOp();
        return;
    }
    if (tokenIsInFirstOf(getTokenKind(), k_Hyphen))
    {
        parse_Hyphen();
        // ...same as for k_Plus...
        e.setAdditive(false);
        // ...same as for k_Plus...
    }
    // Handle the null production case
    {
        if (getTokenKind() == k_NONE) wasReducing(k_TermOp);
        return;
    }
}

Back to Article

Listing Eleven

protected void parse_Ident()throws ParseException, LexicalException
{
    if (getTokenKind() != k_Ident) throw createSyntacticException(k_Ident);
    IdentExpression e = new IdentExpression();

    Symbol s = null;
    try
    {
        s = resolve(getCurrentRegion(), getTokenValue(), false);
        e.setSymbol(s);
        e.setType(s.getType());
    }
    catch (UnresolvedSymbolException ex1)
    {
        // Symbol not found - that's ok at this level
    }
    catch (SymbolConflictException ex2)
    {
        throw new RuntimeException("Should not happen");
    }
    getNextToken();
    push(e);
    return;
}

Back to Article