Regular Expressions


Building a finite automaton to represent a regular expression amounts to parsing the regular expression and constructing a directed graph that describes the order in which characters are to occur.

Constructing this graph requires the regular expression parser to recognize the elements of a regular expression, create graph fragments for these elements, and attach these fragments to the automaton graph in the proper place. The algorithm used is described in the first "Dragon Book" [2].

The implementation described in the article handles the following set of regular expression syntax:

The regular expression elements are:

A subexpression is represented by a start node, an end node, and one or more transitions between the nodes for each character event. Each subexpression graph is contained between a pair of outer nodes and empty transitions. These outer nodes are referred to as the outer start and end nodes.

Although these outer nodes introduce unnecessary transitions into simple expression graphs, they are needed to prevent ambiguity in the interpretation of more complex expressions involving wildcards adjacent to the OR operator. Without these outer nodes, an expression like a*|b might be misinterpreted as (a|b)* when it should be treated as (a*)|b.

Using these outer nodes also allows the conversion to be performed in one pass of the regular expression.

Subexpressions joined by logical operators are connected to the parent graph through their outer nodes with empty transitions. For two subexpressions ANDed together, an empty transition is inserted between the outer end node of the first subexpression and the outer start node of the second (Figure 1a) . Two subexpressions joined by an OR have their outer start and end nodes connected to each other using empty transitions (Figure 1b) .

Applying a wildcard character to a subexpression adds empty transitions (i.e., transitions without an event) between the inner start and end nodes. For the "any character" wildcard ?, an empty transition is inserted from the inner start to inner end node; for the "one or more" character +, an empty transition is added from the end to the start; for the "zero or more" character *, empty transitions are added from the end to the start and from the start to the end (Figures 1c, 1d, 1e) .

A complete example of a finite automaton graph derived from a regular expression is shown in Figure 2 (the initial and terminal nodes are filled in). o