May 1998/A Regular Expression Class Library/Sidebar

Regular Expressions

Building a finite automaton to represent a regular expression amounts to parsing the regular expression and constructing a directed graph that describes the order in which characters are to occur.
Constructing this graph requires the regular expression parser to recognize the elements of a regular expression, create graph fragments for these elements, and attach these fragments to the automaton graph in the proper place. The algorithm used is described in the first "Dragon Book" [2].
The implementation described in the article handles the following set of regular expression syntax:

wildcards: ? (for any character), + (for one or more of something), * (for zero or more of something)

sets of characters: characters enclosed in square brackets will be treated as an option set, although character ranges have not been implemented. That is, [abc] is valid, [a-c] is not.

logical OR: subexpressions may be ORed together.

parenthesized subexpressions: a regular expression may be enclosed within parentheses and will be treated as a unit.

escape characters: sequences such as \t, \n, etc. will be substituted for an equivalent single character. \\ represents the backslash.

The regular expression elements are:

subexpressions: either a single character (including escape characters), or a set of characters (enclosed between square brackets), or a parenthesized regular expression

wildcard characters: to modify a subexpression's meaning

logical operators: these are used to control how a subexpression graph is connected to the parent automaton graph. The OR operator | is the only explicit logical operator recognized, but any sequence of subexpressions is connected by an implicit AND operator.

A subexpression is represented by a start node, an end node, and one or more transitions between the nodes for each character event. Each subexpression graph is contained between a pair of outer nodes and empty transitions. These outer nodes are referred to as the outer start and end nodes.
Although these outer nodes introduce unnecessary transitions into simple expression graphs, they are needed to prevent ambiguity in the interpretation of more complex expressions involving wildcards adjacent to the OR operator. Without these outer nodes, an expression like a*|b might be misinterpreted as (a|b)* when it should be treated as (a*)|b.
Using these outer nodes also allows the conversion to be performed in one pass of the regular expression.
Subexpressions joined by logical operators are connected to the parent graph through their outer nodes with empty transitions. For two subexpressions ANDed together, an empty transition is inserted between the outer end node of the first subexpression and the outer start node of the second (Figure 1a) . Two subexpressions joined by an OR have their outer start and end nodes connected to each other using empty transitions (Figure 1b) .
Applying a wildcard character to a subexpression adds empty transitions (i.e., transitions without an event) between the inner start and end nodes. For the "any character" wildcard ?, an empty transition is inserted from the inner start to inner end node; for the "one or more" character +, an empty transition is added from the end to the start; for the "zero or more" character *, empty transitions are added from the end to the start and from the start to the end (Figures 1c, 1d, 1e) .
A complete example of a finite automaton graph derived from a regular expression is shown in Figure 2 (the initial and terminal nodes are filled in). o