Features


A Regular Expression Class Library

Duncan Ellis and Sameer Udeshi

Regular expressions are a great way to describe patterns in text, with lots of applications.


Introduction

Regular expressions are a concise notation for describing patterns of strings. They're used in text editors and utilities such as grep, which allow users to search for a pattern in text. Compilers use regular expressions for token recognition. This article presents a set of classes which will enable you to add regular expression handling to your code.

The aims of the class design are as follows:

Regular Expressions as Finite Automata

It is possible to represent any regular expression as a finite automaton (see the sidebar "Regular Expressions" for a description). Representing a regular expression in this way provides a simple interface for input matching. The finite automaton conducts a search by sequencing through a series of internal states while reading characters from the input string. The automaton jumps from one state to the next based on its current state and the character it receives from its input. The match succeeds if the automaton reaches its accept state.

The following classes support this interface (refer to Listings 1, 2, 3a, 3b, 4a, 4b, and 5 for class definitions):

My first design of the finite automaton had a flaw, which is explained as follows. The finite automaton provides two major operations:

The flaw was that a program that needed only traversal functionality would be forced to include conversion functionality as well, and vice versa. Conversion and traversal functionality might need to be separated if the regular expressions were precompiled, for instance in a command-line interpreter for a utility or embedded system.

In redesigning the class structure, I have used the Strategy pattern [1] to achieve this separation and the ability to override automaton information storage. Strategy lets the algorithm vary independently of clients that use it. There are two classes, RegularExpression and FiniteAutomaton, which serve as clients of the strategy; CFABuild and CFAQuery are the abstract base classes that represent the strategy.

Implementing Input Matching

This section describes how the above classes may be used for implementing some common input matching requirements.

The CAutomatonState class encapsulates state information pertaining to a finite automaton. A CAutomatonState instance changes state as input is fed to it. Although CAutomatonState implements a nondeterministic finite automaton, the class effectively hides this from the user. It appears as a deterministic finite automaton (see the sidebar "Finite Automata" for description of these terms). Input matching code typically feeds character input to the CAutomatonState from a stream, checking for conditions such as:

Before any input matching can occur, the regular expression must be converted into a finite automaton representation. Below I describe some input matching functions that take finite automata as arguments. Before using any of these functions you must precede them with some code equivalent to the following:

// SimpleFA inherits from both
// CFABuild and CFAQuery
// argument: const char *regExp
// to construct a regular expression
// automaton
RegularExpression re;
SimpleFA fabq;
re.SetFABuild( &fabq );
re.ConvertToFiniteAutomaton( regExp );
 
// to inform the automaton handler
// of the query interface object
FiniteAutomaton fa;
fa.SetFAQuery( &fabq );

You can then call the match function with the automaton and search string:

 ... match function... (fa, srchStr);

As mentioned above, the CFAQuery object could have been precompiled. Using the fa object declared here, some input matching functions might be:

1. Exact match — Check whether the input string exactly matches the given regular expression. This function feeds successive input characters to the automaton until it reaches the end of the input or an invalid state is reached. An accept state if the first of these conditions is met implies an exact match.

//returns true if the input string
//exactly matches // the regular expression encoded in the
// automaton bool ExactMatch (FiniteAutomaton &fa, const char* srchText) { bool retval = false; CAutomatonState astate( fa );

  for( ;*srchText && astate.IsValid() ;
       srchText++)  
  {
    astate.Next(*srchText);
  }
    
  if( astate.IsAccept( ) &&
      (*srchText == 0))
    retval = true;
  return retval; }

2. Substring match — Check whether a substring (prefix) of the given input string matches the regular expression. The following function returns true if a substring that matches the regular expression is found. matchLen returns the length of the substring. This function feeds successive input characters to the automaton until it reaches an accept state.

bool SubStr(FiniteAutomaton &fa,
    const char* srchText, int& matchLen)
{
  bool retval = false;
     
  CAutomatonState astate( fa );
     
  for(matchLen=0;
      *srchText && astate.IsValid();
      srchText++)
  {
    astate.Next(*srchText);
    matchLen++;
    if( astate.IsAccept( ))
    {
      //the automaton is in accept state, means that the input
      //read so far matches the regular expression.
       retval = true;
       break;
    }
  }
  return retval;
}

3. Substring search — The input string contains an embedded string that matches the regular expression. This is the requirement for programs like grep — the substring match function can be used to implement this behavior. This function returns true if a substring that matches the regular expression is found in the input string. matchStr and matchLen return the position and length of the substring found.

bool StrStr(FiniteAutomaton &fa,
    const char* srchText,
    const char*& matchStr, int& matchLen)
{
  bool retval = false;
  for( matchStr = srchText; *matchStr
       ;matchStr++ )
  {
      if( SubStr(fa , matchStr ,
                 matchLen ))
      {
        retval = true;
        break;
      }
  }
  return retval;
}

4. Longest substring match — Look for the longest substring that matches the regular expression. Such a match is used by lexical analyzers to resolve ambiguity in identifying token values. For example, suppose the regular expression for variable names is [a-zA-Z]+[a-zA-Z0-9]*. If the input stream is "fre1 = 112" then the substrings "f", "fr", "fre", and "fre1" all match the regular expression for a variable name, but the lexical analyser should choose "fre1".

bool LongestSubStr(FiniteAutomaton &fa,
    const char* srchText, int& matchLen)
{
  bool retval;
  CAutomatonState astate( fa );
  int len = 0;
  for(len = 0;*srchText && astate.IsValid() ; srchText++ )
  {
     len++;
     if(astate.IsAccept())
    {
      retval = true;
      matchlen = len;
    }
  }
     
  return retval;
}

Overriding CFAQuery and CFABuild

As mentioned previously, the CFABuild and CFAQuery classes are the strategies used by the RegularExpression and FiniteAutomaton client classes. The client classes use the strategy classes' abstract interfaces; thus, the strategy implemented can vary without affecting the clients in any way. Examining the code for the SimpleFA class (Listing 6) will give you an idea of what is required from the implementation.

SimpleFA is not an optimal implementation. It supports a limited number of automaton transitions in the automaton, and its implementation of the ToStatesForEvent function is inefficient. If you want to use more complex regular expressions than SimpleFA can reasonably support, you can write your own implementation of the strategy classes.

Note that this code was developed with Visual C++ v4.1 and uses the MFC collection classes.

Conclusion

Many text processing applications/utilities can use regular expressions. A class library that provides support for a simple interface to use them can be very useful, especially since it hides the underlying implementation details.

Acknowledgements

The authors would like to thank Jitinder Sethi for reviewing this article.

Bibliography

[1] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software (Addison-Wesley, 1995).

[2] Alfred V. and Ullman Aho. Principles of Compiler Design (Addison-Wesley, 1977).

Duncan Ellis is a software developer in the CASE tools group at Oracle in the United Kingdom. He has ten years of experience in embedded systems, telecommunications, and applications development. He can be reached at dellis@uk.oracle.com.

Sameer Udeshi is a software developer working with Mastek India Ltd. He has seven years of experience in software development. Object-Oriented Design and C++ programming are his main interests.