WRITING AWK-LIKE EXTENSIONS TO C

When neither AWK nor C can do the job, try a combination of the two

Jim Mischel

Jim Mischel is a former financial systems programmer and database consultant. He can be reached at 20 Stewart St., Durango, CO 81301 or on CompuServe: 73717,1355.


The process of string searching is an important part of many computer applications. Text editors, database managers, spreadsheet programs, and countless utility programs use string searching to one degree or another. Unfortunately, other than the common case-insensitive search, most programs require that the search string be specified exactly. Very few programs allow a search on strings that are described by such information as "all strings that begin with a capital letter, contain at least one digit, and end with a period." This restriction isn't very surprising, considering the inadequacy of the string search functions in most popular languages.

The AWK programming language is unique with respect to string search functions. Along with many other interesting features, AWK contains the facilities to search a string for substrings that match a regular expression (see the accompanying sidebar on regular expressions). AWK is a fantastic tool for in-house use, but there are drawbacks to using it for commercial software development. First, AWK is interpreted, which makes its programs execute more slowly than compiled code. In addition, the interpreter is an extra-cost item that must be included in order for the program to run. Second, AWK provides little access to either the operating system or the underlying hardware. In the following discussion, some familiarity with the AWK programming language will be helpful, but is not required.

AWK's Functions

AWK is designed around an implicit processing loop that automatically reads each input record, splits the record into fields, and then performs programmer-specified actions upon the record. AWK also provides global variables that contain several bits of information about program status: the number of records read, fields in the current record, and so on. This functionality can be nearly duplicated in C by means of a simple controlling loop and the AWKLIB getline( ) function, as shown in the QSTRING.C program in Listing One. (getline( ) and the other AWKLIB functions are declared in the header file AWKLIB.H, which is shown in Listing Two. AWKLIB.H must be included in any program that uses AWKLIB.)

The awk_init( ) function, which must be called before any of the other routines are used, initializes the AWKLIB global variables. Failure to call this function may produce some interesting (and probably erroneous) results.

getline( ) has the same calling sequence as fgets( ) but returns EOF on end-of-file or error, rather than returning NULL. (This change in return values was a purely selfish design decision on my part.) getline( ) reads the next input record (line) into memory, splits that record into fields on the global field separator FS (discussed later), stores the fields in the FIELDS[] array, and updates the field counter (NF). Unlike AWK routines, these C routines do not allow the record separator to be changed, and they assume that the record separator is a newline \ n.

The match( ) function searches the supplied string for substrings that match the supplied regular expression. This function returns a pointer to the beginning of the matched substring, or else returns NULL if no match is found. match( ) also updates the RSTART and RLENGTH variables. The following example returns a pointer to the s in she:

     char *c;
      c = match ("We followed wherever she went", "s?he");

In this example, RSTART is 22 and RLENGTH is 3.

The functions sub( ) and gsub( ) provide string substitution capabilities. sub( ) substitutes the replacement string for the first occurrence of a specified substring, and gsub( ) performs the substitution for every occurrence of the substring. Both of these functions return the number of substitutions that were made. For example, the following returns 1, and s reads "William took Bill's book."

     char s[80] = "Will took Bill's book";,
     sub ("[WB]ill", "William", s);

The next line then returns 2, and s reads "William took William's book:"

     gsub ("[WB]ill", "William", s);

Notice that gsub( ) makes only one pass through the string, and it doesn't try to substitute strings that have already been substituted (which is a good thing, because such a step would send the above example an infinite loop).

Both sub( ) and gsub( ) allow you to use the "&" operator to reference the matched substring in the replacement string. The following example appends "ed" to occurences of the word "want" in a sentence by modifying s to read "I wanted to go home!"

     char s[80] = "I want to go home!";
   gsub ("[Ww]ant", "&ed", s);

In order to include the "&" character literally in a string, preceed that character with the escape character. The following code modifies s to read "I &ed to go home!"

     gsub ("[Ww]ant", "\ &ed", s);

Neither sub( ) nor gsub( ) affect any of the global variables.

The split( ) function allows you to split a record into fields on a specified field separator, which can be any regular expression. (The global field separator, FS, defaults to "[ \ t]+," which means "one or more spaces or tabs." FS may be changed by using the setfs( ) function.) The function getline( ) uses split( ) and the global field separator in order to split input lines into the FIELDS[] array. split( ) can also be called by the application program, which may specify its own fields array or else use the global FIELDS. split( ) updates NF and then returns NF. The following example sets NF to 4 and returns 4:

char *a[128]; /* enough room for 127 fields */
split ("Will::took:Bill's:book.", a, ":+");

In this example, the a[] array is:

   a[0] - "Will::took:Bill's:book."
   a[1] - "Will"
   a[2] - "took"
   a[3] - "Bill's"
   a[4] - "book."

Note that the first element of the array always contains the unmodified string. When calling split( ), make sure that the fields array contains enough spaces for all of the fields in the record. Failure to include enough spaces will crash the program because split( ) handles allocation of memory for each field but cannot allocate memory for more fields.

The Functions at Work

The AWKLIB routines (Listing Three) are divided into three groups:

1. the compiler (makepat( ) and related functions) 2. the interpreter (match( ) and related functions) 3. the user interface (match( ), split( ), sub( ), gsub( ), and their re_...alternates).

The compiler is based upon a simple recursive-descent design. (Because this article is not about compilers, I won't cover the compiler specifics.) For more information on how compilers are built, consult the references listed at the end of this article. Input to the compiler consists of a human-readable regular expression. The compiler outputs the regular expression in prefix form, which is used more easily by the matching functions. The compiled form is very much like the form presented in Kernighan and Plauger's Software Tools in Pascal (which, by the way, is an excellent book).

The simplest form of regular expression, which is a literal character such as "a," is compiled into an expression such as "cae," which means "Match literal character 'a' and then end." A string literal is compiled into a series of literal characters. For example, "hi" becomes "chcie." The way that the interpreter handles this process will be discussed shortly.

The process of compiling character classes is also fairly simple. A compiled character class is represented simply by a "[" (or "]" if the character class is negated), followed by a count of characters, followed by the characters themselves. For example, the character class "[0-9A-Fa-f]," which specifies a hexadecimal digit, is represented by "['22'0123456789ABCDEFabcdefe." Note that the 22 is a single byte value, chr(22).

(This encoding scheme restricts the lengths of character classes and closure expressions -- neither may be longer than 255 characters. This restriction presents only a minor problem in the case of character classes because there are only 256 character codes. Any character class that contains N characters can be expressed as a negated character class of 256-N characters, or a maximum of 128.)

This process of handling alternation is a little more complex. The regular expression "a | b" compiles to "| caecbee." In this compiled expression, the 'e's tell the matching routines where to stop scanning. Alternation was the last item that I included in my compiler, and it created some problems that I hadn't anticipated. This operator forced the inclusion of the end-of-term ('e') characters in the compiled strings.

The compiled form of a closure is much like the compiled form of a character class. The compiled representation of a closure consists of the closure operator (?, *, or +), followed by the length of the expression, followed by "e." For example, "(a | b)*" is encoded as "*'7' | caecbee." Note again that the 7 actually is chr(7). With the above discussion in mind, let's take a look at how the matching functions work.

In a real sense, the matching functions, starting with match_term( ), make up an interpreter --these functions run a program (the compiled regular expression) against input data (the string to be searched) in order to produce output (the matched substring). The interpreter is based upon a fairly simple design (thanks to the pre-processing done by makepat( )) and holds very few tricks. In fact, the only pieces that gave me any real trouble were the alternation processing (" | " operator) and the closure matching.

The heart of the interpreter is the match_term( ) function, which determines what is to be matched and then, depending upon the operator, either attempts to match the next character or else dispatches to the proper routine for further processing. "Further processing" is either a character class, alternation, or a closure operation.

Character classes are easy to match by using match_ccl( ), which looks for a match by scanning the list of characters. match_ccl( ) then returns the proper value, depending upon if a match was found and whether that match is a negated character class.

More Details.

Alternation matching was the most difficult part for me to implement correctly. There is probably a better way to handle this step, but my solution works well and is fast enough for my purposes. The match_or( ) function breaks the remaining pattern into two patterns (one pattern for each branch), searches both branches, and then returns the longest substring that matched, if any matches occurred. For example, the match_or( ) breaks the regular expression "(b | c)d," which reads "| cbeccecde" when compiled into the two patterns "cbcde" and "cccde," and then uses each pattern to search the string. The step of splitting the patterns is performed by the skip_term( ) function.

Closure is handled by generating the longest possible substring that matches the closure operator, then backtracking to discover where the closure should stop in order to correctly match the rest of the pattern. Because of the amount of backtracking that's involved, closure can become very inefficient. This is especially true when nested closure operators, such as (a*b)*, are involved. Since closure, by definition, can match a null string, making this process work correctly for all cases was a little difficult.

I had originally designed the match_closure( ) function to return a Boolean value, but ran into trouble during testing when I (inadvertently) tried a pattern that contained a double closure operator ((a*)*). According to the automata theory books, "(a*)*" is the same as "a*" --but my program didn't know that. Consequently, match_closure( ) kept looping between the failure of the outer closure and the success of the inner closure. After spending several fruitless days attempting to have the compiler optimize the expressions (which is quite a bit more difficult than it looks), I finally hit upon the simple solution. Why not add another return to match_closure( ) to indicate that the closure matches a null? Sometimes the simplest solutions are the most difficult to come up with.

Using the Functions

The QSTRING.C program (in Listing One) locates and outputs all quoted strings into a C source file. This is the first version of a utility I've been working on; although QSTRING.C is primitive, it illustrates the use of the AWKLIB routines.

The most important point here is that awk_init( ) should always be called at the beginning of any program that uses the AWKLIB functions. Failure to do so will cause the functions to act erratically.

The makepat( ) function has been provided so that the program can compile a regular expression and then save the compiled form of the expression for later processing. The routines re_match( ), re_sub( ), re_gsub( ), and re_split( ) use a compiled pattern in lieu of a regular expression in order to speed processing. The two program fragments shown in Example 1 produce the same output, but Program A executes much faster because it only compiles the regular expression once. In contrast, Program B compiles the regular expression each time that match( ) is called. Typically, a program will attempt to match multiple strings against one or two regular expressions. In this case, the program executes much faster if the regular expressions are compiled first with makepat( ), and then re_match( ) (rather than match( )) is used inside the processing loop. This increase in program speed also holds true with respect to use of the other major functions: split( ) (use re_split( )), sub( ) (use re_sub( )), and gsub( ) (use re_gsub( )).

Example 1: Program A executes faster because it only compiles the regular expression once, while Program B compiles the regular expression each time that match( ) is called.

Program A                                  Program B

...                                        ...
while (getline (line, 80, f) != EOF)  {   makepat ("a+bc", pat);
  if (match (line, "a+bc") != NULL)       while (getline (line,
    puts (line);                             80, f) != EOF) {
}                                           if (re_match (line, pat)
                                              != NULL) puts (line);
...                                        }
                                           ...

Notice in the example program that gets( ), rather than getline( ), is used to read the file. Because getline( ) must split the input record into fields, it's much slower than gets( ). Use getline( ) only when the record must be split into fields on every (or most) input records.

The RLENGTH global variable extracts the matched substring from the rest of the string and updates the string pointer for the next match.

Conclusion

This entire project has been quite an education in compilers, interpreters, and automata theory. Construction of the compiler was made much easier by a tutorial series on compiler design that I found in the Computer Language Forum on CompuServe. While attempting to make the compiler optimize regular expressions, I had to delve into automata theory -- and I intend to pursue that interesting subject further.

Bibliography

Aho, A.; Kernighan, B.; Weinberger, J. The AWK Programming Language. Reading, Mass.: Addison-Wesley, 1988. Aho, A.; Ullman, J. Principles of Compiler Design. Reading, Mass.: Addison-Wesley, 1977.

Crenshaw, Jack W. "Let's Build a Compiler," Parts I-VII, unpublished papers, (available for electronic download from CompuServe, Computer Language Forum) Reading, Mass.

Hopcroft, J.; Ullman, J. Introduction to Automata Theory, Languages, and Computation. Reading, Mass.: Addison-Wesley, 1979.

Kernighan, B.; Plauger, P. Software Tools in Pascal. Reading, Mass.: Addison-Wesley, 1981.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063, or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

Regular Expressions

The regular expressions used by AWK and similar programs are derived from the notation used in automata theory to describe formal languages and finite state machines. Regular expressions consist of normal characters (operands), such as "a," "0", and "," plus meta-characters (operators), such as "+," "|," and "[." A regular expression, which is similar to familiar arithmetic expressions, can be either a basic expression or a complex expression that is formed by applying operators to several basic expressions.

The regular expression meta-characters and their uses are:

  \

The escape character is used in escape sequences to specify characters which would otherwise have no representation. These escape sequences are similar to those used in the C language:

  \b    Backspace
  \t    Horizontal tab
  \n    Newline or linefeed
  \f    Newpage or formfeed
  \r    Carriage return
  \ddd  Octal value "ddd."  ddd can
         be one to three digits long.
  \c    "c" can be any character
         that is taken literally

^ The carat matches the beginning of a string. For example, "^a" matches only those strings that have "a" as the first character. When "^" is the first character in a character class, the character denotes a negated character class.

$ The dollar sign matches the end of a string. For example, "z$" matches only those strings that have "z" as the last character.

. The period matches any character. Be careful with this one ".*" matches everything.

[ The open bracket denotes the beginning of a character class.

] The closed bracket denotes the end of a character class.

| This character is the alternation operator. "a | b" matches either "a" or "b".

( ) Parenthesis are used to group expressions in much the same manner in that they are used with arithmetic expressions.