Jim Mischel is a former financial systems programmer and database consultant. He can be reached at 20 Stewart St., Durango, CO 81301 or on CompuServe: 73717,1355.
The process of string searching is an important part of many computer applications. Text editors, database managers, spreadsheet programs, and countless utility programs use string searching to one degree or another. Unfortunately, other than the common case-insensitive search, most programs require that the search string be specified exactly. Very few programs allow a search on strings that are described by such information as "all strings that begin with a capital letter, contain at least one digit, and end with a period." This restriction isn't very surprising, considering the inadequacy of the string search functions in most popular languages.
The AWK programming language is unique with respect to string search functions. Along with many other interesting features, AWK contains the facilities to search a string for substrings that match a regular expression (see the accompanying sidebar on regular expressions). AWK is a fantastic tool for in-house use, but there are drawbacks to using it for commercial software development. First, AWK is interpreted, which makes its programs execute more slowly than compiled code. In addition, the interpreter is an extra-cost item that must be included in order for the program to run. Second, AWK provides little access to either the operating system or the underlying hardware. In the following discussion, some familiarity with the AWK programming language will be helpful, but is not required.
AWK is designed around an implicit processing loop that automatically reads each input record, splits the record into fields, and then performs programmer-specified actions upon the record. AWK also provides global variables that contain several bits of information about program status: the number of records read, fields in the current record, and so on. This functionality can be nearly duplicated in C by means of a simple controlling loop and the AWKLIB getline( ) function, as shown in the QSTRING.C program in Listing One. (getline( ) and the other AWKLIB functions are declared in the header file AWKLIB.H, which is shown in Listing Two. AWKLIB.H must be included in any program that uses AWKLIB.)
The awk_init( ) function, which must be called before any of the other routines are used, initializes the AWKLIB global variables. Failure to call this function may produce some interesting (and probably erroneous) results.
getline( ) has the same calling sequence as fgets( ) but returns EOF on end-of-file or error, rather than returning NULL. (This change in return values was a purely selfish design decision on my part.) getline( ) reads the next input record (line) into memory, splits that record into fields on the global field separator FS (discussed later), stores the fields in the FIELDS[] array, and updates the field counter (NF). Unlike AWK routines, these C routines do not allow the record separator to be changed, and they assume that the record separator is a newline \ n.
The match( ) function searches the supplied string for substrings that match the supplied regular expression. This function returns a pointer to the beginning of the matched substring, or else returns NULL if no match is found. match( ) also updates the RSTART and RLENGTH variables. The following example returns a pointer to the s in she:
char *c;
c = match ("We followed wherever she went", "s?he");In this example, RSTART is 22 and RLENGTH is 3.
The functions sub( ) and gsub( ) provide string substitution capabilities. sub( ) substitutes the replacement string for the first occurrence of a specified substring, and gsub( ) performs the substitution for every occurrence of the substring. Both of these functions return the number of substitutions that were made. For example, the following returns 1, and s reads "William took Bill's book."
char s[80] = "Will took Bill's book";,
sub ("[WB]ill", "William", s);The next line then returns 2, and s reads "William took William's book:"
gsub ("[WB]ill", "William", s);Notice that gsub( ) makes only one pass through the string, and it doesn't try to substitute strings that have already been substituted (which is a good thing, because such a step would send the above example an infinite loop).
Both sub( ) and gsub( ) allow you to use the "&" operator to reference the matched substring in the replacement string. The following example appends "ed" to occurences of the word "want" in a sentence by modifying s to read "I wanted to go home!"
char s[80] = "I want to go home!";
gsub ("[Ww]ant", "&ed", s);In order to include the "&" character literally in a string, preceed that character with the escape character. The following code modifies s to read "I &ed to go home!"
gsub ("[Ww]ant", "\ &ed", s);Neither sub( ) nor gsub( ) affect any of the global variables.
The split( ) function allows you to split a record into fields on a specified field separator, which can be any regular expression. (The global field separator, FS, defaults to "[ \ t]+," which means "one or more spaces or tabs." FS may be changed by using the setfs( ) function.) The function getline( ) uses split( ) and the global field separator in order to split input lines into the FIELDS[] array. split( ) can also be called by the application program, which may specify its own fields array or else use the global FIELDS. split( ) updates NF and then returns NF. The following example sets NF to 4 and returns 4:
char *a[128]; /* enough room for 127 fields */
split ("Will::took:Bill's:book.", a, ":+");In this example, the a[] array is:
a[0] - "Will::took:Bill's:book." a[1] - "Will" a[2] - "took" a[3] - "Bill's" a[4] - "book."
Note that the first element of the array always contains the unmodified string. When calling split( ), make sure that the fields array contains enough spaces for all of the fields in the record. Failure to include enough spaces will crash the program because split( ) handles allocation of memory for each field but cannot allocate memory for more fields.
The AWKLIB routines (Listing Three) are divided into three groups:
1. the compiler (makepat( ) and related functions) 2. the interpreter (match( ) and related functions) 3. the user interface (match( ), split( ), sub( ), gsub( ), and their re_...alternates).
The compiler is based upon a simple recursive-descent design. (Because this article is not about compilers, I won't cover the compiler specifics.) For more information on how compilers are built, consult the references listed at the end of this article. Input to the compiler consists of a human-readable regular expression. The compiler outputs the regular expression in prefix form, which is used more easily by the matching functions. The compiled form is very much like the form presented in Kernighan and Plauger's Software Tools in Pascal (which, by the way, is an excellent book).
The simplest form of regular expression, which is a literal character such as "a," is compiled into an expression such as "cae," which means "Match literal character 'a' and then end." A string literal is compiled into a series of literal characters. For example, "hi" becomes "chcie." The way that the interpreter handles this process will be discussed shortly.
The process of compiling character classes is also fairly simple. A compiled character class is represented simply by a "[" (or "]" if the character class is negated), followed by a count of characters, followed by the characters themselves. For example, the character class "[0-9A-Fa-f]," which specifies a hexadecimal digit, is represented by "['22'0123456789ABCDEFabcdefe." Note that the 22 is a single byte value, chr(22).
(This encoding scheme restricts the lengths of character classes and closure expressions -- neither may be longer than 255 characters. This restriction presents only a minor problem in the case of character classes because there are only 256 character codes. Any character class that contains N characters can be expressed as a negated character class of 256-N characters, or a maximum of 128.)
This process of handling alternation is a little more complex. The regular expression "a | b" compiles to "| caecbee." In this compiled expression, the 'e's tell the matching routines where to stop scanning. Alternation was the last item that I included in my compiler, and it created some problems that I hadn't anticipated. This operator forced the inclusion of the end-of-term ('e') characters in the compiled strings.
The compiled form of a closure is much like the compiled form of a character class. The compiled representation of a closure consists of the closure operator (?, *, or +), followed by the length of the expression, followed by "e." For example, "(a | b)*" is encoded as "*'7' | caecbee." Note again that the 7 actually is chr(7). With the above discussion in mind, let's take a look at how the matching functions work.
In a real sense, the matching functions, starting with match_term( ), make up an interpreter --these functions run a program (the compiled regular expression) against input data (the string to be searched) in order to produce output (the matched substring). The interpreter is based upon a fairly simple design (thanks to the pre-processing done by makepat( )) and holds very few tricks. In fact, the only pieces that gave me any real trouble were the alternation processing (" | " operator) and the closure matching.
The heart of the interpreter is the match_term( ) function, which determines what is to be matched and then, depending upon the operator, either attempts to match the next character or else dispatches to the proper routine for further processing. "Further processing" is either a character class, alternation, or a closure operation.
Character classes are easy to match by using match_ccl( ), which looks for a match by scanning the list of characters. match_ccl( ) then returns the proper value, depending upon if a match was found and whether that match is a negated character class.
Alternation matching was the most difficult part for me to implement correctly. There is probably a better way to handle this step, but my solution works well and is fast enough for my purposes. The match_or( ) function breaks the remaining pattern into two patterns (one pattern for each branch), searches both branches, and then returns the longest substring that matched, if any matches occurred. For example, the match_or( ) breaks the regular expression "(b | c)d," which reads "| cbeccecde" when compiled into the two patterns "cbcde" and "cccde," and then uses each pattern to search the string. The step of splitting the patterns is performed by the skip_term( ) function.
Closure is handled by generating the longest possible substring that matches the closure operator, then backtracking to discover where the closure should stop in order to correctly match the rest of the pattern. Because of the amount of backtracking that's involved, closure can become very inefficient. This is especially true when nested closure operators, such as (a*b)*, are involved. Since closure, by definition, can match a null string, making this process work correctly for all cases was a little difficult.
I had originally designed the match_closure( ) function to return a Boolean value, but ran into trouble during testing when I (inadvertently) tried a pattern that contained a double closure operator ((a*)*). According to the automata theory books, "(a*)*" is the same as "a*" --but my program didn't know that. Consequently, match_closure( ) kept looping between the failure of the outer closure and the success of the inner closure. After spending several fruitless days attempting to have the compiler optimize the expressions (which is quite a bit more difficult than it looks), I finally hit upon the simple solution. Why not add another return to match_closure( ) to indicate that the closure matches a null? Sometimes the simplest solutions are the most difficult to come up with.
The QSTRING.C program (in Listing One) locates and outputs all quoted strings into a C source file. This is the first version of a utility I've been working on; although QSTRING.C is primitive, it illustrates the use of the AWKLIB routines.
The most important point here is that awk_init( ) should always be called at the beginning of any program that uses the AWKLIB functions. Failure to do so will cause the functions to act erratically.
The makepat( ) function has been provided so that the program can compile a regular expression and then save the compiled form of the expression for later processing. The routines re_match( ), re_sub( ), re_gsub( ), and re_split( ) use a compiled pattern in lieu of a regular expression in order to speed processing. The two program fragments shown in Example 1 produce the same output, but Program A executes much faster because it only compiles the regular expression once. In contrast, Program B compiles the regular expression each time that match( ) is called. Typically, a program will attempt to match multiple strings against one or two regular expressions. In this case, the program executes much faster if the regular expressions are compiled first with makepat( ), and then re_match( ) (rather than match( )) is used inside the processing loop. This increase in program speed also holds true with respect to use of the other major functions: split( ) (use re_split( )), sub( ) (use re_sub( )), and gsub( ) (use re_gsub( )).
Program A Program B
... ...
while (getline (line, 80, f) != EOF) { makepat ("a+bc", pat);
if (match (line, "a+bc") != NULL) while (getline (line,
puts (line); 80, f) != EOF) {
} if (re_match (line, pat)
!= NULL) puts (line);
... }
...Notice in the example program that gets( ), rather than getline( ), is used to read the file. Because getline( ) must split the input record into fields, it's much slower than gets( ). Use getline( ) only when the record must be split into fields on every (or most) input records.
The RLENGTH global variable extracts the matched substring from the rest of the string and updates the string pointer for the next match.
This entire project has been quite an education in compilers, interpreters, and automata theory. Construction of the compiler was made much easier by a tutorial series on compiler design that I found in the Computer Language Forum on CompuServe. While attempting to make the compiler optimize regular expressions, I had to delve into automata theory -- and I intend to pursue that interesting subject further.
Aho, A.; Kernighan, B.; Weinberger, J. The AWK Programming Language. Reading, Mass.: Addison-Wesley, 1988. Aho, A.; Ullman, J. Principles of Compiler Design. Reading, Mass.: Addison-Wesley, 1977.
Crenshaw, Jack W. "Let's Build a Compiler," Parts I-VII, unpublished papers, (available for electronic download from CompuServe, Computer Language Forum) Reading, Mass.
Hopcroft, J.; Ullman, J. Introduction to Automata Theory, Languages, and Computation. Reading, Mass.: Addison-Wesley, 1979.
Kernighan, B.; Plauger, P. Software Tools in Pascal. Reading, Mass.: Addison-Wesley, 1981.
All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063, or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).
The regular expressions used by AWK and similar programs are derived from the notation used in automata theory to describe formal languages and finite state machines. Regular expressions consist of normal characters (operands), such as "a," "0", and "," plus meta-characters (operators), such as "+," "|," and "[." A regular expression, which is similar to familiar arithmetic expressions, can be either a basic expression or a complex expression that is formed by applying operators to several basic expressions.
The regular expression meta-characters and their uses are:
\
The escape character is used in escape sequences to specify characters which would otherwise have no representation. These escape sequences are similar to those used in the C language:
\b Backspace
\t Horizontal tab
\n Newline or linefeed
\f Newpage or formfeed
\r Carriage return
\ddd Octal value "ddd." ddd can
be one to three digits long.
\c "c" can be any character
that is taken literally
^ The carat matches the beginning of a string. For example, "^a" matches only those strings that have "a" as the first character. When "^" is the first character in a character class, the character denotes a negated character class.
$ The dollar sign matches the end of a string. For example, "z$" matches only those strings that have "z" as the last character.
. The period matches any character. Be careful with this one ".*" matches everything.
[ The open bracket denotes the beginning of a character class.
] The closed bracket denotes the end of a character class.
| This character is the alternation operator. "a | b" matches either "a" or "b".
( ) Parenthesis are used to group expressions in much the same manner in that they are used with arithmetic expressions.
Character classes are "shorthand" for matching one of several characters. For example, [AaBb] is the same as (A | a | B | b) and matches "A," "a," "B," or "b." There are also character ranges, such as [A - Z], which match any uppercase alpha character. Negated character classes act in the opposite way, and they specify characters that should not be matched. For example, [^A-Z] matches everything except uppercase alpha characters.
With the above in mind, let's examine some regular expressions.
"^[a-z]$" matches any line that consists of only lowercase letters.
"[A - Za-z_][A-Za - z_0-9]*" matches a C variable anywhere on a line.
"(soft | loud) (classical | rock| country) music" matches six strings from "soft classical music" to "loud country music."
"^[0-9].*[^0-9]$" matches any line that begins with a digit and doesn't end in a digit.
"80[123]?8[86] | 680[01234]0" matches any of the more popular microprocessors, plus a few that never made it.
Writing AWK-like Extensions to C
by Jim Mischel
[LISTING ONE]
Copyright © 1989, Dr. Dobb's Journal
/*
* qstring.c - finds and outputs all quoted strings in a C source file.
*
* Copyright 1988, Jim Mischel
*/
#include <stdio.h>
#include <string.h>
#include "awklib.h"
void main (void) {
char pat[MAXPAT];
char buff[MAXSTR];
char s[MAXSTR];
char *c;
awk_init ();
if (makepat("\"[^\"]*\"", pat) == NULL) {
fprintf (stderr, "Error compiling pattern string.\n");
return;
}
while (gets (buff) != NULL) {
c = buff;
while ((c = re_match (c, pat)) != NULL) {
strncpy (s, c, RLENGTH);
s[RLENGTH] = '\0';
puts (s);
c += RLENGTH;
}
}
}
[LISTING TWO]
/*
* awklib.h - defines, global variables, and function prototypes for
* AWKLIB routines.
*
* Copyright 1988, Jim Mischel. All rights reserved.
*
*/
extern int RSTART;
extern int RLENGTH;
extern int NF;
extern char *FS;
extern char *FS_PAT;
extern char *FIELDS[];
#define MAXSTR 1024
#define MAXPAT 2*MAXSTR
void pascal awk_init (void);
char * pascal setfs (char *fs);
char * pascal match (char *s, char *re);
char * pascal re_match (char *s, char *pat);
char * pascal makepat (char *re, char *pat);
int pascal split (char *s, char **a, char *fs);
int pascal re_split (char *s, char **a, char *pat);
int pascal getline (char *s, int nchar, FILE *infile);
int pascal sub (char *re, char *replace, char *str);
int pascal re_sub (char *pat, char *replace, char *str);
int pascal gsub (char *re, char *replace, char *str);
int pascal re_gsub (char *pat, char *replace, char *str);
char * pascal strins (char *s, char *i, int pos);
char * pascal strcins (char *s, int ch, int pos);
char * pascal strdel (char *s, int pos, int n);
char * pascal strcdel (char *s, int pos);
char * pascal strccat (char *s, int c);
[LISTING THREE]
/*
* awklib.c - C callable routines that provide field splitting and regular
* expression matching functions much like those found in AWK.
*
* Copyright 1988, Jim Mischel. All rights reserved.
*/
#include <stdio.h>
#include <string.h>
#include <alloc.h>
#include <process.h>
#define TRUE -1
#define FALSE 0
#define ALMOST 1 /* returned when a closure matches a NULL */
#define ENDSTR '\0'
#define EOL '$'
#define BOL '^'
#define NEGATE '^'
#define CCL '['
#define NCCL ']'
#define CCLEND ']'
#define ANY '.'
#define DASH '-'
#define OR '|'
#define ESCAPE '\\'
#define LPAREN '('
#define RPAREN ')'
#define CLOSURE '*'
#define POS_CLO '+'
#define ZERO_ONE '?'
#define LITCHAR 'c'
#define END_TERM 'e'
#define FS_DEFAULT "[ \t]+"
#define MAXSTR 1024
#define MAXPAT 2*MAXSTR
#define MAXFIELD 128
/*
* AWKLIB global variables. These variables are defined in AWKLIB.H and may
* be accessed by the application.
*/
int RSTART; /* start of matched substring */
int RLENGTH; /* length of matched substring */
int NF; /* number of fields from most current split */
char * FS; /* global field separator */
char * FS_PAT; /* compiled field separator */
char * FIELDS[MAXFIELD]; /* contents of fields from most current split */
/*
* Internal function prototypes.
*/
char * pascal re_match (char *s, char *pat);
char * pascal makepat (char *re, char *pat);
int pascal re_split (char *s, char **a, char *pat);
char * pascal do_sub (char *str, int inx, int len, char *replace);
char parse_escape (void);
/*
* These string routines, while designed specifically for this application,
* may be useful to other programs. Their prototypes are included in the
* AWKLIB.H file.
*/
char * pascal strins (char *s, char *i, int pos);
char * pascal strcins (char *s, int ch, int pos);
char * pascal strdel (char *s, int pos, int n);
char * pascal strcdel (char *s, int pos);
char * pascal strccat (char *s, int c);
/*
* Initialize AWKLIB global variables. This routine MUST be called before
* using the AWKLIB routines. Failure to do so may produce some strange
* results.
*/
void pascal awk_init (void) {
int x;
char * pascal setfs (char *fs);
FS = FS_PAT = NULL;
setfs (FS_DEFAULT);
RSTART = RLENGTH = NF = 0;
for (x = 0; x < MAXFIELD; x++)
FIELDS[x] = NULL;
} /* awk_init */
/*
* Sets the field separator to the regular expression fs. The regular
* expression is compiled into FS_PAT. FS_PAT is returned. NULL is returned
* on error and neither FS or FS_PAT is modified.
*/
char * pascal setfs (char *fs) {
char pat[MAXPAT];
pat[0] = ENDSTR;
if (makepat (fs_DEFAULT, pat) == NULL)
return (NULL);
if (FS != NULL)
free (FS);
if (FS_PAT != NULL)
free (FS_PAT);
FS = strdup (fs);
FS_PAT = strdup (pat);
return (FS_PAT);
} /* setfs */
/*
* makepat() - "compile" the regular expression re into pat and return a
* a pointer to the compiled string, or NULL if the compile fails.
*
* Performs a recursive descent parse of the expression.
*/
char *_re_ptr; /* global for pattern building */
char * pascal makepat (char *re, char *pat) {
char *t;
char * parse_expression (void);
_re_ptr = re;
if ((t = parse_expression ()) == NULL)
return (NULL);
else if (*_re_ptr != ENDSTR) {
free (t);
return (NULL);
}
else {
strcpy (pat, t);
free (t);
return (pat);
}
} /* makepat */
/*
* parse_expression() - Parse and translate an expression. Returns a pointer
* to the compiled expression, or NULL on error.
*/
char * parse_expression (void) {
char pat[MAXPAT];
char *arg1;
char * parse_term (void);
pat[0] = ENDSTR;
if ((arg1 = parse_term ()) == NULL) /* get the first term */
return (NULL);
while (*_re_ptr == OR) { /* parse all subsequent terms */
strccat (pat, OR);
strcat (pat, arg1);
strccat (pat, END_TERM);
free (arg1);
_re_ptr++;
if ((arg1 = parse_term ()) == NULL)
return (NULL);
}
strcat (pat, arg1);
strccat (pat, END_TERM);
free (arg1);
return (strdup (pat));
} /* parse_expression */
/*
* parse_term() - parse and translate a term. Returns a pointer to the
* compiled term or NULL on error.
*/
char * parse_term (void) {
char *t;
char pat[MAXPAT];
int isfactor (char c);
char * parse_factor (void);
pat[0] = ENDSTR;
if (*_re_ptr == BOL)
strccat (pat, *_re_ptr++);
do {
if ((t = parse_factor ()) == NULL)
return (NULL);
else {
strcat (pat, t);
free (t);
}
} while (isfactor (*_re_ptr)); /* parse all factors of this term */
return (strdup (pat));
} /* parse_term */
/*
* isfactor() - returns TRUE if c is a valid factor character
*/
int isfactor (char c) {
static char nfac_chars[] = "^|)]+?*";
return (strchr (nfac_chars, c) == NULL) ? TRUE : FALSE;
} /* isfactor */
/*
* parse_factor() - parse and translate a factor. Returns a pointer to the
* compiled factor or NULL on error.
*/
char * parse_factor (void) {
char pat[MAXPAT];
char *t;
char * parse_expression (void);
int parse_closure (char *pat, char c);
char * parse_ccl (void);
pat[0] = ENDSTR;
switch (*_re_ptr) {
case LPAREN : /* parenthesised expression */
_re_ptr++;
t = parse_expression ();
strcat (pat, t);
free (t);
if (*_re_ptr++ != RPAREN)
return (NULL);
break;
case CCL : /* character class */
_re_ptr++;
t = parse_ccl ();
strcat (pat, t);
free (t);
if (*_re_ptr++ != CCLEND)
return (NULL);
break;
case ANY : /* '.' or '$' operators */
case EOL :
strccat (pat, *_re_ptr++);
break;
case ESCAPE : /* ESCAPE character */
_re_ptr++;
strccat (pat, LITCHAR);
strccat (pat, parse_escape ());
break;
case CLOSURE :
case POS_CLO :
case ZERO_ONE :
case NEGATE :
case CCLEND :
case RPAREN :
case OR : /* not valid characters */
return (NULL);
default : /* literal character */
strccat (pat, LITCHAR);
strccat (pat, *_re_ptr++);
break;
}
/*
* check for closure
*/
if (*_re_ptr == CLOSURE || *_re_ptr == ZERO_ONE || *_re_ptr == POS_CLO)
if (parse_closure (pat, *_re_ptr++) == FALSE)
return (NULL);
return (strdup (pat));
} /* parse_factor */
/*
* parse_escape () - returns ASCII value of character(s) following ESCAPE
*/
char parse_escape (void) {
unsigned char ch;
switch (*_re_ptr) {
case 'b' : _re_ptr++; return ('\b'); /* backspace */
case 't' : _re_ptr++; return ('\t'); /* tab */
case 'f' : _re_ptr++; return ('\f'); /* formfeed */
case 'n' : _re_ptr++; return ('\n'); /* linefeed */
case 'r' : _re_ptr++; return ('\r'); /* carriage return */
case '0' : /* 0-7 is octal constant */
case '1' :
case '2' :
case '3' :
case '4' :
case '5' :
case '6' :
case '7' :
ch = *_re_ptr++ - '0';
if (*_re_ptr >= '0' && *_re_ptr < '8') {
ch <<= 3;
ch += *_re_ptr++ - '0';
}
if (*_re_ptr >= '0' && *_re_ptr < '8') {
ch <<= 3;
ch += *_re_ptr++ - '0';
}
return (ch);
default : /* otherwise, just that char */
return (*_re_ptr++);
}
} /* parse_escape */
/*
* parse_closure() - place closure character and size before the factor
* in the compiled string.
*/
int parse_closure (char *pat, char c) {
int len;
memmove (pat+2, pat, strlen (pat) + 1);
pat[0] = c;
len = strlen (pat + 2);
if (len > 255)
return (FALSE); /* closure expression too large */
else {
pat[1] = len;
return (TRUE);
}
} /* parse_closure */
/*
* parse_ccl() - parse and translate a character class. Return pointer to the
* compiled class or NULL on error.
*/
char * parse_ccl (void) {
char pat[MAXPAT];
int first = TRUE;
int len;
char * parse_dash (char *pat, char ch);
strcpy (pat, "[ ");
if (*_re_ptr == NEGATE) { /* if first character is NEGATE */
pat[0] = NCCL; /* then we have a negated */
_re_ptr++; /* character class */
}
/*
* parse all characters up to the closing bracket or end of string marker
*/
while (*_re_ptr != CCLEND && *_re_ptr != ENDSTR) {
if (*_re_ptr == DASH && first == FALSE) { /* DASH, check for range */
if (*++_re_ptr == NCCL)
strccat (pat, DASH); /* not range, literal DASH */
else
parse_dash (pat, *_re_ptr++);
}
else {
if (*_re_ptr == ESCAPE) {
_re_ptr++;
strccat (pat, parse_escape ());
}
else
strccat (pat, *_re_ptr++);
}
first = FALSE;
}
len = strlen (pat+2);
if (len > 255)
return (NULL); /* character class too large */
else {
pat[1] = len; /* store CCL length at pat[1] */
return (strdup (pat));
}
} /* parse_ccl */
/*
* parse_dash() - fill in range characters.
*/
char * parse_dash (char *pat, char ch) {
int ch1;
for (ch1 = pat[strlen (pat) - 1] + 1; ch1 <= ch; ch1++)
strccat (pat, ch1);
return (pat);
} /* parse_dash */
/*
* match() - Return a pointer to the first character of the left-most longest
* substring of s that matches re or NULL if no match is found. Sets
* RSTART and RLENGTH. This routine compiles the regular expression re and
* then calls re_match to perform the actual matching.
*/
char * pascal match (char *s, char *re) {
char pat[MAXPAT];
pat[0] = ENDSTR;
if (makepat (re, pat) == NULL)
return (NULL);
return (re_match (s, pat));
} /* match */
/*
* re_match() - Return a pointer to the first character of the left-most
* longest substring of s that matches pat, or NULL if no match is found.
* Sets RSTART and RLENGTH. The != FALSE test below must NOT be changed
* to == TRUE. match_term() can return TRUE, FALSE, or ALMOST. Both TRUE
* and ALMOST are considered TRUE by this routine.
*/
char *_s_end; /* global points to last character matched */
char * pascal re_match (char *s, char *pat) {
char *c = s;
int pascal match_term (int inx, char *s, char *pat);
_s_end = NULL;
while (*c != ENDSTR) {
if (match_term (c-s, c, pat) != FALSE) {
RSTART = c-s;
RLENGTH = _s_end - c;
return (c);
}
c++;
}
RSTART = RLENGTH = 0;
return (NULL);
} /* re_match */
/*
* Match a compiled term. Returns TRUE, FALSE, or ALMOST.
*/
int pascal match_term (int inx, char *s, char *pat) {
int pascal match_or (int inx, char *s, char *pat);
int pascal match_ccl (char c, char *pat);
int pascal match_closure (int inx, char *s, char *pat, char *clopat);
int pascal match_0_1 (int inx, char *s, char *pat);
_s_end = s;
if (*pat == ENDSTR)
return (FALSE);
do {
switch (*pat) {
case BOL : /* match beginning of line */
if (inx != 0)
return (FALSE);
pat++;
break;
case LITCHAR : /* match literal character */
if (*s++ != *++pat)
return (FALSE);
pat++;
break;
case END_TERM : pat++; break; /* skip end-of-term character */
case ANY : /* match any character ... */
if (*s++ == ENDSTR) /* ... except end of string */
return (FALSE);
pat++;
break;
case OR : return (match_or (inx, s, pat));
case CCL : /* character class requires */
case NCCL : /* special processing */
if (*s == ENDSTR)
return (FALSE);
if (!match_ccl (*s++, pat++))
return (FALSE);
pat += *pat + 1;
break;
case EOL : /* match end of string */
if (*s != ENDSTR)
return (FALSE);
pat++;
break;
case ZERO_ONE : return (match_0_1 (inx, s, pat));
case CLOSURE :
case POS_CLO : {
char clopat[MAXPAT];
strncpy (clopat, pat+2, *(pat+1));
clopat[*(pat+1)] = ENDSTR;
return (match_closure (inx, s, pat, clopat));
break;
}
default :
/*
* If we get to this point, then something has gone very wrong.
* Most likely, someone has tried to match with an invalid
* compiled pattern. Whatever the case, the only thing to do
* is abort the program.
*/
fputs ("In match_term: can't happen", stderr);
exit (1);
break;
} /* switch */
_s_end = s;
} while (*pat != ENDSTR);
return (TRUE);
} /* match_term */
/*
* match_or() - Handles selection processing.
*/
int pascal match_or (int inx, char *s, char *pat) {
char workpat[MAXPAT];
char *t1, *t2, *junk;
int pascal match_term (int inx, char *s, char *pat);
char * pascal skip_term (char *pat);
/*
* The first case is build into workpat. Second case is already there.
* Both patterns are searched to determine the longest matched substring.
*/
workpat[0] = ENDSTR;
pat++;
junk = skip_term (pat);
strncat (workpat, pat, junk-pat);
strcat (workpat, skip_term (junk));
t1 = (match_term (inx, s, workpat) != FALSE) ? _s_end : NULL;
/*
* The second pattern need not be searched if the first pattern results
* in a match through to the end of the string, since the longest possible
* match has already been found.
*/
if (t1 == NULL || *_s_end != ENDSTR) {
t2 = (match_term (inx, s, junk) != FALSE) ? _s_end : NULL;
/*
* determine which matched the longest substring
*/
if (t1 != NULL && (t2 == NULL || t1 > t2))
_s_end = t1;
}
return (t1 == NULL && t2 == NULL) ? FALSE : TRUE;
} /* match_or */
/*
* Skip over the current term and return a pointer to the next term in
* the pattern.
*/
char * pascal skip_term (char *pat) {
register int nterm = 1;
while (nterm > 0) {
switch (*pat) {
case OR : nterm++; break;
case CCL :
case NCCL :
case CLOSURE:
case ZERO_ONE:
case POS_CLO:
pat++;
pat += *pat;
break;
case END_TERM: nterm--; break;
case LITCHAR: pat++; break;
}
pat++;
}
return (pat);
} /* skip_term */
/*
* Match the ZERO_ONE operator. First, this routine attempts to match the
* entire pattern with the input string. If that fails, it skips over
* the closure pattern and attempts to match the rest of the pattern.
*/
int pascal match_0_1 (int inx, char *s, char *pat) {
char *save_s = s;
if (match_term (inx, s, pat+2) == TRUE)
return (TRUE);
else if (match_term (inx, save_s, pat+2+*(pat+1)) == FALSE)
return (FALSE);
else
return (ALMOST);
} /* match_0_1 */
/*
* Match CLOSURE and POS_CLO.
* Match as many of the closure patterns as possible, then attempt to match
* the remaining pattern with what's left of the input string. Backtrack
* until we've either matched the remaing pattern or we arrive back at where
* we started.
*/
int pascal match_closure (int inx, char *s, char *pat, char *clopat) {
char *save_s = s;
if (match_term (inx, s, clopat) == TRUE) {
save_s = _s_end;
if (match_closure (inx, save_s, pat, clopat) == TRUE)
return (TRUE);
else
return (match_term (inx, save_s, pat+2+*(pat+1)));
}
else if (*pat != CLOSURE)
return (FALSE); /* POS_CLO requires at least one match */
else if (match_term (inx, save_s, pat+2+*(pat+1)) == TRUE)
return (ALMOST);
else
return (FALSE);
} /* match_closure */
/*
* Match a character class or negated character class
*/
int pascal match_ccl (char c, char *pat) {
register int x;
char ccl = *pat++;
for (x = *pat; x > 0; x--)
if (c == pat[x])
return (ccl == CCL);
return (ccl != CCL);
} /* match_ccl */
/*
* Substitue 'replace' for the leftmost longest substring of str matched by
* the regular expression re.
* Return number of substitutions made (which in this case will be 0 or 1).
*/
int pascal sub (char *re, char *replace, char *str) {
if (match (str, re) != NULL) {
free (do_sub (str, RSTART, RLENGTH, replace));
return (1);
}
else
return (0);
} /* sub */
/*
* Substitue 'replace' for the leftmost longest substring of str matched by
* the compiled regular expression pat.
* Return number of substitutions made (which in this case will be 0 or 1).
*/
int pascal re_sub (char *pat, char *replace, char *str) {
int pascal re_sub (char *pat, char *replace, char *str);
if (re_match (str, pat) != NULL) {
free (do_sub (str, RSTART, RLENGTH, replace));
return (1);
}
else
return (0);
} /* re_sub */
/*
* Substitute 'replace' globally for all substrings in str matched by the
* regular expression re.
* Return number of substitutions made.
*
* This routine uses makepat() to compile the regular expression, then calls
* re_gsub() to do the actual replacement.
*
* NOTE: gsub() makes only 1 pass through the string. Replaced strings
* cannot themselves be replaced.
*/
int pascal gsub (char *re, char *replace, char *str) {
int pascal re_gsub (char *pat, char *replace, char *str);
char pat[MAXPAT];
pat[0] = ENDSTR;
if (makepat (re, pat) == NULL)
return (0);
return (re_gsub (pat, replace, str));
} /* gsub */
/*
* Substitute 'replace' globally for all substrings in str matched by the
* compiled regular expression pat.
* Return number of substitutions made.
*
* NOTE: gsub() makes only 1 pass through the string. Replaced strings
* cannot themselves be replaced.
*/
int pascal re_gsub (char *pat, char *replace, char *str) {
char *m = str;
int nsub = 0;
char *p;
while ((m = re_match (m, pat)) != NULL) {
p = do_sub (m, 0, RLENGTH, replace);
nsub++;
m += strlen (p);
free (p);
}
return (nsub);
} /* re_gsub */
/*
* remove 'len' characters from 'str' starting at position 'inx'. Then insert
* the replacement string at position 'inx'.
*/
char * pascal do_sub (char *str, int inx, int len, char *replace) {
char *p;
char * pascal makesub (char *replace, char *found, int len);
p = makesub (replace, &str[inx], len);
strdel (str, inx, len);
strins (str, p, inx);
return (p);
} /* do_sub */
/*
* Make a substitution string.
*/
char * pascal makesub (char *replace, char *found, int len) {
char news[MAXSTR];
char *c = replace;
int x;
news[0] = ENDSTR;
while (*c != ENDSTR) {
if (*c == '&')
for (x = 0; x < len; x++)
strccat (news, found[x]);
else if (*c == '\\') {
_re_ptr = c+1;
strccat (news, parse_escape ());
c = _re_ptr - 1;
}
else
strccat (news, *c);
c++;
}
return (strdup (news));
} /* makesub */
/*
* split - split the string s into fields in the array a on field separator fs.
* fs is a regular expression. Returns number of fields. Also sets the global
* variable NF. This routine compiles fs into a pattern and then calls
* re_split() to do the work.
*/
int pascal split (char *s, char **a, char *fs) {
char pat[MAXPAT];
pat[0] = ENDSTR;
makepat (fs, pat);
return re_split (s, a, pat);
} /* split */
/*
* re_split() - split the string s into fields in the array on field seperator
* pat. pat is a compiled regular expression (built by makepat()). Returns
* number of fields. Also sets the global variable NF.
*/
int pascal re_split (char *s, char **a, char *pat) {
int rstart = RSTART; /* save RSTART and RLENGTH */
int rlength = RLENGTH;
char *c = s;
char *oldc = s;
NF = 0;
if (a[0] != NULL)
free (a[0]);
a[0] = strdup (s);
while (*oldc != ENDSTR) {
while ((c = re_match (oldc, pat)) == oldc)
oldc += RLENGTH;
if (*oldc != ENDSTR) {
if (c == NULL)
c = &oldc[strlen (oldc)];
if (a[++NF] != NULL)
a[NF] = realloc (a[NF], c-oldc+1);
else
a[NF] = malloc (c-oldc+1);
memcpy (a[NF], oldc, c-oldc);
a[NF][c-oldc] = ENDSTR;
oldc = c;
}
}
RSTART = rstart; /* restore globals */
RLENGTH = rlength;
return (NF);
} /* re_split */
/*
* Reads a line from infile and splits it into FIELDS. Returns EOF on
* end-of-file or error.
*/
int pascal getline (char *s, int nchar, FILE *infile) {
char *c;
if (fgets (s, nchar, infile) == NULL)
return (EOF);
if ((c = strchr (s, '\n')) != NULL)
*c = ENDSTR; /* look for and replace newline */
re_split (s, FIELDS, FS_PAT);
return (0);
} /* getline */
/*
* add a character to the end of a string
*/
char * pascal strccat (char *s, int ch) {
register int len = strlen (s);
s[len++] = ch;
s[len] = ENDSTR;
return (s);
}
/*
* removes the character at pos from the string.
*/
char * pascal strcdel (char *s, int pos) {
memcpy (s+pos, s+pos+1, strlen (s) - pos);
return (s);
} /* strcdel */
/*
* inserts the character ch into the string at position pos. Assumes there
* is room enough in the string for the character.
*/
char * pascal strcins (char *s, int ch, int pos) {
memmove (s+pos+1, s+pos, strlen (s) - pos + 1);
s[pos] = ch;
return (s);
}
/*
* removes n characters from s starting at pos
*/
char * pascal strdel (char *s, int pos, int n) {
memcpy (s+pos, s+pos+n, strlen(s)-pos-n+1);
return (s);
}
/*
* inserts the string i into the string s at position pos. Assumes there
* is sufficient memory in s to hold i.
*/
char * pascal strins (char *s, char *i, int pos) {
char *p = s+pos;
int ilen = strlen (i);
memmove (p+ilen, p, strlen (s) - pos + 1);
memcpy (p, i, ilen);
return (s);
}
Example 1. The Program A executes faster because it only
compiles the regular expression once, while Program B compiles it
each time match() is called.
Program A Program B
... ...
while (getline(line,80,f) != EOF) { makepat("a+bc",pat);
if (match(line,"a+bc") != NULL) while (getline(line,80,f) != EOF) {
puts(line); if (re_match(line,pat) != NULL)
} puts(line);
... }
...