The Perl-Compatible Regular Expressions Library

C/C++ Users Journal October, 2005

Perl-strength regular expressions in native apps

By Ethan McCallum

Ethan McCallum is a freelance technology consultant who spends a lot of time with Linux, C++, and Java. He can be reached at ethanqm@ penguinmail.com.

Version Compatibility
Feeding the Linker and Preprocessor
Prototyping Regexps

Perl's regular expression (regexp) muscle makes it a favorite for text processing. If your project spec calls for C or C++, however, your choices are few: Call specialized Perl scripts with popen() (ugly), use Perl's C interface (uglier), or use the Perl-Compatible Regular Expressions (PCRE) library [1]. Philip Hazel's PCRE is a native library that implements Perl-style regexp support. It offers Perl's extended regexp semantics as well as its ability to extract matched substrings for you. PCRE is used in several projects, including the Apache httpd, the BlueFish HTML editor, and the nmap scanning tool. Its BSD-style license permits its use in both free and commercial software.

In this article, I examine PCRE, including text matching and substring extraction. I wrap up with a stub of a log-processing tool that uses PCRE as the basis of pattern-matching objects. Familiarity with Perl's regexp rules is not required, though it may help you understand some of the concepts presented here. The sample code was tested under Fedora Core 3, PCRE 4.5/5.0, and GCC 3.4.2. See the "Version Compatibility" sidebar if you are using a different version of PCRE. The complete source code is available at http://www.cuj.com/code/.

Make the Switch

Your first use of PCRE may involve converting an existing program. PCRE provides a straightforward migration path from the POSIX regexp toolkit: Its compatibility layer wraps PCRE calls in functions named for their POSIX counterparts.

Migrating existing code, then, is a two-line operation:

Change an #include<> statement in your source file.
Link against a different library in your makefile.

Calls to the familiar regcomp() and regexec() work as before, even though they call PCRE functions behind the scenes. This is demonstrated with step1.cc, a stub program available at http://www.cuj.com/code/. The makefile builds both POSIX and PCRE versions, based on the USING_PCRE preprocessor constant that I pass in the makefile:

#ifdef USING_PCRE
  #include<pcre/pcreposix.h>
#else
  #include<regex.h>
#endif

(This isn't to suggest that you maintain both POSIX and PCRE compatibility in your apps, but simply to demonstrate the ease of conversion.) If this example fails to build on your system, see the sidebar "Feeding the Linker and Preprocessor."

Despite their difference in linkage, the programs step1-posix and step1-pcre function identically. Pass the programs a regular expression on the command line, for example:

$ ./step1-posix 'foo.*bar'

and feed them lines via standard input. (Single-quote the regexp or the shell misinterprets it.) The programs report whether each line matches the supplied regexp.

Performing a Match

Of course, there's more to PCRE than cloning regcomp() and regexec(). The file step2.cc is a rewrite of step1.cc that demonstrates matching using PCRE's main API.

PCRE requires that you compile a regexp using pcre_compile() (line 73) before you use it to test strings:

pcre* pcre_compile(
  const char *pattern,
  int options,
  const char **errptr,
  int *erroffset,
  const unsigned char *tableptr
);

The regexp is supplied as the string pattern. Pass a set of bitwise-OR'ed constants to alter matching behavior. Examples of such constants include PCRE_CASELESS (case-insensitive matching), PCRE_UTF8 (assume UTF-8 encoded data), and PCRE_MULTILINE (test strings may contain newlines). The pattern in step2.cc defaults to no modifiers, or 0. (You can also set some options within the pattern itself.)

If pattern compilation fails, pcre_compile stores the error message in the supplied errorMessage pointer and the index of the offending character in errorOffset. tableptr is an optional set of character tables. The example passes NULL so it uses the default tables. The resultant pcre* object should be stored for maximum efficiency: It's wasteful to repeatedly recompile the regexp for each use.

Equally wise then, is to study the pattern to yield faster matching. The function pcre_study() (line 66) returns a pcre_extra* object as the result of its analysis:

pcre_extra *pcre_study(
  const pcre *code,
  int options,
  const char **errptr
);

As with pcre_compile(), errors are stored in the provided char** parameter. The options parameter is currently unused, but exists for forward compatibility. pcre_study() returns NULL if it cannot further optimize matching. Functions that take pcre_extra pointers gracefully handle this condition, though, so the sample code doesn't test pcre_study()'s return value.

PCRE uses an int[] as a work area for matching. This array's size is based on the pattern's total number of potential substring matches. Lines 72-79 take advantage of C++'s support for runtime-sized arrays and calls pcre_fullinfo() to extract the number of potential matches from the pcre* variable re:

int totalMatches ;
pcre_fullinfo(
  re ,
  reStudy ,
  PCRE_INFO_CAPTURECOUNT ,
  &totalMatches
) ;

When matching against several patterns in a single program (or single thread), it is more memory efficient to reuse a shared work area sized for the pattern with the greatest number of matches. The work area for PCRE 4 and 5 is slightly larger than that of Version 3. Lines 109-115 use preprocessor macros to determine the compile-time library version and size the work area accordingly. (An excerpt of this routine is shared in the sidebar "Version Compatibility.")

pcre_exec() tests an input string for a match (lines 134-143):

int pcre_exec(
  const pcre *code,
  const pcre_extra *extra,
  const char *subject,
  int length,
  int startoffset,
  int options,
  int* ovector,
  int ovecsize
);

The parameters code and extra are the compiled regexp and study results, respectively. subject and length are the string to test against the pattern and its length, respectively. It's possible to test starting from an arbitrary point in the subject, but for simple matches against the entire string, the startoffset is 0 (the beginning). ovector and ovecsize refer to the work area and its size, respectively.

pcre_exec() returns one more than the number of substring matches—parenthesized patterns within the regexp—in the subject string. A successful match against a regexp with no substrings thus yields 1. Return codes less than 1 indicate a problem. 0 means the work area is too small: Perhaps you missized it, or there's an error in your regexp that causes extra matching to occur. (Check especially for over-escaped parentheses.) PCRE_ERROR_NOMATCH indicates that the subject did not match the regexp. Several other error constants are described in detail in the pcreapi man page.

Finally, lines 180-181 clean up the pcre* and pcre_extra* objects allocated earlier. Behind the scenes, PCRE allocates this memory using pcre_malloc() and you must free it using pcre_free(). These functions call plain malloc() and free() by default. You can assign a custom allocator by setting the (global) variables pcre_malloc() and pcre_free(), respectively.

Unlike step1.cc, step2.cc supports the more powerful Perl-style regexps. For example:

$ ./step2 'January (\S+) 2005'

matches any contiguous set of nonspace characters between the words "January" and "2005." You can also use the POSIX-style character classes. For example, [:alnum:] matches any alphanumeric character, so the following expression would match January of any year:

$ ./step2 'January [:alnum:]*'

Refer to the sidebar "Prototyping Regexps" for hints on how to prototype your regexps.

Setting Options Within the Pattern

PCRE and Perl support the (?N) operator to set matching options within the pattern itself. This is an alternative to hard-coding an option (such as PCRE_CASELESS) in the call to pcre_compile().

Replace N with one or more characters, such as (?i) for a caseless match. The operator affects the regexp up to the next enclosing parentheses, or to the end of the pattern if there are none. For example:

(?i)foobar

matches the string foobar with any capitalization; whereas in the regexp:

foo((?i)bar)baz

only bar is matched in a case-insensitive manner.

The option letters match their Perl counterparts:

(?i) (PCRE_CASELESS), case-insensitive matching.
(?m) (PCRE_MULTILINE), test strings may span multiple lines.
(?s) (PCRE_DOTALL), lets "." match even newlines in test strings.
(?x) (PCRE_EXTENDED), permits space and comments in regexps.

You may specify multiple modifiers, such as (?mx) for a caseless, extended-format regexp.

The (?) operator is used elsewhere in PCRE (such as callouts and named substring matches) and can be considered a general control sequence.

Extracting Substrings

POSIX regexp support boils down to a simple question: "Does string X match pattern Y?" Perl and PCRE let you mark and extract specific substrings in the matching text segments.

pcre_get_substring() hands you the matched substrings as an array:

int pcre_get_substring_list(
  const char *subject,
  int* ovector,
  int stringcount,
  const char ***listptr
);

Here, subject is the string tested by pcre_exec(). ovector and stringcount are the work area and number of matched strings, respectively. PCRE stores the captured matches in listptr. Free the listptr array using pcre_free_substring_list(). Note that array index 0 represents the entire string, while 1 is the first matched substring.

The functions pcre_copy_substring() and pcre_get_substring() return individual matched substrings based on their numeric position in the regexp:

int pcre_copy_substring(
  const char *subject,
  int *ovector,
  int stringcount,
  int stringnumber,
  char *buffer,
  int buffersize
);
int pcre_get_substring(
  const char *subject,
  int *ovector,
  int stringcount,
  int stringnumber,
  const char **stringptr
);

pcre_get_substring() allocates memory for a new string; call pcre_free_substring() to release it. By comparison, pcre_copy_substring() copies the string to the user-supplied buffer array, so it does not need to be explicitly freed.

PCRE uses the same match-counting rules as Perl: You calculate a match position from outer to inner parentheses, then from left to right. Consider the following regexp excerpt:

... ((\S+) (\d+)) ...

If the outermost parentheses bound match N, then (\S+) is match N+1, and (\d+) is match N+2.

Simplify code maintenance by storing symbolic names for substring matches in an enum. Alternatively, you can tag your matches and fetch them by name. Inside a substring match's parentheses, precede the pattern of interest with ?P<name>. Pass name as the stringname parameter of pcre_copy_named_substring() and pcre_get_named_substring(). (These operate similarly to pcre_copy_substring() and pcre_get_substring(), respectively.) For example:

pcre_get_named_substring( ... pcre*,  	pcre_study* ... , "Foo" , ... )

fetches the text matching the regexp fragment:

(?P<Foo>\d+)

The stub programs step3.cc and step4.cc demonstrate extracting substrings by numeric index and name, respectively.

Callouts

Perl's (?{ ... code ... }) blocks fire code as parts of a string are matched against the regexp. The callout is the PCRE equivalent. Mark a regexp portion using (?CN) syntax, where N is a number from 0 to 255. For example:

(?C1)(foo)(?C2)(bar)(?C3)

PCRE calls the function assigned to the variable pcre_callout as it encounters each marker in the matched string. This function has the signature:

int function( pcre_callout_block* )

As pcre_callout is a global variable, there can be only one callout function per program. In turn, pcre_callout_block.callout_number identifies the callout marker (it's the N in (?CN)) such that pcre_callout can distinguish between callout points. pcre_callout can thus be a simple switch() block that calls other functions based on the callout's number.

C++ users take note: The callout function must be a global or static (class) function; object member functions are not permitted. Furthermore, the function must be exposed with C linkage using an extern "C" declaration. There's nothing to stop you from using the callout function as a pass-through to an object, though. You can assign an arbitrary object to a regexp's pcre_extra.pcre_callout member. That object will be available in the callout function as the pcre_callout_block.callout_data member. (You must cast it from void* to your expected object type.) To not use callouts, (re)set the value of pcre_callout to NULL. The stub program step5.cc (available at http://www.cuj.com/code/) uses a callout to print the last substring match made in the subject string.

Putting It All Together

The sample application (named "app" and available online) uses the techniques described here to postprocess netfilter/iptables firewall logs. A Matcher object represents a regular expression. Its operator() member function calls PCRE code to test strings against the regexp. On success, a MatchInfo* is returned. MatchInfo is a lightweight wrapper around PCRE data types that lets calling code fetch substring matches by descriptive name or numeric index. As an alternative, callers may specialize the template version of PCREMatch::operator() to work directly with the raw PCRE match data.

The PacketInfo class holds source and destination host/port pairs. Its accessor member functions are used by Output objects to further process that data. For example, the supplied TextOutput class prints the info to an output stream. Output could also be subclassed to export the data to a database or XML. PacketInfo and Output objects meet inside a Processor object, which receives logfile data from main().

Matcher and MatchInfo classes wrap PCRE calls. They are generic and may thus be copied to other apps. By comparison, PacketInfo, Output, and Processor are specific to the sample app.

The Wrap-Up

PCRE lets you bring Perl functionality into your native code without resorting to unpleasant methods. Without PCRE to do the heavy lifting, PCREMatch and MatchInfo would have hidden some very ugly code behind their interfaces.