Listing 2: Regex Position and Interpolation - The Perl Journal, Fall 1998

Listing 2.
Regex Position and Interpolation
Tuomas J. Lukka

Parsing VRML
The Perl Journal, Fall 1998

When you match multiple occurrences of a pattern with the /g modifier, the special character \G anchors its position to the last occurrence. Furthermore, the new /c modifier leaves the position intact even if the match fails. All these taken together make it easy to implement a parser in Perl, as you can now perform as much or as little lookahead as you like.

My parser also uses the fact that arguments are passed by reference:

sub foo { $_[0] =~ /\G[a-z]/gsc; }
foo($text);

This advances the position of $text with every call to foo(). If you defined foo() as

sub foo {
    my($str) = @_;
    $str =~ /\G[a-z]/gsc;
}

then $text would not be touched by foo() and the parser would never get anywhere. The position that \G will match next is available to your program via the pos operator. This is used in the text to display an error message and show the location of the parse failure in the VRML source code.

INTERPOLATION

Regex interpolation is a very useful—and underused—technique. Let's say that you have a file format where buzzwords (matched below by the regex /FOO|BAR|BAZ|QUUX/) occur in various contexts. For instance, in context A you are expecting a digit followed by a buzzword followed by a letter, and in context B you are expecting the word "extremely" followed by whitespace and a buzzword. In each context, you need to find out which buzzword matched.

You could write these regexes as

/\G[0-9](FOO|BAR|BAZ|QUUX)[a-zA-Z]/gsc;
/extremely\s+(FOO|BAR|BAZ|QUUX)/gsc;

However, the more regexes you have, the more times you have to write the buzzword, increasing the chance of mistyping one of the buzzwords and causing hard-to-track parser bugs. Perl to the rescue: you can write this as

$buzz = 'FOO|BAR|BAZ|QUUX';
/\G[0-9]($buzz)[a-zA-Z]/ogsc;
/extremely\s+($buzz)/ogsc;

Here, there are two things to notice. First, the /o modifier tells Perl that variable won't change. Now Perl can compile the regex once and be done with it, which makes matches speedier. The second is that I have placed the parentheses outside the variable $buzz, even though they are repeated everywhere. This is because if you have the regex /$a$b$c$d$e/og, where any of those five scalars might contain parentheses, it is fairly difficult to know where the value of $3 (what matched inside the third parentheses) comes from. But taking the parentheses outside the variables, /($a)$b($c)($d)$e/og makes things easier. Regex interpolation is used in my parser to abstract VRML identifiers. Also see the perlre and perlop documentation.