|
When you match multiple
occurrences of a pattern with the /g
modifier, the special character \G anchors
its position to the last occurrence. Furthermore, the
new /c modifier leaves the position intact
even if the match fails. All these taken together
make it easy to implement a parser in Perl, as you
can now perform as much or as little lookahead as you
like.
My parser also uses the fact
that arguments are passed by reference:
sub foo { $_[0] =~ /\G[a-z]/gsc; }
foo($text);
This advances the position of $text with every
call to foo(). If you defined foo() as
sub foo {
my($str) = @_;
$str =~ /\G[a-z]/gsc;
}
then $text would not be touched by
foo() and the parser would never get anywhere.
The position that \G will match next is
available to your program via the pos operator. This
is used in the text to display an error message and
show the location of the parse failure in the VRML
source code.
INTERPOLATIONRegex interpolation is a very
useful—and underused—technique. Let's say
that you have a file format where buzzwords (matched
below by the regex /FOO|BAR|BAZ|QUUX/) occur
in various contexts. For instance, in context A you
are expecting a digit followed by a buzzword followed
by a letter, and in context B you are expecting the
word "extremely" followed by whitespace and a
buzzword. In each context, you need to find out which
buzzword matched.
You could write these regexes as
/\G[0-9](FOO|BAR|BAZ|QUUX)[a-zA-Z]/gsc;
/extremely\s+(FOO|BAR|BAZ|QUUX)/gsc;
However, the more regexes you have, the more times
you have to write the buzzword, increasing the chance
of mistyping one of the buzzwords and causing
hard-to-track parser bugs. Perl to the rescue: you
can write this as
$buzz = 'FOO|BAR|BAZ|QUUX';
/\G[0-9]($buzz)[a-zA-Z]/ogsc;
/extremely\s+($buzz)/ogsc;
Here, there are two things to notice. First, the
/o modifier tells Perl that variable won't
change. Now Perl can compile the regex once and be
done with it, which makes matches speedier. The
second is that I have placed the parentheses
outside the variable $buzz,
even though they are repeated everywhere. This is
because if you have the regex
/$a$b$c$d$e/og, where any of those five
scalars might contain parentheses, it is fairly
difficult to know where the value of $3
(what matched inside the third parentheses) comes
from. But taking the parentheses outside the
variables, /($a)$b($c)($d)$e/og makes things
easier. Regex interpolation is used in my parser to
abstract VRML identifiers. Also see the
perlre and perlop
documentation.
|