The Replacements
Randal L. Schwartz
If you've used Perl longer than 15 minutes, you've no doubt seen (and probably typed) the extremely useful substitute operation, typically appearing as s/old/new/. Let's look at some of the things you may already know, and perhaps a few things that you don't know yet about this very common operation.
The most important thing to notice about the substitute operation is that it acts by default on our friend, the $_ variable:
$_ = "hello";
s/ell/ipp/; # $_ is now "hippo"
The "left side" of a substitute is a regular expression, so all of the rules about regular expressions apply:
$_ = "hello";
s/e.*l/ipp/; # $_ is now "hippo"
Here, the .* portion looks for any number of (nearly) any character, and the longest selection that matches and still permits the rest of the expression to match. In this case, it was the single l character between the e and the second l. Had we instead opted for the lazy version of .*?, we'd get the closest l instead:
$_ = "hello";
s/e.*?l/ipp/; # $_ is now "hipplo"
Like a regular expression match, we can steer the substitution away from $_ and toward some other location, using the =~ construct. Unlike the match operation though, we have to specify an lvalue (such as a variable name), not an rvalue (result of an expression):
my $text = "hello";
$text =~ s/ell/ipp/; # $text is now "hippo"
I occasionally find that some people are confused about the return value of a substitution. After all, if $text now has a new value, isn't that also what I'll see if I put this replacement in a larger context?
my $text = "hello";
my $result = ($text =~ s/ell/ipp/);
And the answer is no. Although the substitution is indeed altering $text here, what it returns is a true/false value of whether or not the substitution has happened. In this case, $result is true. This property of returning the success is handy when we're performing conditional operations:
if (s/foo/bar/) { # if foo was found, it's now bar, and...
... we do the code here ...
} else {
... we didn't find foo, and $_ is unchanged ...
}
The replacement is performed in the first possible place:
$_ = "hello";
s/l/p/; # $_ is now "heplo";
s/l/p/; # $_ is now "heppo";
To repeat the substitution on all non-overlapping matches, we add a g suffix:
$_ = "hillo";
s/l/p/g; # $_ is now "hippo";
The important word there is non-overlapping. Perl looks for each new match after the end of the previous match. So, the result of a substitution like this may at first be surprising:
$_ = "aaa,bbb,ccc,ddd,eee,fff,ggg";
s/,.*?,/,XXX,/g; # replace all fields with XXX (no!)
When we check the result, we see:
aaa,XXX,ccc,XXX,eee,XXX,ggg
Oops! Why did it do every other entry? On the first match, we matched ,bbb, and replaced that with ,XXX,. Good so far. But we can't now look at the comma there as the beginning of ,ccc,, because these have to be non-overlapping!
We can fix that by making the trailing comma merely a lookahead:
$_ = "aaa,bbb,ccc,ddd,eee,fff,ggg";
s/,.*?(?=,)/,XXX/g; # replace all fields with XXX (almost...)
Now, the trailing comma is not considered part of the match, so it's not ripped out, and it's not skipped past to find the next match. Note that I also had to change the replacement string so it doesn't add a comma back in. Now we're getting closer:
aaa,XXX,XXX,XXX,XXX,XXX,ggg
Hmm. We're still missing the beginning. That's understandable, because we're requiring a comma before the letters. And we're also missing the end, because we demand a trailing comma, even though we're not considering it part of the match. We can fix both of those problems with a bit more work:
$_ = "aaa,bbb,ccc,ddd,eee,fff,ggg";
s/(^|(?<=,)).*?((?=,)|$)/XXX/g; # replace all fields with XXX
OK, this is starting to look ugly. Like a regex match, we can pull that apart with a trailing x:
s/
(
^ # either beginning of line
| # or
(?<=,) # a single comma to the left
)
.*? # as few characters as possible
(
(?=,) # a single comma to the right
| # or
$ # end of string
)
/XXX/gx;
That's much easier to read (relatively speaking).
Like a regular expression match, we can use an alternate delimiter for the left and right sides of the substitution:
$_ = "hello";
s%ell%ipp%; # $_ is now "hippo"
The rules are a bit complicated, but it works precisely the way Larry Wall wanted it to work. If the delimiter chosen is not one of the special characters that begins a pair, then we use the character twice more to both separate the pattern from the replacement and to terminate the replacement, as the example above showed.
However, if we use the beginning character of a paired character set (parentheses, curly braces, square brackets, or even less-than and greater-than), we close off the pattern with the corresponding closing character. Then, we get to pick another delimiter all over again, using the same rules. For example, these all do the same thing:
s/ell/ipp/;
s%ell%ipp%;
s;ell;ipp;; # don't do this!
s#ell#ipp#; # one of my favorites
s[ell]#ipp#; [] for pattern, # for replacement
s[ell][ipp]; [] for both pattern and replacement
s<ell><ipp>; <> for both pattern and replacement
s{ell}(ipp); {} for pattern, () for replacement
No matter what the closing delimiter might be for either the pattern or the replacement, we can include the character literally by preceding it with a backslash:
$_ = "hello";
s/ell/i\/n/; # $_ is now "hi/no";
s/\/no/res/; # $_ is now "hires";
To avoid backslashing, pick a distinct delimiter:
$_ = "hello";
s%ell%i/n%; # $_ is now "hi/no";
s%/no%res%; # $_ is now "hires";
Conveniently, if a paired character is used, the pairs may be nested without invoking any backslashes:
$_ = "aaa,bbb,ccc,ddd,eee,fff,ggg";
s((^|(?<=,)).*?((?=,)|$))(XXX)g; # replace all fields with XXX
Note that even though the pattern contains closing parentheses, they are all paired with opening parentheses, so the pattern ends at the right place.
The right side of the substitution operation is generally treated as if it were a double-quoted string: variable interpolation and backslash interpretation is performed directly:
$replacement = "ipp";
$_ = "hello";
s/ell/$replacement/; # $_ is now "hippo"
The left side of a substitution is also treated as if it were a double-quoted string (with a few exceptions), and this interpolation happens before the result is evaluated as a regular expression:
$pattern = "ell";
$replacement = "ipp";
$_ = "hello";
s/$pattern/$replacement/; # $_ is now "hippo"
Using this form of pattern, Perl is forced to compile the regular expression at runtime. If this happens in a loop, Perl may need to recompile the regular expression repeatedly, causing a slowdown. We can give Perl a hint that the pattern is really a regular expression by using a regular expression literal:
$pattern = qr/ell/;
$replacement = "ipp";
$_ = "hello";
s/$pattern/$replacement/; # $_ is now "hippo"
The qr operation creates a Regexp object, which interpolates into the pattern with minimal fuss and maximal speed.
I hope you've enjoyed this brief overview of the replacement operation, although it's no replacement (ugh) for the manpages, such as perlre, perlretut, perlrequick, and perlreref. Check those out for more details, and until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.
|