Escaping Metacharacters in Perl Regex Character Classes
Christopher Lane
At a code review, I became concerned about escaped metacharacters inside Perl pattern match character classes: m/[\.]/.
My misunderstanding was that since period didn't need to be escaped, I thought the backslash would be interpreted as part of the class (i.e., '\' or '.') Of course, I was wrong about the backslash -- it is safe to use with a punctuation character that doesn't need to be escaped.
When should you use the backslash escape and when should you not use it, inside a pattern match character class? Perl reference material isn't always clear about this, but after reviewing several sources and writing my own test scripts, I came up with a dozen rules that I will outline in this article. All of these rules are in the context of a character class, m/[ ]/, although a few apply to patterns in general:
1. Most pattern-match metacharacters lose their special meaning inside a character class. You can escape them, but it is not necessary:
| ( ) [ { } * + ? .
m/[.]/ is the same as m/[\.]/
2. The escape character and some Perl sigils keep their special meaning inside character classes and must be escaped if you simply want the character:
\ $ @
To match \ you must escape it with itself: m/[\\]/
3. Other Perl sigils lose their meaning inside character classes because neither hash arrays nor functions interpolate into strings and patterns:
% &
4. Caret loses its pattern-match anchor meaning inside a character class and may take on the meaning of character set negation, depending on position:
^
m/[a^]/ is the same as m/[\^a]/ or m/[a\^]/ but not m/[^a]/
In the initial position, you must escape it if you just want a caret.
5. Hyphen becomes the range metacharacter inside a character class, depending on position in the set, and may need to be escaped if you just want a hyphen:
-
m/[a\-z]/ is the same as m/[-az]/ or m/[az-] but not m/[a-z]/
6. Close bracket becomes a pattern-match metacharacter inside a character class and must be escaped if you want to match a close bracket:
]
Escape it to avoid closing the character set prematurely: m/[ab\]]/
7. Predefined character class abbreviations that match actual characters retain their meaning when escaped inside a character class:
\d \D \p{...} \P{...} \s \S \w \W
m/[\d\s]/ matches one digit or one character of whitespace.
8. Predefined character classes that match positions rather than actual characters, and a couple of definitions provided for Unicode character handling (or avoidance), lose their meaning. "Use warnings" (-w) will complain when they are escaped inside a character class:
\A \B \C \G \X \z \Z
m/[\B]/ generates the warning, "Unrecognized escape
\B in character class passed through in regex".
9. The word boundary predefined character class reverts to being the string backspace escape sequence inside a character class:
\b
("\b" =~ m/[\b]/) is not the same as ("\b" =~ m/\b/)
10. String escape sequences that represent characters keep their special meaning inside a character class when escaped:
\a \e \f \n \r \t \cX \0NN \xHH \x{HHHH} \N{name}
("\t" =~ m/[\t]/) is the same as ("\t" =~ m/\t/)
11. String escape sequences that modify case keep their special meaning when escaped inside a character class:
\l \L \u \U \Q \E
However, the pattern match logic doesn't see these. They work as expected when used directly inside a pattern match:
m/[\UABC\LDEF]/
Because they are handled in a compilation pass, they might not always work as expected. For example, from inside a variable:
my $pat = "\UABC\L"; m/[${pat}DEF]/ # not the same as above!
12. All other punctuation characters have no special meaning inside a character class:
! " # % ' , / : ; < = > _ ` ~
Many of these punctuation characters can become special when used as the pattern match delimiter and need to be escaped inside the pattern itself:
m#[\#]# or m,[\,], or m/[\/]/ or m:[\:]: or m<[\>]> etc.
Conclusion
The general rule, "when in doubt, precede it with a backslash" seems to work in most situations with punctuation characters inside a character class. Assuming that you have Perl warnings enabled, unusual character class situations that you are likely to encounter will involve the \b boundary/backspace class and interpolating string escape sequences that modify case. Because you cannot expect to remember all 12 rules, you should tuck a copy of this technical note between the pages of the regular expression section of your favorite Perl reference.
Christopher David Lane received his Bachelor's degree in Computer Science from Yale University and began his career 25 years ago writing articles for Commodore hobbyist magazines. He has worked in research computing, primarily at Stanford University, as a programmer, systems administrator and lecturer, teaching students the joys of programming. He is currently programming personal projects in which two of his passions, Perl and music, meet head-on. He can be reached at: cdl@cdl.best.vwh.net.
|