Better
Find and Replace on HTML Content
Charles M. Dalsass
How many times have you run into situations in which you need
to find and replace HTML content across a large Web site? For example,
you need to swap out headers and footers with SSIs (server side
includes) on a large static Web site for better maintainability,
or you need to find a header image across a site and replace it
with another header image. The best tools for performing Find and
Replace against HTML content are bundled with commercial HTML editing
software. These tools are GUI-based, normally run on Windows or
Macintosh, and include the the ability to cut multi-line chunks
of HTML and replace all pages in a folder or site with another multi-line
chunk of HTML. Most allow you to use regular expressions in your
search.
However, there are limitations to GUI software when performing
find-and-replace tasks across large numbers of Web pages. The most
obvious limitation is not having access to an environment with these
tools installed, such as a Unix or server environment. This article
describes these limitations and shows how to perform similar find-and-replace
operations on the command line by using a few scripts and standard
Unix command-line utilities. This article should be as relevant
to sys admins as it is to Web developers. If you've ever spent time
cleaning up a Web site that has been corrupted by a virus, you will
agree.
My company, Neptune Web, Inc., needed a practical, command-line
alternative to GUI search-and-replace facilities. I've always been
told that Perl is great for matching text, but found that it's difficult
to perform find and replace without writing a new program every
time. I also wanted the interface to be as intuitive as possible.
Finally, I wanted to fix the other shortfalls of GUI-based find
and replace:
- GUI tools lack the ability to track changes (revision control)
and undo them if you make a mistake.
- GUI tools normally don't allow you to script out multiple searches
and replacements (e.g., if a first match doesn't work, try a second
match etc.).
- It's difficult to record searches and replacements that were
performed in the past, since everything is GUI-based. Options
such as "regular expression" or "ignore whitespace" can't be written
down since they only exist as checkboxes in the GUI. This can
be particularly irksome if you have made a mistake at some point
in the process of reworking hundreds of documents and have since
made other changes.
- With GUI find and replace, it's difficult to choose subsets
of files or folders, apart from "everything in the site", "everything
in the folder", etc.
- GUI tools lack the ability to do "reverse" matches, where you
grab content from one, known spot and work your way backwards
to perform the replacement. This can be a useful way to grab content
within nested tables.
- In most GUI find-and-replace tools, the regular expressions
have been simplified. For example, many tools don't have the ability
to control greediness (i.e., no ability to match random stuff
between two known tags that may repeat beyond the match in which
you are interested, such as table tags.) Nested tables, for example,
become difficult to match if you are looking for a particular
table.
Our company has written a Perl program called "dwreplace.pl" (see
Listing 1) that, in combination with some "old school" Unix utilities,
does all of the above. We've tried to make the command-line interface
as simple as possible, with real examples in the output, so you
won't forget how to use it.
In a command-line environment, such as the bash shell, it's possible
(but difficult) to pass large strings of text to programs (e.g.,
grep), since most programs have line-based options and return
characters in the options cause problems. To simplify passing complex
arguments to the program, the program uses what I call an "rfile"
(replacement file). The rfile defines a list of possible matches
and replacements to make against the Web pages, defined in a (Perl)
array format (run the program with -e to see an example rfile).
The rfile also serves as a history of your searches, which allows
you to run them again if something went wrong.
The program is run like this:
dwreplace (other options) --rfile=rfile file1 [file2 ... filen]
where "rfile" is the name of the rfile, and file1..filen are
the files on which the search and replace is performed.
The program works by converting the HTML text you've pasted into
the rfile into a Perl regular expression that is "whitespace insensitive".
It also escapes out all characters that have special meanings within
the regular expression. For example, if you have a block of text
like this:
<br>
<br>$5.00 per item*
</p>
the program converts that into a Perl regular expression like this
(\s* matches 0 or more whitespace characters) by escaping all
special characters and turning whitespace into \s*:
<br>\s*<br>\$5\.00\s*per\s*item\*\s*<\/p>\s*
Let's start with an easy example. Suppose I have 1000 HTML pages with
a background color in a table cell and a tag that I'd like to remove.
The old HTML looks like this:
<td bgcolor="#000000" id="termbodybox">
or possibly this:
<td bgcolor="#000000" id="bodybox">
And I'd like to replace the tags with this HTML snippet:
<td id="termbodybox">
I realize this would be a very trivial replacement to do in the GUI
tools -- you'd just have to run the replacement two times. This example
would also be trivial running sed or perl -ni two times.
To enable "rollback" and revision control of changes, I'm going
to check in all the files using RCS (revision control system). Although
any revision control system could be used, RCS is a great because
of its availability (on *NIX systems) and simplicity. RCS will create
a revision history of the file next to the file you are working
on (it's stored in a [filename],v file). The ,v file
can be deleted or removed with no effect on the overall project.
(Keep in mind that it is important to delete any ancillary files
such as "rfiles" or ,v files when you are have finished a
find and replace. By-products left on publicly accessible Web folders
could reveal information about your system to hackers.) RCS has
been around forever, so it's pre-installed on almost all Unix/Linux
distributions. You can read more about RCS here:
http://www.gnu.org/software/rcs/rcs.html
or by running man rcsintro.
I'm going to combine an RCS "initial" check-in with a find
command to check in all the .html files in the site. The -i -t-"initial"
says to create the ,v files associated with each HTML page,
using the string "initial" as a starter comment. The -u
ensures that the files don't disappear (odd as it sounds). I also
use the null option to find so that filenames containing
spaces or quotes won't cause an error:
find . -type f -name "*.html" -print0 | xargs -0 ci -u -i -t-"initial"
Using my dwreplace.pl, I define an rfile (run dwreplace.pl -e
to see a sample rfile), called, for example, "removetablecolors.rfile".
The rfile defines the "OLD" and "NEW" arrays, which are the find and
replace strings, respectively. The code in the configuration file
is standard Perl, so anything goes here (including HERE documents
[shown below], variables, functions, or single- or double-quoted strings).
Matches are performed linearly through the array, against each file
passed to the program:
$OLD[0] = <<'END_OF_TEXT';
<td bgcolor="#000000" id="termbodybox">
<p>
END_OF_TEXT
In this example and in the documentation embedded in the dwreplace.pl
program, I've use Perl's single-quoted HERE documents to define the
boundaries of the $OLD[0] string. I often paste a half page
or more of HTML between the END_OF_TEXT lines. I use this notation
so that I don't have to escape special characters found in the HTML
blocks, such as single quotes.
Since the rfile is standard Perl, any Perl quoting syntax works
fine, for example:
$OLD[0] =
'<td bgcolor="#000000" id="termbodybox">
';
The notation above can be easier to read for shorter strings. If you
use this notation, be sure to escape any single quotes that occur
in the HTML snippet or the Perl code in the rfile will not compile
properly.
The $NEW array contains the text to be substituted in place
of the $OLD:
$NEW[0] = <<'END_OF_TEXT';
<td id="termbodybox">
END_OF_TEXT
Now we use the array index to define the next find and replace. You
can add as many matches as you like:
$OLD[1] = <<'END_OF_TEXT';
<td bgcolor="#000000" id="bodybox">
<p>
END_OF_TEXT
$NEW[1] = <<'END_OF_TEXT';
<td id="bodybox">
END_OF_TEXT
That completes the rfile. Back to RCS and the command line, I check
out and lock the HTML files using RCS like this:
find . -type f -name "*.html" -print0 | xargs -0 co -l
Then I run dwreplace.pl like this:
dwreplace.pl --rfile= removetablecolors.rfile 'find . \
-name "*.html" -print'
Note that at this time I do not have a -null option for the
program, so I run the program using shell backtick notation instead.
This assumes a limited number of files (or the shell will say the
argument list is too long) and that your filenames don't have funny
characters such as spaces and quotes. If these conditions are met,
the simpler backtick notation can be used. You can always reduce the
number of input files by using grep -l to reduce the number
of candidate input files.
I can always see my changes with RCS by running rcsdiff:
find . -type f -name "*.html" -print0 | xargs -0 rcsdiff | less
If I don't like my changes, I can undo (rollback) using this command:
find . -type f -name "*.html" -print0 | xargs -0 co -l -f
If I'm sure replacement worked fine, I can check in (commit a revision)
and comment my files as well. Keeping revisions makes it easy for
me to revert to a given version:
find . -type f -name "*.html" -print0 | xargs -0 ci -u \
-m'removed table color'
Of course, the program allows you to use regular expressions in the
rfile, instead of cut-and-pasted HTML. For each match that you define,
you can pass options such as:
isregexp: the match is a regular expression
interpolator: interpolate backreferences in the right-hand
side (i.e. $1, $2 .. etc.)
eflag: execute the right hand side (right hand side is
normally a function or expression)
reversefile: reverse the file prior to performing the match.
casesensitive: perform a case sensitive match (default
is case insensitive)
For my second example, I want to replace everything between the
two comment tags (see below) with an include file to make the site
easier to maintain. I know that there is some variation in content
between the comments, so I need to use a regular expression to match
everything between the tags:
<!-- Menu Starts --> any random HTML with variable whitespace
<!-- Menu Ends --> to become:
<!-- Menu Starts -->
include("righthandnav.php");
<!-- Menu Ends -->
Let's put together a regular expression to match the HTML text containing
the comments. In Perl, the regular expression to match the comment
tags would look sort of like this:
<!-- Menu Starts -->.*?<!-- Menu Ends -->
But, you need to be sure to escape the parts of the regular expression
that you don't want treated specially (such as "?" or "."). Since
this is HTML, you also want to be sure that you accept strings with
variable whitespace as a match. I've extracted a function from dwreplace.pl
into a standalone program called "escape_perl_regular_expression"
(Listing 2), which replaces existing whitespace with \s* and
takes out the funny characters:
[cdalsass@house dev]$ escape_perl_regular_expression
<!-- Menu Starts -->.*?<!-- Menu Ends -->
(the regular expression above was pasted in to yield)
<!--\s*Menu\s*Starts\s*-->\.\*\?<!--\s*Menu\s*Ends\s*-->\s*
Of course I do want the .*? part of the regular expression,
so I un-escape it to yield this:
<!--\s*Menu\s*Starts\s*-->.*?<!--\s*Menu\s*Ends\s*-->\s*
That completes the regular expression. Next, I define an rfile (run
dwreplace.pl -e to see a sample rfile), called, for example
"rightnav.rfile":
# Example rfile, rightnav.rfile:
# Note: use a normal Perl single quoted string here, since it's a
# single line regular expression
$OLD[0] = '<!--\s*Menu\s*Starts\s*-->.*?<!--\s*Menu\s*Ends\s*-->\s*';
$NEW[0] = <<'END_OF_TEXT';
<!-- Menu Starts -->
include("/rightnav.inc");
<!-- Menu Ends -->
END_OF_TEXT
# use the isregexp option in the rfie.
$OPTIONS[0] = {
isregexp => 1,
};
An important thing to remember about Perl's regular expressions is
greediness versus non-greediness. Most GUI tools do not support a
greediness meta-character, but modern Perl regular expressions do.
When matching HTML content, non-greediness is always represented by
a question mark, usually like this .*?. In this case, it only
matters if there are other "Menu Ends" comment tags after the first.
(A non-greedy match will grab only up to the first "Menu Ends" tag,
while the greedy match will grab all content up to the final "Menu
Ends" comment tag.)
For my third example, I'd like to show a "reverse search and replace".
In this type of search, you reverse the contents of the file and
regular expression, perform the match, and then un-reverse the file.
Regular expression algorithms move linearly forward though the
match-string as different match attempts are made. For matches where
your known text is ahead of repeating, known text, this forward
moving behavior can be undesirable.
To illustrate the general case of how reversing the match string
can simplify your regular expressions, consider writing a regular
expression to find the file extension in a filename such as "theory.of.int.design.edoc"
or "case.b.doc", where "doc" or "edoc" are both valid results. It
is easier to reverse the filename and search forward for all characters
up to the first period and reverse the results than to concoct a
regular expression using "look behind" regular expressions.
The same is often true when matching HTML content. For example,
suppose you know the phrase "Click Here to" exists in a page, and
you want to grab all HTML from "Click Here to", backwards to the
next two table tags. (The following HTML code shows the structure
of the HTML; note that table tags above the "Click Here to" may
be nested and non-distinguishable from other table tags, which makes
this more complicated than having a single or uniquely identifiable
table above the text. You may also want to generalize the regular
expression by ignoring unique table tags so that you can get more
input files to match.) I prefer to do this with a reverse search,
because Perl regular expressions work forward through the match-string.
The reverse search keeps your regular expressions simpler. As an
example, consider this HTML code:
<table><!-- random html -->
<table>
<table>
random html
<!-- grab everything from this point (two table tags before \
"Click here to") to the text "Click here to" -->
<table>
random html
<table>
random html
Click Here to
<!-- grab everything above this point, to the previous 2 table \
tags -->
To perform this search, I combine the isregexp and reversefile
options for the match. I also reverse the text of the regular expression.
The rfile looks like this:
$OPTIONS[0] = [ isregexp , reversefile ];
$OLD[0] = 'ot ereh kcilC.*?>elbat<.*?>elbat<';
$NEW[0] = '<p><center>{$sendbutton}</center></p>';
There are a few optional parameters that can be passed to wreplace.pl.
They are:
--s -- Sarch only, do not replace.
--e -- Show an example rfile.
--v -- Verbose mode, shows detailed matching info and highlighted
text for changes (under development).
--ni -- Case-sensitive mode (not insensitive).
It takes a combination of tricks to improve GUI search and replace
tools. This article should help you to perform these operations
more effectively. It also allows you to do these operations when
using the command line. Finally, the dwreplace.pl program is a nice
tool to keep around in your toolkit, whether you are a Web developer
or a sys admin.
Charles M. Dalsass is the CTO of Neptune Web, Inc., a Boston
area full-service, Web design and development company. Neptune Web,
Inc. provides open source consulting services as well as content
management software, creative services, development and hosting.
Charles can be reached at charles.dalsass@neptuneweb.com. |