Article

mar2006.tar

Better Find and Replace on HTML Content

Charles M. Dalsass

How many times have you run into situations in which you need to find and replace HTML content across a large Web site? For example, you need to swap out headers and footers with SSIs (server side includes) on a large static Web site for better maintainability, or you need to find a header image across a site and replace it with another header image. The best tools for performing Find and Replace against HTML content are bundled with commercial HTML editing software. These tools are GUI-based, normally run on Windows or Macintosh, and include the the ability to cut multi-line chunks of HTML and replace all pages in a folder or site with another multi-line chunk of HTML. Most allow you to use regular expressions in your search.

However, there are limitations to GUI software when performing find-and-replace tasks across large numbers of Web pages. The most obvious limitation is not having access to an environment with these tools installed, such as a Unix or server environment. This article describes these limitations and shows how to perform similar find-and-replace operations on the command line by using a few scripts and standard Unix command-line utilities. This article should be as relevant to sys admins as it is to Web developers. If you've ever spent time cleaning up a Web site that has been corrupted by a virus, you will agree.

My company, Neptune Web, Inc., needed a practical, command-line alternative to GUI search-and-replace facilities. I've always been told that Perl is great for matching text, but found that it's difficult to perform find and replace without writing a new program every time. I also wanted the interface to be as intuitive as possible.

Finally, I wanted to fix the other shortfalls of GUI-based find and replace:

GUI tools lack the ability to track changes (revision control) and undo them if you make a mistake.
GUI tools normally don't allow you to script out multiple searches and replacements (e.g., if a first match doesn't work, try a second match etc.).
It's difficult to record searches and replacements that were performed in the past, since everything is GUI-based. Options such as "regular expression" or "ignore whitespace" can't be written down since they only exist as checkboxes in the GUI. This can be particularly irksome if you have made a mistake at some point in the process of reworking hundreds of documents and have since made other changes.
With GUI find and replace, it's difficult to choose subsets of files or folders, apart from "everything in the site", "everything in the folder", etc.
GUI tools lack the ability to do "reverse" matches, where you grab content from one, known spot and work your way backwards to perform the replacement. This can be a useful way to grab content within nested tables.
In most GUI find-and-replace tools, the regular expressions have been simplified. For example, many tools don't have the ability to control greediness (i.e., no ability to match random stuff between two known tags that may repeat beyond the match in which you are interested, such as table tags.) Nested tables, for example, become difficult to match if you are looking for a particular table.

Our company has written a Perl program called "dwreplace.pl" (see Listing 1) that, in combination with some "old school" Unix utilities, does all of the above. We've tried to make the command-line interface as simple as possible, with real examples in the output, so you won't forget how to use it.

In a command-line environment, such as the bash shell, it's possible (but difficult) to pass large strings of text to programs (e.g., grep), since most programs have line-based options and return characters in the options cause problems. To simplify passing complex arguments to the program, the program uses what I call an "rfile" (replacement file). The rfile defines a list of possible matches and replacements to make against the Web pages, defined in a (Perl) array format (run the program with -e to see an example rfile). The rfile also serves as a history of your searches, which allows you to run them again if something went wrong.

The program is run like this:

dwreplace (other options) --rfile=rfile file1 [file2 ... filen]

where "rfile" is the name of the rfile, and file1..filen are the files on which the search and replace is performed.

The program works by converting the HTML text you've pasted into the rfile into a Perl regular expression that is "whitespace insensitive". It also escapes out all characters that have special meanings within the regular expression. For example, if you have a block of text like this:

<br>
<br>$5.00 per item*
</p>

the program converts that into a Perl regular expression like this (\s* matches 0 or more whitespace characters) by escaping all special characters and turning whitespace into \s*:

<br>\s*<br>\$5\.00\s*per\s*item\*\s*<\/p>\s*

Let's start with an easy example. Suppose I have 1000 HTML pages with a background color in a table cell and a tag that I'd like to remove. The old HTML looks like this:

<td bgcolor="#000000" id="termbodybox">

or possibly this:

<td bgcolor="#000000" id="bodybox">

And I'd like to replace the tags with this HTML snippet:

<td id="termbodybox">

I realize this would be a very trivial replacement to do in the GUI tools -- you'd just have to run the replacement two times. This example would also be trivial running sed or perl -ni two times.

To enable "rollback" and revision control of changes, I'm going to check in all the files using RCS (revision control system). Although any revision control system could be used, RCS is a great because of its availability (on *NIX systems) and simplicity. RCS will create a revision history of the file next to the file you are working on (it's stored in a [filename],v file). The ,v file can be deleted or removed with no effect on the overall project. (Keep in mind that it is important to delete any ancillary files such as "rfiles" or ,v files when you are have finished a find and replace. By-products left on publicly accessible Web folders could reveal information about your system to hackers.) RCS has been around forever, so it's pre-installed on almost all Unix/Linux distributions. You can read more about RCS here:

http://www.gnu.org/software/rcs/rcs.html

or by running man rcsintro.

I'm going to combine an RCS "initial" check-in with a find command to check in all the .html files in the site. The -i -t-"initial" says to create the ,v files associated with each HTML page, using the string "initial" as a starter comment. The -u ensures that the files don't disappear (odd as it sounds). I also use the null option to find so that filenames containing spaces or quotes won't cause an error:

find . -type f -name "*.html" -print0 | xargs -0 ci -u -i -t-"initial"

Using my dwreplace.pl, I define an rfile (run dwreplace.pl -e to see a sample rfile), called, for example, "removetablecolors.rfile". The rfile defines the "OLD" and "NEW" arrays, which are the find and replace strings, respectively. The code in the configuration file is standard Perl, so anything goes here (including HERE documents [shown below], variables, functions, or single- or double-quoted strings). Matches are performed linearly through the array, against each file passed to the program:

$OLD[0] = <<'END_OF_TEXT';
<td bgcolor="#000000" id="termbodybox">
<p>
END_OF_TEXT

In this example and in the documentation embedded in the dwreplace.pl program, I've use Perl's single-quoted HERE documents to define the boundaries of the $OLD[0] string. I often paste a half page or more of HTML between the END_OF_TEXT lines. I use this notation so that I don't have to escape special characters found in the HTML blocks, such as single quotes.

Since the rfile is standard Perl, any Perl quoting syntax works fine, for example:

$OLD[0] =
     '<td bgcolor="#000000" id="termbodybox">
     ';

The notation above can be easier to read for shorter strings. If you use this notation, be sure to escape any single quotes that occur in the HTML snippet or the Perl code in the rfile will not compile properly.

The $NEW array contains the text to be substituted in place of the $OLD:

$NEW[0] = <<'END_OF_TEXT';
<td id="termbodybox">
END_OF_TEXT

Now we use the array index to define the next find and replace. You can add as many matches as you like:

$OLD[1] = <<'END_OF_TEXT';
<td bgcolor="#000000" id="bodybox">
<p>
END_OF_TEXT

$NEW[1] = <<'END_OF_TEXT';
<td id="bodybox">
END_OF_TEXT

That completes the rfile. Back to RCS and the command line, I check out and lock the HTML files using RCS like this:

find . -type f -name "*.html" -print0 | xargs -0 co -l

Then I run dwreplace.pl like this:

dwreplace.pl --rfile= removetablecolors.rfile 'find . \
  -name "*.html" -print'

Note that at this time I do not have a -null option for the program, so I run the program using shell backtick notation instead. This assumes a limited number of files (or the shell will say the argument list is too long) and that your filenames don't have funny characters such as spaces and quotes. If these conditions are met, the simpler backtick notation can be used. You can always reduce the number of input files by using grep -l to reduce the number of candidate input files.

I can always see my changes with RCS by running rcsdiff:

find . -type f -name "*.html" -print0 | xargs -0 rcsdiff | less

If I don't like my changes, I can undo (rollback) using this command:

find . -type f -name "*.html" -print0 | xargs -0 co -l -f

If I'm sure replacement worked fine, I can check in (commit a revision) and comment my files as well. Keeping revisions makes it easy for me to revert to a given version:

find . -type f -name "*.html" -print0 | xargs -0 ci -u  \
     -m'removed table color'

Of course, the program allows you to use regular expressions in the rfile, instead of cut-and-pasted HTML. For each match that you define, you can pass options such as:

isregexp: the match is a regular expression

interpolator: interpolate backreferences in the right-hand side (i.e. $1, $2 .. etc.)

eflag: execute the right hand side (right hand side is normally a function or expression)

reversefile: reverse the file prior to performing the match.

casesensitive: perform a case sensitive match (default is case insensitive)

For my second example, I want to replace everything between the two comment tags (see below) with an include file to make the site easier to maintain. I know that there is some variation in content between the comments, so I need to use a regular expression to match everything between the tags:

any random HTML with variable whitespace  to become:

<!-- Menu Starts -->
include("righthandnav.php");
<!-- Menu Ends -->

Let's put together a regular expression to match the HTML text containing the comments. In Perl, the regular expression to match the comment tags would look sort of like this:

<!-- Menu Starts -->.*?<!-- Menu Ends -->

But, you need to be sure to escape the parts of the regular expression that you don't want treated specially (such as "?" or "."). Since this is HTML, you also want to be sure that you accept strings with variable whitespace as a match. I've extracted a function from dwreplace.pl into a standalone program called "escape_perl_regular_expression" (Listing 2), which replaces existing whitespace with \s* and takes out the funny characters:

[cdalsass@house dev]$ escape_perl_regular_expression
<!-- Menu Starts -->.*?<!-- Menu Ends -->

(the regular expression above was pasted in to yield)

<!--\s*Menu\s*Starts\s*-->\.\*\?<!--\s*Menu\s*Ends\s*-->\s*

Of course I do want the .*? part of the regular expression, so I un-escape it to yield this:

<!--\s*Menu\s*Starts\s*-->.*?<!--\s*Menu\s*Ends\s*-->\s*

That completes the regular expression. Next, I define an rfile (run dwreplace.pl -e to see a sample rfile), called, for example "rightnav.rfile":

# Example rfile, rightnav.rfile:
# Note: use a normal Perl single quoted string here, since it's a
# single line regular expression

$OLD[0] = '<!--\s*Menu\s*Starts\s*-->.*?<!--\s*Menu\s*Ends\s*-->\s*';

$NEW[0] = <<'END_OF_TEXT';
<!-- Menu Starts -->
include("/rightnav.inc");
<!-- Menu Ends -->
END_OF_TEXT

# use the isregexp option in the rfie.

$OPTIONS[0] = {
   isregexp => 1,
};

An important thing to remember about Perl's regular expressions is greediness versus non-greediness. Most GUI tools do not support a greediness meta-character, but modern Perl regular expressions do. When matching HTML content, non-greediness is always represented by a question mark, usually like this .*?. In this case, it only matters if there are other "Menu Ends" comment tags after the first. (A non-greedy match will grab only up to the first "Menu Ends" tag, while the greedy match will grab all content up to the final "Menu Ends" comment tag.)

For my third example, I'd like to show a "reverse search and replace". In this type of search, you reverse the contents of the file and regular expression, perform the match, and then un-reverse the file.

Regular expression algorithms move linearly forward though the match-string as different match attempts are made. For matches where your known text is ahead of repeating, known text, this forward moving behavior can be undesirable.

To illustrate the general case of how reversing the match string can simplify your regular expressions, consider writing a regular expression to find the file extension in a filename such as "theory.of.int.design.edoc" or "case.b.doc", where "doc" or "edoc" are both valid results. It is easier to reverse the filename and search forward for all characters up to the first period and reverse the results than to concoct a regular expression using "look behind" regular expressions.

The same is often true when matching HTML content. For example, suppose you know the phrase "Click Here to" exists in a page, and you want to grab all HTML from "Click Here to", backwards to the next two table tags. (The following HTML code shows the structure of the HTML; note that table tags above the "Click Here to" may be nested and non-distinguishable from other table tags, which makes this more complicated than having a single or uniquely identifiable table above the text. You may also want to generalize the regular expression by ignoring unique table tags so that you can get more input files to match.) I prefer to do this with a reverse search, because Perl regular expressions work forward through the match-string. The reverse search keeps your regular expressions simpler. As an example, consider this HTML code:

<table><!-- random html -->
<table>
<table>
random html
<!-- grab everything from this point (two table tags before \
  "Click here to") to the text "Click here to" -->
<table>
random html
<table>
random html
Click Here to
<!-- grab everything above this point, to the previous 2 table \
  tags -->

To perform this search, I combine the isregexp and reversefile options for the match. I also reverse the text of the regular expression. The rfile looks like this:

$OPTIONS[0] = [ isregexp , reversefile ];

$OLD[0] = 'ot ereh kcilC.*?>elbat<.*?>elbat<';

$NEW[0] = '<p><center>{$sendbutton}</center></p>';

There are a few optional parameters that can be passed to wreplace.pl. They are:

--s -- Sarch only, do not replace.

--e -- Show an example rfile.

--v -- Verbose mode, shows detailed matching info and highlighted text for changes (under development).

--ni -- Case-sensitive mode (not insensitive).

It takes a combination of tricks to improve GUI search and replace tools. This article should help you to perform these operations more effectively. It also allows you to do these operations when using the command line. Finally, the dwreplace.pl program is a nice tool to keep around in your toolkit, whether you are a Web developer or a sys admin.

Charles M. Dalsass is the CTO of Neptune Web, Inc., a Boston area full-service, Web design and development company. Neptune Web, Inc. provides open source consulting services as well as content management software, creative services, development and hosting. Charles can be reached at charles.dalsass@neptuneweb.com.