The Perl Journal May 2003
The simplicity of HTML is sometimes deceiving. Sure, it's pretty easy for your average Perl hacker to set up a web-based bulletin board system, allowing people to come along and write comments. It's even tempting to allow those comments to contain HTML rather than being escaped into monospaced <pre> purgatory. But "there be dragons there," as the old maps used to say.
The problem is that arbitrary HTML permits arbitrary activities to be triggered by merely visiting the site, thanks to these fancy scriptable browsers. As reported in the security journals, these attacks are generally known as "cross-site scripting." They usually come in the form of a JavaScript chunk embedded in a web page where at least part of the content can be controlled by arbitrary visitors, such as a guestbook or a web-based message system. Left unchecked, such attacks can unknowingly leak a person's credentials (such as cookies) to the bad guys, and that can lead to some pretty bad stuff.
Even without the issue of cross-site scripting, we still have to watch out for arbitrary HTML and JavaScript that can trigger browser bugs, which can again lead to denial-of-service attacks or usurped credentials. While keeping up with the latest browser release usually prevents this, most people I know don't upgrade at the first notice, leading to a vulnerability window.
And then there are the just plain annoyances. People who put HTML "start bold" tags in without the end bold. Or worse yet, including a start comment marker without the matching end comment. This isn't always a malicious act: It could happen just as easily by accident.
Because there are so many ways to go wrong, people tend to forbid all HTML, escape everything through an entity escaper, and leave it at that. But how do you permit some "safe" HTML while being very careful not to let "dangerous" HTML or comments into your code? For example, what if inline images were deemed to be annoying? How do you ensure that you are stripping all img elements?
This month, I'll present, in detail, an HTML stripper program to tackle this problem. The program depends on an HTML filter module that I've written. Next month, I'll give a line-by-line breakdown of the filter module. (Both listings are shown at the end of this column, and will be repeated next month.)
I've seen a few solutions to tidy up HTML, usually based on a series of regular-expression replacements (such as HTML::Sanitizer in the CPAN). But these often fail to consider the matching-tag or the implicit close-tag problems of HTML. For example, consider the valid HTML of:
<table><tr><td><b>foo<td>bar</table>
In this case, the bolding really does end at the end of foo, so bar should be rendered as unbolded. But to know that, you have to know that the td element closes off the previous td element, and therefore also the b element as well. That's a bit hard to get into the regular expressions.
One all-encompassing solution is HTML::TreeBuilder from the CPAN. This code understands the nesting and optional closing tags of HTML, and wraps itself around HTML::Parser to find the tags and other syntax of an HTML document. Once we have a nice clean tree of properly parsed and nested HTML elements, we merely need to walk through the tree, throwing away the dangerous elements. As long as we don't mangle the tree, we should get properly nested tags out of the mix as well.
The problem with a solution based around HTML::TreeBuilder is that it is too expensive to use repeatedly (such as every time a page is reloaded). While HTML::Parser is pretty fast, HTML::TreeBuilder has to build a lot of heavily connected heavy Perl objects, at least one for every element of the tree. This kind of tree is slow to create and slow to discard, so a heavily hit web site would be bogged down in short order.
But, from the XML realm (of all places) comes another interesting solution, in the form of XML::LibXML, which is a wrapper around the GNOME libxml2 parser. Although it can be a bit finicky to install, many interesting things become possible once you've got it there.
The XML::LibXML library can parse things in HTML mode, not just XML mode. In HTML mode, missing close tags are automatically deduced, HTML entities are optional and error-corrected, and quotes around attribute values are optional. All of these would be fatal to a normal XML parser. The result of an HTML parsing is an in-memory Document Object Model (DOM) that can then be accessed with XPath or DOM APIs. The advantage is that the DOM stays in the library (C code) side of the picture until requested, rather than in a bag of Perl objects.
In my time trials, regardless of whether the HTML file was small or huge, an HTML parse with XML::LibXML was 10 to 20 times faster than the equivalent parse with HTML::TreeBuilder. This is good news because most of the time is spent recognizing the data and building the tree, so reducing that gives us a big win.
So, once we build the DOM, it's a matter of walking the DOM, removing the forbidden elements and attributes, and then spitting the result out as HTML. And I've constructed a proof-of-concept module for that, which I'll describe next month.
To test my code, I needed a list representing a typical web-based community system's permitted HTML elements and attributes. Since I frequent the Perl Monastery at http://www.perlmonks.org/, I decided to grab their list of approved HTML for typical questions or answers. I extracted the list, and put it into the center of Listing 1.
Lines 1-3 of this program begin nearly every program I write, and turn on compiler restrictions and disable the normal STDOUT buffering.
Line 5 pulls in the My_HTML_Filter module, containing my HTML filtering code. This module is expected to be somewhere within my @INC path. Because I was always invoking this program from the current directory, I put the .pm file in the same directory for testing. In a production system, I might have had to alter @INC to access the locally installed module.
Line 7 gives the URL from which these elements and attributes are extracted. Lines 9-49 create the hash of permitted elements and attributes, as a nested hash. The first level of the hash has a key for every valid element. The corresponding value is a hashref, pointing to a second hash of where the keys represent every valid attribute for that element. The corresponding values for those keys are simply the number 1, permitting a truth test rather than an existence test for when we finally want to check for validity.
The code to create this hash from the "here document" is in lines 10 and 11. First, the data is split on newline, and then for each line, a further split on whitespace puts the first word of the line into $k, and the remaining words into @v. Then, two elements are generated for each input element: the $k value, and a hashref of a hash where the keys are all the @v elements and the values are all 1.
The list of elements and attributes given here is by no means promised to be safe. It just happens to be what is in use at the moment at the Perl Monastery, and has evolved over time.
Line 51 and beyond create a Test::More document, usually used in testing a module within a distribution, but handy here while I was developing and understanding the module. The no_plan in line 51 indicates that Test::More will count the number of tests and put the "plan" for the tests at the end of the output rather than the beginning.
Line 53 creates a filter object $f, passing it the permitted elements and attributes hash. Line 54 tests $f to ensure that it's actually an object of the intended type.
Lines 56-88 illustrate some of the transformations of this HTML stripper. Each is in the form of:
is($trial_text, $reference_text, $explanation)
The $trial_text comes from running the filter on the given string, resulting in some HTML output. This is compared to the $reference_text, which is what we are hoping the output resembles. The $explanation describes the particular test. A sample run of this part of the code looks like:
ok 1 - The object isa My_HTML_Filter ok 2 - basic text gets paragraphed ok 3 - bogons gets stripped ok 4 - links are permitted ok 5 - attributes get quoted ok 6 - bad attributes get stripped ok 7 - comments get stripped ok 8 - tags get balanced ok 9 - b/i tags get balanced ok 10 - b/i tags get nested properly ok 11 - tags get lowercased ok 12 - br comes out as HTML not XHTML
This test list is by no means a full suite of tests that I would use for a production module, but it shows the basics. Bad attributes and comments are removed, bad elements are stripped (and their contents pulled up inline), close tags are automatically added according to HTML rules, and generally, life is good. The resulting HTML could be inserted into an output page safely.
And then the fun partlines 90-97 show me just how fast or slow this code actually can be. I placed the home page for http:// www.stonehenge.com/ into a local file, then brought the contents into $homepage in line 91 (using the autovivified filehandle mechanism new to Perl 5.8). I then ran the stripper on the text (about 8K as I'm testing this) until a CPU second passed, and reported the number of passes per second that can be achieved. On an 8K chunk of HTML (much larger than a typical question or answer at the Monastery), I see about 40 to 50 results per second on my 1-GHz laptop. This is well within reasonable bounds, assuming we cache the result in some nice place on a high-performance website. Thus, the code is useful.
So that's the HTML stripper. You can examine the code for the My_HTML_Filter module, which is shown in Listing 2; and next month, I'll walk through that module in detail.
TPJ
(Listings are also available online at http://www.tpj.com/source/.)
=0= ###### LISTING ONE (main program) ######
=1= #!/usr/bin/perl
=2= use strict;
=3= $|++;
=4=
=5= use My_HTML_Filter;
=6=
=7= ## from http://www.perlmonks.org/index.pl?node_id=29281
=8=
=9= my %PERMITTED =
=10= map { my($k, @v) = split; ($k, {map {$_, 1} @v}) }
=11= split /\n/, <<'END';
=12= a href name target class title
=13= b
=14= big
=15= blockquote class
=16= br
=17= center
=18= dd
=19= div class
=20= dl
=21= dt
=22= em
=23= font size color class
=24= h1
=25= h2
=26= h3
=27= h4
=28= h5
=29= h6
=30= hr
=31= i
=32= li
=33= ol type start
=34= p align class
=35= pre class
=36= small
=37= span class title
=38= strike
=39= strong
=40= sub
=41= sup
=42= table width cellpadding cellspacing border bgcolor class
=43= td width align valign colspan rowspan bgcolor height class
=44= th colspan width align bgcolor height class
=45= tr width align valign class
=46= tt class
=47= u
=48= ul
=49= END
=50=
=51= use Test::More qw(no_plan);
=52=
=53= my $f = My_HTML_Filter->new(\%PERMITTED) or die;
=54= isa_ok($f, "My_HTML_Filter");
=55=
=56= is($f->strip(qq{Hello}),
=57= qq{<p>Hello</p>\n},
=58= "basic text gets paragraphed");
=59= is($f->strip(qq{<p><bogus>Thing}),
=60= qq{<p>Thing</p>\n},
=61= "bogons gets stripped");
=62= is($f->strip(qq{<a href="foo">bar</a>}),
=63= qq{<a href="foo">bar</a>\n},
=64= "links are permitted");
=65= is($f->strip(qq{<a href=foo>bar</a>}),
=66= qq{<a href="foo">bar</a>\n},
=67= "attributes get quoted");
=68= is($f->strip(qq{<a href=foo bogus=place>bar</a>}),
=69= qq{<a href="foo">bar</a>\n},
=70= "bad attributes get stripped");
=71= is($f->strip(qq{<p>What do <!-- comment -->you say?}),
=72= qq{<p>What do you say?</p>\n},
=73= "comments get stripped");
=74= is($f->strip(qq{<table><tr><td>Hi!}),
=75= qq{<table><tr><td>Hi!</td></tr></table>\n},
=76= "tags get balanced");
=77= is($f->strip(qq{<b><i>bold italic!}),
=78= qq{<b><i>bold italic!</i></b>\n},
=79= "b/i tags get balanced");
=80= is($f->strip(qq{<b><i>bold italic!</b></i>}),
=81= qq{<b><i>bold italic!</i></b>\n},
=82= "b/i tags get nested properly");
=83= is($f->strip(qq{<B><I>bold italic!</I></B>}),
=84= qq{<b><i>bold italic!</i></b>\n},
=85= "tags get lowercased");
=86= is($f->strip(qq{<h1>hey</h1>one<br>two}),
=87= qq{<h1>hey</h1>\n<p>one<br>two</p>\n},
=88= "br comes out as HTML not XHTML");
=89=
=90= use Benchmark;
=91= my $homepage = do { open my $f, "homepage.html"; join "", <$f> };
=92=
=93= timethese
=94= (-1,
=95= {
=96= strip_homepage => sub { $f->strip($homepage) }
=97= });
=0= ###### LISTING TWO (My_HTML_Filter.pm) ######
=1= package My_HTML_Filter;
=2= use strict;
=3= require XML::LibXML;
=4= my $PARSER = XML::LibXML->new;
=5=
=6= sub new {
=7= my $class = shift;
=8= my $permitted = shift;
=9= return bless { permitted => $permitted }, $class;
=10= }
=11=
=12= sub strip {
=13= my $self = shift;
=14= my $html = shift;
=15=
=16= my $dom = $PARSER->parse_html_string($html) or die "Cannot parse";
=17= my $permitted = $self->{permitted};
=18=
=19= my $cur = $dom->firstChild;
=20= while ($cur) {
=21= my $delete = 0; # default to safe
=22=
=23= ## I really really hate switching on class names
=24= ## but this is a bad interface design {sigh}
=25= if (ref $cur eq "XML::LibXML::Element") {
=26= ## "that which is not explicitly permitted is forbidden!"
=27= if (my $ok_attr = $permitted->{$cur->nodeName}) {
=28= ## so this element is permitted, but what about its attributes?
=29= for my $att ($cur->attributes) {
=30= my $name = $att->nodeName;
=31= $cur->removeAttribute($name) unless $ok_attr->{$name};
=32= }
=33= ## now descend if any kids
=34= if (my $next = $cur->firstChild) {
=35= $cur = $next;
=36= next; # don't execute code at bottom
=37= }
=38= } else {
=39= ## bogon - delete!
=40= ## we must hoist any kids to be after our current position in
=41= ## reverse order, since we always inserting right after old node
=42= my $parent = $cur->parentNode or die "Expecting parent of $cur";
=43= for (reverse $cur->childNodes) {
=44= $parent->insertAfter($_, $cur);
=45= }
=46= ## and flag this one for deletion
=47= $delete = 1;
=48= ## fall out
=49= }
=50= } elsif (ref $cur eq "XML::LibXML::Text"
=51= or ref $cur eq "XML::LibXML::CDATASection") {
=52= ## fall out
=53= } elsif (ref $cur eq "XML::LibXML::Dtd"
=54= or ref $cur eq "XML::LibXML::Comment") {
=55= ## delete these
=56= $delete = 1;
=57= ## fall out
=58= } else {
=59= warn "[what to do with a $cur?]"; # I hope we don't hit this
=60= }
=61=
=62= ## determine next node ala XPath "following::node()[1]"
=63= my $next = $cur;
=64= {
=65= if (my $sib = $next->nextSibling) {
=66= $next = $sib;
=67= last;
=68= }
=69= ## no sibling... must try parent node's sibling
=70= $next = $next->parentNode;
=71= redo if $next;
=72= }
=73= ## $next might be undef at this point, and we'll be done
=74=
=75= ## delete the current node if needed
=76= $cur->parentNode->removeChild($cur)
=77= if $delete;
=78=
=79= $cur = $next;
=80= }
=81=
=82= my $output_html = $dom->toStringHTML;
=83= $output_html =~ s/.*\n//; # strip the doctype
=84=
=85= return $output_html;
=86= }
=87=
=88= 1;