The Perl Journal June 2003
Last month I presented an HTML stripper that tackled the problem of removing unwanted HTML from posts in web-based bulletin board systems. I outlined some of the risks in allowing arbitrary HTML in such posts. Left unfiltered, such HTML can be exploited to allow cross-site scripting and the triggering of browser bugs, which can lead to denial-of-service attacks or usurped credentials, or it can simply mangle the display of your pages.
The code for the stripper that I described last month is in Listing 1. This month, I'll explain the code for the My_HTML_Filter module, on which the HTML stripper depends. So how does My_HTML_Filter work? Let's go to Listing 2 for the details.
Line 3 pulls in the XML::LibXML module and line 4 creates a $PARSER object, using the default settings. This parser can be shared among many individual filters, so we've made it a class variable.
Lines 6 through 10 provide the constructor, which simply captures the $permitted parameter as an instance (member) variable, and returns the blessed object.
The real meat begins in line 12: the strip instance method. Lines 13 and 14 grab the instance variable and the input HTML, respectively.
Line 16 parses the HTML into a DOM. That's it. The result is an XML::LibXML::Document object from which we can get nice, clean HTML, or even XHTML if we choose. But we'll want to strip out the ugly stuff first. Line 17 caches the $permitted hashref into a simple scalar for quick access.
Line 19 establishes the "cursor" or "current node," conveniently and ambiguously called $cur. We'll do a walk of the DOM tree by moving this pointer around. We'll start by dropping down to the first child of the document node, and we'll end when the value hits undef, dropping out of the loop beginning in line 20.
Line 21 establishes a $delete flag. If this flag is true at the bottom of this loop, the current node must be deleted after we've computed the next node.
The comments in lines 23 and 24 reflect my feelings about how this is a bad design. Any design that requires you to ask an object for its type is generally a maintenance nightmare. Instead, a common protocol should have been established, where I'd merely have to query the properties and abilities of each thing within the tree using a Boolean-returning query method. But not here. So I grit my teeth and use ref a lot, hoping that some future version fixes all of this before it breaks all of this.
If it's an element, we'll note that in line 25, and then proceed to see whether the element type (the nodeName in line 27) is one of our permitted elements. If so, $ok_attr will then be a hashref of the permitted attributes, and we'll continue into line 28.
Lines 29 to 32 remove any attribute that is not permitted. Each attribute is queried for its nodeName, which is then checked against a list of permitted attributes, and removed if not permitted.
Line 34 moves our cursor down into the first child node, if it exists. For example, if we're looking at a td element, it will almost certainly have some content that we then have to scan. Some permitted elements (like br) won't have any content, so $next will be undef, and we'll use the normal "move forward" logic at the bottom of the loop.
That handles the permitted elements, but when we have a forbidden element, we need to remove it and reparent the orphaned children, using the code beginning in line 39.
Line 42 caches the parent node of the node to be deleted. Lines 43-45 move all of the node's children up to follow the node. This must be done in reverse order so that the order is retained following the current node, since we're always inserting immediately following the current node.
Finally, line 47 notes that the $cur node must be deleted after we've computed the following node at the bottom of the loop.
Lines 50 through 52 retain any existing text or cdata sections. The cdata section results when a script tag is used. Although our permitted list will probably cause the script element to be removed, the hoisted data is in a cdata element, and should be treated like text.
Lines 53 through 57 flag comments and DTDs as needing to be deleted. The former is dangerous (possibly hiding JavaScript). The latter is unnecessary, since we'll likely be including this text as part of a larger HTML page anyway.
Lines 58 to 60 attempt to flag anything else that I didn't see in my testing. I have no idea if I covered all of the nodes permitted in an HTML document, but I'm hoping I did.
Lines 62 to 73 compute the "next" node to be visited. I want the equivalent of the XPath expression following::node()[1], without paying the price of parsing an XPath each time through the loop. This expression looks for the next node of any type, either at the same level or at any higher level. Child nodes are not considered.
Line 63 initially sets this "next" node to be the current node. Lines 65 to 68 determine if the next sibling node is available. If so, that's our selection, and we drop out of the "naked block" defined in lines 64 through 72.
If the node has no next sibling, then we need to pop up a level in the tree and look for that node's following node. Line 70 tries this, restarting the naked block if successful. If we're already at the top-level node, then the undef value is left in $next, which will end the outer loop started in line 20.
Lines 76 and 77 delete the current node if needed by requesting that the parent node forget about the current node. Line 79 advances the current node to the next node as the last step of this outer loop.
All that's left now is to spit out the modified DOM as HTML, in line 82, and then toss away the initial DTD in line 83. (In my tests, this was always the first line up to a newline, but this may change in future releases of the library, so this is a bit risky.)
Line 88 provides the mandatory true value for all files brought in with require or use.
And there you have it! A configurable HTML stripper that is fast and thorough. Now there's no excuse for letting someone start a comment tag or bold tag in your guestbook, messing up the rest of your display. Until next time, enjoy!
TPJ
=0= ###### LISTING ONE (main program) ######
=1= #!/usr/bin/perl
=2= use strict;
=3= $|++;
=4=
=5= use My_HTML_Filter;
=6=
=7= ## from http://www.perlmonks.org/index.pl?node_id=29281
=8=
=9= my %PERMITTED =
=10= map { my($k, @v) = split; ($k, {map {$_, 1} @v}) }
=11= split /\n/, <<'END';
=12= a href name target class title
=13= b
=14= big
=15= blockquote class
=16= br
=17= center
=18= dd
=19= div class
=20= dl
=21= dt
=22= em
=23= font size color class
=24= h1
=25= h2
=26= h3
=27= h4
=28= h5
=29= h6
=30= hr
=31= i
=32= li
=33= ol type start
=34= p align class
=35= pre class
=36= small
=37= span class title
=38= strike
=39= strong
=40= sub
=41= sup
=42= table width cellpadding cellspacing border bgcolor class
=43= td width align valign colspan rowspan bgcolor height class
=44= th colspan width align bgcolor height class
=45= tr width align valign class
=46= tt class
=47= u
=48= ul
=49= END
=50=
=51= use Test::More qw(no_plan);
=52=
=53= my $f = My_HTML_Filter->new(\%PERMITTED) or die;
=54= isa_ok($f, "My_HTML_Filter");
=55=
=56= is($f->strip(qq{Hello}),
=57= qq{<p>Hello</p>\n},
=58= "basic text gets paragraphed");
=59= is($f->strip(qq{<p><bogus>Thing}),
=60= qq{<p>Thing</p>\n},
=61= "bogons gets stripped");
=62= is($f->strip(qq{<a href="foo">bar</a>}),
=63= qq{<a href="foo">bar</a>\n},
=64= "links are permitted");
=65= is($f->strip(qq{<a href=foo>bar</a>}),
=66= qq{<a href="foo">bar</a>\n},
=67= "attributes get quoted");
=68= is($f->strip(qq{<a href=foo bogus=place>bar</a>}),
=69= qq{<a href="foo">bar</a>\n},
=70= "bad attributes get stripped");
=71= is($f->strip(qq{<p>What do <!-- comment -->you say?}),
=72= qq{<p>What do you say?</p>\n},
=73= "comments get stripped");
=74= is($f->strip(qq{<table><tr><td>Hi!}),
=75= qq{<table><tr><td>Hi!</td></tr></table>\n},
=76= "tags get balanced");
=77= is($f->strip(qq{<b><i>bold italic!}),
=78= qq{<b><i>bold italic!</i></b>\n},
=79= "b/i tags get balanced");
=80= is($f->strip(qq{<b><i>bold italic!</b></i>}),
=81= qq{<b><i>bold italic!</i></b>\n},
=82= "b/i tags get nested properly");
=83= is($f->strip(qq{<B><I>bold italic!</I></B>}),
=84= qq{<b><i>bold italic!</i></b>\n},
=85= "tags get lowercased");
=86= is($f->strip(qq{<h1>hey</h1>one<br>two}),
=87= qq{<h1>hey</h1>\n<p>one<br>two</p>\n},
=88= "br comes out as HTML not XHTML");
=89=
=90= use Benchmark;
=91= my $homepage = do { open my $f, "homepage.html"; join "", <$f> };
=92=
=93= timethese
=94= (-1,
=95= {
=96= strip_homepage => sub { $f->strip($homepage) }
=97= });
=0= ###### LISTING TWO (My_HTML_Filter.pm) ######
=1= package My_HTML_Filter;
=2= use strict;
=3= require XML::LibXML;
=4= my $PARSER = XML::LibXML->new;
=5=
=6= sub new {
=7= my $class = shift;
=8= my $permitted = shift;
=9= return bless { permitted => $permitted }, $class;
=10= }
=11=
=12= sub strip {
=13= my $self = shift;
=14= my $html = shift;
=15=
=16= my $dom = $PARSER->parse_html_string($html) or die "Cannot parse";
=17= my $permitted = $self->{permitted};
=18=
=19= my $cur = $dom->firstChild;
=20= while ($cur) {
=21= my $delete = 0; # default to safe
=22=
=23= ## I really really hate switching on class names
=24= ## but this is a bad interface design {sigh}
=25= if (ref $cur eq "XML::LibXML::Element") {
=26= ## "that which is not explicitly permitted is forbidden!"
=27= if (my $ok_attr = $permitted->{$cur->nodeName}) {
=28= ## so this element is permitted, but what about its attributes?
=29= for my $att ($cur->attributes) {
=30= my $name = $att->nodeName;
=31= $cur->removeAttribute($name) unless $ok_attr->{$name};
=32= }
=33= ## now descend if any kids
=34= if (my $next = $cur->firstChild) {
=35= $cur = $next;
=36= next; # don't execute code at bottom
=37= }
=38= } else {
=39= ## bogon - delete!
=40= ## we must hoist any kids to be after our current position in
=41= ## reverse order, since we always inserting right after old node
=42= my $parent = $cur->parentNode or die "Expecting parent of $cur";
=43= for (reverse $cur->childNodes) {
=44= $parent->insertAfter($_, $cur);
=45= }
=46= ## and flag this one for deletion
=47= $delete = 1;
=48= ## fall out
=49= }
=50= } elsif (ref $cur eq "XML::LibXML::Text"
=51= or ref $cur eq "XML::LibXML::CDATASection") {
=52= ## fall out
=53= } elsif (ref $cur eq "XML::LibXML::Dtd"
=54= or ref $cur eq "XML::LibXML::Comment") {
=55= ## delete these
=56= $delete = 1;
=57= ## fall out
=58= } else {
=59= warn "[what to do with a $cur?]"; # I hope we don't hit this
=60= }
=61=
=62= ## determine next node ala XPath "following::node()[1]"
=63= my $next = $cur;
=64= {
=65= if (my $sib = $next->nextSibling) {
=66= $next = $sib;
=67= last;
=68= }
=69= ## no sibling... must try parent node's sibling
=70= $next = $next->parentNode;
=71= redo if $next;
=72= }
=73= ## $next might be undef at this point, and we'll be done
=74=
=75= ## delete the current node if needed
=76= $cur->parentNode->removeChild($cur)
=77= if $delete;
=78=
=79= $cur = $next;
=80= }
=81=
=82= my $output_html = $dom->toStringHTML;
=83= $output_html =~ s/.*\n//; # strip the doctype
=84=
=85= return $output_html;
=86= }
=87=
=88= 1;