Turning HTML into an RSS Feed

The Perl Journal November 2002

By Sean M. Burke

Sean is a long-time contributor to CPAN, and is the author of Perl & LWP from O'Reilly & Associates. He can be contacted at sburke@cpan.org.

In the September 2002 issue of The Perl Journal, Derek Valada's article "Parsing RSS Files with XML::RSS" sang the praises of RSS feeds, and showed how even if you don't have any RSS client programs or don't use a web site that aggregates them for you, it takes just a bit of Perl to write your own little utility for viewing the RSS content in your web browser. I can testify that once you get used to having RSS feeds from a few sites, you want all your favorite sites to have them. This article is about what to do when a site doesn't have an RSS feed: Make one for yourself by writing a little Perl tool to get content from the site's HTML.

Sites Without RSS Feeds

Some wonderful web sites provide RSS feeds that make it easy for us to find out when there's something interesting at their site, without actually having to go to that web site first. But some web sites just haven't gotten around to providing an RSS feed. Sometimes this is just because the site's programmers (if there are any!) just happen not to have heard about RSS yet. Or sometimes it's just that the people maintaining the site don't know that so many people actually use RSS and would appreciate an RSS feedthe main sysadmin of one of the Internet's larger web-logging sites recently told me, "I've considered it a bit but haven't found any real compelling reason yet, and no one else has seemed very interested in it"

Hopefully, all the larger sites will come around to providing RSS feeds; but I bet there will always be routinely updated content on the Web that lacks them. In that case, we have to make our own.

The first step in making an RSS feed for a remote site is checking that they don't already have one. Unlike a /robots.txt file, the RSS feed for a site doesn't have a predictable name, nor is there even any one place on a site where it's customary to mention the URL of the RSS feed. Some sites mention the URL in their FAQ, and recently some sites have started mentioning the URL in an HTML link element in the site's HTML, like so:

<link rel="alternate" type="application/rss+xml"

   href="http://that_rss_url" >

If you can't find an RSS feed that way, search Google for "sitename RSS" or "sitename RDF"you'd be surprised how effective that is. And if that doesn't get you anywhere, e-mail the site's webmaster and ask for the URL of their RSS. If they get enough such messages, they'll make a point of more clearly stating the RSS feed's URL if they have one, or setting one up if they don't.

But if absolutely none of these things work out, then it's time to roll up your sleeves and write an RSS generator that extracts content from the site's HTML.

Scanning HTML

Processing HTML to find the bits of content that you want is one of the black arts of modern programming. Not only is each web page different, but as its content varies from day to day, it may exhibit unexpected changes in its template, which your program is hopefully robust enough to deal with. In fact, most of my book Perl & LWP is an explanation of the bag of tricks that you should learn to use for writing really robust HTML-scanner programs. You also need a bit of practice, but if you don't have any experience at scanning HTML, then doing so to write an RSS is a great place to start. In this article, I'll stick to just using regular expressions instead of more advanced approaches involving HTML::TokeParser or HTML::TreeBuilder.

The basic approach is this:

use LWP::Simple;

my $content_url = 'http://whatever.url.to/get/from.html';

my $content = get($content_url);

die "Can't get $content_url" unless defined $content;  

...then extract things from $content...

So, for example, consider http://freshair.npr.org/, the web site for National Public Radio's interview program Fresh Air. One page on the site has the listings for the current program, with HTML like this:

<A HREF="http://www.npr.org/ramfiles/fa/20020920.fa.01.ram">Listen to

   <FONT FACE="Verdana, Charcoal, Sans Serif" 	COLOR="#ffffff" SIZE="3">

   <B> John Lasseter            </B> 

</FONT></A>

 ...

<A HREF="http://www.npr.org/ramfiles/fa/20020920     .fa.02.ram">Listen to

   <FONT FACE="Verdana, Charcoal, Sans Serif" 	COLOR="#ffffff" SIZE="3">

   <B> Singer and guitarist Jon Langford    </B>

</FONT></A>

 ... plus any other segments ...

The parts that we want to extract are:

http://www.npr.org/ramfiles/fa/20020920.fa.01.ram

John Lasseter


http://www.npr.org/ramfiles/fa/20020920.fa.02.ram

Singer and guitarist Jon Langford

We can get the page and match the content with Listing 1, whose regular expression we arrive at through a bit of trial-and-error. When run, this happily produces the following output, showing that it's properly matching the three segments in that page (at time of writing):

url: {http://www.npr.org/ramfiles/fa/20020920.fa.01.ram}

title: {John Lasseter}


url: {http://www.npr.org/ramfiles/fa/20020920.fa.02.ram}

title: {Singer and guitarist Jon Langford}


url: {http://www.npr.org/ramfiles/fa/20020920.fa.03.ram}

title: {Film critic David Edelstein}

We can later comment out that print statement and add some code to write @items to an RSS file.

Now consider this similar case, where we're scanning the HTML in the Guardian's web page for breaking news:

...

 <A HREF="/worldlatest/story/0,1280,-2035841,00.html">    UnsolvedCrimes Vex Afghanistan</A><BR><B>6:50 am</B>    <P><A HREF="/worldlatest/story/0,1280,-2035838,00    .html">Christians Show Support For Israel</A><BR><B>    6:40am</B><P><A HREF="/worldlatest/story/0,1280,-203    5794,00.html">Schroe der's Party Wins 2nd Term</A>    <BR><B>5:30 am</B><P>

...

It's a great big bunch of unbroken HTML (which I've put newlines into just for readability), but look at it a bit and you'll see that each item in it reads like this:

<A HREF="url">headline</A><BR><B>time</B><P>

You'll also note that items follow each other, one after another, with no intervening "</p><p>" tags or newlines.

So we cook up that pattern into a regular expression and put it into our aforementioned code template, as shown in Listing 2. When we run that, the code correctly produces this list of items:

url: {/worldlatest/story/0,1280,-2035841,00.html}

title: {Unsolved Crimes Vex Afghanistan}


url: {/worldlatest/story/0,1280,-2035838,00.html}

title: {Christians Show Support For Israel}


url: {/worldlatest/story/0,1280,-2035794,00.html}

title: {Schroeder's Party Wins 2nd Term}


 ...and a dozen more items...

We're ready to make both of these programs write their @items to an RSS feedexcept for one thing: URLs in an RSS feed should really be absolute (starting with "http://..."), and not relative URLs like the "/worldlatest/story/0,1280,-2035794,00.html" we got from the Guardian page. Luckily, the URI.pm class provides a simple way to turn a relative URL to an absolute one, given a base URL:

URI->new_abs($rel_url => $base_url)->as_string

We can use this by adding a use URI; to the start of our program, and changing the end of our while loop to read like so:

$url = URI->new_abs($url => $content_url)->as_string;

    print "url: {$url}\ntitle: {$title}\n\n";

    push @items, $title, $url;

  }

With that change made, our program emits absolute URLs, such as these:

url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035841,00.html}

title: {Unsolved Crimes Vex Afghanistan}


url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035838,00.html}

title: {Christians Show Support For Israel}


url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035794,00.html}

title: {Schroeder's Party Wins 2nd Term}

 ...and a dozen more items...

Basic RSS Syntax

An RSS file is a kind of XML file that expresses some data about the site in general, and then lists the details of each story item at that feed. While RSS actually has many more features than I'll discuss here (especially in later versions than the 0.91 version here), a minimal RSS file starts with an XML header, an appropriate doctype, and some metadata elements, like this:

<?xml version="1.0"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTDRSS 0.91//EN"

  "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91"><channel>

  <title> title of the site </title>

  <description> description of the site 	</description>

  <link> URL of the site </link>

  <language> the RFC 3166 language tag for this 		feed's content </language>

Then there are a number of item elements like this:

<item><title>...headline...</title><link>...url.. 	</link></item>

And then the document ends like this:

</channel></rss>

So the RSS file that we would produce with our Fresh Air HTML scanner would look like Figure 1 (shown with with a bit of helpful indenting).

We can break Figure 1 down to three main pieces of code: one for all the stuff before the first item, one for taking our @items and spitting out the series of <item>...</item> elements, and then one to complete the XML by outputting </channel></rss>. But first there's one considerationwe can't really just take the code we pulled out of the HTML and dump it into XML. The reason for this is that XML is much less forgiving than HTML, notably with &foo; codes, or "character entity references," as they're called. That is, the HTML could have this:

<a href="...">Proctor & Gamble to merge with H&R 		Block</a>

That's acceptable (if not proper) HTMLbut it's strictly forbidden in XML, and will cause any XML parser to reject that document. In XML, if there's an &, it must be the start of a character entity reference, and if you mean a literal "&", it must be expressed as with just such a &foo; codetypically as &, but also possibly as & or &.

Moreover, just because something is a legal HTML &foo; code doesn't mean it's legal in an RSS file. For the sake of compatibility, the RSS 0.91 DTD (at the URL you see in the <!DOCTYPE...> declaration) defined the same &foo; codes as HTMLbut that was the HTML of several years ago, back when there were just codes for the Latin-1 characters 160 to 255. This gets you codes like é, but if you try using more recent additions like € or —, the RSS parser will fail to parse the document.

So just to be on the safe side, we should decode all the &foo; codes in the HTML, and then reencode everything, except using numeric codes (like {), since those are always acceptable to XML parsers. And while we're at it, we should kill any tags inside that HTML, in case the HTML that we captured happens to contain some <br>s, which would become malformed as XML. To do the &foo; decoding, we can use the ever-useful HTML::Entities module (available in CPAN as part of the HTML-Parser distribution). Then we do a little cleanup and use a simple regexp to replace each unsafe character (like & or é) with a &#number; code. The eminently reuseable routine for doing this looks like Listing 3.

Hooking It All Together

Once we've got the xml_string routine as previously defined, we can then use it in a routine that takes the contents of our @items (alternating title and URL), and returns XML as a series of <item>...</item> elements, as in Listing 4.

We can test that routine by doing this:

print rss_body("Bogodyne rockets > 250&frac12;/share!","http://test");

Its output is this:

<item>

<title>Bogodyne rockets > 250½/ share!</title>

</item>

This is correct, since > is XMLese for "a literal > character," and ½ is for "a literal 1/2 character." Since this is all working happily, we can make another routine for the start of the XML document, as shown in Listing 5. Then spitting out the bare-_{bones RSS XML that we're after is just a matter of making the call shown in Listing 6.}

Run the program, and it indeed spits out the valid XML shown in Figure 2. To do the same for our Fresh Air program, we just append the same code, changing the parameters to rss_start, as in Listing 7. Running that program returns the RSS expression of our @items as shown in Figure 3.

And because we put everything through our xml_escape routine, the XML text is always properly escaped, even if our original HTML scanner regexp happens to trap an HTML tag or malformed &foo; code.

That's all there is to making a basic RSS generator program. The only question left is how to have it run.

Running via CGI or via Cron

There are two main ways to use an RSS generator programit should either run as a CGI and send output to the browser on demand, or it should be run periodically via cron, and save output to a file that can be accessed at some URL.

The mechanics are simple. If the program is to run as a CGI, just start its output out with a MIME header like so:

print "Content-type: application/rss+xml\n\n",

  rss_start(

  ... and the rest, as before ...

If you want to save the output to a file, instead do this:

my $outfile = '/home/jschmo/public_html/freshair.rss';

open(OUTXML, ">$outfile")

  || die "Can't write-open $ouffile: $!\nAborting";

print OUTXML

  rss_start(

  ... and the rest, as before ...

The more complex issue is: Under what conditions would you want to do it one way or the other way? If the program runs as a CGI, it will connect to the remote server to get the HTML as many times as there are requests for the RSS feed. If this is an RSS feed that only you know about, and you access it only a few times a day at most, then having it run as a CGI is just fine.

But if the RSS feed might be accessed often, then it would be more efficient for your server, as well as for the remote server, if you have the RSS updater run periodically via cron, as with a crontab line like this:

13 6-17 * * 1-5 /home/jschmo/make_fresh_air_rss

That will run the program at 13 minutes past the hour between 6:13am and 5:13pm, Monday through Fridayand those are the only times that it will request the HTML from the server, no matter how many times the resulting RSS file gets hit. Implicit in those crontab settings is the assumption that we don't really need absolutely up-to-the-minute information (or else we'd set it to run more often, or just go back to using the CGI approach) and that there's no point in accessing the RSS data outside of those hours. Since Fresh Air is produced only once every weekday, I've judged that it's very unlikely that their HTML listings page will change outside of those hours.

You should always be considerate of the remote web server, so you should request its HTML only as often as necessary. Not only does this approach go easy on the remote server, it also goes easy on your server (which is running the RSS generator). This is fitting, considering that the whole point of an RSS file is to bring people to the content they're interested in, as efficiently as possible, from the points of view of the people and of the web servers involved.

TPJ

Listing 1

use LWP::Simple;   my $content_url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';   my $content = get($content_url);   die "Can't get $content_url" unless defined $content;   $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines    my @items;   while($content =~ m{   \s+<A HREF="([^"\s]+)">Listen to   \s+<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3">   \s+<B>(.*?)</B>   }g) {     my($url, $title) = ($1,$2);     print "url: {$url}\ntitle: {$title}\n\n";     push @items, $title, $url;   }

Back to Article

Listing 2

use LWP::Simple;   my $content_url = 'http://www.guardian.co.uk/worldlatest/';   my $content = get($content_url);   die "Can't get $content_url" unless defined $content;   $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines      my @items;   while($content =~     m{<A HREF="(/worldlatest/.*?)">(.*?)</A><BR><B>.*?</B><P>}g   ) {     my($url, $title) = ($1,$2);     print "url: {$url}\ntitle: {$title}\n\n";     push @items, $title, $url;   }

Back to Article

Listing 3

use HTML::Entities qw(decode_entities);    sub xml_string {     # Take an HTML string and return it as an XML text string          local $_ = $_[0];     # Collapse and trim whitespace     s/\s+/ /g;  s/^ //s;  s/ $//s;          # Delete any stray HTML tags     s/<.*?>//g;      decode_entities($_);          # Substitute or strike out forbidden MSWin characters!     tr/\x91-\x97/''""*\x2D/;     tr/\x7F-\x9F/?/;          # &-escape every potentially unsafe character     s/([^ !#\$%\x28-\x3B=\x3F-\x7E])/'&#'.(ord($1)).';'/seg;      return $_;   }

Back to Article

Listing 4

sub rss_body {     my $out = '';     while(@_) {       $out .= sprintf        "  <item>\n\t<title>%s</title>\n\t<link>%s</link>\n  </item>\n",        map xml_string($_),            splice(@_,0,2); # get the first two each time     }     return $out;   }

Back to Article

Listing 5

sub rss_start {     return sprintf q[<?xml version="1.0"?>   <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"     "http://my.netscape.com/publish/formats/rss-0.91.dtd">   <rss version="0.91"><channel>     <title>%s</title>     <description>%s</description>     <link>%s</link>     <language>%s</language>   ],     map xml_string($_),       @_[0,1,2,3];  # Call with: title, desc, URL, language!   }
...and for the end:
  sub rss_end {     return '</channel></rss>';   }

Back to Article

Listing 6

print    rss_start(      "Guardian World Latest",      "Latest Headlines from the Guardian",      $content_url,      'en-GB',  # language tag for UK English    ),    rss_body(@items),    rss_end()   ;

Back to Article

Listing 7

print    rss_start(      "Fresh Air",      "Terry Gross's interview show on National Public Radio",      $content_url,      'en-US',  # language tag for US English    ),    rss_body(@items),    rss_end()   ;



TPJ

Back to Article