RSS, Ruby, & the Web

Dr. Dobb's Journal January, 2005

Libraries that make RSS and web work easy

By Dave Thomas

Dave is a principle in The Pragmatic Programmers and author of Programming Ruby. He can be contacted at http://www.pragmaticprogrammer.com/.

Really Simple Syndication (RSS) is the protocol of choice for disseminating news on the Internet—and with good reason. As a consumer of information, RSS feeds give you the convenience of consolidating all the data you receive into a single, easy-to-scan format. And if you're creating or publishing information, RSS is an easy-to-implement way of getting your data onto people's desktops.

Many people use RSS to read summaries of weblogs and other news sources. This is natural—these sites typically provide an RSS feed that can be read by RSS clients (known as "aggregators"). But RSS is far more than "Dear Diary, Yesterday I had pizza." With some simple tools, you can extract information from nonRSS sources, convert it to RSS, and summarize it alongside the other feeds in your aggregator. You can use RSS to collect and summarize information from your projects and from your life.

For example, in my publishing business, I work with a number of author teams, each using a part of our CVS repository. I use a trivial script to extract information every time someone commits a change to the repository, summarizing the change in an RSS feed. I'm also interested in the Amazon sales ranks of all our titles, so another simple script downloads the information and converts it into another RSS feed. I also extract and convert ToDo items from my calendar application. Throughout the day, my RSS aggregator notifies me when this information changes. It's like having a synoptic display of important information, tailored to your needs, all with very little work.

To make this happen, you need some tools. The recently released 1.8.2 version of Ruby (http://www.ruby-lang.org/) comes complete with a full set of libraries that make working with the Web and RSS easy. To introduce these, in this article, I show how to:

Finally, I look at the other side of the coin—taking an RSS feed and converting it into an HTML fragment suitable for embedding in a web page's sidebar.

Basic Blogging

At its simplest, a weblog (or blog) is a web page that displays a set of articles in reverse chronological order (newest at the top). When the writer adds a new story, it appears at the head of the list—the oldest story might drop off if the list reaches a given size.

There are many software packages and dedicated sites that implement fully fledged blogging solutions. However, as an introduction to both Ruby and its web libraries, I implement a very basic blog from scratch.

This blog runs in a web server. I could have written something that runs in Apache, but instead I make it standalone. Listing One is a complete web server that handles multiple mime types, directory indexing, CGI execution, user directories, partial content fetches, and more. Not bad for four lines of code. Of course, the WEBrick library handles the real work. You load it into the program, then create a new web-server object. I run our server on port 2000, and tell it to look for documents to serve in the directory html/ (in a real application, you'd probably want to make this an absolute pathname). The next line tells the Ruby interpreter to shut the web server down cleanly when users interrupt it by typing Ctrl-C on the console used to start it. Finally, you tell the server to start executing requests.

You can run this web server from the command line:

ruby webserver1.rb

Point a browser at http://localhost:2000/ and you should see a document served from the html/ directory.

Adding Weblog Capabilities

For illustrative purposes, I create a basic weblog that displays the documents in or below the web server's document root directory, showing the 10 most recent documents, newest first. I start by writing a class to represent a single article (Listing Two). The initialize() method is the class's constructor. It takes the name of a file containing the article and reads in the contents. It looks for a title, searching for either the contents of the HTML <title> element or the first <h1> element. If neither is found, the title is set to the name of the file. Because the articles are regular HTML files, it also removes everything but the content of the <body> element. The resulting title, filename, and body are saved in instance variables (with names that start with @ signs). Make these accessible outside the class via the attributes you declare near the top of the class.

The method Article.list (Listing Three) returns an array of up to 10 Article objects, sorted in reverse chronological order. It illustrates some nice features of Ruby, including three uses of Ruby's blocks.

Ruby blocks are chunks of code (between do/end keywords or braces) that are associated with a method call. The method can invoke the block multiple times, and when the code in the block completes, control is returned to the associated method. The list() method uses three blocks. The first occurs on the call to Dir.chdir(). This method makes the given directory the process's current working directory and invokes the associated block (which, in this case, contains the rest of the body of the method). When the block finishes, control returns to the chdir() method, which resets the current directory back to the value it had previously.

Inside this block, Dir.glob returns an array of filenames that match a given pattern. The ** part of this pattern tells the glob to match in the current directory and in its subdirectories, and *.html matches filenames ending with "html."

The next line sorts this list of names according to the modification time of the associated file. This uses the new (in Ruby 1.8) sort_by() method that lets you sort a collection based on some value derived from the elements in that collection (in this case, the file's modification time).

Finally, the sorted list of names is reversed and up to 10 entries are extracted. This list is passed to the map() method, which creates a new array by passing each entry in turn to the given block and populating the result array with the value returned by the block. In this case, the map() construct converts the list of filenames into an array of Article objects.

It's interesting to compare the length of the method with the length of its description. Ruby code can be remarkably compact and yet (once you're familiar with the language) very readable.

With this infrastructure in place, you're ready to write a blog. You do this by adding a servlet to the existing WEBrick web server. A servlet is a chunk of code that handles requests directed to a given path on the web server. When the client browser sends a request to a URL that includes this path, the servlet is invoked. It can optionally extract information from the request, then provide data for the response to be sent back to the browser.

Class BlogServlet (Listing Four) is the servlet implementation of the blog. The < sign in the class definition specifies that the BlogServlet class is a subclass of WEBrick's AbstractServlet—you inherit all the basic servlet functionality. To handle incoming requests, the servlet implements a do_GET() method, which takes request and response objects as parameters.

The blog servlet doesn't need information from the request, and simply sets the response content to the 10 most recent files in the document tree. Notice here that I use map() again, this time to convert the array of Article objects to an array of the corresponding article bodies. I then use the method join() to convert this list into a single string with an <hr> element between each article.

You mount this servlet on the path /blog. Once you start the server running, you can access our blog using the URL http://localhost:2000/blog/.

Having written a basic blog, you can make its content available via an RSS feed.

RSS Summary

RSS is a simple XML-based messaging format. An RSS message contains a channel header followed by zero or more items; see Figure 1. The channel header identifies the source of the message, and includes a description, title, and the URL of the site serving the feed. Listing Five is the channel header for the BBC News RSS feed (which contains a summary of current news stories).

Following the channel header, an RSS message contains a set of items. Each item is intended to summarize an external piece of information—the RSS item contains a title, a brief description, and a link pointing to a fuller version of the information. To continue the BBC News example, you might find items each containing a synopsis of a news story and a link back to the BBC site where the full story is stored.

There are three major variants of RSS. Netscape created the original 0.90 specification. Later versions in the 0.9 series (0.91 through 0.94) were developed by UserLand Software. The RSS-DEV Working Group subsequently produced RSS 1.0, a total rethink that married RSS technology with RDF, a union that many felt was too complicated. As a reaction, UserLand produced RSS 2.0, which ignored the RDF flavor of 1.0 and instead built on the more successful 0.9 protocol. Unless you have a particular need for the data structuring capabilities of RDF, I suggest sticking with RSS 0.9x or 2.0.

Adding RSS Feeds

The RSS library supplied with Ruby 1.8.2 and later makes it easy both to parse and to create RSS information in 0.9, 1.0, and 2.0 formats. The library represents RSS data as a series of objects—for an RSS 0.92 feed, an RSS object contains a Channel object, and that channel object contains a collection of Item objects. The RSS-generating servlet (Listing Six) shows how these objects are used to create an RSS feed for this simple blog. You start by creating an RSS object—in this case, you ask for one that's compatible with the 0.9 specification. You then create a Channel object, populating it with the blog's title and a link to the top page of the web site. The URI class, also new in Ruby 1.8, makes this easy: You get the URI that was used to request the RSS feed and remove the path component, creating a link to the top-level page of the site. When this address is accessed, WEBrick automatically looks for an index.html file and displays its contents.

Each item in the RSS you generate corresponds to an article in the blog, so you can reuse the Article.list() method. The each() method iterates over this list, passing each article in turn to the associated block. Inside the block, you work out the URL for the article (by adding its path to the base URI of the site) and create and populate a new RSS Item object. This is then added to the list of items in the channel.

Finally, you fill in the response object, giving it an appropriate mime type and the RSS data as a body.

Other Types of RSS

There's no rule that says RSS has to contain blog entries, or that RSS has to be dynamically generated. In fact, one of my favorite uses involves creating a flat file containing an RSS summary of the last 10 commits to a CVS repository.

Each CVS repository contains a number of control files in the special CVSROOT project. One of these files, loginfo, can be used to invoke an arbitrary program as files are stored back in the repository. Each noncomment line in the file contains a pattern and a command. If the pattern matches the directory being saved back into the repository, the corresponding command is executed. This command is passed information about the commit as parameters, and the log message supplied by users, along with a list of modified files, is passed to the command's standard input.

The Ruby program in Listing Seven is designed to be invoked by CVS as a loginfo script. It maintains a separate RSS file for each top-level repository in a site-wide directory. If this directory is accessible to a web server, the contents of this file can be disseminated over HTTP to RSS aggregators.

The CVS loginfo file that invokes this program contains a line similar to this:

ALL /usr/local/bin/ruby /path/to/
commit2rss.rb %{}

This causes the Ruby script to be run ever ytime a commit is made to the repository. The %{ } parameter means that the path of the committed files are passed to the script. At the top of script, you extract the first component of this pathname. This is the top-level CVS module being committed, and you use this to generate the name of the RSS file in the drop directory.

The script keeps the 10 most recent commits in the RSS file. This means that information on the latest commit must potentially be merged with any existing data. Fortunately, Ruby's RSS library can parse RSS data, so we pass it the contents of any existing RSS file. Because this fails the first time the program is run, you open the file in an exception handler, creating a new RSS object if you can't populate one from existing data. Just to add some topical interest to the example, I've used Version 2.0 of RSS in this code.

After setting up the channel information, the code reads the CVS log message from standard input. Lines ending with a colon are headings, so I make them bold. I use the formatted log message to construct a new RSS item, which gets added to the channel. I then concatenate up to nine items from the original RSS file before writing the result back to disk. The net effect is that the RSS file now contains the new log message as the most recent item, followed by up to nine previous items. Figure 2 shows what this RSS feed looks like in the RSS aggregator, NetNewsWire. The panel on the left contains a list of RSS feeds that I read. At the top are several feeds that I generate from external sources using Ruby scripts. The CVS log summary for the Bookshelf repository is selected, and the list of recent commits is shown in the panel at the top-right. Below it is the log message for one recent commit. As authors commit changes to books, the panel updates, so I can see at a glance all activity in the repository.

RSS to HTML

It is occasionally useful to be able to go the other way—taking information from an RSS feed and converting it into HTML. For example, a portal site might want to include a side box with a list of recent headlines from a news site. Because I don't want to fetch the RSS data every time the portal page is displayed, I generate the HTML for the side box periodically, and upload it into a known location on our web server. There, it can be included in the portal's page as a server-side include. The Ruby program in Listing Eight does this. Along the way, it also illustrates several other libraries now distributed as standard with Ruby.

The open-uri library illustrates the dynamic nature of Ruby's classes. When you include the library, it modifies the behavior of Ruby's standard open() method, letting it open URIs as if they were regular files. You use this capability to open an RSS feed (in this case, from my personal blog—http://pragprog.com/pragdave/). It reads the RSS into a string, then uses the RSS library to parse it.

Ruby 1.8 comes with a simple templating system that I use to convert each of the RSS items into an HTML fragment. This templating system takes a string containing a template and a hash containing key/value pairs. It looks for instances of %name% in the string, and replaces each with the value in the hash whose key is name. It can also generate repetitive entries—if the template contains lines between START:name and END:name, then the value corresponding to name in the hash is assumed to be an array, and the lines in the template are repeated for each entry in this array. I use this capability in the program to generate a list entry for each item in the RSS feed. The map() call takes the RSS item array and returns a new array, where each entry is a hash, and each hash contains the key's title, link, and description. I then use the template library to generate HTML from this, writing the result to a temporary file. (I could have taken a different approach to solve this problem, perhaps using XSLT to transform the RSS into HTML. However, doing this work in Ruby lets me showcase a few additional Ruby libraries.)

The last five lines in the listing use Ruby's Net::FTP library to connect to the portal web site and upload this HTML I just generated into the portal/side boxes directory. Notice how the library makes use of a Ruby block: The FTP connection is passed in to the block as a parameter, which uses it to transfer the file. When the code exits the block, the library is automatically closed.

Conclusion

It has been four years since I wrote the first article on Ruby for DDJ. In that time, the language itself has been relatively stable. The libraries, however, have matured tremendously. Out of the box, Ruby's libraries make it a mature player on today's Internet. Add to that third-party frameworks such as Ruby on Rails (http://www.rubyonrails.org/), and Ruby has become a serious contender as a language for knitting together the Internet.

DDJ



Listing One
require 'webrick'
server = WEBrick::HTTPServer.new(:Port => 2000, :DocumentRoot => "html")
trap("INT") { server.shutdown }
server.start
Back to article


Listing Two
class Article 
  attr_reader :file_name, :title, :body
  def initialize(file_name)
    body  = File.read(file_name)
    title = file_name
    if body =~ %r{<title.*?>(.*?)</title}m ||
       body =~ %r{<h1.*?>(.*?)</h1}m 
      title = $1
    end
    body.sub!(%r{<body.*?>(.*)</body.*}m) { $1 }
    @file_name, @title, @body = file_name, title, body
  end
end
Back to article


Listing Three
def Article.list(dir)
  Dir.chdir(dir) do
    file_list = Dir.glob("**/*.html")
    sorted_list = file_list.sort_by {|name| File.stat(name).mtime }
    sorted_list.reverse[0, 10].map do |file_name|
      Article.new(file_name)
    end
  end
end
Back to article


Listing Four
class BlogServlet < WEBrick::HTTPServlet::AbstractServlet
  HEAD = "<head><title>Simple Blog</title></head>"
  def do_GET(req, res)
    articles = Article.list(@server.config[:DocumentRoot])
    content  = articles.map{|a| a.body}.join("<hr />")
    res.body = "<html>#{HEAD}<body>#{content}</body></html>"
  end
end
server.mount("/blog", BlogServlet)
Back to article


Listing Five
<rss version="0.91">
  <channel>
    <title>BBC News | News Front Page | World Edition</title>
    <link>
      http://news.bbc.co.uk/go/click/rss/0.91/public/-/2/hi/default.stm
    </link>
    <description>
      Updated every minute of every day - FOR PERSONAL USE ONLY
    </description>
    <language>en-gb</language>
    <lastBuildDate>Tue, 21 Sep 04 21:46:23 GMT</lastBuildDate>
    <copyright>
      Copyright: (C) British Broadcasting Corporation,
      http://news.bbc.co.uk/1/hi/help/3281849.stm
    </copyright>
    <docs>http://www.bbc.co.uk/syndication/</docs>
    <image>
      <title>BBC News</title>
      <url>
    http://news.bbc.co.uk/nol/shared/img/bbc_news_120x60.gif
      </url>
      <link>http://news.bbc.co.uk</link>
    </image>
    <!-- list of items ... -->
  </channel>
</rss>
Back to article


Listing Six
class RssServlet < WEBrick::HTTPServlet::AbstractServlet
  def do_GET(req, res)
    my_uri = req.request_uri
    rss = RSS::Rss.new("0.9")
    chan = RSS::Rss::Channel.new
    chan.title = "My Blog"
    my_uri.path = ""          # link back to the top-level
    chan.link = my_uri.to_s
    rss.channel = chan
    Article.list(@server.config[:DocumentRoot]).each do |article|
      item = RSS::Rss::Channel::Item.new
      my_uri.path = "/" + article.file_name
      item.link  = my_uri.to_s
      item.title = article.title
      item.description = article.body
      chan.items << item
    end
    res['Content-Type'] = "text/xml"
    res.body = rss.to_s
  end
end
server.mount("/rss", RssServlet)
Back to article


Listing Seven
require 'rss/2.0'
require 'etc'
DROP_DIR = "/var/www/pragprog/data/rss" # Where we create our RSS files
MAX_TO_KEEP = 10 # Max entries in the RSS file
# Just need the top-level project name
dir = ARGV.shift
repo = dir.split(%r{/})[0]
FILENAME = File.join(DROP_DIR, repo + ".rss")
# Read in existing rss (we'll write it out again to the new file)
begin
  existing_data = RSS::Parser.parse(File.read(FILENAME), false)
rescue
  existing_data = RSS::Rss.new("2.0")
end
# write out the new item and up to 9 old items
rss = RSS::Rss.new("2.0")
chan = RSS::Rss::Channel.new
chan.title = chan.description = "Commit Summary: #{repo}"
rss.channel = chan
# Read in the loginfo msg and highlight the headings
desc = ""
while line = STDIN.gets
  line.chomp!
  if line =~ /^[A-Z].*:\s*$/
    desc << "<p /><b>" << line << "</b>"
  else
    desc << line
  end
  desc << "<br />"
end
# new top item is current loginfo
item = RSS::Rss::Channel::Item.new
item.title =  Time.now.strftime("%b %d, %H:%M") + dir
item.pubDate = Time.now
item.description = desc
chan.items << item
# Then up to `n' items from old data
chan.items.concat existing_data.items[0, MAX_TO_KEEP-1]

File.open(FILENAME, "w") {|f| f.puts(rss.to_s) }
Back to article


Listing Eight
# Such down the top 'n' articles from an RSS feed and summarize for a web site
require 'open-uri'
require 'rss/0.9'
require 'rdoc/template'
require 'net/ftp'

TMP_FILE = "/tmp/topfive"
BLOG_URL = 'http://pragprog.com/pragdave/synopsis.rss?count=5'

TEMPLATE = %{
<ul>
START:entries
<li><a href="%link%">%title%</span></a>%description%</li>
END:entries
</ul>
}
open(BLOG_URL) do |http|
  result = RSS::Parser.parse(http.read, false)
  # Convert an array of RSS items into an array
  # of hashes
  entries = result.items.map do |item|
    { 
      'title'       => item.title,
      'link'        => item.link,
      'description' => item.description
    }
  end
  File.open(TMP_FILE, "w") do |f|
    t = TemplatePage.new(TEMPLATE)
    t.write_html_on(f, 'entries' => entries)
  end
end
Net::FTP.open('www.pragprog.com') do |ftp|
  ftp.login('username', 'password')
  ftp.chdir('portal/sideboxes')
  ftp.put(TMP_FILE, 'topfive', 1024)
end
Back to article