The Perl Journal August 2003
In my last article, I mentioned that I maintain a blosxom-based weblog. blosxom is a very clever bit of Perl written by Rael Dornfest, and is designed to be simple, self-contained, and easy to get up and running. I like its simplicity, I like the way it uses flat text files, I like the basic idea of it, but there are some things about it I don't actually like. So after my last article got me thinking about blogging, I decided the time was right to write my own blog software. That weekend, I heard someone speak about "blossoms and briars," and a new project was born.
A lot of what I've been doing recently might be categorized as reinventing wheelsfirst the Email:: project, and now this. But sometimes, it's necessary to reinvent wheels. If we didn't, to paraphrase Gary Burgquist, this would be The Cobol Journal, we'd have leeches all over our bodies, we would be listening to just 8-track tapes. Wheel reinvention is the only way we get smoother wheels. But we don't do so lightly and we don't do it just for the sake of it. When we're reinventing wheels, it's important to think about what problems we're trying to solve and how we're going to improve on the existing technology.
So, while the basic ideas of blosxom were great, it had two big problems. First, the code is monolithic, difficult to follow, and difficult to extend; if, for instance, you want to shift from flat files to a database-backed storage system, you're basically out of luck and reduced to tearing up major parts of the program. Second, because the software is self-contained, the templates for critical parts of the output live in a DATA section at the end of blosxom.cgi (at least for the 1.x series of blosxom; I'm told things get slightly better in the new 2.x series, but I was already fed up with blosxom by this point). Editing these to customize the output can be awkward, and having done so makes upgrading difficult.
It's also important to realize that these things aren't defects in blosxom at all; they come about because blosxom is confirming to a particular set of design goals, and they are goals that I agree with and think are suitable for most people. Keep it simple, self-contained, easy to install, and easy to use. But I also felt that I had outgrown the boundaries of what I could do with blosxom, and so it was time for something new.
Let's first look at a user's perspective on the result of all this deliberation, and then we'll take a look at how some of it was done; in the next article, we'll take a deeper look at the construction of Bryar, and how it can be extended. Along the way, we'll learn a little about blogging with Bryar, program design, finding files with File::Find::Rule, and efficient list handling.
Installing Bryar is relatively simple. Not quite as simple as installing blosxom, but close. First, we have to grab all the modules we need from CPAN, and the best way to do this is with the CPANPLUS Perl module:
perl -MCPANPLUS -e 'install Bryar'
Once you've got CPANPLUS up and runningan essential component of any serious Perl sitethe aforementioned command should install the Bryar code and its dependencies, and it should also tell you at some point:
You probably want to run bryar-newblog in a likely home for your blog once we've finished installing.
Let's do that now.
# mkdir -p /opt/blog # chown simon:simon /opt/blog % cd /opt/blog % bryar-newblog
Setting up a Bryar blog in this directory
Done. Now you want to probably customize 'bryar.conf'. You should probably also customize template.html, head.html and foot.html Then point your browser at bryar.cgi, and get blogging!
Now, just as with blosxom, we need to tell the web server to serve up that directory and treat bryar.cgi as a CGI script and the directory's index. Once we've done that, we can point to our browser at the relevant locationperhaps http://localhost/blog/ and we should see something like Figure 1.
It's not pretty, but it works. As the message says, we should probably customize template.html, head.html, and foot.html to make it a bit prettier. Figure 2 shows what my blog looks like after a tiny bit of customizationdeliberately constructed to make it very similar to blosxom's default format. By default, Bryar uses Template Toolkit to format posts, so regular readers of this column should know how to deal with the HTML files generated by bryar-newblog.
To make blog posts, we simply add files called something.txt in our data directory:
Title
<P>HTML goes here</P>
Bryar automatically picks these up, sorts them in date order, and formats them appropriately. In fact, I use a little script to ensure that I have a unique post ID every time I want to make a blog entry; here's my bin/blog command:
#!/usr/bin/perl
my $path = "/opt/blog/".shift()."/";
$blog[$_]++ for map { /(\d+)\.txt/&&$1} <$path/*txt>;
system("vim",$path.@blog.".txt");
The first line of code decides where to look for posts. The second is a sneaky way to find a new post number. My posts are all entered numerically (6065.txt, and so on). This line looks at all of the ".txt" files and builds an array of currently existing posts; if the highest numbered post we have currently is 1234.txt, it will set $blog[1234] to 1. It'll also assign to a bunch of other elements of that array, but we don't care about those; we only care about the highest number in existence at the moment.
If $blog[1234] is the highest numbered element, @blog itself will have 1235 elements, and thanks to the wonderful Mr. Cantor and his diagonal theorem, post 1235 is guaranteed not to be in use at the moment. So we start an editor on /opt/blog/1235.txta brand new blog post.
For those of you who have embraced the RSS generation, Bryar can also generate an RDF file by adding "/xml" to the end of the URL. The POD documentation to Bryar.pm details all the other things that can be done with Bryar URLs.
Now that we have an idea of what Bryar does, let's have a look at how it does it.
I designed Bryar to have four major areas of operation; it turns out that there are possibly some more things it needs to do, but these can be worked out over time. By splitting Bryar's operation into these distinct areas, we enable it to be highly customizable by letting people replacing any or all of the classes that implement these operations.
The four things that Bryar has to do can be described as "interface," "retrieval," "collation," and "formatting." But since I didn't think in those terms when I was designing the code, we'll call them the Frontend, DataSource, Collector, and Formatter classes.
FrontendThe Frontend class deals with the interaction between Bryar and the outside world. As with blosxom, the primary interface mechanism is through the Common Gateway Interface; the URL and other options are determined from the execution environment, the final output is written to standard output, and so
on. We can equally conceive of a component which implements Bryar::Frontend as a mod_perl handler, for instance, or as a stand-alone program taking options from the command line.
DataSourceBlog entries have to come from somewhere; the DataSource class finds entries, finding only those entries fulfilling particular criteria, and turns them into a set of Bryar::Document objects that abstract postings away from the data source and into a common interface. Since we're emulating blosxom, the default DataSource class reads all the .txt files in a given data directory, and uses filesystem properties such as the last-modified time and the owner as post metadata. We could also take blog entries from a database and we'll show an example of this later in the article.
CollectorThe Collector class has the job of interpreting the options obtained from the Frontend, turning them into a search query, and asking the DataSource class for all the documents that fulfill the query. For instance, we might return the last 20 posts, (the default operation) or the posts in a particular month or day, or posts containing a given phrase. This is possibly the most "stationary" class in Bryar, since it's difficult to imagine a good reason for changing the default behavior.
RendererOnce the Collector has decided on a set of Bryar::Document objects to be displayed, they are handed off to the Renderer class. As mentioned earlier, we use the Template Toolkit by default, but there's no reason why we couldn't create Renderer classes that use HTML::Template or HTML::Mason.
Bryar's operation can be summarized in the flowchart shown in Figure 3.
Other satellite classes that turned out to be useful were Bryar::Config, which encapsulates the configuration, and Bryar::Comment, a subclass of Bryar::Document used for encapsulating comments on a blog posting.
We'll take a detailed look at two of these areas, those that are most likely to be customizedthe data source and the front end.
As I've mentioned, the job of the DataSource class is to turn our raw data into Bryar::Document objects. There are three methods we need to provide in order to do this: all_documents should return absolutely everything, search should return those documents matching specified search criteria, and add_comment should record a comment against an article.
As it turns out, if we implement all_documents, we don't need to implement search but can instead inherit from Bryar::DataSource::Base. This base class provides a very dumb search facility that looks at every single Bryar::Document and sorts out those that match the search terms. This is very inefficient, though, since it's usually faster to do some kind of searching at the data-source levelfor instance, if our posts are stored in a SQL database, we might as well use the capabilities of SQL SELECT to find posts within a certain time period or for a particular ID, rather than plough through individual posts.
Similarly, if we implement search, we don't actually have to implement all_documents, as nothing in Bryar calls it (yet). On the other hand, it may be useful to do soboth for completeness, and as a warm up for writing search. It's also conceivable that, in the future, one might add extensions to, say, format an entire journal for printing as a PDF file.
Let's start by looking at all_documents in our flat file data source. With the obscene comments removed, this looks like:
sub all_documents {
my ($self, $bryar) = @_;
croak "Must pass in a Bryar object"
unless UNIVERSAL::isa($bryar, "Bryar");
my $where = cwd;
chdir($bryar->{config}->datadir);
my @docs = map { $self->make_document($_) }
File::Find::Rule->file()
->name("*.txt")
->maxdepth($bryar->{config}->depth)
->in(".");
chdir($where);
return @docs;
}
We're called as a class method and passed a Bryar object; this stores the configuration and current state of the blog, and essentially draws everything together. First, we need to determine where our data files live; this is stored in the Bryar config, which we access via $bryar->{config}. Once we've found this, we change to that directory and start looking for files. I've used Richard Clamp's wonderful File::Find::Rule module which, as we'll see later, turns out to be almost purpose built for what we're about to do.
File::Find::Rule is a nicer way of finding files than the usual File::Find. The idea is that we chain together rules that describe what we're looking for, and finally tell it where to look. This acts a little like the UNIX find(1) command. So in this case, we look for all the *.txt files in the current directory and subdirectories below, up to a maximum depth specified in the blog configuration. This allows us to have entries categorized into subdirectories included in the blog. We call the helper function make_document on each one, which does the heavy work of turning a filename into a Bryar::Document object, and we return all these results.
make_document is fairly uninteresting, being concerned with extracting metadata from a file; we saw in my last article how to do this for blosxom-style files. Now we're warmed up and ready to go, let's move on to the more interesting search method, which stretches File::Find::Rule a little more.
The search method takes a Bryar object as before, but it also takes a hash of things to look for. In particular, id finds a document with a particular ID; this is used to provide permanent links for articles, especially in the case of articles referenced from an RDF feed. since and before look for documents after and before particular UNIX epoch times; contains finds documents containing a particular word, and subblog looks for documents in a given sub-blog or category. Finally, the limit parameter is used to set a maximum number of entries to return. For instance, by default, the front page of a blog will show the 20 most recent articles.
Thanks to File::Find::Rule, we can actually construct a search that covers all but one of these search terms in one single call. Naturally, the more specific we can make a single search, the fewer passes over the data we need to do, and therefore, the more efficient our data source class turns out to be.
The first few lines of search should be reasonably obvious:
sub search {
my ($self, $bryar, %params) = @_;
croak "Must pass in a Bryar object"
unless UNIVERSAL::isa($bryar, "Bryar");
my $was = cwd;
my $where = $bryar->{config}->datadir."/";
if ($params{subblog}) { $where .= $params{subblog}; }
chdir($where);
We find our data directory as before, and this time, if we're given a subdirectory name to look in: We start searching from there instead of the document root.
Now we start putting our query together. The most obvious thing to look for is the document ID. If we're trying to find blog post number 89, then it's quite dull to look for "*.txt" and extract "89.txt"we might as well just go straight for "89.txt":
my $find = File::Find::Rule->file();
if ($params{id}) { $find->name("$params{id}.txt") }
else { $find->name("*.txt") }
As before, we restrict our search to a maximum depth:
$find->maxdepth($bryar->{config}->depth);
And now the clever stuff starts. File::Find::Rule allows us to specify bounds for the last-modified time of a file, something that will help us to find entries within a given period:
if ($params{since}) { $find->mtime(">".$params{since}) }
if ($params{before}) { $find->mtime("<".$params{before}) }
Notice what's happening herewe're simply modifying the $find object by adding constraints to it. It isn't hitting the filesystem at all yet, it's simply building up a data structure that determines how to perform the search.
Finally, we can look for items that contain a particular word or phrase; File::Find::Rule allows us to grep through the contents of a file using the aptly named grep method:
if ($params{content}) { $find->grep(qr/\b\Q$params{content}\E\b/i) }
Are you comfortable with that regular expression? It says that we're looking for a word break, then the literal contents of $params{content}, then another word break. The word breaks are there to ensure that a search for "pie" only finds articles that talk about "pie," and not those which talk about being "occupied" or similar; the \Q and \E make sure that we don't treat $params{content} as a regular expression, but rather as literal text.
Why don't we let the users search using regular expressions? Although this would undeniably be powerful, we want to keep this relatively simple. Since some data sources, such as a SQL back end, won't support searching by regular expression, so we deliberately restrict the search to make it as efficient as possible.
Now we are ready to actually launch our search and grovel around the filesystem. The only search term we have not dealt with is limit, restricting the number of documents returned. The problem with this is that we can't force File::Find::Rule to return the results in any particular order. When we want at most 20 blog entries, we actually want the most recent 20, not just the first 20 we come across on the disk; these may not turn out to be the same. So, to implement limit, we need to do a little trickery. Let's first dispatch the case where there is no limit given:
if (!$params{limit}) {
@docs = map { $self->make_document($_) } $find->in(".");
}
We simply start the search in the current directory, turn all the found files into documents, and we're done. That was easy. When there's a limit, we have to be a bit more careful:
@docs = map { $self->make_document($_) }
(
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [$_, ((stat $_)[9]) ] }
$find->in(".")
) [0..$params{limit}-1];
What's going on here? Well, the core of it is the same as the unlimited casewe find the files in the current directory and turn some of them into Bryar::Documents. In the middle, though, we first have a well-known Perl sort technique, the Schwartzian transform:
@sorted_files =
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [$_, ((stat $_)[9]) ] }
@files;
This is the most efficient way to sort a list of files newest-first. We could, of course, just say this:
@sorted_files = sort {
my $a_modification = (stat $a)[9];
my $b_modification = (stat $b)[9];
$b_modification <=> $a_modification;
} @files;
Unfortunately, this makes two stat calls for every comparison, and every student of algorithms knows that Quicksort uses roughly n log n comparisons. This means that, for 100 files, we have about 1000 system calls. This isn't great. Instead, we use the Schwartzian transform to reduce that to only 100 statsone per filecan't get much better than that. For more details of how the Schwartzian works, see http://www.cpan.org/doc/FMTEYEWTK/sort.html.
Now that we have the sorted list of filenames, we use a slice to pull out the required number, and then pass these on to make_document:
@docs = map { $self->make_document($_) }
@sorted_files[0..$params{limit}-1];
However, as we've seen, most of the heavy lifting has already been done by File::Find::Rule, so this turns out to be a reasonably efficient search routine, given that we're using a filesystem-based data source.
We've said that Bryar::DataSource::FlatFile is reasonably efficient. But of course, we can make the whole process of retrieving posts even more efficient. In our next article, we'll look at how we can implement the same DataSource interface with a SQL back end using Class::DBI to retrieve posts from a database, and how this affects the way we can search for posts.
We'll also look at extending some of the other parts of Bryar as well. It's a tribute to Bryar's component-based design that we'll be able to turn it from a CGI-based flat-file blog tool using Template::Toolkit to a database-backed Apache-mod_perl blog formatted by HTML::Mason. Tune in next time to find out how we do itand in the meantime, happy blogging!
TPJ