Really Lazy Persistence

The Perl Journal November 2002

By Simon Cozens

Simon is a freelance programmer and author, whose titles include Beginning Perl and Extending and Embedding Perl. He's the creator of over 30 CPAN modules, a former Parrot pumpking, and an obsessive player of the Japanese game of Go. Simon can be reached at simon@simon-cozens.org.

It comes up fantastically often—you've got some piece of data stored in a variable; maybe a cache or a hash full of CGI session data. You want the data in the variable to be available next time you run your Perl program. This is the well-known "data persistence problem."

Thankfully, the data persistence problem has a huge number of solutions; if you want to make lots and lots of data persistent, you can look at SQL backended object databases or Perl modules such as Tangram, Alzabo, Pixie, DBIx::SearchBuilder, and the like. Or for the much more common small cases where you don't need a full-blown RDBMS to store your data, you can use one of the DBM libraries.

DBM is an ancient UNIX database file format, which stores key-value pairs, much like hashes. There are a bunch of libraries around that implement DBM: The Berkeley database from Sleepycat, DB_File, is the most flexible and reliable, but NDBM and GDBM are two others.

So, we get out DB_File, choose some filenames, tie the variables we need tied, and that's it, right? Problem solved.

Well, not exactly, because not everyone has DB_File installed; so there's a proxy module called "AnyDBM_File" that tests to see which DBM libraries are actually available. Great.

Actually, there's another snag. One limitation of the DBM format is that it doesn't understand complex data structures—it only stores keys and values as strings. No big deal—the multilevel DBM (MLDBM) Perl module sits between Perl and the DBM library, and automatically serializes and unserializes references, making it possible for you to use data structures in your persistent variables.

The Real Problem

Although we've found a workable solution, the upshot of all this is that in order to get a hash to persist, we need to think up a filename, find a safe place to put the file, load the AnyDBM_File library, load MLDBM, remember the syntax to tie, and so on; you end up with code looking like this:

use AnyDBM_File;
use MLDBM qw(AnyDBM_File);
use File::Spec::Functions qw(catdir tmpdir);
@AnyDBM_File::ISA = qw(DB_File NDBM GDBM);
# Prefer DB_File
my (%hash, @array);

# Must remember to make sure nobody else is using
# these file names!
my $hash_file = catdir(tmpdir(), "hashdatabase");
my $array_file = catdir(tmpdir(), "arraydatabase");

tie %hash, "MLDBM", $hash_file;
tie @array, "MLDBM", $array_file;

That's nine lines of real code just to make two variables persist; and we still have to make sure that no other application decides to choose the same names for its database files. What happened to making easy things easy and hard things possible?

This is the real data persistence problem: We have a solution, but it forces us to do a lot of work, and we'd rather be lazy.

The Solution

Enter Attribute::Persistent. Here's the same code as above, written to use Attribute::Persistent.

     use Attribute::Persistent;
     my (%hash, @array) :persistent;

As if that wasn't enough, you get an additional guarantee that no other application will corrupt your files. Isn't that more like what we should expect from Perl?

So, you're probably wondering how this works and what it really does. We'll start by looking at Perl's attributes system, where we'll find a story that reflects the "real" data persistence problem.

A Bit About Attributes

Attributes appeared in Perl 5.005 via the now-deprecated attrs module. These let you specify attributes on a subroutine, although they were restricted to "locked," which put an exclusive lock around the sub in threaded Perl, and "method." Perl 5.005 attributes also had the following ugly syntax:

sub blast {
    use attrs qw(locked method);
    ...
}

5.6 tidied this up a little, to the now-familiar syntax:

sub blast :locked :method {

}

and added user-definable attributes and the ability to apply attributes to variables as well as subroutines.

Unfortunately, the mechanism for doing anything with attributes is rather obscure. If you say

my $pencil :color(blue);

then a call is made to a class method called MODIFY_SCALAR_ATTRIBUTES in the current package. This method receives a reference to the variable or subroutine that is having a method applied, and then all the attribute names in full; it's then expected to return the names of the attributes it couldn't do anything with.

So you ended up with something like

sub UNIVERSAL::MODIFY_SCALAR_ATTRIBUTES { 
    my ($self, $ref, @attribs) = @_;

    for (@attribs) {
        if (/color\(\w+\)/) {
            tie $$ref, "String::Colored", $1;
        } else {
            push @unknown, $_;
    }
        
    return @unknown;
}

This is OK, but it's not very friendly; it's also especially problematic if you want to handle more than one attribute name in your class—or indeed, if you wanted several modules to each be able to chip in a bunch of attributes they could handle.

So Damian Conway produced a module called Attribute::Handlers, which makes the process a little more transparent. Now all you need to do is create a subroutine with the same name as the attribute you're trying to define, and declare that this is an attribute handler. How do you declare that? With an attribute, naturally!

use Attribute::Handlers;
sub color :ATTR(SCALAR) {
    my ($self, $symbol, $referent, $attr, $data, 		$phase) = @_;
    tie $$referent, "String::Colored", $data;
}

As you can see, this is much easier to deal with, and you also get a lot more information about the attribute: the symbol table entry (typeglob) and the reference that are having the attribute applied; the attribute name, any data ("blue" in our example) passed to the attribute, and at what stage of Perl's parsing of the program the attribute was applied.

Since we have a typeglob, we can even get at the variable's name, using a little-known typeglob trick:

my $varname = *{$symbol}{NAME};

Of course, in this code, the idea of using an attribute to tie a variable is not an uncommon one, and so Attribute::Handlers gets even friendlier:

use Attribute::Handlers autotie => { color => "String::Colored" };

All this infrastructure has made it very easy to implement a load of cool, attribute-based modules: Attribute::Memoize, Attribute::Types, and Attribute::Util being the most interesting three.

But not Attribute::Persistent; that had yet another problem. I wanted Attribute::Persistent to choose names for the databases that were related to the name of the variable being made persistent. Unfortunately, even though Attribute::Handlers provides a lot of information to developers, it can't provide the variable name for lexical variables—they don't live in a symbol table, so we can't use the *glob{NAME} trick on the symbol table entry to find the variable name.

To get around this obstacle, I used another Damian module—Attribute::Handlers::Prospective. This used a source filter to pass on more information about the attribute, including—thankfully for me—even the name of lexicals.

The Finished Product

So now we can take a peek inside Attribute::Persistent and see how it was done.

The first few lines should not be any surprise to anyone.

package Attribute::Persistent;
use strict;
our $VERSION = "1.0";

Now we needed a means to identify the user's program uniquely, so that other programs on the same system using Attribute::Persistent don't stomp over its databases. We call this the key.

my $key;

You might think that we should generate this from the filename; I thought that too, but then I realized that it wasn't very unlikely to have two different foo.pls lying around. So if possible, we take an MD5 sum of the source file, which reduces chances of a collision to 1 in 10155; if MD5 collides, we have bigger problems anyway.

require Digest::MD5;
local *IN;
if (-e $0 and open IN, $0) {
    local $/;
    my $x = <IN>;
    $key = Digest::MD5::md5_hex($x);
    close IN;
} else {
    $key = "Persistent$0";
}
1;

This means, of course, that when you change the source code to a program, the MD5 sum changes, and now you can't find your old data any more. Just think of it as a way of automatically flushing the cache when you update your code...

And that's all we do in the Attribute::Handlers package. Because we want this attribute to be available to all packages, we have to put the real meat into the UNIVERSAL package.

package UNIVERSAL;
use Attribute::Handlers::Prospective;
Now comes all that rubbish you saw before when handling the DBM file:
use File::Spec::Functions (':ALL');
BEGIN { @AnyDBM_File::ISA = qw(DB_File GDBM_File 		NDBM_File) }
use AnyDBM_File;
use MLDBM qw(AnyDBM_File);
no strict; # Attributes do evil things
And finally, we can define the attribute:
sub persistent :ATTR(RAWDATA) {

The first thing we want to do is find the name of the variable we've been passed. We also demand that this is a lexical:

    *{$_[1]}{NAME} =~ /LEXICAL\((.*)\)/ or do {
        require Carp;
        croak("Can only define :persistent on 		lexicals");
    };

And now we have the variable name, complete with sigil, in $1. We need to use this in two ways: We need to keep the original name, for reporting errors, and we need to munge it into something suitable for use as part of a filename:

    my $name = $1;
    my $origname = $name;

We need to know what type the variable is, so that we can dereference its reference correctly; this also helps us set up the filename so that %foo comes out as H-foo and @foo comes out as A-foo. This way, we can have two different persistent foo variables and they'll be kept separate:

$name =~ s/^\%/H-/ and $type = '%';
$name =~ s/^\@/A-/ and $type = '@';

However, we also allow the user to explicitly give their own name to a variable as an argument to the :persistent attribute, which we'll use as part of the filename:

if ($_[4] ne "undef") { $name = $_[4]; }

Now we make it filesystem-safe and put it in the temporary directory:

$name =~ s/\W+/-/g;
my $filename = catdir(tmpdir(),"$key-$_[0]-$name");

Finally, we can do the tie by correctly dereferencing the reference we were passed:

tie (($type eq "%" ? %{$_[2]} : @{$_[2]}), "MLDBM", $filename)
or do {require Carp; croak("Couldn't tie $origname to 	$filename - $!")}; 

And that, essentially, is all of Attribute::Persistent.

Philosophical Aside

We've seen two stories here: the story of Attribute::Persistent, and the story of attributes in Perl. There's a way in which these two stories are very similar: We started off with a great idea that got implemented a long time ago, but that suffered from having an interface that didn't allow us to be as lazy as we'd like. So, someone bit the bullet and wrote higher level modules to simplify the interface and make things either more powerful or less cluttered for the end programmer.

In the case of attributes themselves, the basic functionality lay unused for quite a while due to the clumsiness of the interface. However, once Damian produced Attribute::Handlers and did most of the work, an explosion of impressive uses for the feature sprung into being soon afterwards.

If you look at other parts of Perl, you'll see the same story; source filters, for instance, were a "little known feature" back in 1996. Paul Marquess's Filter::Util::Call module did a great job of exposing the C interface to source filters, but still the feature languished in obscurity. Then Damian again came up with a higher level module, Filter::Simple, which truly was simple, and suddenly a wealth of filter-based modules appeared. This happens again and again—with pluggable optimizers, the iThreads system, XS versus Inline, and so on.

What can we learn from this? First, interface design is much more important than you might think in terms of "selling" an advanced feature. It may be fantastically powerful, but even though you're writing for other programmers, if nobody's interested in getting their heads around the interface, it'll be largely ignored. Laziness is a Perl virtue, and if you do something clever with Perl, it's important to take the time to allow other people to be lazy with it.

Second, we can learn that there are a bunch of very interesting things going on in Perl or in Perl modules that nobody knows about. Source filters were in Perl for over four years before they became vogue. Take a look around, and you too may be able to create a "middle man" module that does the hard work, allowing others to be lazy in playing around with the next great feature.

Third, we can be thankful that there are people like Damian Conway who are willing to do the hard work of putting obscure and powerful features into easy-to-use packages. As we saw with the history of using DBMs, there's a trend towards more and more abstractions making the end-programmer's life easier. Writing the abstraction code that does the heavy work so another programmer doesn't have to is unglamorous and generally thankless, but it means you'll be doing the biggest favor possible to your fellow developers—you'll be saving them time.

TPJ