The Perl Journal April 2003
Long-time readers of my columns will know that I have two particular interests when it comes to Perl programmingmail handling, as evidenced by my articles on Mail::Miner and Mail::Audit, and also making things as simple as possible for the programmer, but no simpler.
Until recently, I have to admit, these two interests have been a little at war with one another because, as it turns out, mail handling in Perl is anything but simple.
This is quite a shame because mail handling as an abstract concept is very simple indeed. Nine times out of ten, you want to look at a piece of mail, get or set some of its headers, and look at its body. And that's it. Unfortunately, this abstract concept turns out to be anything but simple when it's turned into reality.
Let's first look at the options available, and then what I've proposed to do about them. In the process, we'll see what lessons we can learn about object-oriented (OO) and module design, and the way I've approached the redesign of some of my own modules. This means that this article will turn out rather more philosophical than practical, but that's OK. I promise I'll make up for it next time.
There are two main mail message handling libraries in Perl. The most commonly used of these is Mail::Internet, and it's not so horrendous to use:
use Mail::Internet; my $mi = Mail::Internet->new([split /\n/, $mail]); print $mi->as_string;
Each message object has an associated Mail::Header object, and you can get headers by looking at that:
my $from = $mi->head->get("From");
Wait a minutewhy is this? The Mail::Header object is not entirely useful on its own; it's only really useful in the context of the mail it comes from. What's happened here is that an implementation decisionputting header parsing and handling into its own classhas leaked out into the user interface to the module. I shouldn't need to care how the header handling is implemented. Getting headers is, as far as I'm concerned, just a part of looking at mail. Sure, you get a bit of extra flexibility by this implementation decision, but not enough to warrant exposing it to the user.
If you do want to do it this way, using a separate Mail::Header object, that's fine. You can hide the implementation decision from the user by means of OO delegation. What's really going on here is that Mail::Internet has what's called a HAS-A relationship with Mail::Header.
Unlike the usual vertical IS-A relationships, HAS-A relationships are horizontal. Mail::Internet doesn't inherit from Mail::Header, or the other way around, but it contains one, encapsulates it, and uses it within itself.
When you see a HAS-A relationship, you also often see delegation. Delegation is an OO principle by which you direct methods through a HAS-A relationship: We should be able to call something like head_get on Mail::Internet and it should pass the request onto the Header that it has. Again, this avoids exposing the implementation detail of the Mail::Header class in the first place. Sadly, Mail::Internet doesn't support delegation.
But let's back up a step. Why bother having a separate Mail::Header anyway? This was supposed to be a simple problem. Before we move on to looking at the other solution, let's just tot up a quick score for Mail::Internet; see Table 1.
The main competitor to Mail::Internet, at least until a couple of days ago, was Mail::Message, written by Mark Overmeerwho also, coincidentally, now maintains Mail::Internet. This is part of the Mail::Box suite of libraries.
If you think that Mail::Internet was overkill, then you probably want to avert your eyes. Mail::Box is a full-featured mail handling suite, comprising around 90 full-featured classes and over 14,000 full-featured lines of code.
Mail::Message similarly splits off header handling to Mail::Message::Head, but does provide some delegate methods that access fields. These return Mail::Message::Field objects in most cases, but sometimes return Mail::Address objects in the case of headers such as "From," "Cc," and so on. These two classes magically stringify to the value you're expecting if used in string context.
All of these classes inherit from a single Mail::Reporter class that handles error reporting, and some of these classes have special subclasses that are used to "lazy-load" for speed. For instance, the header can originally be returned as a Mail::Message::Head::Delayed, which doesn't do any parsing of the header, and this is then turned into a parsed Mail::Message::Head::Complete when a field is requested. Speed is very important in the design of this module, which explains in part why it is so horrendously slow compared to the much simpler Mail::Internet.
For example, I benchmarked reading an e-mail into a Mail::Message and Mail::Internet object, respectively, and retrieving its "From" header:
Benchmark: timing 10000 iterations of internet, message internet: 59 wallclock secs (58.85 usr + 0.02 sys = 58.87 CPU) @ 169.87/s (n=10000) message: 122 wallclock secs (117.97 usr + 0.41 sys = 118.38 CPU) @ 84.47/s (n=10000) And to be fair, in another test, I read in an e-mail and spat it back out as a string: Benchmark: timing 10000 iterations of internet, message, simple... internet: 60 wallclock secs (59.17 usr + 0.05 sys = 59.22 CPU) @ 168.86/s (n=10000) message: 128 wallclock secs (124.27 usr + 0.54 sys = 124.81 CPU) @ 80.12/s (n=10000)
There's an important lesson here. The object-oriented model is good, like vintage wine is good. But if you drink several gallons of vintage wine in a sitting, you're liable to end up getting a little confused. You end up with what's called "lasagna code," the object-oriented equivalent of spaghetti code; your inheritance tree becomes so towering it's nearly impossible to tell which classes you're really using and where their methods are coming from.
Then you find yourself having to optimize your code, which is by now extremely complex, by adding more complexity, whereas the rules of optimization tell you that you optimize by taking complexity away.
Our example of loading up a message and looking at its "From" header, which took twice as long as the Mail::Internet version, used the following Perl modules: Mail::Address, Mail::Box::Parser, Mail::Box::Parser::Perl, Mail::Message, Mail::Message::Body, Mail::Message::Body::File, Mail::Message::Body::Lines, Mail::Message::Body::Multipart, Mail::Message::Body::Nested, Mail::Message::Construct, Mail::Message::Field, Mail::Message::Field::Fast, Mail::Message::Head, Mail::Message::Head ::Complete, Mail::Message::Part, and Mail::Reporter, for a total score shown in Table 2.
Ouch.
Let's go back to solving the nine-out-of-ten case: getting the body and setting the headers. Once we've got this case nailed down, then we can start adding complexity. I'm not a massive fan of Extreme Programming, but one of its doctrines is that you start as simply as possible, and only add more complex functionality when you need it. I like that idea. With Perl modules, as with writing, the time to stop is not when there is nothing more to add but when there is nothing more to take away.
So I decided to set out and reinvent the mail-handling wheel but at least try to make it less square this time and forgo the ornamental carvings, fuzzy dice, and the attachment for getting stones out of horse's hooves. I wanted to write a Perl mail-handling library that was stunningly simple in every way, even at the cost of a little flexibility later.
In a fit of pique, I decided that the whole Mail::* namespace was rotten to the core (especially the bits of it that I wrote), and if we were going to have a fresh start at mail handling, we should start afresh in a different namespace. Sometimes heresy is an important part of innovation.
I started the design of Email::Simple by working out what methods I would need. I came up with six, which I still think is probably too many. We want to create a new object; we want to get and set a header; we want to get and set the body; we want to be able to output the whole mail as a string again.
After a lot of consultation and argument with peers in the Perl community, I decided upon having separate accessor and setter methods: I could have cut my method count down to four without thisa pleasing thoughtbut I would lose a lot in the process.
First, I wanted to stick to UNIX design principles: Small, single-purpose tools. Every module, every method, every line of code, should do one thing and do it well. Having combined accessors smacks of doing two things. It loses regularity because the same method does different things depending on whether or not you give it an argument, and it loses symmetry.
Second, constructing and examining mail are usually two very distinct operations. Generally, you're either examining existing mail or making up new mail. These are separate concepts that deserve not to be confused, and hence have separate methods to distinguish them.
In the same way, while it is possible to use Email::Simple to create a new e-mail from scratch, this is discouraged. Creating mail is a separate action and needs a separate module. Small, single-purpose tools.
This also led me to rethink how I would lay out the code into subroutines; again, I tried to think of subroutines as small, single-purpose tools that do one thing and do it well. This means that most of the subroutines in Email::Simple are four or five lines long, and must fit in one screen in the absolute worst case.
I also decided that Email::Simple was doing such a fundamental and simple task that it should not use any external modules. Not because I'm not a fan of code reusethis is a library after all!but because I wanted to minimize dependencies, making this portable, easy to install, and easy to use. In the end, I caved in while writing the as_string method and used the core module Text::Wrapper to fold long header lines. Pragmatism must beat principles every time.
As well as being simple to implement, it's fairly important that this module is simple to use. By trimming down the number of classes and methods to the bare minimum, I think I've achieved this:
my $es = Email::Simple->new($email);
print "From ", $es->get_header("From");
print "\n\n";
print $es->body;
The e-mail is expected to be a single scalar; the other modules can happily take a glob, an array of lines, an IO::File object, and who knows what else. I chose not to do that because another design principle for this project is predictability. "Do What I Mean" is very useful when it does do what you mean, but causes all sorts of fun when it doesn't.
While many of my other modules are perfectly happy to try their hardest to work out what you actually mean, and do that, this one is different. In my opinion, high-level magic belongs in high-level modules, not in fundamental libraries like this one. Clever is for high-level stuff; low-level modules should be as dumb as possible.
For the same reasons, Email::Simple objects don't automatically stringify, or indeed do anything without you asking specifically for it. If you use an Email::Simple object, you know exactly how it's going to behave in every situation. (Some languages call this the "principle of least surprise," and in my opinion, overloading goes against this in most casesyou don't expect random objects to stringify, so they should avoid doing so unless there's a very good reason.)
The minimalist interface and the defined behavior standards are small enough to fit inside your head. Ideally, you should only need to read the Email::Simple manual page once.
What about speed? I'm a fervent, passionate believer that if you follow the design principles I've outlined, with good, clean algorithms and simple design, you won't need to worry about speed; it'll just fall neatly out. The best way to optimize is to remove complexity, not to add it, and if you design your code to have very little extraneous complexity anyway, you'll find your modules already optimized!
And, in fact, that's how it turns outbecause Email::Simple is so simple, because it does nothing extraneous, and because it's cleanly designed, it's very, very fast:
Benchmark: timing 10000 iterations of internet, message, simple... internet: 59 wallclock secs (58.85 usr + 0.02 sys = 58.87 CPU) @ 169.87/s (n=10000) message: 122 wallclock secs (117.97 usr + 0.41 sys = 118.38 CPU) @ 84.47/s (n=10000) simple: 9 wallclock secs ( 9.17 usr + 0.00 sys = 9.17 CPU) @ 1090.51/s (n=10000)
Naturally, it's much faster than the other modules because it only does a single job and does it well; but it's the job that most users of these modules will want to be doing. Oh, and to total it up, see Table 3.
I apologize for it containing so many lines of code, but I wanted to quote extensively from RFC2822 in the comments, to help keep it standard compliant during whatever small amount of maintenance it might need. But nevertheless, I think we have a winner.
Now we have a foundation module, and we can start to build on this foundation.
I've done a lot of ragging on other people's code in this article, so the next module I wanted to raze to the ground and replace was one of my own: Mail::LocalDelivery.
Mail::LocalDelivery grew organically out of the Mail::Audit mail filter; one of the major aspects of Mail::Audit is that it delivers mail into your mailboxes. Once upon a time, this was a relatively simple process, with a bit of locking, and only a few lines of code, and it worked fine inside of Mail::Audit.
But then I made a stunningly stupid mistake, one I hope to never repeat. I accepted a patch that added maildir support. OK, that wasn't the mistakethe mistake was that I didn't then take the opportunity to refactor the code before adding the patch. I just added in an if statement that separated maildir from mailbox, and the code continued to grow organically again. And grow, and grow...
Before I realized what was going on, Mail::Audit's accept method was the bulk of the module and was almost impossible to follow. So a golden lesson was learned there: Once your code grows two separate ways to achieve a task, always modularize at that point, at the very least.
And while you're about it, take the opportunity to see if you can split the whole thing off to a separate module or procedure. One of the design principles I've learned from my wise boss is that if it looks like a problem is getting complex, try adding another layer of abstraction. It's a very generic rule, but it can help this sort of situation. (Of course, it must be coupled with another principle: If it doesn't look like your problem is going to get complex, don't add another layer of abstraction, or you end up with Mail::Box. We're still trying to keep things as simple as possible, but no simpler.)
I eventually came to my senses when I was reminded that local delivery was a useful thing to be doing outside of the context of Mail::Audit, and I split out the accept logic to a new module, Mail::LocalDelivery. But I still had the problem that the multiple delivery methods weren't abstracted out at all, and although I planned to rewrite Mail::Audit's accept method in terms of Mail::LocalDelivery, to tell the truth, I was scared to do so because the code had grown too hairy and I didn't know what would break if I touched it. And Mail::Audit, which handles every piece of mail I receive, is rather important to me. I didn't want to touch it if things might break.
So I took the same design principles that guided me in Email::Simple and applied them to the local delivery problem. I was very happy when designing this module since I could reduce the number of methods down to onedeliver will deliver a message to a bunch of mailboxes.
Why, then, did I make it OO anyway? Surely if you're just providing a single function, you could export it. Well, I could, but I wanted to implement it in terms of methods that could be inherited from, in case someone wanted to come along with a more featureful version in the future.
This is something that has to be learned the hard wayalthough I've been claiming that you don't unnecessarily abstract things out before a definite need arises, the decision whether or not to go OO is something that must be done right at the start of your planning: For one thing, you will want to avoid changing your user interface from procedural to OO style as a result of changing your implementation, something we talked about with respect to delegation.
It seemed appropriate to have a single front-end that delivered a message, but to use separate back-end modules to implement mailbox, maildir, and other delivery mechanisms. Again, we're keeping the implementation details away from the interface.
Mail::LocalDelivery was not badly designed. It had, for instance, separate deliver_to_mbox and deliver_to_maildir back ends, which called a unified write_message, and so on. So I tried to keep this design in my rewrite.
Unfortunately, I learned something that you can only learn when you rebuild something from scratch: Most of the "shared code" in the unified method wasn't actually shared at all. The power of write_message was that it handled locking a mailbox, opening it for append, writing the message to the end, unlocking, and so on. But mailbox delivery doesn't do any of these things: There's no need for locking and mailboxes are written from scratch, not appended to. The original write_message used arguments and if statements to decide whether to lock or unlock, which essentially removed any of the abstraction that it was designed to provide. Oops.
It turned out that there was nothing to be shared between the various back ends anyway. This made me happy. Inheritance is a good thing when used sparingly, or to extend a given class with replaced or additional functionality. When used to excess, it can lead to the unhappy situation where you need to grovel through five or six levels to find out where a given method is defined in order to debug or understand it. Walking a class inheritance tree is a great thing for a computer to do, but not so great for a human.
Once again, the principle of building one to throw away is vindicated: By rewriting the module, I managed to remove a load of extraneous code, and even though it's now split across multiple back-end modules, it weighs-in much lighter than the original Mail::LocalDelivery, and it's much, much easier to understand. All the accumulated cruft that had grown organically on the module got swept away, and this cut down the line count and the complexity of the code dramatically:
The old version:
% wc -l LocalDelivery/LocalDelivery.pm 534 LocalDelivery/LocalDelivery.pm
The new version:
% wc -l LocalDelivery.pm LocalDelivery/*pm 77 LocalDelivery.pm 86 LocalDelivery/Maildir.pm 70 LocalDelivery/Mbox.pm 233 total
Less than half the size.
Inspired by this, I felt it was finally time to reinvent the venerable Mail::Audit.
In the same way as Email::Simple, I wanted to cut out as much of the extraneous functionality as possible, and pare it to the essentials. This means that Email::Filter doesn't support any logging, any local options, or anything of the like.
However, there's a problem herethese things, especially logging, are actually very useful. We want to be able to support them somehow. One possibility was to design the module to be easily subclassable, so that people would be able to write Email::Filter::Logging very easily, and then it'd be up to the end user.
But subclassing's a pain; I don't want to be writing a new class to accommodate my particular foibles about what logging should be done and how. There ought to be a nice, easier way to allow a user to customize a class's behavior. Thankfully, there is, and it's implemented by the Class::Trigger module. This provides simple, inheritable "trigger points" where the end user can attach callbacks. For instance, if I say
$item->add_trigger( ignore => sub { log("This mail was ignored") } );
then the subroutine will be called every time the ignore method is invoked. All I need to do in my class is to use Class::Trigger and this gives me a new method to call in my own class:
sub ignore {
my $self = shift;
$self->call_trigger("ignore");
...
}
But waitwhere did these two methods come from? We don't inherit from Class::Trigger, we only use it. What's happening is that Class::Trigger imports the methods into the caller's namespace, just like an ordinary, non-object-oriented method. We gain the two methods, as do any of our subclasses, but we don't have an inheritance dependency on Class::Trigger itself. This technique is called using a "mix-in," and it's quite a popular one in Ruby and Python for adding functionality to a class without inheritance.
In a bizarre way, I've found that some of the restrictions in Email::Filter caused by trying to make it as simple as possible have led to interesting ways of doing things that I wouldn't have otherwise considered. For instance, in Mail::Audit, there's a pipe method that dispatches the mail off to an external program. I modified this in Email::Filter so that it returns the standard output from the program. This means that, for instance, in the absence of direct support for Mail::SpamAssassin, you can say:
$mail->simple(Email::Simple->new($mail->pipe("spam assassin")));
In other words, pipe the mail to the spamassassin command-line command, which outputs a marked-up mail message; take this mail message and turn it into an Email::Simple object, and then use that as the underlying object for the Email::Filter entity.
Email::Filter currently weighs in at 255 lines of code compared to Mail::Audit's 1053, but to be fair, this is because Email::Filter isn't anywhere near finished yet.
Email::Simple and Email::LocalDelivery are currently released on the CPAN; I wanted to hold them back until I had finished a few more modules in the Email::* project, but I got a few requests for them to be released early and, to be honest, there were a couple of bug reports in Mail::LocalDelivery that I couldn't be bothered to fix.
Right now, I'm working on Email::Filter, and after that will be the next big challengeEmail::MIME. This will be a relatively high-level library, but built around Email::Simple and many of its design principles, including, of course, simplicity; it's planned that this will be another "reader" module, concentrating on separating out the parts of a MIME message, rather than creating a new one from scratch. As such, it'll probably only add one or two methods to Email::Simple"parts" to return a list of attachments in some format, and probably something to get some additional MIME-related metadata about the message.
After these reader modules are done, it will be time to start the creator modules, Email::Creator and Email::MIME::Creator. Who knows what I'll be working on after that; I have my beady eye on Mail::SpamAssassin. (People keep telling me that LWP is in dire need of a rewrite, but there just isn't enough time and coffee in the world.)
But I'd like to commend my design principles to you: simplicity, single-purpose tools, predictability, and not being afraid to sacrifice a little bit of functionality to achieve the nine-out-of-ten-cases solution. As we've seen, they lead to clean, maintainable code, that often turns out to be quite a lot faster than the all-encompassing solution anyway.
TPJ