Article jun2007.tar

From Mbox to Maildir -- One Sys Admin's Journey

Phil Hollenback

One day a user at my company complained about lost email. I researched the issue, examined alternate solutions, and developed a plan. This article is the result of my investigation. One result of that work was a script that I developed. Another result was a more reliable mail infrastructure. Although converting from mbox to Maildir for mail storage is conceptually simple, there are a number of techniques and pitfalls to consider. I hope that by sharing my results, I can help others who may be planning a similar migration.

Overview

No computer system is perfect and the skill of the systems administrator often lies in finding the right balance between theoretical ideals and practical realities. My company's mail system is a good example of this balancing act. It has worked well for many years, and users are generally satisfied with it. As with all real-world systems, it grew organically over time and within the limitations of our network connectivity, hardware, and software. One result of this evolution is the fact that our mail store is NFS-mounted across various Unix and Linux machines. Users like the flexibility of being able to access their mail in a consistent manner on any of our systems. All mail is stored in the traditional mbox format.

Storing mbox-format mail on an NFS share does create the possibility of mailbox corruption if file locking doesn't work exactly right. However this setup worked fine for us for many years. Users ran mail clients, such as pine, locally or over SSH connections and tended to only run one mail client. Also, NFS-mounted shared mail storage gave several advantages, such as simplified mail delivery and ease of centralized backups.

Problems began to occur during the past year, however, as users began to expect more of their mail, including remote mail access over IMAP. A pre-1.0 version of Dovecot was set up as an IMAP mail server, so users could read mail from home without having to SSH into work. Users began to complain of mysteriously vanishing mail. Their mail would be delivered, but when they checked again it would be gone. I was unsure what was happening at first, but by tracing the logs and speaking with users, I saw that this was only happening with IMAP mail users. I could verify through the logs that messages were being delivered but at some point later were no longer in the mailboxes.

Examination of user mail folders soon showed the problem: nulls in the files. The text of the mailboxes appeared Ok, but when run through cat -tve, long strings of nulls (ASCII character 0) were seen embedded in the files. I wrote the following Perl one-liner to check mailboxes and find these strings of nulls:

perl -ne 'm/\0\0\0/ and print "found nulls in $ARGV at $.\n"; \
  close ARGV if eof' /var/mail/*

I found that almost all IMAP user inboxes contained this corruption. If the nulls showed up in the middle of a message, the mail server (Dovecot) would ignore them and the mail would appear correct to the user. But, if the nulls damaged a mail message header, the message would "disappear" from a user's mailbox because the mail server couldn't figure out how to process it.

Clearly, this corruption had to be fixed, but I couldn't easily take the mail storage off NFS, because our entire infrastructure was built around that. Nor could I immediately change the mail server software, because this would be very disruptive for users. Additionally, both of these system-wide changes would require months of careful planning.

The best short-term answer was to convert user mail storage to a more robust format, and, after some Web searching, I chose to convert our user mail to Maildir. However, I had to make this conversion in a way that minimized user inconvenience and could be gradually rolled out. It was too risky to convert all users at once, because there could be problems that I would not discover until after the conversion was performed.

Maildir

The Maildir format was originally developed by Dan Bernstein (creator of Qmail) as a solution to the problems of the standard mbox mail file format. The key difference between Maildir and mbox is that mbox stores entire mail folders in one file, while Maildir stores each mail message in a separate file. This makes file locking much easier, because the system only has to lock one small mail file, not the entire folder. Additionally, Maildir uses a system of hard links and temporary files to make file locking theoretically completely unnecessary. Although some limitations of modern file systems mean that file locking is actually needed, in general, mail stored in the Maildir format is much less likely to be corrupted by system crashes or other processes. For more information on Maildir, see:

http://en.wikipedia.org/wiki/Maildir

The next step was to determine whether all the software in the mail delivery chain supported Maildir. The mail delivery agent (sendmail) did not, but procmail did, so I could filter all incoming user mail through procmail to convert it to Maildir. The IMAP mail server (Dovecot) supported Maildir, but I was unsure whether it could be configured to support different mailbox formats for different users. Without this ability I couldn't perform a staggered rollout, which was essential. I couldn't just convert everyone over a weekend and cross my fingers on Monday morning. So, I upgraded to a newer version of Dovecot, because it was clear that it could support multiple mailbox formats concurrently.

On the mail client end, the IMAP users were already covered, because Dovecot handled Maildir natively and served it over IMAP. However, we have a large contingent of pine users, and I couldn't just tell them to switch to something else. There is an experimental patch to add Maildir support to pine:

http://www.math.washington.edu/~chappa/pine/info/maildir.html

However, other users on the Web reported that this patch was incomplete and caused additional problems. Pine does support IMAP, so I decided the pine users would just have to change their mail access method. The handful of mutt users at my company (myself included) were okay, because mutt has full native Maildir support.

Dovecot and Procmail Changes

First, I adjusted the Dovecot configuration file (dovecot.conf) to check a new private password file before the system password file. For this to work, the new passdb entry must occur before the check of /etc/passwd in the "auth default" section, as shown:

# new per-user password file
passdb passwd-file {
   args = /usr/local/etc/dovecot.passwd
}
# fall back to system password file
passdb passwd {
}

Similarly, I added a new userdb entry before the existing one:

userdb passwd-file {
   args = /usr/local/etc/dovecot.passwd
}

userdb passwd {
}

I left the "default_mail_env" setting pointing to the old mbox-style inbox:

default_mail_env = mbox:%h/mail:INBOX=/var/mail/%u

Thus, by default, users would continue to access their old mbox inboxes.

With this in place, I could add a user to /etc/dovecot.passwd (one per line) in this format:

username:password:uid:gid::homedir::userdb_mail=maildir:~/Maildir

Then, Dovecot would check /home/<user>/Maildir for mail instead of /var/mail/<user>. Note that most of these entries, including the hashed password, can be taken directly from /etc/passwd. I also made sure to make /usr/local/etc/dovecot.passwd readable by only the Dovecot user and it contained the user password hashes.

At this point, Dovecot could retrieve mail from $HOME/Maildir and serve it to all IMAP clients. The next problem was mail delivery; sendmail does not know about Maildir, so it would continue to deliver mail to /var/mail/<user>, meaning that converted Dovecot users would never see any new mail.

Procmail is the usual answer when you want to mangle mail delivery, and I found that procmail has full Maildir support, including delivery to Maildir directories. Several changes are necessary, however, to convince procmail to deliver to Maildir directories instead of the default of mbox files. To begin with, the names of all the mailboxes references in a user's ".procmailrc" must be converted to Maildir format by adding a "." at the beginning and a forward slash at the end. Thus, the delivery recipe:

:0:
* From.*william@somewhere.edu
petcompil

becomes:

:0:
* From.*wiliams@somewhere.edu
.petcompil/

Next, you have to set "$MAILDIR" and "$ORGMAIL" as follows in "$HOME/.procmailrc":

MAILDIR=$HOME/Maildir/
ORGMAIL=$HOME/Maildir/

Note that the trailing slash is essential. Without that, procmail will dump messages into the Maildir directory one per file with the name "msg.XXX" instead of putting them into the correct Maildir format. By default, if procmail doesn't deliver a message using any of the recipes in your .procmailrc, it will deliver it to $MAIL, which means back to the default /var/mail/<user>. To protect against this, add a delivery recipe to the end of the .procmailrc:

:0
$ORGMAIL

This will ensure all mail will be delivered to your new Maildir mailboxes.

Converting Existing Mailboxes

These changes would be enough for a new user: procmail delivers mail to your Maildir folders, and Dovecot retrieves it from the same folders. Unfortunately, this setup does nothing to address existing user mbox mail folders. To accomplish this conversion, I wrote a Perl script to automate the modification of the mail delivery and transform existing mbox-format mail to Maildir format.

I had several criteria for my script:

1. I had to be able to convert users one at a time.

2. The conversion had to be done in such a way that the user never lost incoming mail.

3. The conversion had to happen as quickly as possible to minimize email downtime.

4. The conversion had to happen as safely as possible with as many sanity checks, data backups, and redundancies as possible.

Clearly, I had to write a script to do all this -- the steps to convert an individual user were complex enough that manual conversion could lead to errors. But first, I had determine how to convert email from mbox to Maildir format.

Conceptually, the conversion is fairly simple; it requires splitting the mbox file into individual messages and placing each message in a separate file. This could be accomplished with formail, for example. However, I did some checking and found that someone had already written a tool called mb2md.pl to handle this conversion:

http://batleth.sapienti-sat.org/projects/mb2md/

I incorporated this script into my own, so I wouldn't have to re-implement the conversion of the mailbox files. I still had work to do, however, because this script would only convert the actual mail from mbox to Maildir format. The mb2md.pl script doesn't do any sanity checking (e.g., to see whether the conversion would cause a user to go over quota). It also doesn't update any configuration files, such as the Dovecot password file or the user .procmail file.

The script I wrote incorporating mb2md.pl is called Mbox2Maildir.pl, and the complete version can be found on the Sys Admin Web site:

http://www.sysadminmag.com

or my own Web site:

http://www.hollenback.net/writings/Mbox2Maildir

In the rest of this article, I will highlight various parts of the script and attempt to explain why I designed it the way I did.

Script Breakdown

My script started as a wrapper around mb2md. I've since added significantly to the logic, so that wrapper mechanism is only one part of what the script does. Above all, the script operates in the most atomic and reliable fashion I could devise. Mail delivery is critical, so I structured the script to have a point of no return. The actual body of the script starts on line 398, but it isn't until line 428 that an irreversible action occurs.

That action is the conversion of the user .procmailrc to a Maildir-friendly format. Once that occurs all new mail will be delivered to $HOME/Maildir. I will discuss that mechanism later, but I want to make my methodology as clear as possible: Perform as many sanity checks and backup operations as possible before committing to anything. Also, I always try to determine whether actions will fail as early as possible in the logical flow. For example, I check the user quota early in the script. If it looks as if there won't be enough room to do this work, the script aborts before the first irreversible action occurs.

Initial Setup

Lines 1 through 28 of the script are the standard sort of initial setup you see in any Perl script: I call the interpreter (/usr/local/bin/perl), load libraries, and set as many variable as possible. Note that I always run my systems administration Perl scripts with the -w flag to enable as many warnings as possible. That way, if something goes wrong, I uncover it early. Here are the Perl modules used:

 6     use Getopt::Std;
 7     use File::stat;
 8     use File::Copy;
 9     use File::Temp  qw/ tempfile /;
10     use File::Path;
11     use Term::ReadKey;
12     use Quota;

These modules reveal a lot about my script, including one key thing about me: I'm lazy. Why re-implement something like option handling when you can use Getopt::Std? Also, if I do it myself, there's a much larger risk I will screw something up. This is especially important with things like temporary file handling, which I guarantee I cannot do as well as File::Temp can.

Similarly, I think it is important to do as much as possible within Perl as opposed to using the system() call to execute external commands. I usually begin developing these sorts of system administration scripts as shell scripts and convert them to Perl when things get too complicated. Intermediate versions of my scripts tend to use system a lot for operations like copying files, because this is the most direct conversion from a shell script. Every sys admin knows how to call:

system("cp file1 file2");

because that uses the familiar layout of shell program execution. However, I recommend strongly that you convert constructs like this to real Perl whenever possible. In this case, use File::Copy instead:

copy(file1, file2);

I admit that it isn't really any more difficult, but it does take time to convert your shell knowledge to the Perl way of doing things. In intermediate versions of my Perl scripts, I leave those system() calls in place, so I can write the script quickly and don't have to stop to look up Perl syntax.

The most important reason to use the Perl version of external commands wherever possible is that error handling is so much better. Part of the problem is that Unix return codes are reversed from Perl error values, and they are not very descriptive. The proper way to write the system version of the copy command above is:

system("cp file1 file2") == 0 or die "system cp failed: $!";

This at least does basic error checking but is just plain ugly. Because the Perl versions of commands already do proper Perl error handling, you can just check the return value directly:

copy(file1, file2) or die "copy failed: $!";

This is much cleaner and easier to read. This method is also more portable, because it avoids calling a system tool that may not be present (i.e., the script has a better chance of actually working on Windows if it doesn't directly call Unix commands like cp).

There isn't too much to discuss about the other modules I use. My company's mail storage is on a file system with quotas enabled, so the Quota module is important to check that the conversion will have enough space to write the new mail files. This is also why I first copy the existing mbox-format mail to a temporary location and then copy it back to the user home directory during the Maildir conversion.

Sanity Checking

I am a strong believer in paranoid programming. You always have to check return codes. Of course, my script contradicts that because I don't check the return for filehandle close() calls, so do as I say not as I do. Similarly, I always make backup copies of any files that are modified, using RFC 3339 date strings of the form YYYYMMDD in the filenames for portability and ease of sorting. The script actually makes a cpio copy of the user's mail on line 105 -- again very early in the process and before any permanent changes are made. This would serve to recover the user mail to a known good state if something went horribly wrong during the conversion.

Again, scripts should anticipate as many failure modes as possible as early as possible. If your script is copying large amounts of data on the system, you should always check that the receiving file system has enough available space. My script takes user quotas into account when converting from mbox to Maildir. I found through casual testing that this conversion caused a user's disk usage to grow by a few percent, presumably because of two things:

1. The overhead of all the extra files in the Maildir setup versus one file per mailbox with mbox.

2. The slightly less efficient use of disk space as each mail file wastes a partial block at the end of message.

This would cause the conversion to fail if the user was near their quota beforehand. I strive to avoid this at all costs, because backing out a partial conversion is painful.

The checking of user quotas starts on line 31 of the script in the subroutine CheckQuota(). I arbitrarily selected 10 percent as the quota headroom based on a few sample conversions. If the user's usage plus 10 percent exceeds their limit, the script aborts very early on -- before any changes are made to any files. This is all accomplished via the Perl Quota module, which makes the check portable between different systems.

To recap: Assume all system calls can fail and check return codes. Make backups of all files. Check resource usage as early as possible and exit before permanent changes are committed.

Converting Procmail

With the various setup, sanity check, and backup procedures out of the way, the script reaches the commit point at line 428 with the call to ProcessProcmail(). Users may or may not have an existing .procmailrc file. The conversion to Maildir makes procmail a requirement, so ProcessProcmail() either creates a brand new .procmailrc with some default settings or converts an existing file.

Creating a new one is straightforward via a "here document". Converting an existing one is more complicated, because the format of a .procmailrc file is complex. I considered doing an automatic conversion but then realized a manual conversion was the best way to go. The script dumps the existing file into a temporary file and adds the new pieces in comments. It is up to the person running the script to merge these into a coherent whole.

Following my goal of performing the conversion in as atomic a manner as possible, the script first copies the existing .procmailrc to a temporary file (or creates the tempfile from a template) and then dumps the user into their favorite editor to perform the changes. I use the Perl File::Temp module to ensure I don't create insecure temporary files.

One glaring hole in this file handling is that it doesn't deal well with multiple "include" files in a user's .procmailrc. A few users are affected by this, and to convert them I had to edit about 20 procmail files per user. This was painful, but I decided to live with it because these were a minority of my users. Still, you can see that this case breaks the atomic nature of the conversion. With just one .procmailrc, the minute the changes to that file are complete, the system starts delivering all mail to the user via Maildir. With multiple procmail files, mail could still be delivered to an old location before all the files are changed.

Once the user's .procmailrc is edited and converted to deliver to Maildir, all incoming email starts flowing to ~/Maildir. Procmail creates this directory structure automatically if it doesn't already exist. A nice feature of procmail is that if delivery to ~/Maildir fails (e.g., because the user is over quota), it will fall back to delivering to the old user mbox file in /var/mail/<user>. Thus, going over quota on /home does not automatically cause mail to be bounced. Of course, the user could still go over quota in /var/mail and, at that point, mail would no longer be delivered.

Converting Existing Mail

Once procmail is properly set up, the incoming new mail is taken care of and no more mail will be delivered to the old user mailbox. Then, I move the user's old mbox folder directory ~/mail to scratch space (via the MoveMbox() subroutine on line 113). This directory contains all the user's mail except for the spool file in /var/mail/<user>. I move the user's mail directory to a temporary location to avoid hitting the user quota on /home. If you don't have quotas, then you could skip that step.

This is one time I don't rely on Perl modules, instead I use cpio, mainly because cpio is very reliable and preserves symlinks and permissions. This could probably be done with a few Perl modules (starting with File::Find), but cpio is simple to use and works very well here.

Next, I run the actual mb2md.pl script twice, first for the spool file and next for all other mbox mail folders (from the backup copy of ~/mail). This has to be done in two steps because of the awkward way that Maildir (specifically Maildir++) handles subfolders. I realize that I could have incorporated the entire mb2md script inside my script, but it didn't seem worth the extra effort. Instead, I just call it externally with the system command. After these two conversion runs, all user mail (the spoolfile and user mbox directories) are stored in ~/Maildir.

The obvious question here is what happens if mb2md.pl is writing messages to the Maildir folder at the same time procmail is delivering messages. Potentially, there could be a conflict if procmail and mb2md.pl chose the same filename for a new file. The Maildir specification:

http://cr.yp.to/proto/maildir.html

avoids filename conflicts by combining the results of calls to unix_sequencenumber (returns a number that increases by one every time it is called) and unix_bootnumber (the number of times that the system has been booted).

However, these calls don't exist on any modern Unix such as FreeBSD or Linux. Thus, other sources of semi-unique data are combined in both procmail and mb2md.pl to create (hopefully) unique filenames. The mb2md.pl script uses:

<seconds since the epoch>.<message count in current mbox>.mbox

while procmail uses:

<seconds since the epoch>.<pid>_<delivery attempts>.<hostname>

Although neither of these exactly follows the original Maildir specification, the key point is that it is impossible for mb2md.pl and procmail to generate identical filenames because they use different filename formats. Thus, it is safe for both programs to write messages to the Maildir at the same time.

Another way this could potentially go awry is if a user were reading their mail and making changes while this conversion process was running. Thus, I ask users to exit their mail programs while I run the conversion. While all of this isn't perfectly safe, it's adequate for my needs. Converting even a large user mailstore takes well under an hour.

Converting Pine

The user's mail is now all in Maildir format, but the only way the user can now access it is with a mail program that understands Maildir. Mutt for example works very well, especially with the new header-caching logic in the 1.5 development series. However, the majority of my users use pine or Thunderbird to access their mail. Pine doesn't support Maildir, so as I mentioned previously, I converted each user's pine setup to use IMAP instead. That way Dovecot can do all the Maildir handling, and users still get to run pine in essentially the same way as before the conversion.

The pine conversion process is entirely automatic, mostly because the pine configuration file (.pinerc) is simpler than the procmail configuration file. The conversion all takes place in the ConvertPinerc() subroutine, which starts at line 251. The logic is similar to what happens with the user .procmailrc -- if an old one exists, it is copied to a temporary file and, if it doesn't, a template is copied to the temporary file. Then, the script loops through the entire user .pinerc checking for the relevant config options. Each one is commented out and replaced with an updated line. The template .pinerc contains all the pine configuration options, so I handle it the same way as an existing user .pinerc. If for some reason any of the configuration options aren't found in the existing file, the script adds them to the end of the file. This isn't the prettiest way to parse the file, but it works and is reliable.

Tweaking Dovecot

Almost all the pieces are now in place: Mail is being delivered to the Maildir folder and pine is talking over IMAP. One thing is missing, however; the IMAP server (Dovecot) still doesn't know to check ~/Maildir for mail instead of /var/mail/user. As I explained previously, the Dovecot server at my site by default looks in /var/mail/user. I change this on a user-by-user basis via a custom /usr/local/etc/dovecot.passwd file. Dovecot reads this file first and, if a user is found in it, the default settings are overridden. The GenDoveCotPasswdEntry() subroutine starting on line 133 takes care of this conversion. That is one part of the script you would need to customize to match your own authentication setup. With a few tweaks, this subroutine could be adapted to work with regular Linux passwd files, NIS, LDAP, or any other authentication mechanism you might use.

In my setup Dovecot operates on a separate server, so I can't just automatically add the resultant entry to the private Dovecot password file. Thus, the solution is to display it and tell the person running the script to cut and paste. This at least cuts down on typos in the passwd file, because my script generates the entire entry.

Note also that another Dovecot conversion occurs slightly earlier in the script in the ConvertSubscriptions() subroutine on line 367. The IMAP server tracks which IMAP folders a user is subscribed to using the subscriptions file. The syntax of this file is different for mbox folders or Maildir folders, so some regex munging is needed. After all this, the script is essentially done. Dovecot checks the custom password file on every IMAP login so no restart of Dovecot is required. The last thing the script does is remove the temporary mail folders.

Summary

At the time of writing, I have converted approximately 25% of the users at my site from mbox to Maildir using this script. The effort to develop the script has definitely proven worthwhile. I guarantee that I would make mistakes performing this conversion manually for each user -- there are simply too many variables to remember. This script is the best way I have found to perform this task in a reliable and repeatable manner. These qualities are particularly important when dealing with mail, because it is absolutely indispensable in the modern office and users expect it to always work perfectly. Ideally, I would convert all users to Maildir at once, but this scripted approach is the next best thing.

I hope this article illustrates that real systems administration programming is a series of compromises. You never have the time to develop the perfect script. You must simply weigh the tradeoffs carefully, think over the problem, and do the best that time and other constraints allow.

Phil Hollenback is a Linux System Administrator at Schrodinger (www.schrodinger.com) in New York City. He holds a B.S. in Computer Science with an emphasis in AI. Outside of work, he tries to avoid getting run over while skateboarding the streets of Manhattan. He can be reached via his website at www.hollenback.net.