Mining Mail

The Perl Journal December 2002

By Simon Cozens

Simon Cozens is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules, a former Parrot pumpking, and an obsessive player of the Japanese game of Go. Simon can be reached at simon@simon-cozens.org.

In my article, "Filtering Mail with Mail::Audit and News::Gateway" (TPJ, Summer 2000), I discussed a relatively simple way to help manage e-mail by filtering into mail folders and gatewaying to private news groups. In this article, I'll discuss the "next generation" of mail handling and mail mining, and demonstrate the utility of a Mail::Miner set up.

Data Mining

The "formal" definition of data mining from the Free Online Dictionary of Computing states that it is "analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data." However, when we use it in this article, we use it in a much looser sense—data mining is the automated extraction of core pieces of information from a mass of data, and its filing in such a way as to make querying and retrieval relatively easy.

One convenient source of a massive corpus of data suitable for data mining is the mass of e-mail that arrives at our system every day.

E-mail is a surprisingly interesting data format. It contains a lot of structured, regular data in the form of mail headers, which is easy for a computer to parse. Unfortunately, the utility of the mail headers is pretty hit and miss. While things like "To" and "Subject" are always going to be important, in the vast majority of cases, many of the headers are almost useless when the message has been delivered, filed, and read.

The structure of the body is also relatively easy to parse; there may be binary or textual attachments, which may or may not contain useful data. Finally, there's usually one reasonably large unstructured part, the textual body of the message.

Just like the mail headers, this is pretty hit and miss, too. There could be things that we will want to remember later: phone numbers, dates, place names, addresses, snippets of code, and so on. But interspersed with that, we find line after line of small talk, signature files, flames, and all kinds of other noninformation that is almost useless when the message has been delivered, filed, and read.

The idea behind mail mining is to provide a means for separating the wheat from the chaff. I want to be able to find out where and when I'm meeting someone, without much concern for the state of the weather in western Japan one particular Friday afternoon. So our goal, then, is to produce a mechanism for extracting useful information from both the structured and unstructured portions of a mail message, preferably without human intervention, and provide a means for retrieving that information quickly and easily.

To put it in extremely human terms, I want a tool that lets me say "Show me the mail I got around three weeks ago from Nat that had something to do with web services and had an interesting snippet of code in it."

And this is precisely what the Mail::Miner module does.

The Mail::Miner Method

Mail::Miner, as its name implies, is a module, rather than a complete application. To be precise, it's a collection of modules arranged in the framework shown in Figure 1.

The mail comes in at the top of the diagram and is converted by your Mail::Audit filter into a MIME::Entity, which is handed to Mail::Miner::Message by whatever your delivery process happens to be. Right at that moment, an entry is created for the e-mail in a relational database, storing the from address, subject, and other useful but trivial metadata. Any attachments are stripped off and filed separately in the database, associated with the mail message in question. A notification is added to the body of the e-mail, of the form:

    [ text/x-perl attachment test.pl detached - use
             mm --detach 821
     to recover ]

Notice the format of this text. To retrieve the attachment, I just cut and paste the middle line onto a shell prompt, and the attachment will be dumped into the current directory. (If there's already a test.pl there, mm, the Mail::Miner command-line utility, will prompt before overwriting.)

Then Mail::Miner locates and calls any Mail::Miner recognizers. We'll come back to those in a second.

After this point, the mail, with its newly flattened body, can be filed into the database. Once that's done, the message can leave the Mail Miner system, ready for delivery to the user's inbox.

Notice that here, even with no cleverness, we have a system for managing attachments, plus a searchable database of old e-mail. However, the real power of Mail::Miner comes in its recognizers.

Recognizers and Queries

Recognizers are simply modules that look for things that may be considered interesting in an incoming message ("assets"), file them away for later, and provide an interface to query for them.

Let's take a tour of the currently implemented Mail::Miner recognizers.

The first recognizer isn't really a recognizer at all, but it does provide a query mechanism—the Mail::Miner::Message module itself can be used to query the From address of messages in the database. Recognizers declare command-line options that they can fulfill, so I can now say:

% mm --from tpj.com --summary
 26 matched
 766:2002-03-29:       Kevin Carlson <kcarlson@tpj.com>: The Perl Journal
 768:2002-03-29:       Kevin Carlson <kcarlson@tpj.com>:Re: The Perl Journal 
 769:2002-03-29:       Kevin Carlson <kcarlson@tpj.com>:Re: The Perl Journal 
 770:2002-03-29:       Kevin Carlson <kcarlson@tpj.com>:Re: The Perl Journal 
 783:2002-04-03:       Kevin Carlson <kcarlson@tpj.com>:Re: The Perl Journal 
...

If I hadn't specified the summary option, mm would have returned a dump of all of the above messages in a UNIX mailbox format—this gives us virtual folders, á la Evolution. (http:// www.ximian.com/products/evolution/)

Let's add some more intelligent recognizers. Now, I'm a firm believer in 80-percent solutions. Most of the time, the effort required to make an algorithm "perfect" isn't worth it. Edge cases are, well, edge cases; and if you don't expect the algorithm to get things right, 80 percent of the time is just fine.

So when I say "intelligent" recognizers, I'm not referring to artificial intelligence; in fact, what the recognizers do is more along the lines of artificial stupidity, leaving the human (who is supposed to have some sort of natural intelligence) to do some elementary top-level filtering. Computers are good at grinding data, so we'll leave them to do that, and humans are good at top-level filtering, so we'll leave them to do that.

Think of the recognizers, then, as a production line of trained monkeys. When these monkeys find something interesting or shiny, they file it away in the database, as an "asset." Assets know which mail message they were found in, and which monkey discovered them.

For instance, when a recognizer attempts to discover any phone numbers in a message, it throws up anything that it can find that looks even remotely like a phone number. This naturally produces one or two false positives—although not that many, since most of the long sequences of numbers, parentheses, and hyphens found in mail messages turn out to be phone numbers anyway—but that's "Officially OK" by the Mail::Miner design philosophy. After all, if you can very quickly scan through 500 MB of e-mail and produce three candidate phone numbers for someone, two of which are obviously bogus, I'd call that a sufficiently big win.

The phone number recognizer is actually a slightly interesting example, because using it alters the output format of mm. Essentially, there are two types of recognizer: those that help to find particular messages, and those that actually store "hard" information. The former type of recognizer produces a mailbox full of messages that match; the latter type of recognizer just dumps out the asset in question.

What does this mean in practice? Well, when you're asking mm about phone numbers, it's more than likely that you don't want messages containing phone numbers, but you actually want to get at the numbers themselves. So, for instance, if I ask Mail::Miner for Tim O'Reilly's phone number:

  % mm --from "Tim O'Reilly" --phone
 Phone numbers found in message 2863 from "Tim O'Reilly" <tim@oreilly.com>:
 (555) 123-4567

(Notice there that I've used both the simple "From" query tool and the query tool provided by the phone-number recognizer to form an additive filter.)

Of course, if I want to check this out, I can get a copy of the message in question:

  % mm --id 2863
  From mail-miner-2863@localhost Thu Oct 31 18:58:53 2002
  Received: from rock.oreilly.com ([209.204.146.34] helo=smtp.oreilly.com)
  ...

Or just find out what the mail was actually about:

 % mm --id 2863 --summary
 1 matched
 2863:2002-08-31: "Tim O'Reilly" <tim@oreilly.com>:Re: Mail::Miner update

And speaking of what mails are about, let's move on to another recognizer, the keywords recognizer.

One interesting aspect of Mail::Miner from my point of view is that it has sparked the development of a few other neat little modules. The first stemmed from the fact that, as discussions tend to drift, a once-relevant "Subject:" line is now completely irrelevant to a particular e-mail. How do I find important details about a forthcoming business trip if they're hidden in a thread entitled "Return from caller"?

To solve this problem, I came up with the amazingly simple Lingua::EN::Keywords module to extract a set of salient keywords from a block of text. The keywords recognizer runs this module over a mail message, files each keyword returned from Lingua::EN::Keywords as an asset attached to the message, and provides the (synonymous) --about and --keyword command-line query functions.

The other recognizer that is currently implemented is Mail::Miner::Recognizer::Address. This recognizes physical addresses by looking for something that looks a bit like a postcode or U.S. zip code and state, and filing the entire paragraph.

Database Digression

There now follows a technical digression that you may skip if you're not particularly interested in reading a rant about relational databases.

As we've seen above with --from ... --address, queries can be combined. As an interesting technical point, I wanted each query to be represented by a single SQL statement for efficiency. How this works in practice is that we generate SELECT * FROM messages WHERE and each recognizer takes its argument, generates a suitable WHERE clause, and all the WHERE clauses get ANDed together.

When the SELECT statement returns a set of messages, another function in each of the recognizers is called to allow them to post-process and filter out inapplicable messages on a more fine-grained basis than can be achieved with raw SQL alone. This is an elegant and efficient design.

Unfortunately, most of the interesting recognizers are looking for messages that contain assets of a particular form. Hence, the WHERE clause that they return is actually a subselect along the lines of

  EXISTS (
     SELECT * FROM assets
     WHERE message = message.id
       AND recogniser = "me"
       AND asset LIKE '%something%'
  )

Of course, the most popular open source relational database does not support subselects. Hence, for the moment, Mail::Miner requires the second most popular open source relational database, PostgreSQL, until either MySQL gets its act together or I am persuaded that there's an equally elegant design that doesn't use subselects.

It's About the User, Stupid

In the grand tradition of this sort of article, I shall talk at length about currently unimplemented features as though they were up and running.

The other module sparked by the Mail::Miner project came out of the need to express timeframes in a human manner. As well as being a big fan of 80 percent solutions, I'm also a big fan of fuzzy input. Fuzzy input means that the computer, which has an awful lot of processing power when it comes to precise operations, tries its best to understand the human, who isn't all that great at precise operations. This requires admitting that computer programs are there for the user's benefit and not the programmers, but that's a different story and will be told a different time.

So, given that the whole premise of Mail::Miner is that I only vaguely know what I'm looking for, I don't want to have to specify explicit dates and times in order to narrow down searches. If I knew the date and time of the e-mail, I wouldn't need Mail::Miner in the first place!

Instead, I want to be able to say "find me the e-mail I got from Adam sometime around a week ago." The Date::PeriodParser module was written to solve this problem: Given a "fuzzy" date expressed in English, produce a pair of UNIX time values that are likely to bracket the date.

For instance, as I write this on a Thursday night:

  % perl -MDate::PeriodParser -le 'print scalar localtime $_ for
    parse_period("around the morning of the day before yesterday")'

  Mon Oct 28 22:00:00 2002
  Tue Oct 29 14:00:00 2002

"Around the morning of the day before yesterday" translates to "between very late on Monday evening to early Tuesday afternoon." Mail::Miner's --date option will give an interface to this.

It's Not Just About the User

So far in our explorations of Mail::Miner, we've seen how it can be used to extract salient information from the incoming e-mail of a single user. However, if the system was to be deployed across an organization with multiple users, this would require each user to have their own database.

Or would it? The natural extension to filing information about how you communicate by e-mail is to file information about how an organization communicates internally and with its clients.

A planned future phase of Mail::Miner is to have it sit on the main mail gateway to a company and file every single incoming and outgoing e-mail. From this simple idea, we can develop a customer relationship management utility—now you know where your clients live, what they do, and more importantly, who's been speaking to them about what.

So What Next?

The future of Mail::Miner lies in three distinct developments: The first is the development of more specific recognizers; the second, in developing this idea of Mail::Miner as an organization-wide tool; and third, the extension of Mail::Miner from a purely search and retrieval tool to an integrated part of an e-mail client.

In the first category, I intend to work on recognizers that detect and extract place names, dates and times, code snippets, human languages used in an e-mail, and much more.

In the second, I see far more possibilities. Adding a user-defined asset recognizer and asset-management tool will allow you to specify more accurately the details that you want to record from an e-mail; combine this with the idea of restructuring the assets system so that it can be applied either to a particular e-mail or a particular recipient, and you have a system that can store the fact that a particular client has expressed an interest in playing golf, or is going on holiday for the next two weeks, or many other things that a computer cannot pick out, no matter how many monkeys are involved.

Similarly, there could be recognizers that attempt to divine the relationships between correspondents on an e-mail message—if I always cc Joe when I e-mail Amy, then Mail::Miner should take notice of this.

This ties in with the third category, which combines the data-retrieval capabilities of Mail::Miner with the ordinary e-mail client. I've already intimated that Mail::Miner can be used to generate virtual mail folders. Imagine if your favorite e-mail client could search read mail based on language or a vague description of when you read it! On this front, I expect to work on IMAP proxies that can use an ordinary mail client to perform Mail::Miner searches, as well as integration with some of the more common UNIX mail clients.

However, for me, Mail::Miner is and always has been a way to make sure I never need to remember anything again—now, who do I have to send this article do? I'm sure I got an e-mail from someone at tpj.com a week or so ago...

TPJ