Tracking Finances with WWW::Mechanize and HTML::Parser

The Perl Journal May 2003

By Simon Cozens

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.

One of the things I like about my bank is that it has a very comprehensive online banking service. I can fire up a web browser, log in, and transfer money, pay bills, check balances, and so on.

One of the things I really hate about it, though, is that this requires me to fire up a web browser, log in, and so on. So one of the first things I did when dealing with the bank was to write an LWP::UserAgent-based module that handled all the quirks of logging in and doing trivial tasks like checking balances and getting statements. I had Finance::Bank::LloydsTSB, and I was happy.

Unfortunately, the bank recently made matters worse by insisting on another layer of security. Now, don't get me wrong, I'm not averse to more security on my bank account per se, but I am averse to having my nice labor-saving module break.

However, I realized that this would give me the opportunity to rewrite F::B::LloydsTSB in terms of the WWW::Mechanize module, and also give me an opportunity to tell you how I did it.

WWW::Mechanize

But first, what is WWW::Mechanize, and why is it better than LWP::UserAgent? WWW::Mechanize, written by Andy Lester, is a fork of Kirrily Robert (Skud)'s WWW::Automate. Skud wrote WWW::Automate in order to help test web-based applications. The idea was to have a subclass of LWP::UserAgent that did more work behind the scenes, which would enable it to feel like it was emulating a web browser, rather than just a dumb web client.

For instance, it knows about the forms on a page, and can help you fill them in; it knows about following links, hitting a back button, reading the title of a page, and so on.

Let's take a simple example. We'll visit the Perl home page, http://www.perl.com/, follow the first link we find, and see where we end up:

use WWW::Mechanize;
my $agent = WWW::Mechanize->new();
$agent->get("http://www.perl.com/");
$agent->follow(0);
print $agent->title, " : ", $agent->uri, "\n";

(As it happens, it's a link back to www.perl.com, but hey.)

As WWW::Mechanize is a subclass of LWP::UserAgent, all the familiar methods are there—new creates a new user agent and get downloads a page.

However, there's also the new follow method, which takes the number of a link on the page, starting from zero; 0 is the first link. Mechanize also stores the URL of the page it's currently visiting, and we can retrieve that with the uri method. It also knows how to parse the HTML for the page and extract the HTML title, which we retrieve with title.

Examining the Lloyds Site

To work out how we'd make Mechanize automate access to our web site, we first need to look at how we'd do so in a browser. As it turns out, we'll find some unpleasant secrets along the way.

We start by going to the site's entry URL, https://online .lloydstsb.co.uk/customer.ibc/. This gives us the main login box, prompting us for a username and password. We fill this in, and post the form back to the site (see Figure 1).

Once we've done that, we come up against the new "memorable information" page, presumably created in order to stop people like us writing programmatic interfaces to log into the system. As well as a password, we've supplied the bank with a nine-character piece of information, and the new page presents us with a form with three drop-down menus, and asks us to select certain characters from the memorable information phrase (see Figure 2).

Once we post that back, we're finally logged in. From the post-login page, the account balances are stored in a table that we'll parse out using HTML::Parser—more on that later. First, we'll translate this complicated login sequence into WWW::Mechanize code.

Mechanizing the Login

The first phase is obvious—we want to get the front page, fill in the form, and click the "log in" button. Here's what this code looked like originally:

my $orig_r = $ua->get 	     ("https://online.lloydstsb.co.uk/customer.ibc");
croak $orig_r->error_as_HTML unless $orig_r->is_success;

my $orig = $orig_r->content;
my $key;

$orig =~ /name="Key" type="HIDDEN" value="(\d+)"/
	or croak "Couldn't parse key!";
$key = $1;
my $check = $ua->post 	    ("https://online.lloydstsb.co.uk/customer.ibc", {
	Key => $key,
	LOGONPAGE => "LOGONPAGE",
	UserId1 => $opts{username},
	Password => $opts{password},
   });

As you can see, a lot of this is handling the hidden form fields that the front page provides. With Mechanize, all of that is taken away as the forms are preparsed and the form values persist nicely:

$ua->get("https://online.lloydstsb.co.uk/customer.ibc");
croak $ua->res->error_as_HTML unless $ua->res->is_success;
$ua->field(UserId1  => $opts{username});
$ua->field(Password => $opts{password});
$ua->click;

Much more Perlish! This is what I like about WWW::Mechanize; it allows us to code at the appropriate level—we're simply saying what we want done, instead of how we want to do it. We don't want to get bogged down in the mechanics, we just want to fill in two fields and click the button.

Another useful feature is that we don't need to explicitly store the HTTP::Response object returned by get; Mechanize automatically stashes that away for us, as well as the current HTTP::Request.

Unfortunately, this doesn't work. It would, were it not for this little snippet of HTML:

<input type="UserId1" name="UserId1" ...

Mechanize uses HTML::Form, which quite rightly gets very distraught with the idea of a UserId1-type input. Most browsers render this as a text input, and we should do the same. Hence we have to get a little dirty with the HTML::Form object. First, we find the object that represents the user ID form input:

my $input = $ua->current_form->find_input("UserId1");

This is currently an ignored input, an instance of HTML::Form::IgnoreInput. We want to change it into a text input:

$input->{type} = "text";
bless $input, "WWW::Form::TextInput";

And now we should find ourselves at a page with the three drop-down menus on it. If not, we've probably failed to login. The three characters it wants are stored in ResponseKey0 through ResponseKey2, so we extract those from our memorable information phrase and put them back in the form:

for (0..2) {
    my $key;
    eval { $key = $ua->current_form->find_input 	    ("ResponseKey$_")->value; };
    croak "Couldn't log in; check your password and username" 	if $@;
    my $value = substr(lc $opts{memorable}, $key-1, 1);
    $ua->field("ResponseValue$_" => $value);
}

Now this happens to return a redirect (it's funny how bank sites tend to bring out all the interesting edge cases in screen scraping...), so we go to the location it specifies:

$response = $ua->click;
$ua->get($response->{_headers}->{location});

And now, finally, we're logged in! Now what?

Extracting the Account Balances

The account balances are stored in a table on the front page. We need to parse the HTML for the table and extract the account name, number, and the balance. Thankfully, Johnathan Stowe has written a very neat little HTML::Parser-based table parser in his collection of HTML::Parser examples (http://www.gellyfish.com/htexamples/), and so we steal his code pretty much wholesale.

HTML::Parser works by calling various methods every time it sees certain elements of the HTML document—a method for an opening tag, one for a closing tag, and one for any text in the middle.

We want to collect the data out of a table, so we want to be concerned with the table, tr, and td tags. In particular, when we see a tr tag, we want to start a new row; when we see a td tag, we start a new entry:

sub start {
   my ($self,$tag,$attr,$attrseq,$orig) = @_;
   if ($tag eq 'table') { $intable++;  $self->{Table} = []; }
   if ($tag eq 'tr')    { $inrecord++; $self->{Row}   = []; }
   if ($tag eq 'td')    { $infield++;  $self->{Field} = ''; }
}

HTML::Parser actually passes in a lot of things we don't really care about, such as the attributes; we're only interested in storing data structures for the table, row, and cell.

Now, if we're inside a table and a row and a cell, we want to collect any text we see into the current field:

sub text {
   my ($self,$text) = @_;
   if ($intable && $inrecord && $infield) { $self->{Field} .= $text; }
}

We have to concatenate onto $self->{Field}, as the cell may contain other tags. If we have:

<TD> Hello, <B>esteemed</B> visitor</TD>

then text will actually be called three times, and start and end will be called for the B tag as well. Because we want all three pieces of text instead of just the last one, we concatenate them all together.

The real hard work is done by the end tag processor. We've been gathering text into $self->{Field}, and when we come to the end of our cell, the </td> tag, we push the contents we've accumulated into the current row; similarly, when we come to the end of a row, we push that row onto the table:

sub end {
   my ($self,$tag) = @_;
   if ($tag eq 'table') { $intable—; }
   if ($tag eq 'td')    { $infield—;  
                          push @{$self->{Row}}, $self->{Field}; }
   if ($tag eq 'tr')    { $inrecord—;
                          push @{$self->{Table}}, $self->{Row}; }
}

And, well, that's it; once this package inherits from HTML::Parser like so:

package TableThing;
use strict;
use vars qw(@ISA $infield $inrecord $intable);
use base 'HTML::Parser';

we will be able to use it to parse our tables. Once we've logged in, we can say:

my $table_parser = TableThing->new;
$table_parser->parse($ua->content);

and $table_parser will now contain a data structure representing the table of accounts. Extracting the balances is now a matter of ordinary Perl data-structure munging.

Putting It Together

Let's now put this lot together into a module, Finance::Bank ::LloydsTSB. We'll start with the ordinary module preamble, and set up our browser object:

package Finance::Bank::LloydsTSB;
use strict;
use Carp;
our $VERSION = '1.2';
use WWW::Mechanize;
our $ua = WWW::Mechanize->new(
    env_proxy => 1,
    keep_alive => 1,
    timeout => 30,
);

Our constructor will be called check_balance since it will return a bunch of account objects. We make sure that it has the requisite parameters, and bless that into an object:

sub check_balance {
    my ($class, %opts) = @_;
    croak "Must provide a password" 
        unless exists $opts{password};
    croak "Must provide a username" 
        unless exists $opts{username};
    croak "Must provide memorable information" 
        unless exists $opts{memorable};

    my $self = bless { %opts }, $class;
And now comes the Mechanize code we established before:
$ua->get("https://online.lloydstsb.co.uk/customer.ibc");
my $field = $ua->current_form->find_input("UserId1");
$field->{type}="input";
bless $field, "HTML::Form::TextInput";
$ua->field(UserId1  => $opts{username});
$ua->field(Password => $opts{password});
$ua->click;

for (0..2) {
    my $key;
    eval { $key = $ua->current_form->find_input 	    ("ResponseKey$_")->value; };
    croak "Couldn't log in; check your password and username" if $@;
    my $value = substr(lc $opts{memorable}, $key-1, 1);
    $ua->field("ResponseValue$_" => $value);
}

my $response = $ua->click;
$ua->get($response->{_headers}->{location});

Getting the data out of the table turns out to be slightly tricky; first, we extract those table rows that contain nonwhitespace:

my @table = @{$foo->{Table}};
@table = grep { grep { s/&nbsp;//g; s/\s{2,}//g; /\S/ } @$_ } @table;

The top row is a header, so we get rid of that:

shift @table;

And now we look for cells that contain nonwhitespace:

for (@table) {
    my @line = grep /\S/, @$_;

The balance is the last cell in the line and is specified as a number, followed by either CR for credit or DR for overdrawn, so we fix that up to be a real number:

    my $balance = pop @line;
    $balance =~ s/ CR//;
    $balance = -$balance if $balance =~ s/ DR//;

We can extract the other components of the table directly and bless them into the Finance::Bank::LloydsTSB::Account class. We'll also throw in a link to the current $self because, although it's not used at the moment, we can later use this for reconfirming the password when we make transfers or payments:

    push @accounts, (bless {
        balance    => $balance,
        name       => $line[0],
        sort_code  => $line[1],
        account_no => $line[2],
        parent     => $self
    }, "Finance::Bank::LloydsTSB::Account");
}
return @accounts;

And that's basically our module. All that remains is to provide accessors for the name, sort_code, account_no, and balance, and we do this extremely lazily:

package Finance::Bank::LloydsTSB::Account;
sub AUTOLOAD { my $self=shift; $AUTOLOAD =~ s/.*:://; 	       $self->{$AUTOLOAD} }

So before we know it, we've written an interface to our online banking system. Interestingly, even though we added another screen to go through in the form of our memorable information page, the module ended up being four lines shorter than the previous incarnation—this was directly due to changing from LWP::UserAgent to WWW::Mechanize and programming at a more appropriate level.

I've found WWW::Mechanize useful for hacking up all kinds of screen-scraping code, from simple tests of web-based services right up to full-featured CPAN modules as we've seen in this article. Finance::Bank::LloydsTSB is available from CPAN, and has spawned several other online banking access modules, many of which switched to Mechanize much earlier than I did. I hope from this article you've gained some impression of how to go about writing something to interface to your own banking service, and an idea of how to use WWW::Mechanize in order to automate web access from Perl.

TPJ