Determining List Relationships with List::Compare

The Perl Journal May, 2004

By James E. Keenan

James founded Perl Seminar New York in 2000, and is Treatment Team Leader for the New York State Office of Mental Health at Kingsboro Psychiatric Center. He can be contacted at jkeenan@cpan.org.

Two years ago, I was hired by the continuing education department of a local college to teach a six-week introductory course in Perl. The circumstances of my hiring were a bit abrupt: The person originally hired for the course quit the week before the course began. I was hired two nights before my first session, so I needed to develop course materials, and fast.

Perl, of course, came to my rescue. I decided to use Randal L. Schwartz's Learning Perl (O'Reilly & Associates, 2001) as the course text and to create an HTML-based slideshow as my curriculum. I was somewhat familiar with such slideshows because, at the first meeting of one of our local user groups, Perl Seminar New York, my colleague Mark Miller had demonstrated such a slideshow. But that show displayed photos; I wanted to display text. I did a quick hack on Mark's script so that I could accomplish the following:

Write a separate plaintext file, written in Perl's Plain Old Documentation (POD) format, for each slide in the show. Each such plaintext file's name ended in .slide.txt:

chomp.slide.txt
defined.slide.txt
# false.slide.txt  # commented-out for illustrative purposes
hello.slide.txt
intro1.slide.txt
print.slide.txt

Place the name of each slide in a master slidelist file.
Control the order in which slides appear in the show by simply moving their names up or down within slidelist. For instance, a more logical order for the slides above might be:

intro1.slide.txt
hello.slide.txt
print.slide.txt
chomp.slide.txt
defined.slide.txt

A First Attempt at Using Seen-Hashes

I had to make certain that, for each slide named in slidelist, the corresponding POD file actually existed in a particular subdirectory, texts, where I was writing such files. Put another way, I had to compare the list of filenames in slidelist with the list of actual files in texts and make sure that the first list was a mathematical subset of the second list. To code this, I first turned to Tom Christiansen and Nat Torkington's Perl Cookbook (O'Reilly & Associates, 2003) and its discussion of what I have since come to call "seen-hashes"; i.e., hashes that record whether certain strings have been "seen" in particular lists. (These can be thought of as a kind of lookup table.)

In Listing 1, I show the code I wrote before refactoring and optimizing. I first read the filenames in slidelist (lines 5-13); next read the POD files in texts (lines 15-17); then construct seen-hashes for each (lines 19-21). To determine whether @selections is a proper subset of @sources, I declare $subset_status and set its initial value to "1." I loop through the elements in @sources and if I encounter one that is not also a key in %seen_selections, I set $subset_status to a false value of "0" and exit the loop (lines 23-29).

At this point, I wanted to be warned if I had included a filename in slidelist that did not actually exist in texts. Lines 31-40 accomplish this and shut down the script early if those files are missing.

Next, I decided it would be nice to get a list of those slides for which I had created files but was not currently including in slidelist. Lines 50-60 accomplish this. So at this point, I had a validated list of slides I was ready to turn over to the slidebuilder mechanism itself (code not shown here).

Along with my slideshow, I was developing some basic Perl scripts that I distributed to my students in a printed handout. As not all these scripts made it into the final version of the handout, I created another master file, this time called scriptslist, with which I could control the order in which the selected scripts appeared in the handout—just as I did for the slides with slidelist. Except for some variable names, the code was exactly the same as in Listing 1.

Refactoring

At that point, I heard Mark Jason Dominus's voice crying out, "Repeated code is a mistake!" Beginning with Yet Another Perl Conference::North America in Pittsburgh in 2000, I had attended several of Mark's "Perl Program Repair Shop" talks and had absorbed two key points:

Code that is repeated within a single script should be refactored into a subroutine.
Code that is used in more than one script should be refactored into a module.

The fact that I was using the same code to manage scriptslist as I was for slidelist meant that I clearly had the makings of a module. But what functions were to go into that module?

I reread the relevant recipes in the Perl Cookbook and noticed that Tom and Nat used seen-hashes to derive several interesting relationships among two lists:

The union of the two lists.
The intersection of the two lists.
Items unique to the first of two lists (what Tom and Nat call "simple difference").
The symmetric difference of the two lists: items found in one or the other of two lists, but not both.

In Listing 1, the list described by keys %seen_selections in lines 33-37 was clearly a list of items unique to the first of two lists. In addition, I had used seen-hashes to derive two relationships not explicitly described in the Cookbook:

Items unique to the second of two lists were found in @unused, derived between lines 42 and 50.
The "left-to-right subset status" derived between lines 23 and 29.

Wouldn't it be nice, I thought, to have a function that, when provided with two lists as inputs, returned all the most interesting relationships between the lists?

The Birth of List::Compare

And thus, I had the inspiration for List::Compare. During the next month, when I wasn't busy writing slides for my students, I was creating an object-oriented interface in which:

1. References to arrays holding two lists are passed to the constructor:

use List::Compare;
$lc = List::Compare->new(\@selections, \@sources);

2. The constructor's initializer computes all of the most interesting relationships between the two lists (see Listing 2).

3. The results of those computations are stored in a hash that is blessed into the List::Compare object.

4. List::Compare methods simply return the various relationships:


@union        = $lc->get_union;
@intersection = $lc->get_intersection;
@Lonly        = $lc->get_unique;
@Ronly        = $lc->get_complement;
@LorRonly     = $lc->get_symmetric_difference;
$LR           = $lc->is_LsubsetR;

(For additional interlist relationships not described here, see the List::Compare documentation.)

Using List::Compare, I can rewrite Listing 1, as shown in Listing 3. What once was a 60-line script is now only 41 lines. Moreover, should I need at this point additional relationships between the two lists (e.g., their symmetric difference), I could simply call the appropriate List::Compare method and get the result without additional computation.

A More Mature List::Compare

I first showed List::Compare to other Perl hackers at the May 2002 meeting of Perl Seminar New York and its first CPAN version was uploaded the next month just before YAPC::NA::2002 in St. Louis. Since that time, I have worked on both its interface and internals in an attempt to allow potential users maximum flexibility and, where possible, speed improvements. Here are the most significant improvements:

Accelerated Mode. Shortly after List::Compare's first presentation, Perl Seminar NY member Josh Rabinowitz argued that if a user wanted to compute only one comparison between two lists (e.g., just their intersection), List::Compare should not spend time computing any other relationships between the two lists.

In response, I decided to offer the user the option of an "accelerated" mode in which the user passes -a as the first argument to the constructor:

$lca = List::Compare->new('-a', \@selections, \@sources);
@intersection = $lca->get_intersection;

Internally, the constructor calls an initializer, which populates the object with references to the lists submitted as arguments rather than with the results of computations of set relationships. In the accelerated mode, it is the individual method called that does the computation, not the initializer.

Preliminary benchmarking showed that if a user indeed wanted to extract only one comparison between two lists, the accelerated mode was faster. However, it was no faster than the regular mode if the user wanted two comparisons, and the regular mode pulled ahead if the user wanted three or more comparisons.

Multiple Mode. Not long after List::Compare's CPAN debut, I wondered: Why should a user be restricted to comparisons between just two lists? Why shouldn't a user be able to compute, say, the intersection of three or more lists at a time? So I set out to develop a "multiple" mode in which a user would simply pass additional lists as arguments to the constructor; see Example 1.

Comparing three or more lists at a time, however, requires a more careful specification of certain comparisons. When calling the get_unique() method, for example, I wanted the user to be able to get those items unique to, say, @Carmen and not just to the first list passed by reference to the constructor. Therefore, I designed that method's interface so the user could specify the index position of the targeted list as an argument to the method. Since Perl starts counting at zero, @Carmen's index position is "2" and a list of items unique to that array is generated like this:

@unique_Carmen = $lcm->get_unique(2);

Similarly, the items not found in @Don would be generated like so:

@complement_Carmen = $lcm->get_complement(3);

Enabling List::Compare to handle more than two lists also enables us to define two new relationships among the various lists: the complement of their intersection and the complement of their symmetric difference. But since that's very verbose, I have simplified their names for the purpose of method calls and documentation into nonintersection and shared.

List::Compare defines the nonintersection of several lists as those found in any of the lists passed to the constructor that do not appear in all of the lists (i.e., all items except those found in the intersection of the lists):

@nonintersection = $lcm->get_nonintersection;

List::Compare defines items shared among several lists as those which appear in more than one of the lists passed to the constructor (i.e., all items except those found in their symmetric difference):

@shared = $lcm->get_shared;

At first, List::Compare's multiple mode did not have an "accelerated" variant for faster calculation of just one relationship among three or more lists, but this limitation has been eliminated in the last year. To choose accelerated calculation, pass the -a option as the first argument to the constructor:

$lcm = List::Compare->new(
		   '-a', \@Al, \@Bob, \@Carmen, \@Don, \@Ed
	   );
@intersection = $lcm->get_intersection;

Returning Array References

Later in 2002, I had occasion to use List::Compare in some work on my day job for the New York State Office of Mental Health. I was using the get_union method to return a list, but then had to immediately pass a reference to the array holding that list to another function.

@union = $lc->get_union;
some_other_function(@union);

Since having subroutines receive arguments by—and return results by—references is faster than by receiving/returning whole lists, why not allow List::Compare to return just a reference to a list? This led to a new set of List::Compare _ref methods:

$unionref = $lc->get_union_ref;
some_other_function($unionref);

After this revision had been uploaded to CPAN, Glenn Maciag of Perl Seminar NY noted that if I had simply rewritten the get_union() method to use Perl's wantarray function to examine the method's calling context, I could have dispensed with a separate get_union_ref() method:

@union    = $lc->get_union;
$unionref = $lc->get_union;  # NOT IMPLEMENTED

However, since this approach was slightly less self-documenting and since the _ref methods were already out there on CPAN, I didn't implement his suggestion—though I may in the future.

Four Membership Methods

In a subsequent revision, I gave List::Compare the ability to answer these questions:

Given a string, determine to which of the lists passed to the constructor the string belongs.

@memb_arr = $lcm->is_member_which('golfer');

The list returned by is_member_which() is a list of the indexes in the constructor's argument list in which golfer is found; i.e.:

( 0, 1, 2 )

Given several strings, determine to which of the lists passed to the constructor the various strings belong. Do this by passing to the are_members_which() method a reference to an array holding references to the various lists under examination. Get a hash of array references as the return value:

$memb_hash_ref = $lcm->are_members_which(
			    [ qw| abel baker fargo hilton zebra | ]
		        );

"$memb_hash_ref" will be:

{
	 abel     => [ 0             ],
	 baker    => [ 0, 1          ],
	 fargo    => [ 0, 1, 2, 3, 4 ],
	 hilton   => [    1, 2       ],
	 zebra    => [               ],
};

Given a string, determine whether it can be found in any of the lists passed as arguments to the constructor. Return 1 if a specified string can be found in any of the lists and 0 if it cannot:

$found = $lcm->is_member_any('abel');

In the example above, $found will be 1 because abel is found in one or more of the lists passed as arguments to new().

Given several strings, determine if each such string can be found in any of the lists passed as arguments to the constructor. Do this by passing to the are_members_any() method a reference to an array holding references to the various lists under examination. Get a hash reference as the return value where the value of each element is either 1 if the particular string can be found in any of the lists under examination and 0 if it cannot.

$memb_hash_ref = $lcm->are_members_any(
			    [ qw| abel baker fargo hilton zebra | ]
		        );

"$memb_hash_ref" will be:

{
	 abel     => 1,
	 baker    => 1,
	 fargo    => 1,
	 hilton   => 1,
	 zebra    => 0,
};

Passing Seen-Hashes to the Constructor

In the course of a Perl script, if a user has already created two or more lookup tables in the form of seen-hashes and needs to determine set relationships among the lists implied by those seen-hashes, it seemed to me that he or she should be able to pass references to those seen-hashes directly to the constructor. This is now possible; see Example 2.

C<@intersection> will contain the following:

qw( baker camera delta edward fargo golfer)

The constructor figures out for itself whether it has been passed arrays or seen-hashes.

Returning Unsorted Lists

When I first wrote List::Compare, all comparison lists were sorted in ASCII-betical order by default. Since sorting imposes a cost in speed, a later revision enables a user who does not need a presorted list to pass an unsorted option to the constructor:

$lc = List::Compare->new('-u', \@Llist, \@Rlist);

$lc = List::Compare->new('—unsorted', \@Llist, \@Rlist);

I haven't benchmarked this approach, yet, but my impression is that the speed boost is largely marginal. It's there for you to use; TMTOWTDI.

A Functional Interface

In the autumn of 2003, I began to wonder if an object-oriented interface to List::Compare was always necessary. What would I have to do to enable a user to pass references to several lists directly to a function such as get_union() rather than first passing them to a constructor?

This is now possible with List::Compare::Functional:

use List::Compare::Functional qw( get_union get_complement );

# same 5 lists as above:  @Al, @Bob, @Carmen, @Don, @Ed

@union = get_union( [ \@Al, \@Bob, \@Carmen, \@Don, \@Ed ] );

Note that, as with many Perl modules that employ a functional interface rather than an object oriented one, the user must specifically import the function(s) he or she later wishes to call. (Import tag groups are available; see the documentation.)

Note also that with List::Compare::Functional, the first argument passed to the function is a reference to an array (anonymous or named), which, in turn holds a list of references to the lists under examination. This proved necessary to distinguish the arguments representing the lists under examination from particular lists that are passed to certain functions. For example, if we wish to find those items not found in @Don, we need to pass a second argument (also via reference) like this:

@complement_Don = get_complement(
		  	     [ \@Al, \@Bob, \@Carmen, \@Don, \@Ed ],
		   	     [ 3 ]
		 );

Since, on the inside, much of List::Compare::Functional's code works like that of List::Compare's accelerated mode, and since it incurs no overhead for object creation, it should provide faster results than List::Compare. But the interface to each List::Compare::Functional function is necessarily less elegant than that to List::Compare's methods.

Conclusion

List::Compare offers a variety of ways to compare two or more lists. The Perl code around which it is built is old, well-tested, and certainly not mine. It is, at its best, a sound implementation of a good interface. It is not rocket science.

It should be noted that, with the exception of List::Compare's get_bag() method (not discussed in this article), all comparisons conducted by List::Compare only ask whether a given string was seen in a given list at all. In general, List::Compare ignores how many times a string was seen in a given list. If you need to make decisions based on how many times a string was seen in a given list, don't use List::Compare; look elsewhere. You will find, however, many situations in which List::Compare or its Functional variants can save you from typing that Perl Cookbook code over again.

TPJ

Listing 1

#!/usr/bin/perl
use strict;
use warnings;

my ($raw, @selections, @sources);
open LIST, 'slidelist' or die "Can't open slidelist for reading:  $!";
while ($raw = <LIST>) {
        next if ($raw =~ /^\s+$/ or $raw =~ /^#/); # no comments or blanks
        next unless ($raw =~ /\.slide\.txt$/);
        chomp $raw;
        push(@selections, "texts/$raw");
}
close LIST or die "Cannot close slidelist:  $!";

opendir DIR, 'texts' or die "Couldn't open texts directory: $!";
@sources = map {"texts/$_"} grep {/\.slide\.txt$/} readdir DIR;
closedir DIR;

my (%seen_selections, %seen_sources);
$seen_selections{$_}++ foreach @selections;
$seen_sources{$_}++    foreach @sources;

my $subset_status = 1;
foreach (@selections) {
        if (! exists $seen_sources{$_}) {
                        $subset_status = 0;
                        last;
        }
}

unless ($subset_status) {
        my (%selections_only);
        foreach (keys %seen_selections) {
                        $selections_only{$_}++ if (! exists $seen_sources{$_});
        }
        print "These files, though listed in 'slidelist', are not found in 'texts' directory.\n\n";
        print "    $_\n" foreach (keys %selections_only);
        print "\nEdit 'slidelist' as needed and re-run script.\n";
        exit (0);
}

my (%intersection, %difference);
foreach (keys %seen_selections) {
        $intersection{$_}++ if (exists $seen_sources{$_});
}
foreach (keys %seen_sources) {
        $difference{$_}++ unless (exists $intersection{$_});
}

my @unused = keys %difference;
open UNUSED, ">unused" or die "Could not open unused for writing: $!";
print UNUSED "Files currently unused:\n";
if (@unused) {
        print UNUSED "    $_\n" foreach (sort @unused);
        print "There are unused files; see 'unused'.\n";
} else {
        print UNUSED "    [None.]\n";
        print "There are no unused files.\n";
}
close UNUSED or die "Could not close unused: $!";

Back to article

Listing 2

# Computation of Relationships between Two Lists in List::Compare's Regular Mode
# Regular Mode => Compares only two lists; computes all relationships at once;
# stores the results in a hash which is blessed into the List::Compare object.
# Below:  _init(), which is called by List::Compare's constructor and which returns
# a reference to the hash which the constructor will bless.

sub _init {
        my $self = shift;
        my ($unsortflag, $refL, $refR) = @_;
        my (%data, %seenL, %seenR);
        my @bag = $unsortflag ? (@$refL, @$refR) : sort(@$refL, @$refR);

        my (%intersection, %union, %Lonly, %Ronly, %LorRonly);
        my $LsubsetR_status = my $RsubsetL_status = 1;
        my $LequivalentR_status = 0;

        foreach (@$refL) { $seenL{$_}++ } 
        foreach (@$refR) { $seenR{$_}++ } 

        foreach (keys %seenL) {
                $union{$_}++;
                if (exists $seenR{$_}) {
                          $intersection{$_}++;
                } else {
                          $Lonly{$_}++;
                }
}

        foreach (keys %seenR) {
                $union{$_}++;
                $Ronly{$_}++ unless (exists $intersection{$_});
        }

        $LorRonly{$_}++ foreach ( (keys %Lonly), (keys %Ronly) );

        $LequivalentR_status = 1 if ( (keys %LorRonly) == 0);

        foreach (@$refL) {
                if (! exists $seenR{$_}) {
                        $LsubsetR_status = 0;
                        last;
                }
        }
        foreach (@$refR) {
                if (! exists $seenL{$_}) {
                        $RsubsetL_status = 0;
                        last;
                        }
        }

        $data{'seenL'}                = \%seenL; 
        $data{'seenR'}                = \%seenR; 
        $data{'intersection'}        = $unsortflag ? [    keys %intersection ] 
                                                                               : [ sort keys %intersection ];
        $data{'union'}                = $unsortflag ? [    keys %union ]
                                                                                                         : [ sort keys %union ];
        $data{'unique'}        = $unsortflag ? [    keys %Lonly ] 
                                                         : [ sort keys %Lonly ];
        $data{'complement'}        = $unsortflag ? [    keys %Ronly ] 
                                                         : [ sort keys %Ronly ];
        $data{'symmetric_difference'} = $unsortflag ? [    keys %LorRonly ] 
                                                         : [ sort keys %LorRonly ];
        $data{'LsubsetR_status'}        = $LsubsetR_status;
        $data{'RsubsetL_status'}        = $RsubsetL_status;
        $data{'LequivalentR_status'}  = $LequivalentR_status;
        $data{'bag'}                = \@bag;
        return \%data;
}

Back to article

Listing 3

#!/usr/bin/perl
use strict;
use warnings;
use List::Compare;

my ($raw, @selections, @sources);
open LIST, 'slidelist' or die "Can't open slidelist for reading:  $!";
while ($raw = <LIST>) {
        next if ($raw =~ /^\s+$/ or $raw =~ /^#/); # no comments or blanks
        next unless ($raw =~ /\.slide\.txt$/);
        chomp $raw;
        push(@selections, "texts/$raw");
}
close LIST or die "Cannot close slidelist:  $!";

opendir DIR, 'texts' or die "Couldn't open texts directory: $!";
@sources = map {"texts/$_"} grep {/\.slide\.txt$/} readdir DIR;
closedir DIR;

my $lc = List::Compare->new(\@selections, \@sources);
my $LR = $lc->is_LsubsetR;
unless ($LR) {
        my @slidelist_only = $lc->get_unique();
        print "These files, though listed in 'slidelist', are not found in 'texts' directory.\n\n";
        print "    $_\n" foreach (@slidelist_only);
        print "\n";
        print "Edit 'slidelist' as needed and re-run script.\n";
        exit (0);
}

my @unused = $lc->get_complement;
open(UNUSED, ">unused") or die "Could not open unused for writing: $!";
print UNUSED "Files currently unused:\n";
if (scalar(@unused)) {
        print UNUSED "    $_\n" foreach (sort @unused);
        print "There are unused files; see 'texts/unused'.\n";
} else {
        print UNUSED "    [None.]\n";
        print "There are no unused files.\n";
}
close(UNUSED) or die "Could not close unused: $!";

Back to article