Sorting
with the Schwartzian Transform
Randal L. Schwartz
It was a rainy April in Oregon more than a decade ago when I saw
the Usenet post made by Hugo Andrade Cartaxeiro on the now defunct
comp.lang.perl newsgroup:
I have a (big) string like that:
print $str;
eir 11 9 2 6 3 1 1 81% 63% 13
oos 10 6 4 3 3 0 4 60% 70% 25
hrh 10 6 4 5 1 2 2 60% 70% 15
spp 10 6 4 3 3 1 3 60% 60% 14
and I like to sort it with the last field as the order key. I know
perl has some features to do it, but I can't make 'em work properly.
In the middle of the night of that rainy April (well, I can't
remember whether it was rainy, but that's a likely bet in Oregon),
I replied, rather briefly, with the code snippet:
$str =
join "\n",
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_, (split)[-1]] }
split /\n/,
$str;
And even labeled it "speaking Perl with a Lisp". As I posted that
snippet, I had no idea that this particular construct would be named
and taught as part of idiomatic Perl, for I had created the Schwartzian
Transform. No, I didn't name it, but in the follow-up post from
fellow Perl author and trainer Tom Christiansen, which began:
Oh for cryin' out loud, Randal! You expect a NEW PERL PROGRAMMER
to make heads or tails of THAT? :-) You're postings JAPHs for solutions,
which isn't going to help a lot. You'll probably manage to scare
these poor people away from the language forever? :-) BTW, you have
a bug.
Tom eventually went on to describe what my code actually did.
Oddly enough, the final lines of that post end with:
I'm just submitting a sample chapter for his perusal for inclusion
the mythical Alpaca Book :-)
It would be another 8 years before I would finally write that
book, making it the only O'Reilly book whose cover animal was known
that far in advance.
On the next update to the "sort" function description in the manpages,
Tom added the snippet:
# same thing using a Schwartzian Transform (no temps)
@new = map { $_->[0] }
sort { $b->[1] <=> $a->[1]
||
$a->[2] cmp $b->[2]
} map { [$_, /=(\d+)/, uc($_)] } @old;
Although the lines of code remain in today's perlfunc manpage,
the phrase now lives only within perlfaq4. Thus, the phrase
became the official description of the technique.
So, what is this transform? How did it solve the original problem?
And more importantly, what was the bug?
Like nearly all Perl syntax, the join, map, sort, and split functions
work right-to-left, taking their arguments on the right of the keyword,
and producing a result to the left. This linked right-to-left strategy
creates a little assembly line, pulling apart the string, working
on the parts, and reassembling it to a single string again. Let's
look at each of the steps, pulled apart separately, and introduce
variables to hold the intermediate values.
First, we turn $str into a list of lines (four lines for
the original data):
my @lines = split /\n/, $str;
The split rips the newlines off the end of the string. One of my students
named the delimiter specification as "the deliminator" as a way of
remembering that, although I think that was by accident.
Second, we turn the individual lines into an equal number of arrayrefs:
my @annotated_lines = map { [$_, (split)[-1]] } @lines;
There's a lot going on here. The map inserts each element of
@lines into $_, then evaluates the expression, which
yields a reference to an anonymous array. To make it a bit clearer,
let's write that as:
my @annotated_lines = map {
my @result = ($_, (split)[-1]);
\@result;
} @lines;
Well, only a bit clearer. We can see that each result consists of
two elements: the original line (in $_), and the value of that
ugly split-inside-a-literal-slice. The split has no arguments, so
we're splitting $_ on whitespace. The resulting list value
is then sliced with an index of -1, which means "take the last
element, no matter how long the list is". So for the first line, we
now have an array containing the original line (without the newline)
and the number 13. Thus, we're creating @annotated_lines to
be roughly:
my @annotated_lines = (
["eir 11 9 2 6 3 1 1 81% 63% 13", "13"],
["oos 10 6 4 3 3 0 4 60% 70% 25", "25"],
["hrh 10 6 4 5 1 2 2 60% 70% 15", "15"],
["spp 10 6 4 3 3 1 3 60% 60% 14", "14"],
);
Notice how we can now quickly get at the "sort key" for each line.
If we look at $annotated_lines[2][1] (15) and compare it with
$annotated_lines[3][1] (14), we see that the third line would
sort after the fourth line in the final output. And that's the next
step in the transform -- we want to shuffle these lines, looking at
the second element of each list to decide the sort order:
my @sorted_lines = sort { $a->[1] <=> $b->[1] } @annotated_lines;
Inside the sort block, $a and $b stand in for two of
the elements of the input list. The result of the sort block determines
the before/after ordering of the final list. In our case, $a
and $b are both arrayrefs, so we dereference them looking at
the second item of the array (our sort key), and then compare then
numerically (with the spaceship operator), yielding the appropriate
-1 or +1 value to put them in ascending numeric order. To get a descending
order, I could have swapped the $a and $b variables.
As an aside, when the keys are equal, the spaceship operator returns
a 0 value, meaning "I don't care what the order of these lines in
the output might be". For many years, Perl's built-in sort operator
was unstable, meaning that a 0 result here would produce
an unpredictable ordering of the two lines. Recent versions of Perl
introduced a stable sort strategy, meaning that the output
lines will be in the same relative ordering as the input for this
condition.
We now have the sorted lines, but it's not exactly palatable for
the original request, because our sorted data is buried within the
first element of each sublist of our list. Let's extract those back
out, with another map:
my @clean_lines = map { $_->[0] } @sorted_lines;
And now we have the lines, sorted by last column. Just one last step
to do now, because the original request was to have a single string:
my $result = join "\n", @clean_lines;
And this glues the list of lines together, putting newlines between
each element. Oops, that's the bug. I really wanted:
$line1 . "\n" . $line2 . "\n" . $line3 . "\n"
when in fact what I got was:
$line1 . "\n" . $line2 . "\n" . $line3
and it's missing that final newline. What I should have done perhaps
was something like:
my @clean_lines_with_newlines = map "$_\n", @clean_lines;
my $result = join "", @clean_lines_with_newlines;
Or, since my key-extracting split would have worked even if I had
retained the trailing newlines, I could have generated @lines
initially with:
my @lines = $str =~ /(.*\n)/g;
but that wouldn't have been as left to right. To really get it to
be left to right, I'd have to resort to a look-behind split pattern:
my @lines = split /(?<=\n)/, $str;
But we're now getting far enough into the complex code that I'm distracting
even myself as I write this, so let's get back to the main point.
In the Schwartzian Transform, the keys are extracted into a readily
accessible form (in this case, an additional column), so that the
sort block be executed relatively cheaply. Why does that matter?
Well, consider an alternative steps to get from @lines to
@clean_lines:
my @clean_lines = sort {
my $key_a = (split ' ', $a)[-1];
my $key_b = (split ' ', $b)[-1];
$key_a <=> $key_b;
} @lines;
Instead of computing each key all at once and caching the result,
we're computing the key as needed. There's no difference functionally,
but we pay a penalty of execution time.
Consider what happens when sort first needs to know how the line
ending in 13 compares with the line ending in 25. These relatively
expensive splits are executed for each line, and we get 13 and 25
in the two local variables, and an appropriate response is returned
(the line with 13 sorts before the line with 25). But when the line
ending with 13 is then compared with the line ending with 15, we
need to re-execute the split to get the 13 value again. Oops.
And while it may not make a difference for this small dataset,
once we get into the tens or hundreds or thousands of elements in
the list, the cost of re-computing these splits rapidly dominates
the calculations. Hence, we want to do that once and once only.
I hope this helps explain the Schwartzian Transform for you. Until
next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration, security,
technical writing, and training. He has coauthored the "must-have"
standards: Programming Perl, Learning Perl, Learning
Perl for Win32 Systems, and Effective Perl Programming.
He's also a frequent contributor to the Perl newsgroups, and has
moderated comp.lang.perl.announce since its inception. Since 1985,
Randal has owned and operated Stonehenge Consulting Services, Inc. |