Sorting with the Schwartzian Transform

Randal L. Schwartz

It was a rainy April in Oregon more than a decade ago when I saw the Usenet post made by Hugo Andrade Cartaxeiro on the now defunct comp.lang.perl newsgroup:

I have a (big) string like that:

print $str;
eir      11   9   2   6   3   1   1   81%   63%   13
oos      10   6   4   3   3   0   4   60%   70%   25
hrh      10   6   4   5   1   2   2   60%   70%   15
spp      10   6   4   3   3   1   3   60%   60%   14

and I like to sort it with the last field as the order key. I know perl has some features to do it, but I can't make 'em work properly.

In the middle of the night of that rainy April (well, I can't remember whether it was rainy, but that's a likely bet in Oregon), I replied, rather briefly, with the code snippet:

$str =
join "\n",
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_, (split)[-1]] }
split /\n/,
$str;

And even labeled it "speaking Perl with a Lisp". As I posted that snippet, I had no idea that this particular construct would be named and taught as part of idiomatic Perl, for I had created the Schwartzian Transform. No, I didn't name it, but in the follow-up post from fellow Perl author and trainer Tom Christiansen, which began:

Oh for cryin' out loud, Randal! You expect a NEW PERL PROGRAMMER to make heads or tails of THAT? :-) You're postings JAPHs for solutions, which isn't going to help a lot. You'll probably manage to scare these poor people away from the language forever? :-) BTW, you have a bug.

Tom eventually went on to describe what my code actually did. Oddly enough, the final lines of that post end with:

I'm just submitting a sample chapter for his perusal for inclusion the mythical Alpaca Book :-)

It would be another 8 years before I would finally write that book, making it the only O'Reilly book whose cover animal was known that far in advance.

On the next update to the "sort" function description in the manpages, Tom added the snippet:

# same thing using a Schwartzian Transform (no temps)
@new = map { $_->[0] }
    sort { $b->[1] <=> $a->[1]
                   ||
           $a->[2] cmp $b->[2]
    } map { [$_, /=(\d+)/, uc($_)] } @old;

Although the lines of code remain in today's perlfunc manpage, the phrase now lives only within perlfaq4. Thus, the phrase became the official description of the technique.

So, what is this transform? How did it solve the original problem? And more importantly, what was the bug?

Like nearly all Perl syntax, the join, map, sort, and split functions work right-to-left, taking their arguments on the right of the keyword, and producing a result to the left. This linked right-to-left strategy creates a little assembly line, pulling apart the string, working on the parts, and reassembling it to a single string again. Let's look at each of the steps, pulled apart separately, and introduce variables to hold the intermediate values.

First, we turn $str into a list of lines (four lines for the original data):

my @lines = split /\n/, $str;

The split rips the newlines off the end of the string. One of my students named the delimiter specification as "the deliminator" as a way of remembering that, although I think that was by accident.

Second, we turn the individual lines into an equal number of arrayrefs:

my @annotated_lines = map { [$_, (split)[-1]] } @lines;

There's a lot going on here. The map inserts each element of @lines into $_, then evaluates the expression, which yields a reference to an anonymous array. To make it a bit clearer, let's write that as:

my @annotated_lines = map {
  my @result = ($_, (split)[-1]);
  \@result;
} @lines;

Well, only a bit clearer. We can see that each result consists of two elements: the original line (in $_), and the value of that ugly split-inside-a-literal-slice. The split has no arguments, so we're splitting $_ on whitespace. The resulting list value is then sliced with an index of -1, which means "take the last element, no matter how long the list is". So for the first line, we now have an array containing the original line (without the newline) and the number 13. Thus, we're creating @annotated_lines to be roughly:

my @annotated_lines = (
  ["eir  11   9   2   6   3   1  1  81%  63%  13", "13"],
  ["oos  10   6   4   3   3   0  4  60%  70%  25", "25"],
  ["hrh  10   6   4   5   1   2  2  60%  70%  15", "15"],
  ["spp  10   6   4   3   3   1  3  60%  60%  14", "14"],
);

Notice how we can now quickly get at the "sort key" for each line. If we look at $annotated_lines[2][1] (15) and compare it with $annotated_lines[3][1] (14), we see that the third line would sort after the fourth line in the final output. And that's the next step in the transform -- we want to shuffle these lines, looking at the second element of each list to decide the sort order:

my @sorted_lines = sort { $a->[1] <=> $b->[1] } @annotated_lines;

Inside the sort block, $a and $b stand in for two of the elements of the input list. The result of the sort block determines the before/after ordering of the final list. In our case, $a and $b are both arrayrefs, so we dereference them looking at the second item of the array (our sort key), and then compare then numerically (with the spaceship operator), yielding the appropriate -1 or +1 value to put them in ascending numeric order. To get a descending order, I could have swapped the $a and $b variables.

As an aside, when the keys are equal, the spaceship operator returns a 0 value, meaning "I don't care what the order of these lines in the output might be". For many years, Perl's built-in sort operator was unstable, meaning that a 0 result here would produce an unpredictable ordering of the two lines. Recent versions of Perl introduced a stable sort strategy, meaning that the output lines will be in the same relative ordering as the input for this condition.

We now have the sorted lines, but it's not exactly palatable for the original request, because our sorted data is buried within the first element of each sublist of our list. Let's extract those back out, with another map:

my @clean_lines = map { $_->[0] } @sorted_lines;

And now we have the lines, sorted by last column. Just one last step to do now, because the original request was to have a single string:

my $result = join "\n", @clean_lines;

And this glues the list of lines together, putting newlines between each element. Oops, that's the bug. I really wanted:

$line1 . "\n" . $line2 . "\n" . $line3 . "\n"

when in fact what I got was:

$line1 . "\n" . $line2 . "\n" . $line3

and it's missing that final newline. What I should have done perhaps was something like:

my @clean_lines_with_newlines = map "$_\n", @clean_lines;
my $result = join "", @clean_lines_with_newlines;

Or, since my key-extracting split would have worked even if I had retained the trailing newlines, I could have generated @lines initially with:

my @lines = $str =~ /(.*\n)/g;

but that wouldn't have been as left to right. To really get it to be left to right, I'd have to resort to a look-behind split pattern:

my @lines = split /(?<=\n)/, $str;

But we're now getting far enough into the complex code that I'm distracting even myself as I write this, so let's get back to the main point.

In the Schwartzian Transform, the keys are extracted into a readily accessible form (in this case, an additional column), so that the sort block be executed relatively cheaply. Why does that matter? Well, consider an alternative steps to get from @lines to @clean_lines:

my @clean_lines = sort {
  my $key_a = (split ' ', $a)[-1];
  my $key_b = (split ' ', $b)[-1];
  $key_a <=> $key_b;
} @lines;

Instead of computing each key all at once and caching the result, we're computing the key as needed. There's no difference functionally, but we pay a penalty of execution time.

Consider what happens when sort first needs to know how the line ending in 13 compares with the line ending in 25. These relatively expensive splits are executed for each line, and we get 13 and 25 in the two local variables, and an appropriate response is returned (the line with 13 sorts before the line with 25). But when the line ending with 13 is then compared with the line ending with 15, we need to re-execute the split to get the 13 value again. Oops.

And while it may not make a difference for this small dataset, once we get into the tens or hundreds or thousands of elements in the list, the cost of re-computing these splits rapidly dominates the calculations. Hence, we want to do that once and once only.

I hope this helps explain the Schwartzian Transform for you. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.