Formatting Reports with Template Toolkit
Randal L. Schwartz
Recently, a Usenet posting (yes, I still read that)
discussed using Perl to determine the disk block usage for various users on
the system. Apparently, the poster had noticed a discrepancy between the
number of blocks reported by du and the total blocks returned by summing Perl's -s function applied to each
file. I replied that you can't use -s, because that's merely the length of the file, and
thanks to sparse Unix files, the value wasn't necessarily the number
of blocks used. Also, a large file consumes indirect blocks to locate the blocks as needed.
Because of sparse files and indirect blocks, the
accurate way to determine the actual cost of a file requires calling stat and using the blocks
value (element index 12). As I was thinking about this, I thought
it'd be interesting to profile my own disk usage organized according
to logarithmic buckets. Once I had that working, I started tinkering some
more (as I often do) and wanted a nice HTML table output, organized by
user, of the disk blocks and file count for a given directory hierarchy.
And the result is shown in Listing 1.
Line 5 of the listing provides a constant to help me
scale a given block count into the relevant bucket. If I take the natural
log of the block count and divide it by the natural log of 2, then truncate
that downward to an integer, I'll get buckets for 1 block, 2-3
blocks, 4-7 blocks, 8-15 blocks, and so on, in nice powers of two. Rather
than keep dividing by the natural log of 2, I'll use the reciprocal
and multiply, which in hindsight is a completely unnoticed
micro-optimization. Oh well.
Lines 7 and 8 pull in the File::Find module (found in the core distribution)
and Template Toolkit module (found in the CPAN).
Line 10 defines the summary data structure, which will
hold the results of our data gathering. The hash is keyed by user ID
number, pointing at an array organized by logarithmic block buckets, each
holding a hashref that accumulates the number of blocks and the count of
files.
Line 14 defaults the directories to be searched to
include the current directory.
Line 16 creates a hash to enable us to ignore the
subsequent times that we see the same multiply-linked file. This hash has
to be defined outside the subroutine used by File::Find.
Lines 18 to 28 recurse through the directories using File::Find. The inline
anonymous subroutine is called for every entry: it's our job to
return quickly if the entry is not needed. Hence, lines 19 and 20 return
for symbolic links or items that are not either files or directories.
(Directories consume disk blocks; thus, I chose to count those in my
storage summary.)
Lines 21 to 23 pick out the relevant parts of the stat call, including the
device and inode number, the user ID of the file's owner, and the
number of blocks. If the device and inode number pair have been seen
before, line 23 rejects the entry quickly.
If we make it to line 25, it's the first time
we've seen a given file, so we next need to determine the bucket in
which this entry should be accumulated. If the block count is 0, we
can't compute the logarithm, so I decided to arbitrarily put that in
with the entries with a block count of 1. Otherwise, the natural log of the
block count is divided by the log of 2. We don't need to take the
integer value of the result because we're using it as an array index,
which automatically truncates the value.
Lines 26 to 27 accumulate the information for this
file into the appropriate bucket.
Once the data is gathered, I create two auxiliary data
structures to help the rendering a bit. Line 30 creates a mapping from user
ID to usernames. Lines 31 to 36 create the top-of-table column headers by
computing the powers of 2 as needed. The @labels array is compared to each user's data, and if
needed, extended until it's long enough.
For the rendering, I reach for my favorite templating
system, Template Toolkit. I need to pass the data into the template using a
hash, which I create in lines 38 to 41. Inside the template, sums refers to my %sums hashref, and so on.
Lines 43 and 44 create the top-level Template object, configured to
invoke post-chomping (to avoid spurious newlines in the output), and invoke
the template on my data. The result appears on STDOUT.
This is the end of the Perl code. The rest of the file
is a Template Toolkit template in a giant "here" document. Line
45 declares that the blocks returned by stat are in 1K blocks, which I determined by
experimentation.
Lines 46 to 49 are a slight tweak to make the table
cells look reasonable in my browser. I'm no CSS expert, so if this is
broken, just remember that I never claimed to be "Just another CSS
hacker". It does seem to do what I want though.
Lines 51 to 67 define a macro to take a number and
format it nicely into 4 characters or fewer (up to a terabyte, and after
that, you're on your own, you lucky dog). I tried using Template::Plugin::Number::Format and
it didn't do what I wanted, so I wrote my own. I use this both for
the bytes and for the file count, simply by applying the BLOCKSIZE scaling
factor as needed.
Lines 68 to 70 define a macro to show bytes and file
count in a consistent way from column to column, on separate lines in the
HTML.
Lines 71 to 107 define the main loop, walking through
the results user by user.
We need to preface the first user's output with
all of the table header information, including the column headers. This is
detected in lines 72 to 84.
Lines 74 to 83 produce the column heads by walking
through the labels array.
Ahead of the first column, I insert the username column (line 76), and
after the last column, I append the totals column (line 80). The remaining
labels are scaled by the block size and displayed using the nice number
formatting routine (line 78).
Once the headers are displayed, it's time to
display each row. Lines 85 to 103 iterate over the columns of a given row,
using the labels to define the columns over which we will iterate.
As with the column headers, we'll insert the
username before the first data column and initialize the total accumulators
(lines 88 to 91). And after the data is complete, the totals are dumped
(lines 99 to 102).
For each data cell, we'll pull out the detailed
hashref in line 87. If this cell exists (checked in line 92), then we
should show the data (likes 93 to 95). Otherwise, we need to include a
non-breaking space so that the browser renders the cell anyway (line 97).
Okay, this started out being a 10-line program, but
there it is. A sample run on my laptop's library directory as
rendered by my browser is shown in Figure 1. I hope you find this tool
useful and that you learned a few things about creating data reduction
tools. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software
design, system administration, security, technical writing, and training.
He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl
Programming. He's also a frequent
contributor to the Perl newsgroups, and has moderated
comp.lang.perl.announce since its inception.
Since 1985, Randal has owned and operated Stonehenge Consulting Services,
Inc.
|