Dr. Dobb's Journal October, 2005
Data Crunching: Solving Everyday Problems Using Java, Python, and More
Greg Wilson
Pragmatic Bookshelf, 2005
188 pp., $29.95
ISBN 0974514071
A few weeks ago, a customer handed me some files that I had requested. We had previously agreed upon a convention for storing date information in these files and had decided that every file would include the date as its first line in the format YY-MM-DD, such as 05-06-23. When the files were given to me, however, the first line was in the form DD-Month-YYYY, such as 23-June-2005, and sometimes the file contained a blank line or two before the date. There were hundreds of files, so fixing this problem was too big a job to do by hand, but it wasn't a task that I planned on repeating several times, so it didn't warrant a lengthy design and development cycle.
This type of small data manipulation task falls right into the realm of a recent addition to the Pragmatic BookshelfGreg Wilson's Data Crunching. The book promises a pragmatic look at some of the most useful data-crunching techniques, and its delivery on this promise is stellar. From cover to cover, Data Crunching provides an exceptionally practical look at how to save time and effort when it comes to doing that "other stuff" that seems to creep up on every project. (In the spirit of full disclosure, I need to mention that Greg Wilson is an Adjunct Professor in Computer Science at the University of Toronto, where I am an undergraduate student. In addition, Wilson is a DDJ contributing editor, too.)
Wilson's clear, concise (and often humorous) writing makes it easy to linearly consume its 188 pages, but the book's examples and structured layout make it equally valuable as a reference text. Wilson dedicates a chapter to each of the most common aspects of data crunchingtext files, regular expressions, XML, binary files, and relational databases. There's also a chapter on unit testing, dates and times, encoding, and other "horseshoe nails" that are described as "apparently trivial things that can bring the whole system crashing down when they go wrong."
Though all of the important data-crunching techniques and idioms were included in this remarkably succinct book, its real strength comes from the fact that it never leaves the real world behind. Real-world programmers have to work on multiple platforms, and so the examples are appropriately platform independent. Real-world programmers code in dozens of different languages, and though the book's examples are mostly Java and Python, the wisdom behind them transcends any one specific programming language. And, most importantly, real-world data is messy, and so Data Crunching continuously reminds you that users will add in capital letters where you didn't expect them, and edge cases can't be forgotten. Incomplete and unexpected data are realities of data crunching, and rather than avoid the issue, the author jumps right into it and begins to explain how to deal with it.
Each of the topics of the book are taught through clear, practical examples. The examples and code are simple enough to be understandable upon first read but never feel trivial. Each one is powerful enough to be directly reused or altered slightly to solve some future problem. The author does a remarkable job of justifying his decisions at each point along his examples' narratives, and always favors the practical approach over one that might be found in other data-manipulation texts. For example, he covers not only unit testing, but also some simpler alternatives for when an entire testing infrastructure would be overkill.
The listed skill range for this book is beginner to intermediate, but I would argue that it's appropriate for every developer, whether as a first-time instructional tool or a reference guide for the seasoned professional. Wilson's text and examples have been woven together into 188 succinct pages of wisdom and pragmatic advice for programmers of all levels.
My copy of Data Crunching lives on top of my computer monitor where it's within arms length at all times. Regardless of what computing field you're in, you'll find this book to be valuable. Data manipulation tasks won't ever go away, but this book provides the strategy and mindset necessary to spend less time on data crunching and more time on the rest of your programming.
DDJ