Data Is What It Is All About

Dr. Dobb's Journal September 2000

By Doug Reilly

Doug is the author of Inside Server-Based Applications, moderator of the Database topic area on http://www.ddj.com/, and owner of Access Microsystems. He can be contacted at doug@accessmicrosystems.com.



Guru's Guide to Transact-SQL

Ken Henderson
Addison-Wesley, 2000
576 pp., $49.95
ISBN 0-201-61576-2

Mastering Data Mining
Michael J.A. Berry and Gordon S. Linoff
John Wiley & Sons, 2000
494 pp., $44.99
ISBN 0-471-33123-6

Transact-SQL (T-SQL) is the version of SQL used in Microsoft SQL Server. While there certainly is a place for books that cover ANSI Standard SQL (like Joe Celko's SQL For Smarties), Guru's Guide to Transact-SQL, by Ken Henderson, is a welcome addition for developers who are dedicated to Microsoft SQL Server, as well as developers who need to know what is SQL Server specific and what is generic SQL.

Guru's Guide to Transact-SQL begins with an introduction to SQL that is not particularly vendor specific. First, ANSI standards and compliance are covered. Creating tables and databases as well as selecting and modifying data is covered next, including filtering and grouping, as well as joining multiple tables. Most people reading this book will know this, but the description here is worth at least a skim, even by experienced SQL programmers.

There is not much in T-SQL that trips up beginner programmers more than the way that dates and times are stored. A discussion of the T-SQL datetime datatype leads off Chapter 2's discussion of datatype nuances. Strings, numerics, blobs, and bits are covered as well, along with coverage of some datatypes unique to T-SQL, such as uniqueidentifier (a type to hold a GUID) and the somewhat misnamed timestamp type used to determine if rows in a database have changed. You might expect that the string discussion would be somewhat pedestrian, but you would be wrong. In addition to the expected, there is a stored procedure that improves upon SQL Server's built-in SOUNDEX function.

No book covering SQL would be complete without weighing in on the controversy over the use and abuse of NULL. Guru's Guide to Transact-SQL does not take a firm stand on the issue, but covers the implementation details associated with NULL. A switch in SQL Server determines how NULLs are treated, and while all of Chapter 3 is devoted to NULLs, throughout the balance of the book wherever NULLs might complicate things, for instance, when aggregate functions are used, the author gives insight into the implications of NULLs.

One feature programmers coming from traditional general-purpose languages miss about SQL is the lack of a standard array type. Chapter 10 discusses how to circumvent this limitation, if you really need to, aided by large string variables allowed in current versions of SQL Server. Programmers new to SQL also tend to think in procedural terms rather than the set-oriented view of the world that SQL favors. Chapter 13 lets you do what you need to do using cursors, including coverage of system-stored procedures that can be useful in debugging applications using cursors, along with good advice (on page 251) about avoiding them:

The advice I usually give people who are thinking about using cursors is not to. If you can solve a problem using Transact-SQL's many set-oriented tools, do so. It is rare (but not impossible) for a cursor-based solution to outperform a set-based approach.

At the heart of any nontrivial system is the need to ensure that a set of operations either runs to completion or makes no change whatsoever. This is what transactions are for, and this is the topic of an entire chapter. T-SQL-specific transaction isolation modes are described in detail, along with coverage of transaction nesting and optimizing transaction code.

Stored procedures and triggers are the key to creating scalable systems that will work even when more than a single type of client operates on the database, for instance a web-based client and a traditional fat client. Triggers and stored procedures are an area where the details of T-SQL matter more than most places, and the book comes through with a chapter covering these within a T-SQL perspective. Performance optimization and undocumented DBCC commands each get their own chapter, along with coverage of the underused and poorly understood OLE Automation functions.

I have one complaint about the production quality of Guru's Guide to Transact-SQL. Code listings are printed using a fairly bold monotype font that I found distracting. This did not prevent me from enjoying the book, and I did get used to it, but it took a chapter or two.

I came to Michael Berry and Gordon Linoff's Mastering Data Mining not as a data mining expert, but rather as someone looking for assistance and direction in digging through the pile of data generated by a consumer-oriented web site. Fortunately, the book did not presume any previous data mining background, and it provided a good background for a rookie. If I were already deep in the data mining trenches, I might not have been as pleased.

Mastering Data Mining begins with a couple of introductory chapters, including one discussing the alternatives to developing data mining expertise in-house. This is probably a good chapter to read. Much of data mining involves knowing how to analyze your data. What are you looking to do with the data? Determine what has happened, or predict what will? A combination of the two? In any event, what you expect to do with the data, how much work you expect to invest in the effort, and how much money you have to spend buying talent or canned solutions will greatly influence how much you personally need to know.

The next part of the book has more to do with understanding what to look for than the technical details of how to analyze the data. I personally wanted to just dive into a data warehousing product, but the chapters dealing with the nontechnical issues should not be passed up. While much of the information is not rocket science, sometimes the common sense of it might be lost in the race to use the latest technical trick. For instance, there is a section that prompts you to think about who exactly the customer is. This might seem like a silly question, but in reality sometimes the customer might be broader than your current definition. For instance, an example of a bank where a problem on the consumer side of the business has a serious ripple effect on the business side of the business is outlined. Once again, these are not purely technical discussions, but then the subtitle of the book is "The Art and Science of Customer Relationship Management."

Part Two of the book is entitled "The Three Pillars of Data Warehousing." The three pillars are:

This section of the book is perhaps the most technical. Clustering, decision trees, and neural networks are the primary data mining techniques discussed, with reasonable coverage of each. The chapter on data is useful even if data mining is not in your future. Thinking about how data might be used in a data mining project might help you when you design your next database. For example, after reading this section, I thought about another project I am working on. There are two columns in one table, a Patient Incident table, that both really refer to the same attribute of an incident -- whether it is closed. There is a bit flag field called CLOSED, as well as an IncidentTerminatedDate field that should only be filled in if the incident is closed. Well, wouldn't you know it, a careful analysis of the live data indicated that not every record that was CLOSED had a terminated date set, and not every record with a terminated date set was closed. This is the kind of problem that data warehousers must deal with all the time, and reading the chapter on data in this book will give some pointers on how to deal with such potentially contradictory data.

The various ways to model data is perhaps the technical core of Mastering Data Mining. While much of the rest of the book is worthwhile reading, you might naturally come to similar conclusions about the balance of the topics after a lot of hits and misses. That cannot be said about the data modeling chapter. It is unlikely you will come upon this knowledge easily through the natural course of doing a data mining project.

Most of the balance of the book describes several case studies in industries such as banking and telecommunications. These are good case studies, but in some ways are simply elaborations on the earlier minicase studies in earlier chapters.

Overall, Mastering Data Mining was useful for me as a beginner. If you are just starting out, or in the midst of your first large-scale data mining project, Mastering Data Mining is a worthwhile investment. If you are looking for a pure technical read (and I don't think you should be) this book is clearly not for you.

DDJ