How do you encapsulate design decisions? One at a time.
Copyright © 1999 by Dan Saks
Several months ago, I wrote a couple of columns on using classes to partition large programs into simpler components. (See "C++ Theory and Practice: Partitioning with Classes," CUJ, February 1999 and "C++ Theory and Practice: Trimming Excess Fat," CUJ, March 1999.) Rather than dwell on basic class features, which I assumed you already knew, I focused on the judicious use of more advanced features in an effort to express the program's design more clearly.
In both articles I used xr, my trusty old (that's trusty, not crusty or rusty) cross-reference generator program, as my example for illustrating the techniques. I was making pretty good progress refining xr until I got hung up on the problem that arises from the shallowness of the const qualifier. I devoted the next three columns to developing a class template called deep_pointer<T> to cure the problem. (See "C++ Theory and Practice: Thinking Even Deeper," CUJ, July 1999.) Okay, okay, so maybe I got a little obsessed with what might seem to some of you to be a minor problem, but I thought it was pretty cool. I expect to get considerable mileage from deep pointers in the future.
Anyway, now that I have a seemingly viable deep pointer template, I'd like to get back on the previous track. But, before we immerse ourselves in specific programming issues, let me restate my motives for this exercise.
Why Are We Doing This?
C++ has numerous features designed specifically to support the development of large systems. Just about everything having to do with classes, including access control, derived classes, virtual functions, and operator overloading, exists primarily to support large systems. In fact, most of C++ that isn't C is arguably there for the same reason.
The difficulty in developing a large program is not so much a function of the program's size (in source lines or characters) but rather a function of the number of details you must grasp in order to work on the program. It's all those details input and output values, algorithms, data structures, and physical resources that make the program conceptually complex. Reliability and maintainability suffer at the hands of such complexity. In other words, complex programs are hard to get right and to keep right.
How, then, do C++ features such as classes support large-scale software development? By making it easier to decompose large, complex programs into collections of smaller, simpler components.
Of course, just breaking a big program into smaller pieces (functions, classes, or namespaces) doesn't necessarily make the overall program structure simpler. It does only if you break it up into appropriate pieces. Each piece should be an abstraction. That is, it should hide some of the program's details from the other pieces.
I think Grady Booch put it very well when he wrote:
The task of the software development team is to engineer the illusion of simplicity [1].
My intent in this and succeeding articles is to discuss some of the fine points of using C++ to enhance the illusion of simplicity. For this, I need a concrete programming example that's big enough to demonstrate, at least in some small ways, the problems you can run into when developing and maintaining really big programs. The program has to be complicated enough so that it leaves something to simplify, but not so complicated that it drowns us in details. I think the cross-reference generator fills that bill quite nicely.
I started with an existing program rather than one that I wrote from scratch because I didn't want to spend time developing algorithms and data structures. Rather, my focus is on wrapping those algorithms and data structures into tidy abstract bundles.
Where We Are
xr reads text from standard input and writes a cross-reference to standard output. The cross-reference output is an alphabetized list of the words (identifiers as in C++) that appeared in the input. Each line in the output contains one word followed by the sequence of unique line numbers on which that word appeared in the input.
For example, if you use one of xr's source files (Listing 1) as input to xr, the cross-reference output should contain the following line for the identifier token:
token: 15 17 18 19 20token actually appears twice on line 18, but the output only mentions 18 once.
xr uses garden-variety data structures, such as binary trees and linked lists. If you're not familiar with these structures, you can find out about them in just about any book on data structures ever written. Specifically, xr uses a binary tree to maintain the words in alphabetical order. Each node in the tree holds a word and its corresponding sequence of line numbers. More precisely, each node in the tree contains a pointer to the null-terminated character array representation of a word, and pointers to the head and tail of a linked list representing the sequence of line numbers for that word.
In my first rewrite of the program, I created a class called cross_reference_table, which encapsulates that tree. (See "C++ Theory and Practice: Partitioning with Classes," CUJ, February 1999.) When I created the class, I split the program from one source file into two source files and one header file:
- xr.cpp: the main part of the application, including the input processing
- table.h: the cross_reference_table class and inline member function definitions
- table.cpp: the remaining definitions for the cross_reference_table class
These three files appear as Listings 1 through 3, respectively.
Listings 2 and 3 employ the deep pointer class template, which I developed over the past few months. The deep pointer class definition appears in Listing 4. For the most part, a deep_pointer<T> behaves just like a T *, but a deep_pointer<T> const behaves like a T const *const.
The tree implementation employs a structure type, tree_node, which was originally global. Now it's a private member of the cross_reference_table class. The cross_reference_table class definition in Listing 2 does not need the complete type definition for tree_node. Therefore, it declares tree_node as an incomplete member type. The complete type definition appears in Listing 3.
The cross_reference_table class employs two functions, add_tree and put_tree, which were also originally global. Now they are private static member functions. (See "C++ Theory and Practice: Trimming Excess Fat," CUJ, March 1999.)
Some of you may have noticed that I could use some of the STL containers (the generic containers in the Standard C++ library) instead of the special-purpose data structures already in this program. For example, I could declare the cross_reference_table as:
map< string, set<unsigned> > table;I could use a vector, list, or deque instead of a set. I could use a multimap instead of a map, although I suspect a multimap would be noticeably less convenient than a map. I will consider these possibilities eventually, but not just yet.
Although the STL containers are very handy for many common problems, they don't eliminate the need to write your own data structures. When you do write you own, you should give careful thought to how you package those structures as abstract types. It's those thoughts that I want to consider now.
Identifying Arbitrary Decisions
I wrote the cross_reference_table class to create an illusion of simplicity inside the main function. It seems to have worked. The class effectively hides from main all traces of the trees and lists inside the cross-reference table. There isn't much left there to simplify.
Most of the complexity of this program is now in the cross_reference_table class. As a very crude measure, I ran the MKS wc (word count) utility on the program files. No matter which metric you choose (lines, words, or characters), there's about twice as much source code together in table.h and table.cpp as there is in xr.cpp.
Probably a better indicator that there's complexity lingering in the cross_reference_class is that all the linked data structures, list_nodes as well as tree_nodes, are in the code for that class. Even though the body of add_tree (Listing 3) is about the same length as the body of get_token (Listing 1), add_tree deals with much more complicated data. I mean, just look at all those -> operators in add_tree.
Yet another way to look at the complexity of the cross_reference_table class is that it embodies not one, but three, design decisions:
1. The words in the table are alphabetized in a binary tree of tree_nodes. Each tree_node pairs a word with a sequence of line numbers.
2. Each word is represented as a pointer to the zeroth character of a null-terminated character string stored in a dynamically allocated array of char.
3. Each sequence of line numbers is represented as a pair of pointers, one to the first node and one to the last node in a singly-linked, null-terminated list of list_nodes. Each list_node contains one number of its sequence.
Each of these design decisions is somewhat arbitrary in that two programmers of comparable skill could look at the same set of requirements and yet quite reasonably choose different data structures to accomplish the chore. Or, a programmer who chose the data structures for an early version of the program might replace any or all of those structures in a later version.
Let's consider some alternatives:
- The binary tree in the current implementation doesn't balance itself. You might replace it with a balanced tree, such as an AVL or red-black tree. You might use a multi-way tree such as a b-tree. Or, you could use a linear structure such as a skip list.
- You might use a compression algorithm that will yield a more compact string representation for the words in the tree. You might use a memory management algorithm that causes less fragmentation of the free store.
- You might try using a circular list for the line number sequences. With a circular list, you need only a pointer to the tail of the list. The next pointer of the tail node leads back to the head node, eliminating the need for a separate pointer to the head of each list.
I think you get the idea.
Though each of the three design decisions contributes to the structure of the cross-reference table, each one really is a distinct decision. It's not hard to imagine wanting to change just one of them. Unfortunately, the code that implements each decision runs together inside the cross_reference_table class, making it harder to change one decision without disturbing the others.
The blurring of the design decisions is evident in several places in the code. For example, the tree_node structure (from Listing 2):
struct cross_reference_table::tree_node { deep_pointer<char> word; deep_pointer<list_node> first, last; deep_pointer<tree_node> left, right; };combines aspects of all three design decisions:
- The pointer word represents a character string.
- Together, the pointers first and last represent a line number sequence.
- The pointers left and right each represent a subtree of the binary tree.
The design decisions also run together in the bodies of add_tree and put_tree. For example, add_tree (in Listing 3) contains the following code to create a new tree_node for word w found on line number n:
if (t == NULL) { t = new tree_node; // (1) t->word = new char [strlen(w) + 1]; strcpy(t->word, w); // (2) t->first = new list_node; t->first->number = n; t->first->next = NULL; t->last = t->first; // (3) t->left = t->right = NULL; }I've inserted comments here to separate the code corresponding to each design decision. The code between comments (1) and (2) fills in the word for the node. The code between comments (2) and (3) places the first line number into the corresponding line number sequence. The code below (3) ensures that the left and right subtrees are empty. Again, without these comments the code just runs together.
A Simple Rule of Thumb
I think it's pretty evident that the next refinement of the program should make the distinction between the design decisions more apparent. This will make the code more readable and improve the odds that we'll be able to change any one design decision without corrupting the code for another.
Even before object-oriented techniques were all the rage, Parnas [2] suggested a simple rule to guide this effort:
- Each encapsulation unit should hide one design decision.
When Parnas suggested this, hardly anyone knew what object-oriented programming was. Designers spoke in terms of encapsulation units such as modules or functions. For programmers using C++, or any language with classes, I think the updated guideline is simply:
- Hide each design decision in a separate class.
We've already encapsulated the first decision (the tree structure) as the cross_reference_table class. We can isolate each of the other two decisions (the representations for words and for sequences of line numbers) as separate classes. The transformation of the program starts with the data structures.
Once again, a tree_node looks like:
struct cross_reference_table::tree_node { deep_pointer<char> word; deep_pointer<list_node> first, last; deep_pointer<tree_node> left, right; };Here:
deep_pointer<char> word;is merely the current implementation for the abstract concept of a variable-length character string. We could use the standard library's string class or write our own if the standard string's performance isn't up to snuff. In either event, the declaration of the word member becomes something like:
string word;which is more abstract than a pointer member.
Similarly, the pointer members:
deep_pointer<list_node> first, last;constitute the implementation du jour for a line number sequence. By crafting a class line_number_sequence, we could replace both pointers with a single member object:
line_number_sequence lines;which is also more abstract.
Applying these abstractions to tree_node we get:
struct cross_reference_table::tree_node { string word; line_number_sequence lines; deep_pointer<tree_node> left, right; };This definition describes the program's design much better than the previous tree_node definition. Even without any comments, it says "each node in the cross-reference table's tree associates a string with a corresponding line number sequence."
A Line Number Sequence Class
Rather than introduce both classes at once, I think it's more instructive (and less risky) to introduce one new class at a time. Let's start with the line_number_sequence class, for which we'll need new header and source files, sequence.h and sequence.cpp, respectively.
For the most part, you can create sequence.h and sequence.cpp by stealing code from table.h and table.cpp. Parts of the cross_reference_table class definition become parts of the line_number_sequence class definition. Parts of the cross_reference_table member function bodies become the bodies of the line_number_sequence member functions. I show you the details of the transformation next time around.
References
[1] Grady Booch. Object-Oriented Design with Applications (Benjamin/Cummings, 1991).
[2] David Parnas. "On the Criteria to be Used in Decomposing Systems into Modules," The Communications of the ACM, vol. 15, no. 12, December 1972.
Dan Saks is the president of Saks & Associates, which offers training and consulting in C++ and C. He is active in C++ standards, having served nearly seven years as secretary of the ANSI and ISO C++ standards committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield, OH 45504-4906 USA, by phone at +1-937-324-3601, or electronically at dsaks@wittenberg.edu.