To err is human. To prepare for all possible errors is the mark of a good programmer.
Come on in. Shut the door it's getting cold in here. Welcome to The Journeyman's Shop. Pour yourself a cup of coffee if you like. Sorry the room is such a mess as soon as things calm down a little, I'm going to clean the place up and get organized. Warm your hands by the woodstove, and let's talk a bit about software development before we get down to work.
Software projects always start out with architectural decisions: where the software will get its data, what it will do to the data, and how it will report its results. Even if you work alone and hack code all day long you still have to make all of those architectural decisions. You may not have written up an architectural specification, but you have it there in your head, and you're updating it as you go.
Here at The Journeyman's Shop we don't develop architectural specifications. We leave that to the application architect. Our job is to build the pieces that fit together to become the application. We take the blueprint that the architect supplies, we make parts according to the blueprint, and we assemble the parts to make the application. Of course, we don't have to be blind to problems that we see in the blueprint, and should call any serious problems to the architect's attention. But for the most part those high-level decisions have already been made, and our job is to implement them.
Don't worry, though: there's lots of room for creative innovation in what we do. That architect hasn't worked out all the details. If you're into categorizing the steps in the development process, what the architect does is high-level design. What we do in The Journeyman's Shop is low-level design and implementation. In more mundane terms, we make the application work. To do that we must understand a wide range of programming disciplines. We must be able to divide a task into a coherent set of functions and design the interfaces between those functions. We must know how to choose the best sorting algorithm given the amount of data to be sorted, the time constraints, the amount of available memory, and whether the data is in memory or on disk. Another important part of what we do is error handling. In fact, error handling is the main topic of this installment of The Journeyman's Shop. We'll get into the specifics in a few minutes. In general, though, our job requires that we be able to design and implement error handling within the code that we're writing, and make sure that our error handling strategy integrates easily into the application's error handling strategy. To sum up our capabilities as low-level designers and implementors, we must be able to break a task down into smaller pieces, make those pieces work efficiently, and then assemble those pieces into a larger component that satisfies the application's architectural requirements. These design decisions affect the maintainability, efficiency, and correctness of the application.
Over the years I've often heard from programmers who made a design decision that didn't quite work out and then asked for help in coding their way out of trouble. They seem to think that they're like the cartoon character who can paint himself into a corner, paint a door in the wall, open the door, step through, and reach back and paint the part of the floor where he was standing. Trouble is, we're not cartoon characters. When we paint that door on the wall, no matter how hard we try, it's not going to open. There's a good side to this, though: it's not really paint, and we can go back and start fresh without messing up our shoes. Part of being a good programmer is being able to recognize when an earlier design decision isn't working out, and having the courage to abandon what we've done and start over. It hurts our pride, and it may produce pitying looks from our colleagues, but sometimes it's the best solution.
Don't think, though, that I'm urging you to simply give up when things get hard. Far from it. That drive to complete what we started, the desire to tear down all obstacles, and the thrill that we get when we succeed in solving what looked like an intractable problem all contribute to the excitement and sense of satisfaction that we get from our profession. On the other hand, the frustration of continuing to slog along through a maze of twisty little passages, all alike, as we struggle to implement a design that we know, deep down inside, won't work, contributes to burnout and job hopping. We must all learn to recognize when we've run into a dead end, and be willing to discard what we've done and start over.
Error Handling An Integral Requirement
In the application that we're building, however, we usually don't have that flexibility. The code that we've written is out there in the field, and our customers expect it to work. If something goes wrong we can't expect to be called in to fix the problem. Like it or not, our code must be written to anticipate difficulties and to handle them in appropriate ways. This aspect of program design is, of course, what's usually referred to as "error handling," and it's a subject that many of us try to postpone thinking about or avoid altogether. We'd much rather focus on getting the code to produce results than on figuring out how our code can go wrong. But producing an application that is robust and reliable requires that we pay attention to error handling. We must design error handling into the application, and not try to retrofit it after we've done the parts that we like better. Otherwise we'll leave footprints all over the painted floor.
Simply stated, an error occurs when a function is unable to produce a required result. When we are designing the code to perform some computation we must look at the ways that the code we're about to write could fail, and with that information, we must make decisions about how to handle those potential failures. That doesn't mean that every design should incorporate explicit error checking and handling the C character classification functions, for example, accept without complaint values that are not valid character representations. As we'll see later, in some cases deciding to ignore an error is a reasonable choice. In others it is not. In all cases, however, failing to decide how to handle an error is itself an error. One thing we should always do before we conclude that a function we've written is finished is to ask ourselves whether we've considered all of the possible ways that our function could fail to produce a required result.
There are four broad categories of errors we should consider:
1) Our function can be called with arguments that are outside the range it is designed to handle.
2) The function may be unable to get resources needed for the computation.
3) The function may be violating the application's security policies.
4) The function may contain a coding error.There's a certain amount of overlap in these categories. For example, if we made a coding error that inflates the length of a string to a couple of gigabytes we'll probably find that we can't get the resources needed to create that string. This makes a coding error look like a resource failure. When we're identifying possible errors we need to focus on where errors come from, because that makes it easier to spot them in our code. Once we know where they can come from, we can figure out how to detect them. When we're debugging, on the other hand, we need to focus on what effects an error will produce, that is, how to recognize it. Once we've found out what's going wrong it's much easier to track down the source of the problem. At the moment we're talking about identifying possible ways that our code can fail, so we should think in terms of how errors can arise.
The first category of errors is range errors. A range error is a call to a function with arguments outside the range it is designed to handle. For example, think about the C function
double sqrt(double x);Since this function returns a double and not a complex value, it cannot be used to calculate the square root of a negative number. Calling sqrt with a negative argument is an error. If we're writing the sqrt function, one of the error cases we have to think about is its being called with a negative argument.
The second category of errors is resource errors, specifically, situations in which a necessary resource is not available. The obvious example of a scarce resource is memory. As C programmers, we've been told ever since we started programming that we must always check the return value of malloc to see whether it is NULL. If it is, the runtime library was unable to allocate the memory that we requested, and our computation probably cannot continue without taking corrective action. More generally, we should be careful about anything that we have to ask for before we can use it. It might not be available.
The third category of errors is security violations. Java has made programmers more aware of the significance of security policies in programming, but security didn't start with Java. For example, Windows NT has, from the start, supported security controls in applications. In any event, if the application we are writing is subject to any sort of security control, we must consider possible security violations when we design our code. If our application or its user does not have sufficient security rights to do what has been asked, our code cannot produce a meaningful result. This is an error that we must consider in our design.
The fourth category of errors is coding errors. These happen to all of us at one time or another. If we've done our jobs well these errors don't make it past our internal reviews and testing. Still, we must consider the possibility that coding errors will invalidate the results of our computation.
Once we've identified the possible sources of errors in the function we're writing we should look at our specification once again, to decide whether these possible errors are covered by the specification. If the specification does not address them it may need to be revised. For example, suppose we've been asked to write a function that takes a pointer to a null-terminated array of char as a parameter. The array is supposed to contain characters representing the digits 0 through 9, and the function is supposed to translate those characters into an integral value. Our first pass at writing this function might look like this:
int translate(const char *str) { int val = 0; while (*str != '\0') val = val * 10 + *str++ - '0'; return val; }When we look at this code we see several possible error conditions. First, our code returns 0 when the first character in the array is the null terminator. That's the most natural way to implement this function, but it may not be what the writer of the specification intended. Second, if the array contains non-digit characters we will incorporate them into our result anyway, producing a value that doesn't make sense. Third, str might be a null pointer. Fourth, the value represented by the digits in the array might be too large to store in an int.
The first three errors are actually covered by our specification if we read it literally: they are not valid inputs to this function. On the other hand, if our specification is an informal description of capabilities, the spec writer might have overlooked these possibilities. In that case we must ask what would be appropriate actions to take for these input conditions. This calls for an exercise of judgment: if the architect always says things precisely and accurately, then reading the specification literally is usually the right thing to do. We can't fall back on a literal reading of the specification as a defense for writing code that is obviously flawed, however. Application development is a cooperative effort by a team of programmers. While our role is primarily to implement what the specification describes, we may be in a better position than the architect to see some problems and to recommend changes to the specification.
The fourth problem was one of calculating a result too large to fit in an int. This clearly points to a failure in the specification. We cannot fix this problem ourselves, because the solution affects how the function is used. For example, if our function simply drops the high bits as the value overflows, we produce the wrong value. That's acceptable if the specification tells users of the function what the maximum allowable value is. If, instead, our function somehow indicates that an error occurred, then the caller of our function must be prepared to handle that error indication. We must ask the architect what to do if this error occurs, perhaps suggesting what we think is an appropriate solution, and we must make sure that the answer becomes part of the specification. That way users of this function will know what to expect.
The next step is to decide what to do with each of the above possible errors. There are two broad answers here: do nothing, or detect the error and handle it. Doing nothing has a couple of advantages: it is simple to understand, it introduces no additional control flows into our code, and it requires no increase in the size of our application. In some cases doing nothing is a reasonable approach.
For example, if it is easy for the caller to avoid using invalid arguments there is less need for the function itself to check for them. One example is the function sqrt that we talked about earlier. The rule for the caller is simple: don't call sqrt with a negative value. It's easy for the caller to ensure that sqrt is never called with a negative value. The caller can explicitly check for a negative value or it can rely on logic that always deals with non-negative values. For example:
for (i=0; i < 10; ++i) printf("%d: %f\n", i, sqrt(i));In this code sqrt can never be called with a negative value. If we can rely on the user to never call our function with invalid values then we need never check for those values.
On the other hand, if recognizing invalid input is hard to do, we shouldn't push that burden onto our users. Consider a function that computes the two real roots of a quadratic equation:
void quad(double a, double b, double c, double *r1, double *r2) { double disc = b*b - 4*a*c; *r1 = (-b + sqrt(disc))/(2*a); *r2 = (-b - sqrt(disc))/(2*a); }If disc is less than zero the roots are complex numbers, and cannot be represented by double values. We could document that this function fails if the roots are not real, but that's hard for users to recognize. We could go a step further, and document that the function fails if b*b-4*a*c is less than 0, but that isn't much of an improvement. This is a case where we probably should not ignore the problem. It's too hard for users to avoid it.
If we decide that we need to check for errors in the code we're writing, we must add code to our function to check for these errors. That's pretty straightforward: just insert if statements in the appropriate places. However, a certain vocabulary has grown up around error checking that you should be familiar with. When we write code at the top of a function to check for invalid argument values we're testing a precondition. When we write code at the end of a function to check for a correct result we're testing a postcondition. Preconditions and postconditions, together, are the constituents of the notion of "programming by contract." The contract is: if you call this function with arguments that satisfy the preconditions, the function will return with results that satisfy the postcondition. In the case of sqrt, the precondition is that the argument is not negative. The postcondition is that the square of the result is equal to the argument. If we write code to check both the precondition and the postcondition, the function looks something like this:
double sqrt(double x) { double res; assert(0 <= x); /* some lengthy computation */ assert(fabs(res*res - x) < MAX_ERROR); return res; }While we're working on the code in sqrt the postcondition test is very useful: it tells us if we've produced an incorrect result. The precondition test, on the other hand, is more useful to users of our code: it tells them they have called our code with an invalid value. Once the entire application has been completed and thoroughly tested we may decide to remove the explicit precondition and postcondition tests. That is, we might change our design decision about checking for these errors. This should be done cautiously, however, because recompiling the application with these tests removed can bring out symptoms of problems that were masked before. Be sure to allow sufficient time for retesting and debugging after removing such tests.
If we find that an error has occurred, we must decide what to do with it. There are four possibilities here: abort, avoid, protect, and report.
Aborting program execution when an error occurs is, of course, not at all appropriate in, say, a pacemaker. There are times, though, when it's the best thing to do. In particular, if the error indicates that the application's internal data structures are so hopelessly corrupted that there is no way that the program can continue to run at all, the best thing to do is to quit before we do any further damage. We should give the user the best possible description we can of what's wrong, and then stop. In less hopeless situations, there may, nevertheless, be nothing that our function can do to make sense of the data that it has been passed. As an extreme example, consider the case of a compiler being asked to compile a file that actually holds data from a spreadsheet. This simply won't work, and most compilers have a limit on the number of coding errors they will report before they decide it's time to quit.
Avoiding the problem usually means trying a different approach. For example, some of the algorithms in the Standard C++ library can be implemented to run faster if there is extra memory available for storing intermediate results. It could be an error for the implementor of one of these algorithms to simply assume that the extra memory is available and use only the fast version. If there is a possibility that the extra memory won't be available, the code should check whether it can get the extra memory. If it can, the code can then use the faster version. If it cannot, the code can fall back on the slower version. Another example of avoiding a problem occurs in user interface code. The code asks the user for input, then checks whether that input is acceptable. If not, it asks again.
Protecting ourselves from an error may seem like an odd notion, but it's actually a fairly common practice. C++'s iostreams do exactly this: when an operation fails, all subsequent operations on that stream will also fail, without attempting to perform any actual input or output. Every stream has a data member that can be examined to determine whether the stream is in a usable state, and every operation on that stream checks that flag before doing any actual work. When an operation fails, it sets this flag, so no further attempts will be made to use this stream. This means that the stream's user doesn't have to check every stream operation for successful completion; she can wait until a logically related set of operations have been performed, and check for success at the end.
Another example occurs in what will probably become the new C standard in a year or so. In C as it exists today, floating-point operations that produce values too large to store in a floating-point number store the value HUGE_VAL, which is defined in <float.h>. The actual value of HUGE_VAL is not specified by the C standard, but it is often a large but finite floating-point value. If floating-point code doesn't check for HUGE_VAL, it may end up performing some operation on it (such as dividing it by 1,000) that produces a reasonable looking value but just happens to be dead wrong.
The new C standard's working paper proposes a solution to this problem, which is to adopt the IEEE specification for floating-point computations. This specification adds three values to the usual range of floating-point values: positive infinity, negative infinity, and NaN (not a number). Instead of producing HUGE_VAL, operations that produce values too large to store in a float produce the value positive infinity. Unlike HUGE_VAL, dividing positive infinity by a number greater than zero produces the value positive infinity. Dividing it by a negative value produces negative infinity. Dividing positive infinity by zero produces NaN. Further, all operations involving NaNs result in NaN. If you think it through, you can see that once we've obtained one of these special values in some computation, we won't ever get back to a normal value. This means that floating-point code can defer checking for errors until the computation is finished. If an error of this sort occurred anywhere in the calculation the result will be one of these special values, and the code can recognize immediately that something went wrong.
In both of the above examples, protecting ourselves from an error does not solve the underlying problem. It does permit us to simplify our code, because we don't have to deal with error handling throughout. However, when we adopt a strategy such as this we still must report back to the calling code that something went wrong.
Finally, if we can't ignore the problem and we can't avoid it and we can't just quit, we've got to report it. We've run out of things that we can do in our own part of the code, so we've got to pass the responsibility on to the code that called our code. There are many techniques for telling the calling code that our function was unable to do what it was asked to do. That's a big topic, and we'll dig into it next month.
Pete Becker is Technical Project Leader for Dinkumware, Ltd. He spent eight years in the C++ group at Borland International, both as a developer and manager. He is a member of the ANSI/ISO C++ standardization committee. He can be reached by email at petebecker@acm.org.