Value Lattice Static Analysis

Dr. Dobb's Journal March 2001

A new approach to static analysis finds errors that slip through the cracks

By William Brew and Maggie Johnson

William is director of R&D at Reasoning and can be reached at bill.brew@reasoning.com. Maggie is a professor of computer science at Stanford University and can be contacted at johnson@cs.stanford.edu.

Most software developers consider testing a necessary evil. We don't really want to hear about our mistakes. We don't like going back to fix bugs when we are working on the solution to a new and more interesting problem. But we all know it must be done. The software we create is incredibly complex — more complex than the considerable human intellect can comprehend. Consequently, we rely on software testing techniques to help manage this complexity by exercising the software in ways that we did not envision — and by finding the problems we overlooked in the rush to meet a deadline.

The two primary types of test design — white-box and black-box testing — focus on path execution and functional requirements. Both help to uncover a large number of errors (see, for instance, "White-Box Testing," by Oliver Cole, DDJ, March 2000). But these techniques do not do enough. We still release software with crash-causing bugs to customers. How can this happen if we carefully tested the software?

Bugs happen because it is not feasible to test all possible paths in a complex piece of software. The number of paths grows exponentially as we add control structures to a program. In testing, we tend to focus on the paths that will most likely execute under normal conditions. We may check some standard and some unusual cases, but you have to draw the line somewhere. The problem with this approach is that many of the most devastating bugs we encounter — null pointer dereferences, memory leaks, out-of-bounds array accesses, uninitialized variables — can occur on infrequently executed paths. These are the bugs that can bring a system down, and the infrequently executed paths are the ones that standard testing often misses.

Fortunately, there are automated tools that can detect some of these errors. A static analyzer parses and analyzes source code to find possible faults and anomalies, or to check that coding standards are being enforced. Lint-like tools are examples of static analyzers. (For more on lint-like tools, see "Examining C++ Program Analyzers," by Scott Meyers and Martin Klaus, DDJ, February 1997.) They do not execute a program — they just analyze the source code using one of several different analysis algorithms.

If you have ever worked with lint-like tools, you know about the high noise level these tools produce. A run on a moderately sized program can generate hundreds of warnings, with only a few representing real defects. This high number of false positives is a serious issue with many static analyzers and a primary reason why programmers tend not to use them.

In this article, we'll describe a new and effective approach to static analysis using a "value lattice." This technique finds the most dangerous defects, those that tend to slip through testing, with an accuracy that is much higher than other static analyzers. Also, many parts of the approach are language independent. Currently, the value lattice (VL) system can analyze C/C++, but versions for other languages are fairly straightforward to implement. The VL system is being implemented in C/C++ inspection technology at Reasoning Inc. (the company for which one of the authors works, http://www.reasoning.com/).

Static Analysis

Static analyzers perform many different tasks. Some calculate complexity metrics by path analysis. Others check coding standards by determining the length of functions, depth of nesting, or number of variables in an expression. The static analyzers that find defects use a number of different analysis algorithms ranging from path and value analysis to pattern-matching.

A first step for many static analyzers is the creation of a parse tree and control-flow graph. A parse tree is an efficient representation of the source code derived from an application of grammar rules. A control-flow graph shows the different paths of execution through a program. Both of these abstractions are illustrated in Figure 1. These abstractions are commonly used by compilers to parse and optimize.

In addition to these abstractions, we need a mechanism for storing what values each variable can have at different points in a program. Consider the code in Example 1. You can see that the variable z will be returned uninitialized if x is less than or equal to 2. Most lint-like tools will find this bug. Static analyzers detect this bug by tracking the possible values of the variables, given the paths through the function. As you can imagine, there is a lot of information that must be tracked at each line in the source code. This is another exponential explosion that analysis algorithms must handle.

Now take a look at Example 2. At certain points of execution you can see that the variable z in the function bar is not initialized. The value of z is dependent on the values of x and y in the function foo, which come in via formal parameters from main. Most lint-like tools will not find this error because there are limitations in how much analysis they do over function boundaries. The value lattice technique, however, does find this error.

The Value Lattice Approach

In the value lattice approach to program analysis and defect detection, the initial steps include building a parse tree and a control-flow graph. In addition, a table that cross-references definitions and declarations and the call graph for the program must be built. Another important step is type modeling. All types in the program are modeled using a language-neutral abstraction. This abstraction is just a tree that captures the structure of the type providing an accurate depiction of memory layout. During this initial phase, each program file is processed independently, making parallel processing possible (see Figure 2).

The next phase cross-references the abstract types to identify which ones are the same. This is an essential step that supports the cross-function analysis that will occur later in the process. Note that only the parsing and the resulting parse tree are language dependent. The control-flow graph and type abstraction are language independent.

It's important to recognize that defect detection is basically a problem of data-flow. You need to have a clear picture of what values each entity in the program can contain at each point in the program. To create such a picture, you need to do several things:

Understand the dependencies between data elements. For instance, in Example 2, the value of the z variable in foo is dependent on the values of x and y in the guard condition of the if statement.
Track all the values that can flow to a particular data element. In Example 2, x and y (in the function foo) can take on a number of different values based on the for loop iteration in main. Models of all these values must be maintained along with specific dependencies.
Uncertainty about values must be captured. In Example 2, the possible values of z in the bar function are unknown until further analysis on the caller of bar is done.

All this information must be generated and stored for defect detection. These are difficult tasks to perform within the limits of feasible running time and space conditions.

We have found value lattices, in conjunction with computation analysis graphs (CAGs), to be very efficient in handling these tasks. In a nutshell, a VL is a data structure that represents a partial order. For each type present in the program (int, double, or whatever), a VL is created that contains all the values in the program of that type.

For each function in the program, a CAG is created. A CAG is a data structure that specifies the flow of values through a computation. This structure is based solely on data dependencies rather than control flow. Figure 3 illustrates a simple CAG for a sequential set of instructions. Nodes contain values and operators. Directed edges indicate data flow, and thus data dependencies. A more interesting example in Figure 4 shows how a control construct can be converted to a data-flow representation. We do this by introducing a new type of node called a "selector." Notice that the selector takes on a value based on the value of the decision variable.

The first step of the analysis phase builds a VL for each type, and a CAG for a routine from the structures already created. The algorithm constructs each VL in such a way that the bottom of the lattice contains data elements whose values are known. Further up in the lattice are elements whose values are less certain. At the top, all you really know is that the element has a value. The knowledge you have about values for elements in the middle may be, for example, that you know it lies in a particular range or is one of a set of values. Perhaps all you know is that the element contains a valid or invalid value.

Once each routine's CAG is constructed and optimized, an abstract interpreter begins its processing. At this point, you know about each of the operations in the routine, and you have some knowledge about the values of the data elements. The interpreter performs a symbolic evaluation where each operation in the program is executed and an abstract value is computed from the VL for each node in the CAG.

For example, suppose you have an addition operation on variables x and y. What you know about x from the VL is that it is a valid integer at this program point. You also know that y equals 42. A valid integer plus 42 will always yield another valid integer. Thus, you can conclude that the operation does not produce a bad value.

When the abstract interpreter completes its evaluation, a set of detection rules are fired to see if any bad values were produced. The current VL implementation handles the values needed to detect null pointer dereferences, uninitialized variables, memory leaks, and out-of-bounds array accesses. The typical scenario during symbolic evaluation is some number of values are resolved to be valid, some are invalid, and the rest are too uncertain to ascertain. Information about bad values is recorded for later reporting.

To handle the uncertain values, begin by evaluating the program inside-out; see Figure 5. You begin by building the CAG for function A, then run the abstract interpreter. The uncertain values that remain, a summary of the side effects, and return value are spliced into the CAG for the caller of function A. You then run the abstract interpreter on the caller of A. This splicing continues until you reach the top of the program, which is the main function. At that point, most of the values are resolved and information about any bad values recorded.

The values that might remain unresolved include any external inputs coming into the program. There may be other values that get lost due to approximations built into the algorithm to keep the sizes of the data structures reasonable. These approximations are controlled by user-defined parameters that define certain aspects of the algorithm. In general, the more information spliced in from a called CAG, the more precise the analysis — however, the structure size and running time will increase. The less information spliced in from a called CAG, the more likely some values will remain unresolved when the analysis completes; but the analysis will run more efficiently.

As an example of the entire process, consider the source code in Example 2. The algorithm begins by building a CAG for an individual routine, and a VL for each type. Then, the abstract interpreter is applied to the CAG. Notice in the bar function the variable z coming in as a formal parameter. You do not have enough context to resolve it within the CAG for bar. You need to process the calling function. The CAG for bar gets spliced into the CAG for foo, the calling function. Now, you can obtain more specific information about z, but you see that its value is dependent on the formals x and y. So the parts of foo's CAG pertaining to x, y, and z must get passed up and spliced into main. At this point, you do have enough context to discover that z indeed is uninitialized given certain values of x and y.

The final phase of the process is reporting. A large amount of information about the program is stored during the course of the phase. Over 30 different tables of information in a database are generated that describe all aspects of the program. The most important table, of course, is the one with the defects.

Conclusion

The value lattice and computation analysis graph approaches to static analysis have many advantages over other static analysis techniques. They can find dangerous defects across function boundaries. The number of spurious warnings is an order of magnitude lower than in lint-like tools.

Language independence is another important feature of the implementation. Only the parser and the resulting parse tree are language dependent. The rest of the structure, including the type abstraction, the control-flow graph, the VLs, and the CAGs are language independent. This allows for a relatively easy application of the approach to other programming languages.

Finally, the system has been designed throughout to take as much advantage of parallel processing as possible. This is essential in an application that must deal with potential exponential explosion in both structure size and path tracing.

Suggestions for Further Reading

Aho, A., R. Sethi, and J. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986.

Muchnick, S. Advanced Compiler Design and Implementation. Morgan Kaufman, 1997.

Weise, D. et al. "Value Dependence Graphs: Representation Without Taxation." http://www.research.microsoft.com/scriptpubs/view.asp?TR_ID=MSR-TR-94-03. Microsoft Research, 1995.

DDJ