Robert Metzger is manager of Advanced Optimizers at CONVEX Computer Corp. He is responsible for the development of CONVEX's interprocedural optimizer and has worked with vectorizing/parallelizing FORTRAN and C compilers. He has a B.A. in mathematics from the State University of New York and an M.Div from Gordon Conwell Theological Seminary. You may contact Robert at CONVEX Computer Corp., 3000 Waterview Parkway, P.O. Box 833851, Richardson, TX 75083.
A hurdle to efficient numerical C applications is the definition of pointers. Pointers are not associated with specific objects pointed at, but with all of memory (in common-usage C) or with entire storage classes (in ANSI C). interprocedural C compilers must, therefore, be conservative if they are to preserve a program's correctness. This conservative bias leads to reduced application performance. Without changes in the language, interprocedural compilers must have new sources of information if they are to perform more aggressive optimization.
Traditional (Procedure) Compilers
Procedure compilers have some significant limitations. If writers edited papers the way procedure compilers process programs, they would check each page of the paper thoroughly. They would not, however, check that the last sentence on one page made a logical connection with the first sentence of the following page. Nor would they check whether a page had dangling references or undefined terms that no other page resolved. In fact, every time they picked up a new page, they would completely forget everything they had read on the previous page.Procedure compilers must make worst-case assumptions about the side effects of procedure calls. They must assume that:
The result of making such assumptions is over-cautious optimization and applications that don't execute as fast as they might.
- Every global variable used in the calling procedure is used and assigned by the called procedure;
- Every variable whose address is passed is both used and assigned by the called procedure;
- Every memory location that might be pointed at by pointers used by the called procedure is used and assigned.
Procedure compilers generally assume procedure invocations are correct. Some languages (such as Ada) require the programmer to declare the proper way to call a procedure so that procedure calls can be checked. ANSI C provides the means to check procedure calls via function prototypes, but doesn't require their usage. Some of the most commonly used languages (FORTRAN, common-usage C) don't even provide the means to check the correctness of procedure calls, even if the programmer wants the compiler to do so. Compilers for languages that don't require procedure-call checking cannot find common errors, such as omitting arguments or passing arguments of the wrong type.
Procedure compilers are unable to find subtle programming errors that require an analysis of the entire application. They cannot determine whether global variables used in one procedure are initialized somewhere else. Nor can they determine at compile time that an argument pointer points at freed storage when it is dereferenced.
Interprocedural Compilation
Interprocedural analysis gives interprocedural C compilers the information they need to perform aggressive optimization. Interprocedural compilers process functions individually, as do traditional compilers, but they do so based upon information of the application as a whole.Such compilers compile programs the way writers really edit papers. In addition to checking each page carefully, they check the connections between pages, and the references to terms and ideas used on multiple pages. They arrange the sections and pages in an order that is most profitable for the reader.
Compilers that use interprocedural analysis don't make assumptions about the side-effects of procedure calls. Instead, they develop precise information on all the variables that a procedure might use or assign, and on any procedures it may call.
Compilers that use interprocedural analysis don't assume that procedure invocations are correct. They inspect the arguments transmitted at each call site.
Dependence Analysis
Two kinds of dependencies occur in computer programs control dependencies and data dependencies. Both kinds of dependencies constrain the order of execution of a program's parts. If the dependencies are ignored, incorrect results will be obtained. Control dependencies arise from flow-of-control constructs like if-then-else. The code blocks that are executed in the then and else parts are control-dependent on the if test.Data dependencies arise from the flow of information within a program. For example, if a variable is assigned and subsequently used, there is a data dependence between the assignment and the use. Data dependencies come in four varieties:
true an assignment must precede a use
anti a use must precede an assignment
output an assignment must precede another assignment
input a use must precede another use
The sequential order of text in a source file imposes many apparent dependencies. Dependence analysis identifies the dependencies that are necessary to preserve the program's meaning. It eliminates apparent dependencies that are merely artifacts of the way the program was encoded in text.
The dependencies that involve scalar variables are relatively easy to analyze. The hard work comes in analyzing the dependencies between array references, particularly those that occur in loops. In C, dependencies also occur between pointer references. Because of the relationship between pointers and arrays in C (a[i] == *(a+i)) these pointer reference dependencies can be treated as if they are array dependencies.
Figure 1 shows a loop that has no dependencies that prevent vectorization or parallelization. If you start at any operator and trace a path, you will find that the path is self-contained. Each arithmetic operation can be performed independently, without needing the results of any other operation. Compiler designers describe the dependencies between the elements of the operands as loop-independent.
Figure 2 shows a loop that has a dependence that prevents vectorization or parallelization. If you start at any operator and trace a path, you will find that the path leads to the next operator. The entire diagram is a spiral with a single thread. Each arithmetic operation is dependent on values computed on previous iterations, and results are carried forward in the elements of the array. Compiler designers describe the dependencies between the elements of the operands as loop-carried.
For more information on why dependence analysis is important, and how it is performed (see [2] and [3]).
Pointer Tracking
The function in Listing 1 contains a loop that is not vectorized by procedure compilers, but is by an interprocedural compiler.A procedure compiler may claim that there is no induction variable, despite the existence of the obvious candidate i. In the absence of pointer-tracking information, the compiler assumes that pointers can point to any location in memory that has been addressed. If no information has been gathered on which global variables have been addressed, it must assume that the assignment through the pointer p can change the value of the global variable g. By definition, an induction variable must have a constant limit. The procedure compiler concludes that the loop has no induction variable. Loops without induction variables cannot be vectorized.
In contrast, an interprocedural compiler does vectorize this loop. It can do so because it keeps a record of every variable whose address is taken throughout the entire application. If g is never addressed, it knows that the assignment through p cannot change the value of the loop limit. An interprocedural compiler concludes that the loop has an induction variable, and since there are no other impediments to vectorization, vectorizes the loop.
The function in Listing 2 contains three loops. The first two can be vectorized, the third cannot. A procedure compiler cannot vectorize any of these loops. An interprocedural compiler does vectorize the first two, but not the last.
In this function, all the pointer variables are local variables. A procedure compiler could perform pointer tracking within this function and determine that it was safe to vectorize the first loop. It doesn't need to perform dependence analysis on the pointer offsets (array subscripts) for the first loop, because the pointers never point at the same storage block.
Pointer tracking provides necessary, but not sufficient, information to determine whether the second loop can be vectorized or parallelized. The use dereference of pointer q and the assign dereference of pointer p both point at the same block of storage. Dependence analysis on the pointer offsets clearly shows that the dependence between these two operations is loop-independent, and the loop can be vectorized.
A compiler that performs pointer tracking within a procedure would know that in the last loop the p and q pointers point at the same storage block. In this case, however, the pointer offsets create a dependence from one iteration to the next, and thus vectorization is not possible.
The code in Listing 3 has two loops, the first two of which can be vectorized, but the last cannot. Neither of these loops is vectorized by a procedure compiler. An interprocedural compiler does vectorize the first, but not the second. Procedural pointer tracking is not sufficient in this case. Interprocedural pointer tracking is required.
When a procedure compiler compiles the function f3, it has no way of knowing to what the results of functions returning pointers point. In the case of the function getblock1, each result points to a new and distinct block of storage, on each invocation. A procedure compiler must assume that the pointers point to the same blocks of storage and even to different offsets within the same block. The getblock2 function exhibits precisely this worst-case behavior. Such assumptions inhibit optimization.
In contrast, an interprocedural compiler knows that getblock1 returns pointers to new storage. In the first loop, it doesn't even need to perform dependence analysis on the pointer offsets (array subscripts) because the pointers always point at different storage blocks. This knowledge enables the compiler to vectorize the first loop.
Pointer tracking enables an interprocedural compiler to know that getblock2 returns offsets into the same storage block. When it applies dependence analysis to the pointer offsets, it finds a dependence that prevents vectorization or parallelization.
The code in Listing 4 has two loops in two separate functions. The first can be vectorized, but the second cannot. Neither of these loops are vectorized by a procedure compiler. An interprocedural compiler does vectorize the first, but not the second. This might be confusing at first glance, since the text of the two sub-functions is the same, except for their names. Procedural pointer tracking is not sufficient in this case. Interprocedural pointer tracking is necessary.
The difference between the loops is the context in which they are used. When a procedure compiler compiles the sub-functions, it has no way of knowing to what memory locations argument pointers point. In the case of the function f4sub1, each argument points to a distinct block of storage. A procedure compiler must assume that argument pointers point to the same blocks of storage and even to different offsets within the same block. The call to function f4sub2 exhibits this worst-case behavior. Such assumptions inhibit optimization.
On the other hand, an interprocedural compiler knows that whenever f4sub1 is called, the arguments point to distinct storage blocks. It doesn't need to perform dependence analysis on the pointer offsets because the pointers always point to different storage blocks. This makes it possible to vectorize the first loop.
When an interprocedural compiler analyzes the second loop, pointer tracking determines that two of the arguments point to the same storage block. When it performs dependence analysis, it determines that a loop-carried dependence exists, and thus it cannot vectorize or parallelize a loop.
These examples show that pointer tracking permits interprocedural C code that cannot safely be optimized otherwise.
Multi-Dimensional Arrays
The lack of a method to declare argument arrays with varying dimensions discourages some people from using C for scientific programming. In contrast, FORTRAN has adjustable and assumed-size arrays, and Pascal has conformant arrays.The problem is stated very well in Numerical Recipes in C (p. 17) [1]: "The systems programmer rarely deals with two-dimensional arrays, and almost never deals with two-dimensional arrays whose size is variable and known only at runtime. Such arrays are, however, the bread and butter of scientific computing. Imagine trying to live with a matrix-inversion routine which could work with only one size of matrix!"
C programmers have coped with the omission of varying-dimension arrays in several ways. These include explicit indexing, arrays of pointers, and array objects. The following section that implements these approaches has been coded to permit substitution of the equivalent macros for each other with a minimum number of changes.
The first approach stores all multi-dimensional arrays as vectors and uses macros to generate the address calculation code (Listing 5) . The second approach uses structures to represent array objects of different types. Macros generate the address calculation code (Listing 6) . The third approach uses arrays of pointers to represent arrays with more than one dimension. Matrix index calculations are replaced by a second indirection (Listing 7) .
The first method works well with pointer tracking. It is clumsy and not very maintainable because the programmer must embed information about the size of the array in the code. The second method also works well with pointer tracking. The third method obstructs the effectiveness of pointer tracking by introducing a second level of indirection. This extra indirection may slow down codes compiled without pointer tracking as well.
An Implementation
CONVEX Computer Corp. has developed a language-independent interprocedural optimizer for users of its C-series supercomputers. This optimizer is packaged with the existing CONVEX C and FORTRAN compilers in a product called the Application Compiler.The C front end accepts ANSI Standard source (ANSI X3.159-1989). It also provides optional compatibility with common usage C. The FORTRAN front end accepts ANSI Standard source (ANSI X3.9-1978), and provides a high degree of compatibility with the extensions made by DEC or Cray, at the user's option.
The driver program for the Application Compiler determines which source files must be (re)compiled, and applies the appropriate language front end to all such source files. The front end performs lexical analysis, parsing, and semantic analysis. It writes to disk the compiler's internal representation of each function.
After all source files that must be (re)compiled have been processed, the driver invokes the interprocedural optimizer. It performs the analyses, answering the questions in Table 1 in the order listed. Each analysis algorithm reads from the program database and writes the information it generates back to that database.
After the interprocedural algorithms finish, the driver program invokes the CONVEX common compiler back end for each procedure in the application. The back end consists of a machine-independent optimizer and a machine-dependent code generator. The Application Compiler's optimizer makes use of information gathered by the interprocedural analysis phase, rather than make worst-case assumptions about procedure side-effects. The optimizer performs procedure-wide scalar optimizations, such as those found in interprocedural C compilers for PC's and workstations. In addition, the optimizer performs automatic vectorization and parallelization of loops.
After the back end processes all procedures, the Application Compiler driver invokes the linker, which creates an executable image from the generated objects.
Recommendations
There are several ways that C programmers can improve the effectiveness of pointer tracking. Following these recommendations will improve the performance of numerical applications written in C, on computers that have compilers that perform such optimizations.
Avoid global pointers.
This is just basic modular programming. Systems whose functions communicate with arguments and results are simpler to maintain than those whose functions communicate via global variables. Pointer tracking is both more efficient and more effective when analyzing the usage of arguments rather than global variables.
Use standard library functions for memory allocation.
A compiler that performs interprocedural pointer tracking will know that certain functions allocate and deallocate heap storage. If you use your own allocator, this information will not be available. If you use a special memory-allocator library for debugging, replace it with calls to the standard library functions once the application is working.
Use prototypes for functions that return pointers.
Using ANSI C prototypes is a good, defensive programming practice, and may even aid optimization since pointer tracking must collect information about all carriers of addresses both variables and function returns. If a function appears to return an integer (the default), the compiler may not collect all the information it needs prior to interprocedural pointer tracking.
Represent multi-dimensional arrays without multiple indirections.
This is probably the most controversial of the recommendations. Some compiler/hardware combinations will give better performance with explicit address arithmetic, and others will do better with the extra indirections introduced by vectors of vectors. The safest approach is to use a set of compatible macros that provide different implementations, such as those presented here. You can use the preprocessor to define the implementation that best suits for the compiler and hardware used.
Conclusion
C is becoming more and more popular for numerical applications, hosted on platforms ranging from workstations to super-computers. As C programmers learn to use a programming style that assists, rather than hinders, automatic optimization, C applications will achieve the same performance as those coded in FORTRAN.
Bibliography
[1] Press, W., Flannery, B., Teuklosky, S., Vettering, W. Numerical Recipes in C, Cambridge University Press, 1988.[2] Wolfe, M. Optimizing Supercompilers for Supercomputers, The MIT Press, 1989.
[3] Zima. H, and Chapman, B. Supercomputers for Parallel and Vector Computers, Addison-Wesley, 1991.