Automatically Exploiting Implicit Parallelism

Dr. Dobb's Journal July 2001

By Aart J.C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian

The authors are researchers at Intel's Microcomputer Software Laboratories. They can be contacted at aart.bik@intel.com.

Recent trends in processor design have introduced new ways for programmers to exploit parallelism. Some of these trends provide SIMD instructions that let multiple functional units operate simultaneously on so-called packed data elements (that is, relatively short vectors that reside in memory or registers). The Pentium processor with 64-bit MMX technology, for example, supports integer operations on eight packed bytes, four packed words, or two packed dwords. Here, a single instruction such as paddb provides eight-way SIMD parallelism by simultaneously adding 8 bytes in the source operand to 8 bytes in the destination operand. The Pentium III and Pentium 4 introduced 128-bit Streaming-SIMD-Extensions (SSE and SSE2), which provide similar SIMD support for floating-point operations on four single-precision floating-point numbers or two double-precision numbers, respectively. In addition, SSE2 widens the integer MMX technology into 128-bit technology. The reduced cost of processor manufacturing has made another trend possible, namely the introduction of shared-memory multiprocessors, such as dual- or quad-Pentium III systems, in mainstream computing.

Explicitly exploiting parallelism in a program, however, is cumbersome and error prone. It may require the use of inline assembly to generate the appropriate SIMD instructions or the use of a complicated threading library to take advantage of the computing power available on a multiprocessor. Although such explicit techniques can be extremely effective, they also greatly complicate program development and maintenance. An alternative approach is to let a compiler do (at least part of) this exploitation automatically. In this approach, a program written in a sequential programming language, such as Pascal, C, or Fortran, is supplied to a compiler, which analyzes the program for implicit opportunities for parallelism and generates code that takes advantage of this implicit parallelism. An example of such a compiler is the Fortran/C++ compiler developed at Intel's Microcomputer Software Laboratories (where we work). This compiler automatically finds opportunities to generate SIMD instructions (a process called "auto-vectorization") and multithreaded code ("auto-parallelization"). In this article, we will examine some of the techniques used by this compiler.

Data Dependencies

Suppose that you have written Example 1(a) in a sequential programming language such as C. The sequential semantics of C pose the following execution order on the statements: First execute S1 (giving a the value 100) and then execute S2 (giving b the value -10 if we assume that the initial value of c is 0). In Example 1(a), however, executing S2 before S1 (or even executing S1 and S2 simultaneously) has no effect on the final values of the variables a or b. In contrast, suppose you have written Example 1(b). Now, the sequential semantics must be respected or else the final outcome of the code changes. If the initial value of a is 0, for example, then interchanging the execution order of the statements would yield the value -10 for b (rather than 90). This difference with the first example is caused by the fact that there exists a flow (read-after-write) dependence between S3 and S4. Other data-dependence relations that prohibit interchanging the execution order of two statements are the anti (write-after-read) dependence and the output (write-after-write) dependence.

Central for any compiler that wants to exploit implicit parallelism in a sequential program is the observation that the execution order of statements can be changed without affecting the meaning of these statements, as long as all data-dependence relations are respected. (Here we assume that I/O statements are internally represented by assignments to variables that represent open files, so that the resulting output dependencies prohibit changing the execution order of these statements.) Mathematically speaking, sequential programming languages pose a total order on the execution of statements, but only the partial order induced by data dependencies is relevant for the final outcome. Finding the data-dependence relations in a program (a process called "data-dependence analysis") is relatively simple for straight-line code (as in the previous examples), but becomes more complex when loops are involved.

In Example 1(c), for instance, the first complication arises from the fact that the loop causes statement S5 to be executed many times, which gives rise to different instances of the same statement (which we will denote by S5(1), S5(2), S5(3), and so on). The second complication arises from the use of an array rather than a scalar variable. Testing for a flow dependence, for example, consists of determining whether an instance S5(i') that precedes another instance S5(i") (viz. i'<i") can write to the same element of array a that is read by the latter instance (i'=i"-1). Mathematically, this problem is equivalent to solving the following system for i' and i":i'=i"-1 where 1<=i'<i"<100.

In this case, it is relatively simple to see that there is a flow dependence between every instance S5(i') and the instance S5(i'+1) in the next iteration (the solutions for (i',i") are (1,2), (2,3),...,(98,99)). In general, however, even testing for the existence of data dependencies can be as involved as solving an integer linear programming problem. All compilers, therefore, trade off some accuracy for efficiency during data-dependence analysis. The solvers must be conservative, however. That is, it is safe to report data dependencies when dependence does not really exist, but independence must never be reported incorrectly since that could result in an invalid conclusion on implicit parallelism.

Auto-Vectorization

Data-dependence information gives a compiler more insight on where implicit parallelism can be exploited. For example, the simplest examples of code fragments that can be directly implemented with SIMD instructions are formed by loops without any data dependencies.

In the loop in Example 2(a), data-dependence analysis reveals that no instance of statement S6 is data dependent on any other instance of this statement, which implies that the statement instances S6(0), S6(1),...,S6(98), S6(99) may be executed in any order without affecting the final outcome of this code fragment. In particular, this means that the compiler can implement this double-precision floating-point loop with two-way SIMD instructions that simultaneously execute two instances of S6 in each iteration. (That is, S6(0) and S6(1) are executed simultaneously in the first iteration of the SIMD loop, followed by S6(2) and S6(3) in the second iteration, and so on.) Example 2(b) is the assembly-language version of this implementation.

Vectorization is not limited to loops without data-dependence relations only, however. In the loop in Example 2(c), a flow dependence exists between every instance S7(i') and S7(i'+2). Although the data dependence implies that the instances of statement S7 cannot be executed in any arbitrary order without affecting the final values that are stored back into the elements of array a, this loop can still be executed two steps at a time, as in Figure 1, where data- dependence relations are depicted as arrows. As a general rule, we can say that N-way SIMD parallelism cannot be exploited in a loop if there is a (transitive) data-dependence relation between two instances of the same statement with a distance less than N in the iteration space of the loop. So, Example 2(d) cannot be vectorized due to the flow dependencies with distance=1.

However, there are exceptions to this rule. For instance, although Example 2(e) strongly resembles the first loop, there is a small but subtle difference. In this loop, the elements of array x are read before written, causing an antidependence between every instance pair S9(i') and S9(i'+1). If the compiler translates this single-precision floating-point loop into the four-way instructions in Example 3(a), then it is easy to verify that these data dependencies are preserved because the four-way load of elements of array x still occurs before the partially overlapping four-way store back into the array in Example 3(a).

If there are several statements in a loop, then the compiler may have to reorder statements to make all data dependencies lexically forward; see Example 3(b). If the double-precision floating-point loop at the left side would be naively translated into two-way SIMD instructions, then the pair-wise load of elements of array b in S10 would precede the pair-wise store into b in S11, which violates the flow-dependence relations between instances of S11 and S10. After the statements are textually interchanged, however, the store is generated before the partially overlapping load, which respects all data-dependence relations.

Finally, Example 3(c) is an example of a loop that can be translated into SIMD instructions despite self-flow dependencies with distance=1. Accumulating the product of two short arrays, u and v, into a scalar w causes flow, anti, and output dependencies between all instances of statement S12, which clearly prevents straightforward vectorization of this loop. Such reduction idioms can usually still exploit some SIMD parallelism, however, by computing partial results in parallel. On the Pentium 4, the Intel compiler implements this particular mixed-type reduction as in Example 3(d).

After this eight-way SIMD loop, the accumulator xmm0 contains four partial sums that must be added into the final result w (the exact code sequence has been omitted). To illustrate the performance gains that can result from auto-vectorization, Figure 2 shows the performance (in 1 million integer operations per second) of a sequential (SEQ) and vectorized (VEC) implementation of the previous loop for different values of the array length N on a 1.5 GHz Pentium 4 processor (a cleanup loop must be used to handle any remaining iterations in case 8 does not evenly divide the array length). The corresponding speedup is depicted in Figure 2 as (S).

Auto-Parallelization

Data-dependence information also gives a compiler more insights on where implicit parallelism can be exploited to take advantage of the processing power available in shared-memory multiprocessors like dual- or quad-Pentium III systems. Due to the overhead associated with creating threads, the granularity of parallelism that can be effectively exploited is usually less fine-grained than for auto-vectorization. Therefore, auto-parallelization typically focuses more on outermost loops.

Example 4(a), for instance, computes the product of a matrix and a vector. The double loop gives rise to N2 different instances of the statement S. Data-dependence analysis reveals that there are only data-dependence relations between statement instances that occur in the same iteration of the i-loop, as in Figure 3. Consequently, the compiler can generate multithreaded code for the i-loop without affecting the final outcome of Example 4(a) (provided, of course, that each thread executes the j-loop sequentially).

First, the original double-nested loop is replaced by Example 4(b), a single run-time library call that forks a number of threads that each will invoke the function _matvec_par_loop0. This function has the form illustrated in Example 4(c). When a thread invokes this function, first another run-time library call to _static_init is made that will assign a subset of iterations from the original set [0,n] to the thread based on the thread ID tid using the local variables _lower, _upper, and _stride. (This lets us implement a variety of scheduling policies like block scheduling or cyclic scheduling.) Subsequently, each thread executes this subset of iteration. Finally, the threads join again using the library call _static_fini.

Figure 4 shows the performance (in MFLOPS) of this code for different values of N on a 550-MHz quad-Pentium III system with auto-vectorization disabled (SEQ) and enabled (PAR). The corresponding speedup (S) goes up to 3.2, which means that about 80 percent of the available computing power is effectively utilized during matrixXvector execution.

Further Reading

Banerjee, Utpal. Dependence Analysis. Kluwer Academic Publishers, 1997.

Wolfe, Michael. High-Performance Compilers for Parallel Computing. ISBN 0805327304. Addison-Wesley, 2000.

Information on high-performance compilers for Intel architectures: http://developer.intel.com/software/products/.

DDJ