Pentium Optimizations and Numeric Performance

Correctly produced Pentium code can run up to a factor of two faster than 486-optimized code

The Whetstone and the single- and double-precision variations of the three marks are listed in Table 1. Each column represents the results of turning on a different optimization switch. These switch settings range from no optimizations to full Pentium optimizations, which we have labeled "ALL." These marks were taken with an experimental compiler that used our old switch nomenclature. On the Pentium compiler, you get the effect of ALL when you select --OLM. Going across the columns, you discover that the Pentium runs very well in its own situations, where everything fits in the cache and there is no ancillary integer footwork. This is not surprising--most of the marks here were designed to measure the speed of hardware, and things were kept simple to make that possible.

The Whetscale was developed partly to determine the efficiency of vector routines. If you compare the Whetmat with the Whetscale running on the 486 under Microsoft Fortran 5.0, you will discover that the matrix multiply runs at 1.51 megaflops while the Whetscale hits 5.85 megaflops. Our interpretation is that the vector mark was running 25 percent efficient. This same ratio can be computed from the NDP results, and varies from about 10 to 80 percent as optimizations get turned on. These two marks actually tell us something about the Pentium and the code that drives it: The cost of executing vector codes can vary greatly with compiler, but probably is never greater than 80 percent. The typical efficiency we see with 486s executing good vector code is 50 percent. The only device that we know of that hits 100 percent is the i860, and this occurs not because the i860 is a RISC device, but because Intel developed CISC-like load and store instructions which do two things at once. In the inner loop of a dot product, an i860 can issue one of these dual-purpose integer instructions along with a floating-point instruction on every cycle. As a result, there is no indexing overhead, and the processor is able to run as fast as if it were executing nonvector instructions..

The Whetmat demonstrates why correct code is crucial. The double-precision Whetmat running on a 66-MHz Intel Express motherboard sees a 20-to-1 speed-up going from no optimizations to full optimizations. This is the largest variation in speed we have ever seen on any processor. The argument can be made that the code in the "NONE" column must be awful, but we also note that half the marks in this column beat the Microsoft Fortran 5.0 marks. If we had more time, we could determine why this code runs so slowly. It is probably related to the fact that small mistakes on the Pentium can make a big difference in speed. It's more interesting that the Whetmat increases in speed by a factor of 2 when we transition from the --OLM --n2 --n3 to the "ALL" column. This is the effect of Pentium scheduling and loop unrolling combined, and is the largest code-related speed-up that we've seen for the Pentium. The loop unrolling is required to get this speed, and by itself, makes less of an improvement in the Pentium than it does in a 486. Note that the only marks in Table 1 which benefit from scheduling in a big way are the two Whetmats. All the other marks keep variables in registers or the cache or are not involved in flowing data through the processor. The less indexing and movement of data that goes on, the faster the Pentium runs. Indexing costs cycles because extra integer instructions have to be executed; fetches cost cycles because the item may lie out of the cache or in a memory page not currently being accessed.

The Whetstone is now of little use, since it is dominated by transcendentals, as can be seen by comparing it with the Whettrans to observe where the big breakthroughs come. The big "breakthrough" here is the use of inline x87 transcendental operations like fsin. Two other interesting effects are hidden in these marks. The first involves x87 stack storage (--n3 switch). It seems to pay off in double precision for the Whetmat and Whetscale. We expected no pay off here, which is the case for the single-precision versions of these marks. My guess is that this effect is associated with data-cache thrashing. If your cache is thrashing, there still is a benefit to storing things in registers. The second hidden effect is the use of inline transcendentals (n2), which paid off big time for both single and double precision.

The trend we see developing for the Pentium is that it runs at 20 megaflops at 60 MHz and 23--25 megaflops at 66 MHz when running good "Pentium" code. However, it doesn't always take good code to run it fast, so depending on your application, you may luck out with bad code. However, if your program massages data stored in off-chip cache or memory, it pays to run Pentium-scheduled code.

Pentium Scheduling

All RISC-like processors--including the Pentium--are extremely sensitive to the code used to drive them. If you access data from external cache or memory when it can be avoided or execute instructions which depend on values still being computed by other units, you will add cycles (called "stalls") to your overall cycle count and time. When things that were supposed to take two cycles suddenly start taking four, your program slows up by a factor of 2! CPUs such as the 486 take 10--12 cycles to do floating-point operations and benefit much less from scheduling because they are relatively insensitive to one or two cycle mistakes.

In the NDP compilers, floating-point operations are scheduled when the number of stores and loads in a block make it worthwhile to determine a better sequence or schedule of instructions. The compiler helps this along by loop unrolling, which turns frequently used small loops into long ones which do four times as much work. In heavily used inner loops often two or three numeric operations are performed in succession on pieces of data brought in from memory. At the beginning of such a loop, the data is usually loaded into registers, and at the end of the loop it gets stored. Normally, there are several operations between the loads and stores, each of which must execute in order for the program to translate properly. This often will force one numeric unit to wait for another, which costs cycles. It is possible to start the adder unit while the multiplier unit is doing an operation, just as long as adder does not have to wait for the multiplier to finish its current operation before it starts. This implies that the output of the multiplier is not being used as the input of the adder. When this condition exists, we say there is a "load store" conflict or stall. If it is possible to execute the iterations of a loop out of order (the definition of a vectorizable loop) and the loop gets unrolled, then it's often possible to swap numeric instructions such that the numeric units stay busy. This technique is called "scheduling."

The scheduler spends most of its time figuring out which instructions in a sequence can be legally swapped. (In C, this is often impossible, because of the aliasing of information by pointers.) Next, it does a bubble-like sort, which examines several instructions at a time to see if the code can be improved by swapping instructions locally. This process goes on until no more improvements can be made. There is no unique way to do scheduling, and everyone has their favorite approach. The basic goal is to reduce time due to units waiting for each other. For example, if an add immediately follows a multiply and one of the inputs is the output of the multiply, a 1-cycle stall occurs. Similarly, if a multiply immediately follows a multiply, there is a 1-cycle loss because multiplies take two cycles. If the second multiply depends on the output of the first, add another lost cycle. If you look at the scheduled code in Example 1, you'll see that multiplies (fmuls) usually get padded with loads (flds). Loads are 1-cycle instructions that get paired with either adds or multiplies, so they make good filler. You also will see that by the time the first add (fadd) gets used, the multiply that precedes it is working on the second iteration, whose result will not be needed till the second add is started--that is, a "load store" dependency between numeric operations does not exist. A block of code that has been unrolled by four will normally end up having a rhythm to it, in which most of the loads will float up to the top and most of the stores will sink to the bottom. In the middle are combinations of adds and multiplies interspersed with loads and stores. Note that the NDP code in Example 1(d) is quite different from that recommended by Intel, yet provides equivalent performance. Normally there are dozens of equivalent ways to schedule a sequence.

The routine that we chose as our example is DAXPY (a primitive defined in LINPACK). Its inner loop reads:

DO i = 1,n
Y(i) = Y(i) + DA*X(i)
END DO

This loop is one of the most studied pieces of code on the planet because it is the key to Gauss Elimination. What's so special about it is that the solution of linear systems used to account for 30 percent of the CPU time at a typical scientific establishment. This algorithm is the key to these solutions when the arrays used are dense. What makes DAXPY tricky is the amount of I/O--for every line of DAXPY executed, the system will have to read and write between 16 and 24 bytes of data, depending on whether X can be cached or not. On old mainframes with relatively slow numerics, the bottleneck was floating point. As machines got faster, the bottleneck switched from numerics to memory bandwidth. This is exactly what happens when you switch from a 486 to a Pentium. Execute this line 10 million times and you will have to do between 160 and 240 megabytes of I/O. The 10 LINPACK megaflops we measured correspond to a bandwidth of 120 Mbytes/sec. Intel thought that DAXPY was so important that they chose it as the primary floating-point example in their Pentium compiler notes, and I've written in length on how to optimize this code on the i860.

Example 1(a) is the pseudocode executed by a loop that has not been unrolled. The corresponding assembly listing is shown in Example 1(b). The x87 floating-point instructions always reference the x87 stack top, st0. When I say load DA, I mean place the constant DA on the stack, or into st0. This is followed by multiply Y(i), which means take Y(i) out of memory, multiply it by st0 (that is, DA), and leave the result in st0. This is followed by the add, whose result ends up in st0 and a store which takes the result out of st0 and places it back in the location of Y(i). There are seven other registers that aren't used in Example 1(b), but are used in Examples 1(c) and 1(d). The store instruction has a postfix P attached to it which results in the stack getting popped or cleaned up. Examples 1(c), (d), and (e) were taken from the Intel Pentium Compiler Writer's Guide. Example 1(d) shows what transpires in a single iteration of the algorithm. Note the characteristic ld, mul, add, and stp operations just discussed.

The pattern used in Example 1(b) gets repeated three times in Example 1(c)--that is, Intel is unrolling by 3. At the end of each loop, the indexes are updated, and the process repeats. Example 1(d) shows what happens when Pentium scheduling is applied. The three iterations unrolled by Intel suddenly get mixed together. At the start is a sequence of load/multiplies. The scheduler begins by multiplying DA by the first and second X(i)s. It then does the first add followed by the last multiply and completes with the last two adds. Between the numeric operations, you will find a load in the beginning, three stores toward the end, and a number of fxchs. The fxchs are old x87 instructions that have been revamped on the Pentium so that instead of costing three cycles, they cost zero cycles. In Example 1(c), note that each of the three iterations clean up the stack before the next one starts. This doesn't happen in Example 1(d) because the operations have now been rescheduled to improve speed. Getting things back to the TOS (top of stack) requires fxchs, which swap the specified register (st1..st7) with the TOS (st0). So, fxchs make it possible to write scheduled code, but they also produce code that resembles a "shell game." It is nearly impossible to simply examine the code and determine its meaning unless you "play computer." This, of course, is what code generators are designed to do.

Example 1(e) is the code produced by NDP Fortran. It is very similar to that produced by Intel, except that it is unrolled by 4 and uses a slightly simpler addressing mode. This is an example of machine generated and scheduled code. Notice how our scheduling algorithm broke the problem into pairs of multiplies followed by pairs of adds. This arrangement does not affect performance, as the one-cycle "cracks" between multiplies are filled with loads. It appears that our code has one extra cycle of overhead per loop than the Intel sequence, but the extra increment is the exception to the earlier comment about superscalar integer operations not helping floating point. When you divide this benefit over four loops, it produces a 1/4-cycle improvement out of 6, roughly a 4 percent increase in speed. It's nice to see the integer pipes playing a role, but the role they play is a factor of 25 less than that played by properly scheduling the floating-point units. We used cycle counts produced by Intel for Example 1(b), (c), and (d). The best Pentium code beats the 486 by a factor of 5 to 6.

In Fortran programs, the proper scheduling of lower-level code is a feasible task because it is possible to associate memory references with Fortran arrays, and thus to figure out if it is legal to swap instructions. In C, scheduling is often an impossible task because you can't always tell if arrays are aliased or not. This is just one reason why Fortran is the preferred language for numeric optimizations. Another is that algorithms expressed in Fortran map better to most systems' hardware. In his article, "Optimizing Matrix Math on the Pentium" (DDJ, May 1994), Harlan W. Stockman discusses the Pentium scheduling of DAXPY and DDOT (routines in LINPACK). He examines three different ways to write a matrix multiply in C to get it to perform optimally, then takes the best and hand schedules it. His best results on a 60-MHz Pentium are roughly 16 megaflops. The best value we got for the Whetmat doing a simple compilation of a Fortran matrix multiply using our built-in Pentium scheduler was 20 megaflops. Harlan also observed a 12-to-1 speed variation across his dot-product routines. We observed a 20-to-1 speed variation over our program compiling with different switches. Harlan then went on to improve the code produced by a production-grade C compiler for DAXPY to demonstrate how to rewrite code by hand to get good Pentium results for LINPACK. I join Harlan in lamenting the lack of Pentium compilers that don't necessitate coding by hand. He also points out that the code that results from translating Fortran to C with a tool like f2c is quite bad. It executes three times more slowly than the original because of translation-associated problems. If you want fast numeric code, stick with Fortran. Most mainframes were built around it, and it comes closest to the equations of science. If you want to do it fancy, buy a copy of Fortran-90. If you compare Harlan's DAXPY code with the DAXPY suggested by Intel and produced by NDP Fortran, you'll see that the Stockman code did not do as many fxchs as the machine-generated code, partly because it takes a machine to schedule instructions. Harlan's code is more readable than the compiler-produced code, mainly because heuristics do not concern themselves with readability.

LINPACK Benchmarks

The real test of any primitive routine is not how well it runs in a concocted test program, but how well it runs doing the real thing. The Whetmat benchmark tests a dot product doing the real thing--a matrix multiply. Because it was written in Fortran, new columns are fetched from memory in order and the reused column lies in the cache. As a result, a well-scheduled Whetmat will run as close to the theoretical scalar speed of the Pentium up to the point where the row being cached no longer fits into the cache. Even then, it is possible to hit full speed using a vectorization trick known as "strip mining." Conditions for DAXPY are quite a bit more harsh when it gets used in the real world to do Gauss Elimination. Whether or not it hits its peak performance of 6 cycles per loop is a function of the size of the arrays, the bandwidth of the external memory system, and the operating system's ability to turn off the caching of Y(i).

We measured DAXPY under the ideal condition that vectors X and Y were in the Pentium's cache. This resulted in a 20-megaflop value quite close to the one we obtained for the Whetmat. We plotted the results of running DAXPY and LINPACK in Figure 1. The DAXPY test speed is similar to testing an engine on a dynamometer: You learn facts about shaft horsepower and torque as a function of engine RPM. However, you won't be able to predict its performance on a track until you put it into a chassis and try it out. LINPACK could be thought of as a Formula I track. To run it, a system not only requires good numerics power, but a very fast memory system and the ability to keep X in the cache and Y out of it. All of these situations were taken into account in the XP and Microway's ArrayPRO/XP card. This is quite apparent in Figure 1, where a 50-MHz XP beats a 60-MHz Pentium by factors that range from 3:1 to 6:1.

The upper two arcs of data in Figure 1 are for LINPACK and DAXPY running on the ArrayPRO/XP card, which features a single-cycle memory system that is among the world's fastest running on a desktop. It has a 288-bit wide data bus which drives four leaves of 64-bit memory at 400 Mbytes/sec. DAXPY is the topmost curve, and we can see that it parallels LINPACK, running approximately 10 percent above it. DAXPY peaks on the plot at 31.5 megaflops while LINPACK asymptotes up to 28. These two plots are typical of real vector systems running vector codes. Vector algorithms running on vector machines always have a break-even length somewhere in the vicinity of vector lengths 10--20. Above the break-even length it pays to use vector code instead of scalar code. The reason there is a break-even point is that properly coded vector primitives have overhead associated with setting up their inner loops. As the vector length gets much larger than the break-even length, the overhead becomes less important. This explains the case of the XP. If we were to plot the XP results out beyond lengths of 2000, we would see them fall off. That would correspond to the XP's cache no longer being able to hold the X vector. The reason LINPACK stays below the DAXPY mark is twofold. First, DAXPY is the upper limit, since it is the rate-limiting activity. Second, the balance of LINPACK has to do things like back substitution, which are not as vectorizable.

The Pentium has the same data bus as the XP, and at 60 MHz it is capable of 480 Mbytes/sec if run with a single-cycle (no wait state) memory system. Unfortunately, this adds expense and does not benefit Windows, so don't expect to see a supercomputer-style memory running on a Pentium anytime soon. The LINPACK and DAXPY results for the Pentium essentially demonstrate what happens when things don't fit into the cache. Our DAXPY test program quickly rises to 20 megaflops for vector lengths of 300. It then rolls over as the vectors climb out of the cache, an effect which begins at lengths of 330. LINPACK shows an even faster rollover and only peaks at 10 megaflops, a factor of 2 below the DAXPY peak. This is a result of no longer working with vectors that fit into the cache, but dealing instead with columns of a 200x200 array whose locations in the array are always shifting. Since a 200x200 array of doubles takes 320,000 bytes, the 8K on-chip cache is now useless. As a consequence, both the Xs and Ys are now being mostly fetched from external cache or memory. In addition, on the Pentium, you can not count on X being cached (even though it changes only rarely) because there is no way to prevent the Ys from flushing the Xs out of the cache. To make matters worse, where the XP only had to worry about reading Ys, the Pentium now has to worry about switching back and forth between Xs and Ys, and that means that memory pages will switch every time it accesses an X after accessing a Y. Since this will happen very often and most systems today use paged DRAMs, which take a hit when pages change, the memory system will take a further hit in performance. Adding up this little house of access horrors, you can now see why our DAXPY engine loses half of its steam at the peak, and for most lengths of interest, only produces a LINPACK on the order of 6. The peak in LINPACK occurs when X is coming partly out of the on-chip cache. When this no longer occurs, we see a slower linear drop off that follows gradual wash out of Y from the external cache. LAPACK is now starting to replace LINPACK because it is based on dot products instead of DAXPY, making it less sensitive to the memory bottleneck observed here.

At the bottom of the plot is the DAXPY of Microsoft Fortran 5.0. We see 3.3 megaflops for MS Fortran 5.0 running on the Pentium and 2.04 running on the 486. Note that the MS Fortran results do not have an obvious peak, although we did see a 0.5 percent increase in the same area where the NDP Pentium marks exhibited a peak. The 16-bit codes run so poorly that the problems with memory and cache usage never appear.

The LINPACK in Figure 1 exhibits three different types of system behavior. At one extreme, we have the 486 demonstrating constant speed independent of vector length. This behavior is characteristic of old-style microprocessors and mainframes. On the top, we see behavior typical of supercomputers and devices that combine vector numerics, vector registers and very large memory bandwidth. In the middle, we see the Pentium, which has a fast scalar unit but has no way of keeping its cache coherent and does not sport a single-cycle memory system like the XP.

Conclusion

After writing this article, we revisited our benchmarks because we hadn't really pinned down how to produce the best numeric Pentium code and didn't understand all the side effects which appeared to determine Pentium speed. We started off investigating some "mysterious" data, discovering that our scalar anomalies could be reproduced--they weren't simply mistakes. We also discovered that it's difficult to track down speed issues with complex benchmarks--a benchmark-uncertainty principle exists which prohibits you from reducing a complex benchmark to its components and getting agreement between the components and complete mark. This is especially true for double-precision marks compiled so that they do not allocate certain variables to x87 registers. We found that the speed of double-precision codes which did not use register coloring could vary by a factor of 2! We suspected that cache dependencies create doubles in stacks and heaps. These dependencies could also make it difficult to get accurate times--benchmarks fluctuate significantly between runs even though our repeat counts are high enough to guarantee accurate timings. We discovered one inner loop that ran faster with one Fortran run time than it did with another, even though no run time was being called.

Consequently, I set up an experiment to find out what was going on. First, I looked at what happened to my matrix-multiply benchmark when we improved the code. Using larger unrolled loops resulted in faster speeds. On an 60-MHz Intel Express motherboard, I was able to hit 23 megaflops, unrolling by 4, and 24.8 megaflops, unrolling by 16. While a 7 percent improvement was okay, it didn't match the 15 percent improvement our scheduler's statistics predicted, and it was nowhere near the 30+ megaflops the benchmark ought to get. For the problem in question, an i860XR hits 70 megaflops and an XP hits 94 megaflops, so the problem wasn't with the bus-interface unit. The i860XR hits this mark with half the data coming from the cache, and the XP will hit it with all the data coming from memory. For our original Pentium mark, half the data should be coming from memory and half from cache--but we didn't expect that at 25 megaflops the system would be memory bound. To explore just how bound it was, I switched to a mark which did dot products with arrays that slightly overflowed the Pentium's cache using arrays that were 4000 elements long in single precision. The results are shown in Table 2.

As Table 2 indicates, we were suddenly hitting 35 megaflops, which is under two cycles per operation--marks that scaled nicely with our scheduler predictions. In fact, the Pentium was beating the scheduler, which is something we thought might happen and probably results from discrepancies in our reading of the Intel Compiler Writer's Guide, not to mention the level of effort we put into integer scheduling (which might have saved us one cycle at best). After examining the code and data, we decided our code was nearly identical to that produced by the highly touted Intel code generator. Also, when running out of the on-chip cache, it is possible to write a simple formula for the time that it takes to execute a dot-product inner loop: #cycles=4(+2)+n*3 (where (+2) is a code-uncertainty factor that appears in the last loop and might not exist at all). In general, it takes three cycles to execute a single dot product after the unrolling factor becomes large. In fact, for the Pentium it is easy to show that the speed of a dot product running out of the cache can asymptote to 40 megaflops. "Asymptote" implies that we have unrolled the loop to the point where the four to six cycles of overhead per loop are no longer important; the problem runs long enough to average down the time required to switch from one dot product to the next. For most RISC devices, the sum of these two overheads is around 10 percent, and you only exceed 90 percent of a chip's theoretical capabilities at large vector lengths. We've seen these effects with XPs running large dot products, where the benchmark tells you 94 megaflops but a logic analyzer indicates the CPU can cruise at 100 megaflops until the inner loop ends and the next dot product starts.

Next, we explored what happens when vector lengths increase. Figure 2 shows single- and double-precision dot products run with arrays dimensioned to 4000 elements, while Figure 3 shows results of the same program with the arrays dimensioned out to 128,000 elements. (The source code and data that generated these plots is available electronically; see "Availability," page 3.) For each of these points, we performed 150,000,000 floating-point operations, each of which takes anywhere from 5 to 30 seconds. Note that both plots are different: There is a disadvantage to using large arrays to hold small amounts of data. That disadvantage is related to the fact that while the cache is large enough to hold both arrays, there aren't enough tag bits to keep the two arrays from squabbling over the lower address bits. You see in Figure 2 that the Pentium starts off around 30 megaflops, increases to 35 megaflops as the vector overhead washes out, and falls away to 26.7 and 24.0 megaflops when the marks leave the cache. These fall-offs occur for v1=1000 in single precision and v1=500 in double precision. We actually see a small drop off at the saturation point itself and a more precipitous fall for the points where v1=600 and v1=1200. The rates after the fall-off correspond to the speed the Pentium can muster running out of the second-level cache. Figure 2 includes our estimate of the effective bandwidth of the L2 cache for each of these marks. The double-precision one is larger than the single, which indicates that what is holding the chip back is not absolute cache bandwidth, but the ability of the cache to switch back and forth between reads and writes to different addresses in the L2 cache.

Figure 3 is more complicated. Instead of getting two asymptotic performance levels, we get three. The Pentium just barely makes it to 35 megaflops in single precision and then falls off to its L2 asymptote, which is lower than the L2 asymptote in Figure 2. This is followed by another fall when L2 gets filled up, corresponding to data coming from external memory. We discovered that the transition between Figures 2 and 3 occurs on the Intel Express motherboard when we went from 64,000 element arrays to 128,000 elements. The external memory limits corresponds to bandwidths of 75 (single precision) and 85 Mbytes/sec. Again, we get more bandwidth in double precision, but not that much more, which indicates we are probably measuring the speed of a single process. My guess is these effective bandwidths are well below the peak bandwidth the motherboard can achieve, as evidenced by our LINPACK mark, which required 156 Mbyte/sec from the system to hit 13.5. The real problem with memory accesses is likely the architecture. True RISC devices have enough true registers to load 64 bytes from one array before switching to the other. They also make it possible to protect the contents of the cache. In the Pentium there is no way to avoid driving the cache-memory system crazy with small random requests that alternate back and forth between arrays and the pages in memory which hold them. Ironically, a chip which can hit 35 megaflops running out of its internal cache slows up a factor of 5 when running in double precision from external memory.

In conclusion, a 60-MHz Pentium is capable of delivering 9 to 35 megaflops doing dot products and 19 to 25 megaflops doing matrix multiplies on matrices of order 100. In a 100x100 matrix multiply, 100 elements come out of the cache and 100 elements come out of memory. This is a hybrid situation midway between the lower and upper asymptotes in Figure 3. With LINPACK, it hits 12.25 megaflops at v1=100 and 13.5 at v1=150, before falling rapidly off. The Pentium depends heavily on its code and on the motherboard it runs on. The best code allocates variables to x87 registers, unrolls to factors of 8 or more, and uses Pentium scheduling. The Pentium is much more sensitive to code quality than the 486; the same binaries that show a speedup of 12:1 on the Pentium show a speedup of only 3.3:1 on a 486.

All the Pentium motherboards we examined had memory systems whose performance fell below what the chip is capable of. At 60 MHz, a properly supported Pentium can read/write memory at 480 Mbytes/sec (almost as fast as the latest HP PA-RISC technology). However, Intel doesn't provide the technology to interface the chip properly to memory systems which run very fast. For example, the Intel Express motherboard we used to run our DDOT experiments (which just happens to employ one of the best cache controllers in the industry) appears to have an effective bandwidth that is a factor of 6 slower than the Pentium's theoretical bandwidth! What we often discovered about Pentium motherboards is that the higher the frequency, the poorer the memory performance. The same LINPACK that runs at 12 megaflops on the 60-MHz Intel Express falls off to 7 megaflops on a less-expensive 90-MHz motherboard that used a well-known 486 chipset adapted to the Pentium.

The Pentium is not alone in this quandary. Numerous RISC devices capable of incredible performance out of the cache fall off the mark quickly as you leave their design points. High performance out of the cache is important if you are running problems which have a small number of variables that fit into the cache. However, these are the sorts of problems that unrolling can often not be employed with, because they have scalar dependencies which prevent loads and stores from being swapped. We lost in DDOT a factor of 2 in performance running out of the cache when we went from unrolled by 4 to unrolled by 1. This also means that the illusive peak performance of 35 megaflops that we finally discovered by caching both vectors are not an issue with most scalar problems. The typical performance we see for these problems ranges from 17 to 23 megaflops. Problems that do fit into the cache and are not too memory bound include DSP applications involving FFTs, 3-D graphics, rendering, and neural-net backpropagation. However, the i860 still beats the Pentium in these areas, so don't expect any quick wins in the high-end data-processing markets, where the fastest part usually wins. In the end, the Pentium will go down as a very good try, whose legacy may end up becoming "don't expect a general-purpose part designed around Windows to blow away the numerics world of FFTs, DAXPY, and DDOT."

Acknowledgments

I'd like to thank Mahesh Srinivasan for adapting our i860 scheduler to the Pentium and Mark Barrenechea and David Livshin for numerous intellectual discussions.

Example 1: (a) Pseudocode; (b) Intel, no unrolling (cycles: 486, 38; Pentium, 12); (c) Intel, unrolled by 3 (cycles: 486, 102; Pentium, 32); (d) Intel, scheduled by 3 (cycles: 486, 128; Pentium, 19); (e) NDP, scheduled by 4.

(a)
loop:
load DA
multiply X(i)
add Y(i)
store Y(i)
increment i
check end of loop
loop if necessary

(b)
loop:
fld    [esp+8]
fmul   [ebx+eax*4]
fadd   [ecx+eax*4]
fstp   [ecx+eax*4]
inc    eax
cmp    eax,ebp
jle    loop

(c)
loop:
fld    [esp+8]
fmul   [ebx+eax*4]
fadd   [ecx+eax*4]
fstp   [ecx+eax*4]
fld    [esp+8]
fmul   [ebx+eax*4+4]
fadd   [ecx+eax*4+4]
fstp   [ecx+eax*4+4]
fld    [esp+8]
fmul   [ebx+eax*4+8]
fadd   [ecx+eax*4+8]
fstp   [ecx+eax*4+8]
add    eax,3
cmp    eax,ebp
jle    loop

(d)
loop:
fld    [esp+8]
fmul   [ebx+eax*4]
fld    [esp+8]
fmul   [ebx+eax*4+4]
fxch   st1
fadd   [ecx+eax*4]
fld    [esp+8]
fmul   [ebx+eax*4+8]
fxch   st2
fadd   [ecx+eax*4+4]
fxch   st1
fstp   [ecx+eax*4]
fxch   st1
fadd   [ecx+eax*4+8]
fxch   st1
fstp   [ecx+eax*4+8]
add    eax,3
cmp    eax,ebp
jle

(e)
loop:
fld    [esi]
fmul   [ebx]
fld    [esi+8]
fmul   [ebx]
fxch   st1
fadd   [eax]
fld    [esi+16]
fxch   st2
fadd   [eax+8]
fxch   st2
fmul   [ebx]
fld    [esi+24]
fmul   [ebx]
fxch   st1
fadd   [eax+16]
fxch   st2
fstp   [eax]
fadd   [eax+24]
fxch   st2
fstp   [eax+8]
fstp   [eax+16]
fstp   [eax+24]
add    esi,32
add    eax,32
dec    ecx
jne    loop
jle

Table 1

Table 2: Measured versus predicted single-precision, dot-product performance.

                                            Scheduler-     
Optimization       Measured      Measured   predicted       
set                speed in      cycles     cycles/stalls   
NDP Fortran 4.5.   mflops        per loop   per loop        

-OLM-onrc -sc      17.06
-OLM -on (-u4=4)   29.7/29.64    16         19/2
-OLM -no -ur=8     33.7/33.2     28         31/2
-OLM -on -ur=16    35.47/34.95   54         56/3

Figure 1 DAXPY and LINPACK versus vector length. Figure 2 Single- and double-precision dot-product speed versus vector length. Figure 3 Dot-product speed for vectors dimensioned to 128,000 elements.