Optimizing IA-64 Performance

Dr. Dobb's Journal July 2001

If you use the right features, you can accelerate performance on the IA-64

By Sverre Jarp

Sverre is a scientific staff member at CERN, the European Organization for Nuclear Research in Geneva, Switzerland (http://cern.ch/sverre/SJ.html). He can be reached at sverre.jarp@cern.ch.

The speed of computers has increased tremendously over the last decade. Both AMD and Intel have crossed the frequency barrier of 1 GHz and there is every indication that multigigahertz systems are just around the corner. Compare these clock rates to that of the Ferranti-Mercury, which CERN exploited in the late 1950s that had a cycle time of 60 microseconds. The ratio is almost five orders of magnitude!

Even at current speeds, users continue to demand faster computers. The Accelerated Strategic Computing Initiative (ASCI) program (http://www.llnl.gov/asci/) intends to improve performance by three (or more) orders of magnitude in the next decade, hoping to go from teraflops to petaflops of combined performance for solving a given problem. At the European Organization for Nuclear Research (CERN, http://public.web.cern.ch/Public/), we want to increase our computing capacity from thousands of SPECint95 to a few million when the Large Hadron Collider is switched on in 2005.

In this context of endless demand for higher and higher performance, I've closely examined the key features of the Itanium Processor Architecture and microarchitecture. The Itanium, originally known as the IA-64 (and codenamed "Merced" even before that), is a 64-bit processor designed by Hewlett-Packard and Intel. In addition to the obvious performance gains that 64-bit addressing brings, the Itanium also supports performance-enhancing techniques such as predication, speculation, rotating registers, a wide parallel execution core, high clock speed, fast bus architecture, multiple execution units, and the like. Moreover, the Itanium is designed from the ground up around parallelism and uses a new kind of instruction set based on the Explicit Parallel Instruction Computing (EPIC) specification. EPIC has been described as an approach that merges CISC and RISC computing onto a single processor, thereby allowing the processing of Windows-based and UNIX-based applications, among other features. Operating-system support for the IA-64 has been announced for 64-bit Windows, HP-UX, various flavors of Linux, and AIX 5l.

In this article, I show how to achieve optimal code generated by a compiler or generate optimized sequences of IA-64 assembly code to ensure top speed.

Key Itanium Performance Features

The Itanium architecture contains features that make programming the corresponding microprocessors very powerful. While some of the features have been available in previous architectures, the IA-64 formalizes them in its architectural definition:

Predication. Over 90 percent of the instructions can be controlled by predicated registers to determine on the fly whether they should be executed. This is a more powerful capability than, for instance, conditional move instructions and should lead to more compact code in larger blocks and fewer branches.
Software pipelining. Loop-controlling registers, predication, and a concept of rotating registers (programmable register renaming) let loops be staged to optimize performance. Through correct use of predication, there is no code bloat to represent the prologue (while building up all the loop stages) or the Epilogue (required for draining the loop correctly).
Vectorization (single instruction, multiple data, or SIMD). Both integer and floating-point (FP) data can be acted upon by using a large set of SIMD instructions. You can, for instance, operate on 8 bytes of integer data in a parallel add instruction. This feature is available in many modern architectures and it is particularly important for multimedia programs.
Wide execution. The IA-64 architecture groups three instructions together in a bundle and the Itanium processor allows two such bundles to be executed together. This means that a total of six instructions can be executed per cycle, but for this to happen, a certain number of rules must be followed.
Memory programming. Itanium has architectural definitions for both prefetching data and instructions. The lfetch instruction lets data be prefetched into a certain cache level (L1, L2, or L3) and the branch predict instruction (brp), among others, allows future branches to be signaled to the instruction fetch unit so that instruction fetches are activated correctly.

Targeting Peak Performance

Ideally, programmers only have to worry about the design and the high-level language implementation of an application, leaving everything else to optimizing compilers and/or performance libraries. Alas, life is unfortunately a bit more complicated. Based on my experience, here is my strategy for achieving peak performance on any modern microprocessor:

Profile the application to find out where the time is spent and what kind of optimization would be needed/useful.
Examine the compiler-generated code in the performance-critical region(s) to see if key performance features are being deployed in the performance-intensive area.
Use the compiler option (highest level of optimization, normally O3 or O4), or Profile-Guided Optimization (if/when available) to try to improve performance.
Check if optimized libraries are available and if they can do the job correctly.
If all of the aforementioned fails and it is clear that a lot of performance can still be gained, decide if assembly is the right option.

Here, I use an example coded in assembly language to illustrate how you can achieve speed increases. This example is a simple search for a value (the key) in an array of integers (defined as int). The example is valid for both Windows and Linux because both define int as a 32-bit integer. The example is analogous to the find method defined for vectors in the C++ Standard Library. Programmed in C, the code might look like Listing One. In assembly language, each implementation of this routine is completed in three steps: the assembler directives, the entry/exit code of the routine, and the loop itself.

A Counted Loop

The most straightforward approach in IA-64 is to code the search routine as a sequentially programmed loop exploiting the IA-64's loop count register (AR.LC) for counting.

Step 1: Assembler Directives. First, establish the minimal set of assembler directives as in Listing Two. They are self-explanatory and can be found in the IA-64 Assembler Reference Guide (http://devresource.hp.com/devresource/Docs/Refs/IA64ISA/index.html).

Step 2: Entry/Exit Code. Next, write the code needed for initialization and termination of the routine as in Listing Three. I use the preprocessor to define names for save registers. The full explanation of the IA-64 calling conventions can be found in the IA-64 Software Conventions and Runtime Architecture Guide (http://www.ia64.hp.com/infolibrary/architecture/ia64rt-gen.pdf). I've written this code with an emphasis on readability. This part of the example does not use the optimization features of IA-64. Each section (entry or exit) only requires a few cycles to execute. The meaty part is all in the loop.

The alloc instruction in Listing Three reconfirms the three input registers — in0, the key; in1, the array size; and in2, the array pointer. The compare instruction, which ensures that there is at least one element in the array, can be executed in parallel with the corresponding branch. The first return register, ret0, is initialized to -1 and updated inside the loop.

In the exit code, it is necessary to restore the loop count register, but not the "Previous Function State." Branch register zero (rp=b0) contains the return address, so all you have to do is to use it.

Step 3: The Loop. Listing Four is the loop itself. In this implementation I use five instructions (in two instruction groups) and therefore pay the full price of the load latency for every element. If the entire array is in L1 cache, the latency is two cycles and the loop runs in three. If it is in L2 cache, the loop runs in 7(6+1) cycles.

The combined procedure plus a simple test program (ready for compilation and execution on either IA-64 Linux or Windows) are available electronically (see "Resource Center," page 5). (Refer to "find1.S" and "find2.S" in the tar file "jarp.tar.")

The Software Pipelined Loop

While software pipelining improves loop performance, it also increases the complexity of both the assembler directives and the entry/xxit code. Nevertheless, the routine remains readable and manageable.

Step 1: Additional Directives. Again, software pipelining relies on the use of the rotating registers. To assist in writing code for such rotations, the assembler introduces an array-like syntax via .rotx directives, where the x is either p (predicates), r (general registers), or f (FP registers). All you need are the additional directives in Listing Five.

All the additional directives depend on (and change automatically with) the two critical latencies in the loop, load latency, and compare latency. I exploit this later on.

Step 2: Entry/Exit Code. Things become more complicated in Listing Six, mainly because the rotating integer registers occupy the same "space" as the input registers. This forces you to add move instructions to clear the input registers before the loop. The alloc instruction now allocates five local registers in addition to the three input ones. This lets you define a set of eight rotating registers for use in the loop. Eight is the minimum set you can use. The .rotr directives must be put behind the alloc instruction. You must also save the predicate registers and initialize the Epilogue register (AR.EC), which is the second application register used for controlling loops. In the exit code, you have to restore the predicates but also the Previous Function State. This is, strictly speaking, not necessary for leaf routines.

Step 3: The Improved Loop. The pipelined loop can now be presented as a one-cycle loop, as in Listing Seven. This is possible when there are six or fewer instructions (of the right kind) in the loop. The pipelined loop consists of three stages: the load stage, compare stage, and branch stage. As already mentioned, the last two collapse into one on the Itanium since there is no latency between the compare and the corresponding branch.

As with the first example, the entire procedure plus a simple test program (ready for compilation and execution on either IA-64 Linux or Windows) can be found online. The new procedure is now much more satisfactory. On the whole it uses slightly over one cycle per array element, the exact formula is [1+(Epilogue-1)/M], which approaches 1 when the effective number of loop trips (M) is large compared to the Epilogue.

Modifying the Load Latency

The additional assembler directives are written with the load and compare latency values as parameters. A neat little trick is to extend the load latency from 2 (L1 latency) to 6 (L2 latency). On the Itanium, the L1 data cache is only 16 KB in size, whereas the combined L2 cache is six times larger. Therefore, you have a better chance of finding a medium-sized array in L2. Everything else in the procedure remains unchanged, as in Listing Eight. The important point is that in a pipelined loop, you should avoid the assumption that data is always in L1 cache. By assuming L2 latency as in this example (or even L3 latency), I pay only once when increasing the Epilogue value, whereas I obtain an insurance against the data not being in L1, which is valid per each loop iteration.

Explicit Bundling and SIMD

Although the speed is by now quite respectable, improvements can still be made. However, I will first introduce the notion of explicit (or manual) bundling. This is done mainly for the sake of clarity when pushing performance. I mentioned in the overview of the performance features that IA-64 defines a bundle as a set of three instructions and that two bundles may execute in parallel. The architecture currently defines 12 bundle formats (actually 24, since they are defined with or without an encoded "stop" bit). With explicit bundling, the example looks like Listing Nine.

The usefulness of this approach is related to the microarchitecture. The Itanium has a certain number of execution units (more correctly expressed as "execution ports") and its instruction-dispersal network (see Figure 1) dispatches up to six instructions to six different execution ports if it can. If it cannot, it takes additional cycles to dispatch the rest. Explicit bundling helps you express parallelism in a correct fashion as in Figure 1.

The real sophistication comes from adding yet another performance feature, namely SIMD vectorization. Because the routine searches for a 32-bit integer, you can exploit the fact that a 64-bit register can be used to check two array elements at a time. The loop is modified as in Listing Ten. The main modification is the introduction of a parallel compare instruction in the slot that was empty. Since this instruction does not set a predicate, but creates a mask, you need to preserve the original compare instruction to see if a nonzero mask was created, and then trigger the branch. The value PCL corresponds to the latency between the parallel compare and the normal compare. As you would expect, the Epilogue is now equal to LL+PCL+CL+1.

There is an additional cost in the entry/exit part of the routine when increasing the level of sophistication of the loop. When deploying vectorization, you need to worry about data alignment (ld8 must have an 8-byte alignment), and you also need to worry about whether the loop count (N) is even or odd. The positive outcome is, of course, that you have practically doubled the performance. When M is large, each element will only need half a cycle for checking.

Prefetching

To this point, I have been assuming that the vector to be searched resides in L1 or L2 cache. For large vectors, you may not be able to count on data being in cache at all, in which case you need to insert a prefetch instruction. The advantage of the introduction of explicit bundling is that I know in detail which instructions I have coded versus which execution units are available. My problem for guaranteeing top speed is then the following — both bundles are full and I cannot find a free instruction slot!

Well, as every programmer knows, never say never. You can almost always fit one extra instruction into a sequence (without being detrimental). This exercise is left to the reader: An M slot needs to be freed up to issue the prefetch instruction that should be coded by using the postincrement option (in the same way as the load instruction). Here are a couple of hints (since there is more than one way to solve the issue):

What happens if the loop is unrolled once more?
Are all the instructions equally important in the existing loop?

Conclusion

The Itanium Processor Architecture and microarchitecture contain features that can help improve performance dramatically when deployed correctly. In this article, I've presented ways to check if the compilers have done their job. When compilers don't do a perfect job, performance suffers. Performance-hungry programmers may consequently want to inspect the results from compilations of compute-intensive parts of their programs and may even want to try some highly optimized assembly code. By deploying software pipelining, worrying about wide execution (and how it matches on to the available execution ports), adding vectorization, and finally adding prefetching, you can greatly enhance the Itanium's performance. In more complex examples where, for instance, predication can play a major role, the improvements could be even higher.

This has led to assembly-code implementations of routines for random-number generation, encryption/security algorithms, histogramming/pattern recognition, kernel programming (including IP checksumming), memory copy/string manipulation, multimedia, transcendental scalar/vector math libraries, and vector/matrix routines. This is already an impressive list and there could definitely be more to come.

DDJ

Listing One

int find(int key, int N, int* array)
{
   int i;
   for (i=0; i<N; ++i)
   {   if (key == array[i]) return i;  // Found   }
   return -1; // Not found
}

Back to Article

Listing Two

 #define Name find
 .text
 .global Name
 .type Name,@function 
 .proc   Name
Name:
//   Initialization code comes here
//   Loop comes here
//   Exit code comes here
 .endp

Back to Article

Listing Three

#define s_pfssave  r9
#define s_lcsave    r10
            alloc   s_pfssave=ar.pfs,3,0,0,0
            mov    s_lcsave=ar.lc

            cmp.le.unc p6,p0=in1,r0
(p6)        br.cond.dpnt.few  notfound
;;
            add     in1=-1,in1   // loop count - 1
;;
            mov     ret0=-1      // initial index count
            mov     ar.lc=in1    // initial loop count
;;
// The loop comes here   
notfound:  mov     ret0=-1   ;;  // Not found
found:     mov    ar.lc=s_lcsave
             br.ret.sptk.many rp
 .endp

Back to Article

Listing Four

#define s_temp      r31 
cntloop:
            ld4        s_temp=[in2],4
            add       ret0=1,ret0       // tracking of index
;;    
           cmp4.eq.unc    p6,p0=s_temp,in0
(p6)       br.cond,dptk.few  found
           br.cloop.dptk.few     cntloop
;;

Back to Article

Listing Five

 #define LL 2  // L1 Load latency
 #define CL 0  // Compare latency
 .rotr   array[LL+1] 
 .rotp   pc[LL+1],qc[CL+1]

 #define Epilogue   LL+CL+1

Back to Article

Listing Six

Name:          
           alloc   s_pfssave=ar.pfs,3,5,0,8
 .rotr   array[LL+1]
            mov    s_prsave=pr
            mov    s_lcsave=ar.lc
            cmp.le.unc p6,p0=in1,r0
(p6)        br.cond.dpnt.few  notfound
            mov    s_key=in0      // move the key
            mov    s_parray=in2  // move the array ptr
            brp.loop.imp modloop,modloop+16
            mov    ar.ec=Epilogue
            add     in1=-1,in1 // loop count - 1
            mov    pr.rot=1<<16 // initialise pr16
;;
            mov     ret0=-1       // initialise the index count
            mov     ar.lc=in1     // initialise the loop count
;;
// The loop comes here   
notfound:   mov    ret0=-1   ;;   //Not found
found:      mov    ar.lc=s_lcsave
            mov    ar.pfs=s_pfssave
            mov    pr=s_prsave,-1
            br.ret.sptk.many    rp

Back to Article

Listing Seven

modloop:
(pc[0])      ld4     array[0]=[s_parray],4
(pc[LL])     add     ret0=1,ret0   // easy tracking of index    
(pc[LL])     cmp4.eq.unc qc[0],p0=array[LL],s_key
(qc[CL])     br.cond.dpnt.few      found
              br.ctop.dptk.few    modloop   
;;

Back to Article

Listing Eight

 #define LL 6  // L2 Load latency
 #define CL 0  // Compare latency
 .rotr   array[LL+1] 
 .rotp   pc[LL+1],qc[CL+1]

 #define Epilogue   LL+CL+1

Back to Article

Listing Nine

modloop:
{ .mii

(pc[0])       ld4       array[0]=[s_parray],4
(pc[LL])      add       ret0=1,ret0          // easy tracking
(pc[LL])      cmp4.eq.unc  qc[0],p0=array[LL],s_key
}
{ .mbb
              nop.m 0
(qc[CL])      br.cond.dpnt.few      found
              br.ctop.dptk.few modloop   
;; }

Back to Article

Listing Ten

modloop:
{ .mii
(pc[0])          ld8        array[0]=[s_parray],8
(pc[LL])         pcmp4.eq   array[LL]=array[LL],s_key
(pc[LL+PCL])     add        ret0=2,ret0
 }
 { .mbb
(pc[LL+PCL])     cmp.ne.unc qc[0],p0=array[LL+PCL],r0
(qc[CL])        br.cond.dpnt.few      found
                 br.ctop.dptk.few    modloop  
;; 
}

Back to Article