Debra joined Intel in 1989 after completing her Master's and Engineer's degrees in EE/CS at MIT. She can be reached through the DDJ offices.
The requirements for a powerful graphics processor include fast spatial transformations and fast rendering. Fast transformations, which demand fast matrix manipulations, are necessary because typical three-dimensional graphics applications translate and rotate objects on a screen in three-dimensional space. Fast rendering is necessary because applications typically store an object to be displayed as a collection of vertex locations for polygons (commonly triangles) which describe the object's surface. Each vertex is stored along with its corresponding information--color and depth (Z-value), for example. The set of vertices and their associated information is also called a "display list."
Before an object can be displayed on a screen, a processor must first flesh out the attributes for pixels between vertices. This fleshing out of a complete pixel-by-pixel representation from the vertex information stored in the graphical database is accomplished by interpolating the vertex attributes (that is, color and depth values) for all the pixels inside the polygons determined by the vertices in the database. One of the most common algorithms for interpolating color values, "Gouraud shading," is simply linear interpolation.
Due to its highly parallel architecture, Intel's i860 processor excels at matrix manipulations. For one thing, the processor can run in dual instruction mode (DIM), whereby a core instruction (a load or store) can execute at the same time as a floating-point instruction (a multiply). On the 50-MHz i860 XP CPU, the 128-bit-wide data-cache-to-register-file path provides sustainable data throughput of 800 Mbytes per second for accesses to data in the 16-Kbyte on-chip data cache. For accesses that miss the cache, a 64-bit-wide burst bus shuttles data in and out of the processor at up to 400 Mbytes/sec.
While the wide data paths continuously feed the floating-point units, the pipelined architecture ensures that the data is disposed of expeditiously. The floating-point multiplier sustains one result every clock in single-precision mode (32-bit results), or one result every other clock in double-precision mode (64-bit results). The floating-point adder keeps pace, producing one result per clock in either single- or double-precision mode.
Furthermore, using its dual-operation instructions, the i860 CPU can perform both floating-point adder and multiplier operations simultaneously. The multiplier and adder can each sustain a throughput of one result per clock in single-precision pipelined mode, so the net throughput of dual-instruction mode and dual-operation mode together is up to three results per clock.
Display-list processing requires two types of interpolation: color interpolation and depth interpolation. With color interpolation, pixel colors are stored and fed to DACs as red, green, and blue (RGB) values, and each color component must be interpolated separately. The i860 CPU employs the faddp.d (floating-point add pixel, double-word length) instruction to accomplish this. Because faddp.d always operates on 64 bits' worth of pixels at a time, those 64 bits are interpreted in different ways, depending on the software-supplied setting of the PS (pixel size) field of the control register PSR.
The i860 architecture incorporates hardware support for 8-, 16-, and 24- or 32-bit pixels (with 24- and 32-bit pixels being treated identically). For simplicity, let's illustrate using 32-bit pixels. Sixteen-bit pixels are similar to 32-bit pixels; 8-bit pixels are more confusing, because there is no standard for representing or processing them.
We begin by calculating the blue (B), green (G), and red (R) intensities for the first two pixels (i and i+1) in a triangle scan line; see Figure 1. Each color intensity is represented as an 8-bit integer portion and a 24-bit binary fraction for purposes of calculation. Let's assume that we've calculated the total color delta over the current triangle scan line for each color component R, G, and B (for example, B_color_delta = B[i+n]-B[i]) and have divided that color delta by the number of pixels to be interpolated across (pixel_delta = n). The result of this division, also represented as an 8-bit integer and a 24-bit fraction, is the incremental color delta.
Now we're ready to recursively add the incremental color delta for each color component to the initial values, so that each successive pixel's RGB values along the triangle scan line are calculated. Here's where faddp.d helps out by automating and speeding up the process.
First let's calculate B values for the next two pixels (i+2 and i+3) in the triangle scan line. To do that, just put the initial B values for the first two pixels of the triangle scan line (i and i+1) into faddp.d's 64-bit op1, side by side in the op1 register pair, as shown in Figure 2. You'll need to use the predefined format of eight bits integer portion and 24 bits fractional portion to use the instruction properly. Then let op2 = two instances of 2* (B_color_delta)/pixel_delta, again side by side in the op2 register pair. The reason the interpolant value is 2* (B_color_delta)/pixel_delta, rather than simply B_color_delta/pixel_delta, is that you are interpolating from pixel i to pixel i+2 in one half of the register pair, and from pixel i+1 to pixel i+3 in the other half.
In one clock faddp.d adds the color fields, generating the B values for the next two pixels. (In fact, like most i860 CPU instructions, all the graphics instructions execute in just one clock.) The result is placed in the fdest register pair so that it can be used as the op1 next time around, in order to generate the B values for pixels i+4 and i+5.
In addition, when PS is set for 32-bit pixels, faddp.d shifts the MERGE register right by eight bits and then updates certain MERGE fields with the integer portions of the faddp.d result. That's so that after three applications of faddp.d--once for R values, once for Gs, and once for Bs--the RGB values for two pixels will be consolidated ("merged") in the MERGE register in precisely the arrangement (packed-pixel format) that graphics hardware typically requires.
After three iterations of faddp.d, one 8-bit field is left unused in the MERGE register. That field can have any other attribute (such as texture) ORed into it with the form (floating-point or with merge) instruction. Form also transfers the MERGE register contents into a floating-point register pair in preparation for storing to the frame buffer, and it clears the MERGE register for the next set of interpolations.
With the RGB values for pixels i, i+1, i+2, and i+3 calculated, the next op1 of the faddp.d instruction will be the B values of pixels i+2 and i+3; the B interpolants in op2 remain the same as they were in the first set of B interpolations. Likewise, after the B values for pixels i+4 and i+5 are obtained, their G and R values are interpolated. In this way, the RGB values for all pixels within a triangle scan line can be quickly and efficiently calculated.
Sixteen-bit pixels are handled similarly to 32-bit pixels, except that for purposes of calculation, colors are represented by an integer portion (for example, Int[Bi]) of six bits and a fractional portion (Frac[Bi]) of ten bits. As illustrated in Figure 3, one faddp.d sums two sets of four pixels' color fields (blue, for instance), updates four 6-bit fields of the MERGE register, and shifts MERGE right by six bits. After two more such instructions, one for green and one for red, the MERGE register contains RGB values for four pixels and is ready to be stored out. One difference for 16-bit pixels, however, is that because there is not room in a 16-bit pixel for six bits each of R, G, and B intensities, two fields (normally for R and G) are allocated six bits each, while the third field (for B) is truncated to just four bits during shifting of the MERGE register. The bits are allocated this way because the human eye is significantly less sensitive to differing shades of blue than of red or green.
Because 8-bit pixels are a nonstandard format, color interpolation for them is often platform dependent. However, because the i860 CPU pixel interpolation instructions only define operand field sizes, and not their uses, the 8-bit faddp.d instruction can be easily adapted to a wide variety of implementations.
In 3-D graphics applications, objects' surfaces, and the pixels that represent these surfaces, have depth (Z-values) associated with them. Just like color values, however, Z-values are only given explicitly for triangle vertices on objects' surfaces. Z-values for pixels on or inside the triangles must be interpolated from the vertex values.
Z-values can be either 16 or 32 bits long. To accelerate interpolations, the graphics instruction faddz (floating-point add with Z merge) interpolates two 16-bit Z-values at a time. Just as in color interpolation, a Z-value interpolant is recursively added to initial Z-values from pixels at one end of a triangle scan line to generate the Z-values of pixels along the scan line.
As shown in Figure 4, the interpolation results are stored in a floating-point register pair. Additionally, the MERGE register is shifted right 16 bits and then updated with the integer portions of the interpolation sums. That way, after two successive faddz instructions, the MERGE register contains 16-bit Z-values for four pixels in a row.
Because 32-bit Z-buffer calculations require more bits of precision than can be accommodated with faddz, they are more efficiently interpolated using the 64-bit integer add instruction, fiadd.dd.
When displaying a 3-D object, not all of its surfaces are to be displayed simultaneously, or the back of the object (with respect to a viewer) might overwrite the front. Likewise, in a scene consisting of multiple objects, some objects' surfaces may obscure other objects. This is why we calculate Z-values during rendering: once Z-values have been calculated for all the different objects' surfaces, those Z-values can be used to decide which surfaces to display. Selecting which pixels to display is known as "hidden surface removal."
One popular method of hidden surface removal is the Z-buffer approach. The Z-buffer, an area of main memory, holds the Z-value of each pixel currently displayed. The Z-buffer serves as a reference against which newly computed pixels' Z-values can be checked.
If a newly computed pixel's Z-value is smaller (closer to the viewer) than the Z-value of the pixel already displayed at that pixel's (x,y) coordinates, then the newly computed pixel is displayed instead of the previous one, and the Z-buffer is updated with the new pixel's Z-value. If the newly computed pixel's Z-value is larger than the Z-value of the pixel already displayed at that pixel's (x,y) coordinates, then the newly computed pixel is not displayed at all, and the Z-buffer retains its value for the given pixel location.
The i860 CPU has two kinds of special graphics instructions, fzchks/fzchkl (floating-point Z-buffer check short/long) and pst.d (double-word pixel store), which expedite the Z-value comparison and subsequent store operations.
Fzchks compares four pairs of 16-bit Z-values in a swoop. Normally one of the sets of four Z-values is from newly computed pixels; the other set is from the Z-buffer. Fzchks first shifts the contents of the 8-bit PM (pixel mask) field in the PSR control register right by four bits. Then it sets one of the high-order bits of PM for each of the four comparisons that indicates that the newly computed pixel has a smaller Z-value than the corresponding one stored in the Z-buffer.
PM is shifted right so that the results of two successive fzchks instructions accumulate in the 8-bit PM field. The PM field is used by the pst.d instruction, which examines the contents of PM and stores to the frame buffer only those pixels within its 64-bit register pair operand that correspond to set bits in PM. Thus only those pixels which need to be updated in the frame buffer are actually written out.
Fzchkl (l for long) is identical to fzchks (short) except that it compares two pairs of 32-bit Z-values at a time, shifts PM right by only two bits, and only updates the two high-order bits of PM corresponding to the results of the two 32-bit comparisons.
Here's the only potentially confusing piece of the puzzle. Although PS and PM are both used by pst.d, they are unrelated. That is, the number of bits allotted to pixel size and to Z-value size are unrelated. You can have an 8-bit pixel with a 32-bit Z-buffer, a 32-bit pixel with a 16-bit Z-buffer, or any other combination you please.
Pst.d stores 64 bits at a time, which represents eight pixels if your pixel size is 8 bits, but only four pixels if your pixel size is 16 bits, or two pixels if your pixel size is 32 bits. Although PM presumably has eight bits (8 pixels' worth) of information in it from multiple fzchks/1 instructions, pst.d only examines the appropriate number of low-order bits of PM. (The "appropriate" number depends on the pixel size as described in the next section.) Pst.d also shifts PM right by 8/pixel_size_in_bytes bits, where pixel_size_in_bytes is determined by PS. That sets up PM for the next pst.d. Multiple pst.d instructions are executed until eight pixels in a row have been stored to the frame buffer (or not stored, depending on the contents of PM).
Assume your pixel size is 8 bits (as determined by the PS field of PSR) and your Z-values are 16 bits. In order to generate eight pixels' worth of Pixel Mask information, you must perform two fzchks instructions, which compare four Z-value pairs at a time. Then you must execute one pst.d, which stores (or doesn't store, depending on PM) eight 8-bit pixels, exploiting all eight bits of PM. All eight bits of PM have been "used up," so you must then proceed to the next round of fzchks instructions before executing another pst.d. This correlates with the fact that one pst.d shifts PM right by 8/1 = 8 bits--that is, effectively shifts all eight bits out.
Alternatively, say your pixel size is 16 bits, and your Z-values are 32 bits. Set up PM with four consecutive fzchkl instructions, each of which compares two Z-value pairs at a time. Then, because one pst.d only stores (potentially) four pixels, exploiting only the low-order four bits of PM, you'll need to execute two pst.d instructions in a row before proceeding to the next fzchkl instructions. Again, this makes sense because pst.d with 16-bit pixels shifts PM by 8/2 = 4 bits.
Because they provide hardware support for rendering as well as fast transformations, the i860 CPUs are optimal solutions for demanding graphics applications. Scientific visualization, CAD/CAM, animation, and other graphics-oriented applications can all benefit from the i860 CPUs' graphics features, enjoying performance improvements of up to ten times compared to conventional integer operations.
Copyright © 1992, Dr. Dobb's Journal