Bill is the author of the 386BSD operating system and can be contacted at wjolitz@ cardio.ucsf.edu.
While the events surrounding the Pentium floating-point bug continue to play out, it is clear that this incident has focused attention on the potential for serious flaws in our computers--even though nearly all of the computers ever produced have equally serious microprocessor flaws as well. Why is a flaw in the fifth digit of a number significant enough to make the nightly news? To answer this, we need to understand a bit about why these bugs occur to begin with.
It is not surprising that errors might arise in floating-point arithmetic, as this area is notorious for its complexity and resistance to diagnosis. In fact, prior to the now-standard IEEE floating-point formats, the numeric results for different brands of computers would rarely be comparable, while computations done in certain numeric ranges might even result in completely different results. For example, IBM mainframes used base-16 exponents and had effectively smaller, more granular mantissas than the base-2 exponent IEEE formats in use today, while CDC mainframes used reciprocal approximators (instead of divide units) for speed. Consequently, programmers occasionally needed to check published error-deviation charts to ensure adequate precision. These floating-point "deviations" were not considered to be of great significance, since floating point was considered the province of scientists and engineers, who could be relied upon to track loss of precision.
After the KCS/IEEE floating-point standardization work (and its widespread use by Intel, Motorola, and Sun), floating-point usage became as commonplace for mainstream computer users as ordinary arithmetic had always been. An immense number of floating-point coprocessors have been sold; however, its hard to believe that there are that many budding scientists and engineers out there.
As a result of this widespread usage, manufacturers have been forced to deal with this problem more methodically, through the development and implementation of elaborate numeric analysis and test procedures specifically designed to localize these problems. Since there are dozens of other comparable chips that use floating point (SuperSPARC, PowerPC, and so forth), and that have successfully (so far) run this gauntlet, it came as a shock to many insiders that Intel fell short.
Problems with floating point fall into the pathological and the systematic. Oddly enough, what may appear to be pathological cases may actually turn out to be systematic in specific applications. For example, in the early 1970s I used HP programmable calculators to calculate trigonometric functions to translate VOR navigational fixes into map coordinates for aircraft navigation.
During the first operational test of the system, we were shooting an approach into NASA/Ames and plotting fixes onto the map, and a new fix suddenly went wide--right into the coastal mountains. This remarkable error was later diagnosed as a bug in one of the trigonometric functions: The value for a certain range of narrow angles would return a value for a complementary angle instead. Since HP had notified customers of the exact problem from the outset, this problem was easily worked around. However, intrigued by how easily we had come across the bug, we went back and examined other runway approaches and found that it was not uncommon to experience this problem, since navigational aids were frequently located on airport fields in such a way that the angles in question were often observed in routine flight conditions. Thus, the characteristics of the application turned a pathologic exception into a common case with large safety implications.
Another renowned pathological case, which actually was quite widespread, involved a flaw in the floating-point accelerator of the venerable VAX computer. Like the Pentium, this flaw resulted in an instruction (polyd) losing precision. Unlike the Pentium, the fix would have required a complete redesign of the accelerator since the hardware did not implement enough guard bits to ensure precision. DEC finessed this problem by "unaccelerating" the instruction using microcode (loaded from media, thankfully) to perform the instruction "slowly." (This fix was akin to solving the Pentium bug by using floating-point emulator software, running 20 times slower.) However, this fix annoyed customers, since this instruction was marketed as being crucial to speeding up transcendental functions. (DEC later "fixed" this concern by removing the instruction.)
Intel itself is no stranger to chip bugs in its x86 family. Over 100 significant bugs have been acknowledged in the x86's lifetime--some requiring replacement of units. Many of these bugs only showed up in protected mode, and were dealt with in the operating system by vendors. (386BSD ran into, and compensated for, many of these in the notorious "sigma-sigma" 386DX chips in the 1980s.) It is commonplace for a manufacturer to deal with these bugs by providing work-arounds, external circuitry, or compiler modifications. Frequently this information is confidential, since manufacturers don't want competitors to exploit the information in ads or news reports.
One interesting aspect of Intel's current dilemma is that the company's success has been founded on compatibility and precision, allowing it to command market share and dictate pricing. Because its flagship processor is 99 percent compatible with past chips, Intel simply cannot afford a recall, which would possibly eat into its technology lead over its competition.
This experience has left many users with a number of unanswered questions: Is this the last bug to be found in this processor family? Who is responsible for solving the problem? When is a bug significant enough to warrant recall? And finally, who is liable for errors in calculations performed by the chip?
While the fallout from this spectacle may eventually answer some of these questions, we know one thing for certain--Intel has learned the hard way that it cannot bury its head in silicon and deny the problem, nor can it ignore the demands, from its customer base, that the problem be solved. Unlike the old days when engineers and physicists were forgiving of the flaws in their computers, the marketplace must now be served in a comprehensive and simple manner, or else it will abandon Intel for its competitors.