Al is the author of DOS 6: A Developer's Guide (M&T Books, 1993) and DOS and Windows Protected Mode (Addison-Wesley, 1993). You can reach Al at 310 Ivy Glen Court, League City, TX 77573 or on CompuServe at 72010,3574.
It's no secret that floating-point calculations can slow program execution. For those PCs that have them, coprocessors can make a difference but, when compared to floating point, integer arithmetic is still faster. Libraries that emulate coprocessors are also available; however, these emulators are notoriously slow.
In this article, I present an alternative approach to floating-point math that takes advantage of 32-bit instructions. Even though I use the 80386 as an example platform, these techniques can be applied to other 32-bit processors as well. (You could even adapt the routines to run on 16-bit processors, but you would see some loss of performance.)
To use FPM (floating-point math), you'll need MASM or TASM to assemble the main source file. You can then use FPM from assembly, C, or C++ programs. Since FPM takes advantage of several 386 instructions, you'll also need a 386- or 486-based PC. One caveat: FPM is not as accurate as a coprocessor (or an emulation library) that uses an 80-bit word. Still, it is accurate enough for many programs.
When you do arithmetic by hand, you don't work directly with floating-point numbers. Instead, you calculate an integer answer and decide where to put the decimal point. To add, for example, you align the two numbers' decimal points, then add. The answer's decimal point remains in the same position. When multiplying by hand, you treat the numbers as integers, and place the answer's decimal point by counting the number of digits to the right of the decimal point in the two multiplicands. FPM uses similar procedures on binary numbers.
Using 32-bit integer arithmetic, the 386 can easily perform real-number math with reasonable accuracy. Since FPM uses binary numbers, there is no "decimal" point. To avoid confusion, I'll refer to the point in a FPM number as the "radix point."
Figure 1 shows how FPM stores floating-point numbers. The val field stores a 32-bit binary number. The scale field determines the location of the assumed radix point. The sign field is 0 for positive numbers, or 0xFF for negative numbers.
FPM uses two routines to adjust numbers. RADJ shifts numbers right until bit 0 is a 1, and LADJ shifts them left until the leftmost bit is a 1. These routines update the scale field by the amount of the shift so that the number's value doesn't change--just its representation.
Before adding two numbers, FPM must line up their radix points using the algorithm in Figure 2(a). FPM uses RADJ on both numbers to allow for as many significant digits as possible; see Figure 3. If the scales are then equal, the addition proceeds. If they are not, FPM adjusts the number with the smallest scale to the left until the scales are equal, or until a left shift would cause a 1 bit to fall off the end of the word. If the scales are still not equal, FPM will right shift the number with the largest scale. While some bits on the right may be lost, these bits have less significance than the bits on the left.
(a)
if (x.value==0) return y
if (y.value==0) return x
right_adjust(y)
right_adjust(x)
if x.scale!=y.scale then
A=number with larger scale
B=number with smaller scale
sdiff=A.scale-B.scale
idx=left_bit_index(B) /* find index of leftmost '1' bit */
if 31-idx>=sdiff then
B.value=B.value<<sdiff
B.scale+=sdiff
else
B.value=B.value<<31-idx
B.scale+=31-idx
A.value=A.value>>sdiff-(31-idx)
A.scale-=sdiff-(31-idx)
endif
endif
if bit 31 of A.value or B.value is set then
A.value=A.value>>1
B.value=B.value>>1
A.scale--
B.scale--
endif
if A.sign==0xff A.value=-A.value
if B.sign==0xff B.value=-B.value
result.value=A.value+B.value
if result.value<0 then
result.value=-result.value
result.sign=0xff
endif
result.scale=A.scale
(b)
if y.value!=0 y=-y
jump to add routine
(c)
right_adjust(x)
right_adjust(y)
if x.value==0||y.value==0 return 0
product=x.value*y.value /* product is 64 bit */
shift product left until top 32 bits are all zero
x.scale=x.scale-amount_shifted
if bit 31 of product set then
product=product>>1
x.scale--
endif
result.value=product
result.scale=x.scale+y.scale
result.sign=x.sign^y.sign /* exclusive-or */
(d)
left_adjust(x)
right_adjust(y)
if x.value==0 return 0
if y.value!=1 then
idx=left_bit_index(y) /* find index of leftmost '1' bit */
divx=x.value /* divx is 64 bit */
divx=divx<<idx
x.scale+=idx
result.value=divx/y.value
else
result.value=x
endif
result.scale=x.scale-y.scale
result.sign=x.sign^y.sign /* exclusive-or */
Problem: Add 10110.100 and 10010.000
(both binary) using 8 bits.
With no adjustment:
10110.100
+ 10010.000
___________
101000.100
Overflow!
With right adjustment:
0010110.1
+ 0010010.0
___________
0101000.1
Subtraction is simple; see Figure 2(b). FPM inverts the sign of the second argument, calls the addition routine, and restores the second argument's original sign.
The 386's multiply instruction can directly act on the two numbers in the val fields. The scale of the result is the sum of the two original scales. Figure 2(c) outlines the multiplication algorithm. The MUL instruction places a 64-bit result in the EDX:EAX register pair. FPM shifts any significant bits out of EDX to the EAX register (and adjusts the answer's scale, of course). The DIV instruction divides a 64-bit number (in EDX:EAX) by a 32-bit number. To maximize precision, FPM shifts the divisor to the right until its bit 0 is a 1 (adjusting the scale, of course). If the divisor is equal to 1, the division is complete. Otherwise, FPM shifts the 64-bit dividend left by the number of significant bits in the divisor. This assures a 32-bit result with the maximum number of significant digits.
Next, the DIV instruction divides the two numbers. FPM then subtracts the divisor's scale from the dividend's scale to locate the result's radix point. Figure 2(d) illustrates this process.
Listing One (page 76) contains FPM.ASM, the core functions for FPM. FPM.H (Listing Two, page 78) and FPMC.C (Listing Three, page 78) simplify using FPM from C programs.
The faddxy(), fsubxy(), fmulxy(), and fdivxy() functions provide the basic mathematical operations. To round out the library, two routines (fixtofloat() and floattofix()) perform conversions between FPM numbers and standard C floats.
Listing Four (FPMGRF.C, page 80) shows an example of FPM at work and pits FPM against the standard floating-point library. FPMGRF plots a line twice using a real-number coordinate system.
First, the standard library plots the line, then fpmline() plots the same line. Variations in color show where FPM and the math library compute different results. Table 1 contains the results of running FPMGRF.C on different machines.
Math Library FPM
---------------------------------
25-MHx 386
without 80387 26 7
25-MHz 386
with 80387 4 7
FPM takes full advantage of the 386 instruction set. Of course, the 32-bit register width is a significant advantage, but other 386 instructions help, too.
FPM often shifts words until their rightmost (or leftmost) bit is a 1 (see LADJ and RADJ in Listing One). A conventional DOS program might use the code in Example 1(a) to right-adjust EAX. The 386-bit scan instructions, BSF and BSR, can considerably speed up this operation. BSF starts at bit 0 and returns the index of the first 1 bit that it finds. BSR is similar to BSF, but it starts with bit 31 (for a 32-bit operand). If there are no 1 bits in the word, the bit-scan instructions set the zero flag (ZF). Example 1(b) shows how to rewrite the right-adjust routine using BSR and BSF. This avoids the expensive looping and jumping that the first method requires.
(a)
SLOOP: TEST EAX,1
JZ DONE
SHR EAX,1
JMP SLOOP
DONE:
(b)
BSF ECX,EAX
SHR EAX,ECX
FPM provides a C-language interface to the bit-scan instructions (bitscan() in Listing One). The prototype is: int bitscan(int dir, unsigned long lword);. If the dir flag is 0, bitscan() uses the BSF instruction. Otherwise, it uses BSR. The lword parameter is a 32-bit integer to scan. Unlike BSF and BSR, bitscan() numbers the bits from 1 (rightmost) to 32 (leftmost). If no bits are set in lword, bit-scan()returns 0.
FPM uses other 386 instructions in several places. For instance, Example 2 shows how the SHLD instruction shifts an operand left, filling the rightmost bits from its other operand and leaving 10FAH in EAX.
MOV EAX, 10H MOV EBX, 0FA000000H SHLD EAX, EBX,8
If you are a C++ programmer, you have probably already thought about writing a class wrapper for FPM. By using C++'s operator overloading, you could make FPM calls with conventional C operators. In fact, an FPM class is simple to construct. Unfortunately, the constructor overhead that C++ imposes degrades FPM's performance. I have experimented with several class wrappers for FPM--all ran more slowly than the normal C float type. If FPM isn't faster than the math library, there is little point in using it.
FPM can provide your programs with a fast alternative for real-number arithmetic. It also shows that you don't have to run in protected mode to benefit from the power of a 386 or 486 CPU.
_32-BIT FLOATING-POINT MATH_ by Al Williams[LISTING ONE]
;##########################################################
;# File: FPM.ASM #
;# 386 Floating point package by Al Williams #
;##########################################################
.MODEL SMALL,C
; enable 386 instructions
.386
.DATA
; Number format
FPM_NUM STRUC
SIGN DB ? ; 0 for + FF for -
SCALE DB ? ; position of decimal point
NUM DD ? ; unsigned magnitude
FPM_NUM ENDS
.CODE
; C routines
PUBLIC faddxy,fsubxy,fmulxy,fdivxy,bitscan
;************************************************
; Helper for C routines bitscan(dir,lword)
; if dir==0 do BSF on num
; if dir!=0 do BSR on num
; Returns index 1-32 or 0 if word was zero
bitscan PROC
ARG DIR:WORD, LWORD:DWORD
MOV AX,DIR
OR AX,AX
MOV EAX,LWORD
JZ SHORT SCANFWD
BSR EAX,EAX
JMP SHORT BITSC0
SCANFWD: BSF EAX,EAX
BITSC0: JNZ SHORT BITSC1
XOR AX,AX
RET
BITSC1: INC AX
RET
bitscan ENDP
;************************************************
; Adjust temp register @BX so that rightmost bit is 1
RADJ PROC
; Find last set bit
BSF ECX,[BX].NUM
; If all zeros, jump to ZOUT
JZ SHORT ZOUT
; Shift right and adjust scale
SHR [BX].NUM,CL
SUB [BX].SCALE,CL
RET
ZOUT:
MOV [BX].SCALE,0
RET
RADJ ENDP
;************************************************
; Adjust temp register @BX so leftmost bit is 1
LADJ PROC
; Find first bit set
BSR ECX,[BX].NUM
; If all zero's goto zout (above in proc RADJ)
JZ ZOUT
; 31-Bit_index -> # places to shift
MOV CH,31
SUB CH,CL
MOV CL,CH
; Shift number and adjust scale
SHL [BX].NUM,CL
ADD [BX].SCALE,CL
RET
LADJ ENDP
;************************************************
; Add x+y->x
faddxy PROC USES SI
ARG ANSOFF:WORD,ANSSEG:WORD,XARG:FPM_NUM,\
YARG:FPM_NUM
; Check for zero addition (x=0)
MOV EAX,XARG.NUM
OR EAX,EAX
JNZ SHORT YZA
MOV EAX,YARG.NUM
MOV CH,YARG.SIGN
MOV CL,YARG.SCALE
JMP DOADD3
; Check for y=0
YZA: MOV EAX,YARG.NUM
OR EAX,EAX
JNZ SHORT NZA
MOV EAX,XARG.NUM
MOV CH,XARG.SIGN
MOV CL,XARG.SCALE
JMP DOADD3
NZA:
LEA BX,YARG
CALL RADJ
LEA BX,XARG
CALL RADJ
; if eq then ok
MOV CL,XARG.SCALE
MOV CH,YARG.SCALE
CMP CL,CH
JE SHORT DOADD
; if xscale>yscale ...
LEA SI,YARG
JG SHORT PLUSADJY
XCHG CH,CL
XCHG BX,SI
PLUSADJY:
; amt=xscale-yscale
SUB CL,CH
; find top bit in y
BSR EAX,[SI].NUM
MOV CH,31
SUB CH,AL
CMP CH,CL
; if 31-topbit>=amt shift y left by amt; yscale+=amt
JL SHORT PLUSELSE
SHL [SI].NUM,CL
ADD [SI].SCALE,CL
JMP SHORT DOADD
PLUSELSE:
; else ...
; shift y left by 31-topbit; yscale+=31-topbit
XCHG CH,CL
SHL [SI].NUM,CL
ADD [SI].SCALE,CL
XCHG CH,CL
SUB CL,CH
; shift x right by amt-(31-topbit); xscale-amt-(31-topbit)
SHR [BX].NUM,CL
SUB [BX].SCALE,CL
DOADD:
; Make certain bit 31 of each number is off and keep scales equal
MOV EAX,XARG.NUM
MOV EBX,YARG.NUM
MOV CL,XARG.SCALE
MOV EDX,EAX
OR EDX,EBX
JNS SHORT SETSIGN
SHR EAX,1
SHR EBX,1
DEC CL
SETSIGN:
XOR DH,DH ; Set positive flag
MOV DL,XARG.SIGN
OR DL,DL
JZ SHORT DOADD1
NEG EAX
INC DH
DOADD1: MOV DL,YARG.SIGN
OR DL,DL
JZ SHORT DOADD2
NEG EBX
INC DH
DOADD2:
; Assume positive sign in CH
XOR CH,CH
ADD EAX,EBX
JNS SHORT DOADD3
OR DH,DH
JZ SHORT DOADD3 ; Large positive number
NEG EAX
NOT CH
DOADD3:
PUSH DS
LDS BX,DWORD PTR ANSOFF
; Store sign, scale, and value
MOV [BX].SIGN,CH
MOV [BX].SCALE,CL
MOV [BX].NUM,EAX
POP DX
RET
faddxy ENDP
;************************************************
fsubxy PROC
PUSH BP
MOV BP,SP
; If X-0 then compute X+0
MOV EAX,[BP+14].NUM
OR EAX,EAX
JZ SHORT SUBBY0
; else compute X+-1*Y
NOT [BP+14].SIGN
SUBBY0:
POP BP
; Let + do the work
JMP faddxy
fsubxy ENDP
;************************************************
fmulxy PROC
ARG ANSOFF:WORD,ANSSEG:WORD,XARG:FPM_NUM,\
YARG:FPM_NUM
; Right adjust X
LEA BX,XARG
CALL RADJ
; Right adjust Y
LEA BX,YARG
CALL RADJ
; Check for X*0 or 0*Y
MOV EAX,XARG.NUM
OR EAX,EAX
JZ SHORT RETZM
MOV EBX,YARG.NUM
OR EBX,EBX
JZ SHORT RETZM
; Do it
MUL EBX
TSTZLP:
; Find bits in EDX (overflow of 32 bits)
BSR ECX,EDX
; If none, goto DXZ
JZ SHORT DXZ
; Shift bits to EAX and adjust scale
INC CL
SHRD EAX,EDX,CL
SUB XARG.SCALE,CL
DXZ:
OR EAX,EAX
JNS SHORT NOOF
; keep answer positive
SHR EAX,1
DEC XARG.SCALE
NOOF:
; store answer
PUSH ES
LES BX,DWORD PTR ANSOFF
MOV ES:[BX].NUM,EAX
; compute new scale
MOV AL,YARG.SCALE
ADD AL,XARG.SCALE
MOV ES:[BX].SCALE,AL
; compute new sign
MOV AL,YARG.SIGN
XOR AL,XARG.SIGN
MOV ES:[BX].SIGN,AL
POP ES
RET
; Jump here to return a zero
; shared by MUL & DIV
RETZM: PUSH ES
LES BX,DWORD PTR ANSOFF
XOR EAX,EAX
MOV ES:[BX].NUM,EAX
MOV ES:[BX].SCALE,AL
MOV ES:[BX].SIGN,AL
POP ES
RET
fmulxy ENDP
;************************************************
fdivxy PROC
ARG ANSOFF:WORD,ANSSEG:WORD,XARG:FPM_NUM,\
YARG:FPM_NUM
; LEFT ADJUST X
LEA BX,XARG
CALL LADJ
; RIGHT ADJUST Y
LEA BX,YARG
CALL RADJ
; CHECK FOR DIVIDE BY ZERO
MOV EAX,XARG.NUM
OR EAX,EAX
JZ RETZM
; CHECK FOR DIVIDE BY 1
MOV EBX,YARG.NUM
CMP EBX,1
; DON'T DIVIDE BY 1 -- BESIDES IF WE ROTATE EDX:EAX LEFT,
; / BY 1 WOULD RESULT IN OVERFLOW
JZ SHORT DODIV
; SHIFT X BITS INTO EDX BASED ON SIZE OF Y
XOR EDX,EDX
BSR ECX,EBX
SHLD EDX,EAX,CL
SHL EAX,CL
; ADJUST SCALE
ADD XARG.SCALE,CL
DIV EBX
DODIV:
PUSH ES
LES BX,DWORD PTR ANSOFF
MOV ES:[BX].NUM,EAX
; COMPUTE ANSWER'S SCALE
MOV AL,XARG.SCALE
SUB AL,YARG.SCALE
MOV ES:[BX].SCALE,AL
; FIND ANSWER'S SIGN
MOV AL,YARG.SIGN
XOR AL,XARG.SIGN
MOV ES:[BX].SIGN,AL
POP ES
RET
fdivxy ENDP
END
[LISTING TWO]
/***************************************************
* FPM.H - C language header for FPM *
***************************************************/
#ifndef FPMHEADER
/* fixed point type */
typedef struct
{
char sign;
char scale;
unsigned long num;
} FIXED;
/* Scale factor
use - 1000.0 for 3 places, 100.0 for 2 places etc. */
#define SFACTOR (10000.0)
#define CVTTYPE float
#ifdef __cplusplus
extern "C"
{
#endif
/* Assembly core functions (from FPM.ASM) */
extern FIXED faddxy(FIXED,FIXED),fsubxy(FIXED,FIXED),
fmulxy(FIXED,FIXED),fdivxy(FIXED,FIXED);
extern int bitscan(int dir,unsigned long lword);
/* C helpers (from FPMC.C) */
FIXED floattofix(CVTTYPE f);
CVTTYPE fixtofloat(FIXED f);
int fixtoint(FIXED f);
#ifdef __BORLANDC__
/* this is the same */
extern FIXED faddxy (FIXED,FIXED),
FIXED fsubxy (FIXED,FIXED),
FIXED fmulxy (FIXED,FIXED),
FIXED fdivxy (FIXED,FIXED);
#else
/* Microsoft definitions */
FIXED fpmmsc_rv;
extern void faddxy (FIXED _far *, FIXED,FIXED),
fsubxy (FIXED _far *, FIXED,FIXED),
fmulxy (FIXED _far *, FIXED,FIXED),
fdivxy (FIXED _far *, FIXED,FIXED);
#define faddxy(a,b) (faddxy(&fpmmsc_rv,a,b),fpmmsc_rv)
#define fsubxy(a,b) (fsubxy(&fpmmsc_rv,a,b),fpmmsc_rv)
#define fmulxy(a,b) (fmulxy(&fpmmsc_rv,a,b),fpmmsc_rv)
#define fdivxy(a,b) (fdivxy(&fpmmsc_rv,a,b),fpmmsc_rv)
#endif
#ifdef __cplusplus
}
#endif
#endif
[LISTING THREE]
/***************************************************
* FPMC.C - C language routines for FPM *
***************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "fpm.h"
/* convert fpm to float */
FIXED floattofix(CVTTYPE f)
{
FIXED ans;
unsigned long i,whole,frac=0L;
CVTTYPE fp=.5;
ans.sign=0;
/* special case */
if (f==0.0)
{
ans.scale=0;
ans.num=0L;
return ans;
}
if (f<0.0)
{
ans.sign=0xff;
f=-f;
}
whole=(unsigned long)f;
f-=whole;
/* scale is number bits in whole */
ans.scale=32-bitscan(1,(unsigned long)whole);
ans.num=(unsigned long)whole<<ans.scale;
if (ans.scale)
/* compute fractional part */
for (i=1L<<(ans.scale-1);i!=0&&f!=0.0;fp/=2)
{
if (f>=fp)
{
f-=fp;
frac|=i;
}
if (!frac&&!whole)
ans.scale++;
else
i>>=1;
}
ans.num|=frac;
return ans;
}
/* convert FPM number to float */
/* Uses SFACTOR for rounding (see FPM.H) */
CVTTYPE fixtofloat(FIXED f)
{
unsigned long i;
CVTTYPE fp=0.5,res,sg=1.0;
if (f.sign)
{
sg=-1.0;
}
if (f.scale<0)
res=(CVTTYPE)(f.num<<-f.scale);
else
{
res=(CVTTYPE)(f.num>>f.scale);
while (f.scale>>5)
{
f.scale--;
fp/=2.0;
}
for (i=1L<<f.scale-1;i!=0;i>>=1,fp/=2)
if (f.num&i) res+=fp;
}
return floor(sg*res*SFACTOR+.5)/SFACTOR;
}
/* convert FPM number to int (truncate fraction) */
int fixtoint(FIXED f)
{
int r;
if (f.scale>=0) r=f.num>>f.scale; else r=0;
if (f.sign) r=-r;
return r;
}
[LISTING FOUR]
/***************************************************
* FPMGRF.C - Graphical demo of FPM with benchmark *
***************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <dos.h>
#include <conio.h>
#include <time.h>
#include "fpm.h"
/* Macro to call BIOS */
#define INT10(r) int86(0x10,&r,&r)
/* Parameters for line (y=Mx+B) */
#define M .291 /* Slope */
#define B .08 /* y-intercept */
main(int argc,char *argv[])
{
clock_t s,e;
long fpmtime,floattime;
/* Set CGA mode */
setmode(4);
/* mark time */
s=clock();
/* draw line with math library */
floatline();
e=clock();
/* compute time */
floattime=e-s;
/* wait for a key */
printat(1,10,"Press any key to continue");
if (!getch()) getch();
printat(1,10," ");
/* mark time again */
s=clock();
/* draw line with FPM */
fpmline();
e=clock();
/* compute time */
fpmtime=e-s;
/* wait for a key */
printat(1,10,"Press any key to continue\n");
if (!getch()) getch();
setmode(3);
/* print results */
printf("Float time=%ld ticks\nFPM time=%ld ticks\n",floattime,fpmtime);
}
/* Set video mode */
setmode(int n)
{
union REGS r;
r.x.ax=n;
INT10(r);
return 0;
}
/* plot a floating point pixel */
floatplot(float x,float y)
{
int x1,y1;
x1=x*320;
y1=y*200;
plot(x1,y1,2);
return 0;
}
/* ploat a point with FPM numbers */
fpmplot(FIXED x,FIXED y)
{
static FIXED xscale;
static FIXED yscale;
static int first=0;
if (!first)
{
xscale=floattofix(320.0);
yscale=floattofix(200.0);
first=1;
}
plot(fixtoint(fmulxy(x,xscale)),fixtoint(fmulxy(y,yscale)),3+128);
}
/* plot a hardware pixel */
plot(int x,int y,int color)
{
union REGS r;
r.h.ah=0x0c;
r.h.al=color;
r.h.bh=0;
r.x.cx=x;
r.x.dx=y;
INT10(r);
return 0;
}
/* Floating point library line drawing routine */
floatline()
{
float x,y;
for (x=0.0;x<1.0;x+=.001)
{
y=M*x+B;
floatplot(x,y);
}
}
/* FPM line drawing routine */
fpmline()
{
FIXED x,y,m,b,incr;
m=floattofix(M);
b=floattofix(B);
x=floattofix(0.0);
incr=floattofix(.001);
while (fixtoint(x)==0)
{
y=faddxy(fmulxy(m,x),b);
fpmplot(x,y);
x=faddxy(x,incr);
}
}
/* print string via BIOS at x,y */
printat(int x,int y,char *s)
{
union REGS r,r1;
r.h.ah=2;
r.h.bh=0;
r.h.dh=y;
r.h.dl=x;
INT10(r);
r.h.ah=0xe;
r.h.bl=7;
while (r.h.al=*s++)
int86(0x10,&r,&r1);
}
Copyright © 1993, Dr. Dobb's Journal