January 1994/A Short Floating- Point Type in C++

Portability

A Short Floating- Point Type in C++

William Smith

William Smith is the engineering manager at Montana Software, a software development company specializing in custom applications for MS- DOS and Windows. You may contact him by mail at P.O. Box 663, Bozeman, MT 59771- 0663.

Introduction
Even though a typical microcomputer can have up to ten times the memory of one just a few years ago, there are still programming problems where memory is a limiting factor. I frequently bump into memory limitations in embedded and data acquisition applications. Numerous times I have had to work with a large quantity of floating- point numbers in a confining space. A common situation is the acquisition of large amounts of data through a 14- bit (or smaller) A- to- D (Analog to Digital) converter.
Storing these numbers as 32-bit floats always seemed like overkill to me and a waste of space. This was especially annoying when I had to store tens of thousands of points in an array and would hit some kind of a memory limitation such as a segment boundary, physical memory limit, or even a file or disk size limit. The standard float type works, but it represents a poor match to the problem to be solved. Matching the floating- point size to what an application needs can result in significant memory savings in data- intensive programs.
I really only needed a 16-bit floating- point type instead of the native 32-bit float. At first, I played some games and stored the data as short int. But this forced me to convert the data to float to do anything useful with it. I wanted a short floating-point type. I even implemented one, albeit crudely, in C. C allowed me to do it, but the conversion process never was clean or transparent. With C++, I was finally able to do what I wanted. I was able to create a short floating-point type that I could use naturally in my applications. C++ can hide all the dirty work, such as conversions.
The new type, which I call sfloat, even allowed me to control range and precision. Some situations called for a floating-point type that ranged between 0 and 10.0 and maximized the precision within that range. Other situations required a larger signed range but less precision. Being able to tailor the characteristics of the type to meet an application's needs was a practical feature I built into sfloat,
I implemented the sfloat type in "Standard C++" (if there is such a beast). The code works with Microsoft C++ and Borland C++ under MS-DOS and MS-Windows. It has some dependencies on the size of the standard types float, unsigned short int, and long. It assumes that:

a float is 32 bits

an unsigned short int is 16 bits

a long is 32 bits
It also assumes that the float type is that defined by the IEEE standard for 32-bit floating-point values. Table 1 gives the IEEE details. As long as a compiler and operating system conform to these restrictions, the code for sfloat will probably work in other environments.

Implementation
Listing 1, sfloat.hpp, defines a C++ class called sfloat. The class has numerous private static members, one protected member and numerous public member functions. There are even some non-member functions prototyped in sfloat.hpp.
The static data provides a workspace for conversion between sfloat and float. This static data is class specific. All instances, or objects, of class sfloat share the same static data. The protected member s is the only object instance data. This member is unique to each instance of sfloat. In fact, the sizeof operator will report the size of sfloat to be the size of this member, 2 bytes.

Constructors
One of the most elemental functions for a C++ class is the constructor. A constructor has the same function name as for the class. Furthermore, you can overload the constructor to provide construction from (conversion from) different types. The sfloat class has three constructors.

sfloat(); sfloat(float f); sfloat(sfloat& sf);
sfloat() defines the "default" construction of an sfloat object, such as on the stack. The compiler would generate this function automatically if you do not specify it. sfloat(float f) converts a floating-point number to an sfloat to initalize the stored value. sfloat(sfloat& sf) initializes the new object by making a copy of another sfloat object. These three constructors provide the functionality needed to support the following declarations using sfloat.

sfloat sf1; // uses sfloat(); sfloat sf2 = 1.0f; // uses sfloat(float f); sfloat sf3 = sf2; // uses sfloat(sfloat& sf);
These three types of construction and initialization cover the minimum required to use sfloat type naturally. The code for the constructor functions resides in Listing 2, sfloat.inl. sfloat() and sfloat(sfloat& sf) are very simple. On the other hand, sfloat(float f) has to do a bit of work. It has to convert a float to an unsigned short and assign it to the object instance data member s.
The conversion process used in sfloat(float f) truncates the mantissa bits to a lower precision. It also lowers the range of the exponent by discarding higher-order bits. The conversion process utilizes some of the static data members of class sfloat as a work space and to hold intermediate values. The bitwise shift operators << and >> move the bits that will be kept from the float value into place before they are packed into an unsigned short.
Since none of the constructor functions allocate memory on the heap (free store) using new there is no need to define a destructor function. C++ will provide a default destructor that does nothing.

Conversion to float
We also need a way to convert an sfloat object to a float. To use conventional notation, we need to define the operator function

sfloat::operator float()
Listing 2, sfloat. inl, contains the definition of this function. You will notice that it's logic is just the reverse of sfloat:: sfloat(float f). The shift operators once again move the bits of the sfloat into the proper locations in the 32 bits of a float. The extra bits are filled with zeros.

Overloaded Operators
Operator overloading is one of the features of C++ that allow you to use new defined types just like the standard existing types. Operator overloading is not so much an object-oriented feature as a convenience. Table 2, an extract from Listing 1, lists the operator functions defined for sfloat. This list includes all the operators that one commonly uses on floating-point numbers. These operator functions allow you to use objects of the class sfloat just like you would a standard floating-point type.
Operator overloading is fairly straight-forward feature of C++ and covered well elsewhere. I recommend the "Stepping Up To C++" series of articles on "Operator Overloading" by Dan Saks (see CUJ January, March, May, and July 1992). I took a very simple approach to implementing these operators. I convert to float, use the predefined operations, then convert back to sfloat. For example, here is the code for the add-assignment operator:
inline sfloat &sfloat::
              operator+=(sfloat sf)
     {
     float f = (float)*this;
     f += (float)sf;
     *this = (sfloat)f;
     return ( *this );
     }    // operator+=
This techniques is not the most efficient (it has to do three type conversions), but it sure is simple. My needs for the sfloat type were data-size driven, not code-speed or code-size driven. Consequently I can live with the overhead of all those conversions. If you cannot, you could rewrite some of these routines to operate directly on the sfloat type.
I would like to emphasize that you can get trapped into inefficiency with operator overloading. If you are not careful, your operator overloading can force unneeded object construction and destruction, especially for the operators +, - , *, and /. One trick to avoid this is to use the corresponding assignment operators (such as +=) with a reference return type to define the other math operators. This technique results in the interesting side effect that the operators +, - , *, and / are neither member or friend functions.

Inlining
In implementing the sfloat class, I choose to inline the overloaded operator functions and the conversion functions. Inlining a function means that its code gets inserted into your compiled program each time the function is called. This can cause your program to bloat in size unexpectedly. If you find this happening, I recommend you do not inline at least the two conversion functions sfloat::sfloat(float f) and sfloat::operator float(). Both are fairly long. But experiment first. To discontinue inlining for a function, remove the inline modifier from its function definition and move the definition from the file sfloat. inl, (Listing 2) to sfloat.cpp (Listing 3) .
Some of operator functions are very short. You may wonder why I did not include their definitions with the class definition in the file sfloat.hpp. Instead I grouped all the inline functions in the file sfloat. inl. This is not quite standard, but I have to agree with Walter Bright, one of the C++ compiler pioneers. Inline function bodies appearing in the class body clutters the class definition (C++ Report October 1992).
Including inline functions with the class definition also violates the separation of the implementation of a class function members from the class definition. For maintenance purposes, it is a good technique to isolate the two. The class definition is the class interface and should change less than the member function implementation.

Controlling Range and Precision
The function sfloatrange, Listing 3 (slfoat.cpp), provides a way to adjust the range, signedness, and precision of the sfloat type:
friend void sfloatrange(
   unsigned short sfNumExpBits,
   unsigned short sfSigned);
The first parameter is the number of exponent bits. This can be any number from 1 to 8. The higher the number the larger the range of values that sfloat can represent. Eight bits is the same as for the standard float type. Table 3 shows the maximum value that sfloat can represent for each of the possible numbers of exponent bits.
The second parameter determines whether or not sfloat is a signed value. If the value is signed, sfloat reserves one of its bits as a sign bit. The number of mantissa bits is the remaining bits out of 16 not used by the exponent or the sign. That number can range from a minimum of seven to a maximum of 15:

The minimum of 7 results from specifying eight exponent bits and designating sfloat as signed.

The maximum of 15 results from specifying one exponent bit and designating sfloat as unsigned.
Table 4 lists all the possible numbers of mantissa bits and the corresponding (minimum) number of significant decimal digits.
I have encountered a requirement to have an unsigned floating-point representation that needs only four significant digits and a range of four orders of magnitude (0 to 104). A combination of 11 mantissa bits, five exponent bits and no sign bit worked fine.
The defaults, if you do not call sfloatrange, are eight exponent bits, a sign bit, and seven mantissa bits. This yields the same range as the standard float, but with much less precision. These values make the conversions between sfloat and float particularly easy. You can just use a union of a float and two unsigned shorts. To convert from a float, just store in the float member of the union and extract the second unsigned short. To convert from an sfloat, you reverse the process. Notice that the conversion functions do this for the special default range situation.
There are limitations with range setting. The sfloat class uses static data to preserve the range information. This prevents you from tailoring the range individually for each instance (object) of the class. In other words, once you set the range, all sfloat objects have the same range. You could have each instance retain information about the number of exponent, mantissa, and sign bits, but this would require each object to store information about range and defeat the desire to save space. Use of static data also helps to speed up conversions.
Use of static data to store the range information has repercussions in multitasking or multithreaded environments. Static data prevents the code for sfloat from being re-entrant. You cannot preserve different range information between tasks if the tasks are sharing the same code such as a Windows DLL (Dynamic Link Library).
To keep sfloat small and make it re-entrant would require eliminating the size adjustability. This would force you to create a different class for each of the different range combinations used. Some real time or multitasking situations may demand you eliminate the range adjustability.

Conclusions
Some of the basic features of C++ make the solution to specific problems elegant and easy compared to C. I have presented a short floating-point type sfloat that utilizes operator overloading for notational convenience. You can easily integrate this new type into your C++ applications. The sfloat type is a 16-bit (two-byte) floating-point representation that you can use instead of the standard four-byte float.
The sfloat type has appeal in applications that need only 16 bits for a floating-point type and require the storage of large amount of data. If you have particular requirements on range, precision, and signedness, you can tailor this type to best match your needs. In this way, you can get as many as five significant decimal digits (only one less than the standard float) in the range 0.0 to 2.0. You can also trade precision for range to get the same range as the standard float but with only three significant digits.