February 1994/Coding with Printable Characters

Algorithms

Coding with Printable Characters

Thad Smith III

Thad Smith III is a consultant developing software for embedded measurement, control, and communication applications. He is current moderator of the Fidonet C Echo and chairman of the local chapter for ACM. He can be reached at (303)449-3322 or thadsmith@acm.org.

Introduction
I recently started exploring efficient ways to represent a binary stream with printable ASCII characters. While high efficiency is not needed for most applications, I found it a good exercise in exploring coding techniques. There are 95 printable ASCII characters, including space, which can be used, but I chose to exclude spaces, since spaces at the beginning or end of a line might get lost (or added!). That left 94 characters.
Ninety-four characters could easily represent six bits (64 values), but seemed wasteful. What could I represent with two characters? That would give 94*94 = 8,836 combinations, which would nicely hold 13 bits (2¹³ = 8,192) — a good fit. Then I asked the opposite question: how large a character set was needed to represent 13 bits in two characters? That would be the square root of 8,192, which is just over 90.5. I chose the 91 consecutive ASCII characters from ! through {. Listing 1 shows how the 13-bit values were coded into two characters. Listing 2 shows how the characters were translated back to the original 13 bits.
The harder part was grouping the input into 13-bit groups when encoding, then distributing the 13 bits into eight-bit bytes on output when decoding. The efficiency for this conversion is 13/16, since 13 bits are coded into 16 bits (assuming eight-bit characters), not counting the overhead of block start, block stop, and newlines. Later, someone suggested the challenge of writing an efficient encoder program. I knew that the 13/16 conversion program was fairly efficient, but wondered if I could do better. I could.
The most efficient implementation would use all of the combinations of the character set, which I restricted to the 94 graphic characters. The amount of information in one such character is log2(94) = ln(94)/ln(2) = 6.55458+ bits. I was using 8*13/16 = 6.5 bits per encoded character. Using the notation of input bits per output bits, the efficiency limit for 94 characters is log₂(94)/8 = 0.8193236+.
I decided to use continued fractions to help find a higher encoding ratio. Continued fractions are a tool which mathematically suggests an efficient rational approximation of any real value. Real values are expressed by a series of compounded fractions, as shown by the example in Figure 1. By truncating the series at any point, we get a rational approximation to the exact value, with more terms offering better approximations. In the case of p, we get the series 3/1, 22/7, 333/106, and 355/113 as the first four approximations, alternately above and below the true value.
The expansion of log₂(94)/8, our ideal encoding ratio, is shown in Figure 2. It expands to the series 0, 1, 4/5, 5/6, 9/11, 59/72, 68/83, ... We must chose a ratio from the "low" series 0, 4/5, 9/11, 68/83. I chose to use the ratio 9/11 = .81818..., since it was very close to the theoretical value, yet had a chance of reasonable implementation.

Converting the Data
With a 9/11 encoder, nine eight-bit bytes are converted to an eleven-character representation. While that could be done in a straightforward manner, using multi-precision arithmetic to work with a 72-bit number, it would require a fair amount of code and processing time.
I chose a method for which most of the input is converted in 32-bit chunks. Figure 3 shows a diagram of the way that the bits are redistributed on the way from nine bytes to eleven characters. A 32-bit value (from four consecutive input bytes) is broken into two parts: one part is converted into a four-character value (maximum of 94⁴); the second part, which I call a prefix, is in the range 0 to 55 (2³²/94⁴ = 55.01+). Doing this twice converts eight bytes to eight characters plus two values in the range 0 to 55.
After the two four-byte blocks are converted into eight characters, there are three remaining values to encode: two prefixes, in the range 0 to 55, and the ninth byte, in the range 0 to 255. Combining them into a single number yields 56*56*256 = 802,816 values, which is less than 94³ = 830,584, allowing them to be converted into three more characters, making eleven altogether. Whew!

Faster Math
Initially, the 32-bit value was separated into the prefix value and four-character encoded value by dividing by 94⁴ = 78,074,896, where the quotient was the prefix and the remainder was the value to be encoded into four characters.
I added a refinement by changing this divisor slightly. Keeping the maximum prefix at 55, I could lower the maximum value needed to be encoded into the four characters. Doing this, I chose a divisor of 1,171 * 2¹⁶ (76,742,656), with the advantage that the encoder's 32-bit division by 94⁴, which is slow on a 16-bit computer, was replaced by a 16-bit division.
The code in Listing 3 shows the calculation to encode the nine input bytes into the eleven output characters. The calculations are done slightly out of order, producing the two blocks of four characters first, even though they are last in the output buffer (I wanted to keep the output in the same significance order as the input).
Inside the for loop, the 32-bit variable block is first assigned to a value from four input bytes. It is then divided into the prefix value q and a remainder. The remainder is then further broken into two-character groups, which are in turn broken into base-94 digits which are translated to characters in the ASCII range ! through ~. The two values of q in the loop are combined into qb by a base-56 to binary conversion, resulting in a value 0 to 3,135.
After the loop, qb is combined with the ninth byte (from the first position) via shift and OR, then that result is divided into three output characters. Since the maximum value of a block on the last calculation is 802,815, it can be represented by two base-94 characters plus one base-91 character. The short range character is placed first in the 11-character block.

Encoding and Decoding
Listing 7 contains the function decode_11_to_9, which shows the code to convert from the eleven-character format, dubbed BAZ911, to the nine original bytes. It reverses the process of the encoding procedure by converting the first three characters into the two prefix codes and the first byte, then converts the two blocks of 4 characters, plus associated prefix, into 4 bytes each.
Routines ebaz_data and dbaz_data use the low-level block encode and decode routines to encode and decode a data stream of arbitrary length. They are called with successive blocks of data, of any size, which are combined and converted, with the output being passed off to the output function specified in the initialization call. This technique allows the stream encoding and decoding to be embedded into various applications.
These functions also form the encoded output into fixed-length lines (if desired), provide/detect a flag for the end of data, and generate/check a CRC for data integrity.
Once a conversion has started, some way is needed to designate the end of conversion. EBAZ. C uses an invalid leading character, }, as a flag to designate end of conversion. It is followed by a character whose encoded value is the number of following bytes, from 0 to 14. The actual number of encoded bytes following is two greater, since the 16-bit CRC is included.
To make the decoding process easier, the minimum size of encoded data is eleven characters, allowing a full block of eleven characters of encoded data to be assembled before testing for end of data. The encoded output is made just long enough to carry the needed bytes only data streams of four bytes or less need to be padded with unused information. By retaining at least five bytes for use of the next ebaz_data call, the final ensemble, starting with }, will be at least eleven characters long, thereby simplifying the decoding process.

The Code
Listing 4 shows BAZ.H, the header for the encoding and decoding functions. Listing 5 contains BAZCOM.H, a private header for DBAZ.C and EBAZ.C. Listing 6 shows EBAZ.C, which contains the stream-level and low-level block encoding. Listing 7 shows DBAZ.C, which contain the stream-level and low-level block decoding. Listing 8 and Listing 9 show CRC16.H and CRC16G.C, respectively, which implement CRC calculations. Finally, Listing 10 and Listing 11 show BAZ.C and ZAB.C, respectively, which demonstrate the use of EBAZ and DBAZ to encode and decode files. The programs should work in any Standard C environment using the ASCII character set and eight-bit characters.
I donate the code to the public domain, so it can be incorporated into your software, including commercial software, without need for licensing.

Sidebar: "Some Printable Encoding Background"