Dr. Dobb's Journal April 1997
There are two problems with compressing digital data -- modeling and entropy coding. Whatever the given data represents in the real world, it exists in digital form as a sequence of symbols, such as bits. The modeling problem is to choose a suitable symbolic representation for the data and to predict for each symbol of the representation the probability of the occurrence of each allowable value for that symbol. The entropy-coding problem is to code each symbol as compactly as possible. (In the realm of lossy compression, there is a third problem -- evaluating the relative importance of various kinds of errors.)
For example, suppose you want to transmit messages composed of the letters a, b, c, and d. A straightforward scheme for coding these messages in bits would be to represent a by 00, b by 01, c by 10, and d by 11. However, suppose that for any letter of the message (independent of all other letters), a occurs with probability 0.5, b with 0.25, and c or d with 0.125 each. You then might choose a shorter representation for a, at the cost of accepting longer representations for the other letters. You could represent a by 0, b by 10, c by 110, and d by 111. This representation is more compact on average; indeed, it is the most compact representation possible (though not uniquely so). In this example, modeling involves determining the probabilities for each possible symbol value; entropy coding involves determining the representations in bits from those probabilities. In short, the probabilities associated with the symbol values play a fundamental role in entropy coding.
Entropy coding is an abstract problem weakly related to the type of data being compressed, while modeling of data compression depends intimately on the type of data being compressed. Entropy coding is well understood, while for many real-world types of data, the modeling issue remains mysterious. In this article, I'll address entropy coding and the most rudimentary aspect of modeling. The algorithm I describe here, which I call the ELS-coder (short for Entropy Logarithmic-Scale), allows rapid encoding and decoding. Though the ELS-coder has a pending patent, Pegasus Imaging licenses it on a royalty-free basis. (For more information, send e-mail to sarmstrong@jpg.com.)
The ELS-coder works only with a binary alphabet {0,1}. You can encode symbols from larger alphabets, but they must first be converted to a binary format. The programs COMPRESS and EXPAND (provided electronically; see "Availability," page 3) illustrate how this is done. This conversion is a disadvantage, but a binary alphabet facilitates rapid coding and rapid probability estimation.
The ELS-coder was developed for an application that would generally compress a dataset once and then decompress it many times, so decoding speed was more important than encoding speed. We therefore designed the decoder first, and the encoder to fit.
The decoder in Listing One oes not provide any compression -- it simply extracts bits from a file (and not in the most efficient manner, either). The most-significant bits of each byte in the file are extracted first. It is presented here as an illustration of a decoder's operation. (I assume that a char has 8 bits of precision, a short 16 bits, and a long 32 bits of precision.)
For the decoder to hold data, it must be furnished with several states. It is convenient to represent the state of the decoder by two components: b, indicating the quantity of data (in bits) held by the decoder; and x, indicating the content of that data. When called upon to decode a symbol (the procedure decode_bit()), the decoder is in one of several possible states. One subset of these states represents the symbol value 0, and another the symbol value 1. The decoder determines to which subset its state belongs (by comparing x to Threshold[b]). It then transitions to a new state to reflect the depletion of its store of data (by decrementing b and, if necessary, decreasing x). At times (when b reaches the value 0), the decoder must replenish its store of data by reading from an external file (decode_import()). This import operation modifies both x and b.
Enhancements to this example produce the ELS decoding algorithm. In Listing One, A[b] gives the number of allowable states of a system holding b bits of data. Thus A[b]=2b. There are A[b] allowable values for x at any time. In arithmetic coding, this relationship between the number of bits of data in a system and the number of allowable states remains valid even when b is not an integer.
I apportioned the allowable values corresponding to 0 and 1, respectively, so that all of the former values were less than all of the latter. Thus you can determine to which subset of values the value of x belongs by comparing it to Threshold[b], which is the minimum value of x corresponding to 1. When the symbol value 0 is decoded, the number of allowable values for x necessarily becomes the value of Threshold[b] before decoding the symbol.
Listing Two is a full-fledged ELS decoder, although at a crude level of accuracy. (The procedures for initializing and terminating the decoding process are available electronically.)
There are differences between Listings One and Two. First, to compress symbols from a binary alphabet, you work with quantities of data smaller than a bit. You therefore define a "jot" to be a unit of data equal to 1/F of a byte, F being an integer larger than 8. In Listing Two, F is given the value 15; in practice, F normally takes larger values. Listing Two measures data in jots rather than bits. For example, A[j] now gives the number of allowable values for x for a given number of jots. This is determined by the same relation as previously: A[j] takes the value 28j/F with appropriate rounding. For example, 23 jots is equal to 23/15 bytes or 8 times 23/15 bits. The number of corresponding allowable values is 28 x 23/15, or about 4927.59. I round this to give a value of 4928 for A[23].
Second, consider the probability associated with the given symbol value. Thus decode_bit() now has a parameter rung (whose relation to the probability will later be made explicit). This is used as an index into a table Ladder[] of structures with three elements: c0 and c1 (indicating the number of jots required to code 0 and 1, respectively), and Threshold, which (as before) is the lower bound for allowable values of x corresponding to the symbol value 1.
Third, unlike b in Listing One, j is not decreased by a single predictable quantity for each symbol decoded. Thus, you can't depend on j hitting the value 0 exactly as it is decreased. Therefore, I have expanded x from one byte to two; I call the higher-order byte the "working" byte and the lower-order byte "reserve." Keep the reserve byte completely filled with data and the working byte at least partially filled with data -- that is, maintain at least 256 allowable values for x. I let j indicate the amount of data in the working byte of x, with negative values of j indicating that the reserve byte is at least partially depleted. Meaningful values for j range from -F when the decoder is completely empty of data (though in actual operation the minimum value attained by j is 1-F), to F when the decoder is completely full. The decoder calls decode_import() when the value of j dips to zero or lower. Moreover, j is never decreased by more than F in a single call to decodeBit(), lest you exhaust the reserve byte.
Fourth, importing a byte is complicated by the expansion of x to two bytes. It involves a shift and OR rather than a simple assignment (as in Listing One).
Consider as an example a call to decodeBit() with a value of 0 for rung, the value of j being 3. This indicates that x contains F+3=18 meaningful jots. Since A[18] = 776, the value of x must be one of 0,1,...,775. If the decoded symbol has the value 0, then j will decrease to 2 by subtracting Ladder[0].c0, and x will then contain F+2=17 meaningful jots. If the decoded symbol has the value 0, then j will decrease to -1 by subtracting Ladder[0].c1, and x will then contain F-1=14 meaningful jots. Since A[17] = 536 and A[14] = 177, out of the 776 allowable values for x, the 536 values 0,1,...,535 represent 0 and the 177 values 536, 537,...,712 represent 1. Suppose the value of x is 600. The first step in decoding the symbol is to compare 600 to 536 (the value of Ladder[0].Threshold[3]) and see that the symbol has the value 1. Then subtract 4 from j, giving it the value -1. The allowable values for x are now 0,1,...,177; subtract 536 from x (making it 64) to bring it within this range.
Since -1
0, you have exhausted the data in the working byte of x; call decode_import() to import a byte from the external file. Suppose the next byte in the external file has the value 137. You update the value of x to (64
256)+137=16521 (actually accomplished by shifting and ORing) and update the value of j to -1+15 = 14. The number of allowable values for x is now A[14+15], or 45283.
Of the 776 allowable values for x, 536 represent 0 and 177 represent 1. The other 776-536-177 = 63 values are wasted. This is a defect of the decoder. Ideally, every allowable value should represent either 0 or 1, but the restriction of j and A to integer values makes such waste unavoidable, at least some of the time.
The definition of "allowable" states does not consider the future development of a state. Thus, some allowable states may not lead to allowable states in the future; such states are unusable in practice. The situation where such states exist is called a "data leak." In the presence of a data leak, there are possible states for the decoder that are not allowable. Another characteristic of a data leak is that some possible coded data streams are illegal or redundant.
Data leaks form one of two main sources of coding inefficiency in the ELS-coder, the other being inaccuracy in the proportion of values allocated to 0 and 1. However, for larger values of F, the inefficiency is quite modest. A good working value for F is 754; in this case, the data leak described causes a coding inefficiency of less than 0.008 bits per symbol.
A data leak exists when some allowable states of the decoder do not lead to allowable states of the decoder at a later time. Since these allowable states embody the data present in the decoder, a data leak implies that data is somehow dissipating into nothing. The situation is analogous to an engine failing to convert all the energy input into usable energy output. It is a virtue for an engine to convert a large portion of energy input to energy output. On the other hand, if you expect energy output to exceed energy input for any engine, you've made a fundamental error.
The analogous principle for the ELS-coder (or for any decoder viewed in the light of the paradigm) is that all allowable states of the decoder must be derivable from allowable states of the decoder at any earlier time. The name for a violation of this principle is "perpetual motion," which implies the existence of allowable states that are not possible. Data leaks and perpetual motion are dual evils; while the former is regrettable, the latter must be avoided at all costs.
For example, the components c0 and c1 for each entry of the array Ladder[] in Listing Two must satisfy the constraint that A[j-c0]+A[j-c1]
A[j] for all values of j between 1 and F, inclusive; that is, of A[j] allowable values for x at any time, those representing 0 and those representing 1 can total no more than A[j].
The example decoding process illustrates a second data leak in the decoder, which may be less immediately apparent. This occurs while importing a byte from the external file. Immediately before importing the byte, x has 177 allowable values. Importing a byte makes the number of values available 256
177 = 45312. However, the number of allowable values after importing is specified as 45283; thus 29 allowable values have been lost. In constructing the table A[], avoid perpetual motion while importing a byte. Recall that A[j] takes the value 28j/F with "appropriate" rounding: "appropriate" means that A[j] takes the value 28j/F rounded to the nearest integer when F
j < 2F, but for 0
j < F, to avoid perpetual motion, we calculate A[j] as (A[j + F] + 255)/256 (that is, rounded up).
The value of rung to be used in a call to decodeBit() is dictated by the probabilities that the symbol to be decoded will have value 0 or 1. Let p denote the probability that the symbol takes the value 1. For a given element of the array Ladder[], the expected number of jots used to code the symbol is c0
(1-p)+c1
p. For example, with a value of 0 for rung, the expected number of jots is (1-p)+4p = 1+3p; with a value of 1 for rung, the expected number of jots is 2(1-p)+2p = 2. For data compression, the smaller of these two values is preferable; you can solve the inequality 1+3p
2 for p to find that the value 0 is preferable to 1 for rung if p
1/3. With similar calculations, you find that 1 is the preferred value if 1/3
p
2/3 and 2 is the preferred value if 2/3
p. (At the boundary points between these intervals, you may choose the preferred value for either side.)
Incidentally, the theoretical optimum compression can be calculated c0= Flog256 (1-p), c1=Flog256 p; the corresponding expected number of jots -((1-p)log256 (1-p) + plog256 p)F is known in data compression theory as the "entropy" of the symbol (usually measured in bits rather than jots).
All entries of the table Ladder[] are calculated so that the values of c0 and c1 satisfy the following "ladder" constraints:
1. A[j-c0]+A[j-c1]
A[j] for any value of j between 0 and F-1, inclusive (to avoid perpetual motion).
2. c0 > 0 and c1 > 0 (so that each symbol value 0 or 1 corresponds to at least some allowable values).
3. c0 <= F and c1 <= F (to avoid running beyond the end of the reserve byte while decoding a symbol).
4.Subject to the above criteria, A[J]-(A[J-c0]+A[J-c1]) should be minimized (to minimize data leaks).
In all cases, Ladder[i].Threshold takes the value A - Ladder[i].c0+F. With F=15 there are three combinations of c0 and c1 satisfying all the aforementioned criteria; these yield the three entries of the table Ladder in Listing Two.
The ELS encoder uses knowledge of the decoder's inner workings to create a data stream that will manipulate the decoder into producing the desired sequence of decoded symbols. Listings Three and Four are procedures for encoding that are compatible with the decoding procedures of Listing Two. The tables A[] and Ladder[] and the macro F are identical to the decoder's.
The ELS-encoder operates by considering all possible coded data streams, gradually eliminating those inconsistent with the current state of the decoder ("current" and other adverbs of time refer to position in the data stream, rather than actual temporal time). For the decoder, the "allowable" values of the internal buffer form a reference point for discussion; for the encoder, you are concerned with the set of values for the coded data stream consistent with the current set of allowable values in the decoder.
The encoder need not actually consider the entire coded data stream at one time. You can partition the coded data stream into three portions. From end to beginning of the data stream they are: preactive bytes, which as yet exert no influence over the current state of the decoder; active bytes, which affect the current state of the decoder and have more than one consistent value; and postactive bytes, which affect the current state of the decoder and have converged to a single consistent value. Each byte of the coded data stream goes from preactive to active to postactive; the earlier a byte's position in the stream, the earlier these transitions occur. A byte is not actually moved to the external file until it becomes postactive. Only the active portion of the data stream need be considered at any time.
Since the internal buffer of the decoder contains two bytes, there are always at least two active bytes. The variable n counts the number of active bytes in excess of 2. In theory, n can take arbitrarily high values, but higher values become exponentially less likely. You number the active bytes 0, 1,...,n+1 from latest to earliest. The encoder has a variable j matching in value the decoder's variable j at the same point in the data stream.
You can think of the n+2 active bytes as forming a single unsigned number with n+2 bytes of precision, byte 0 being least significant and byte n+1 being most significant. The set of allowable values in the decoder form a continuous range; the consistent values of the active bytes in the encoder likewise form a continuous range. Thus you can describe this range simply by its minimum value m and maximum value M. Each of these is a nonnegative integer with n+2 bytes of precision; write mk or Mk to refer specifically to byte k of m or M, respectively. Moreover, since the number of elements of the range is given by A[F+j], only the minimum value need be specified.
Suppose n > 0. Consider the most-significant active byte, byte (n+1). The minimum consistent value for this byte is mn+1 and the maximum is Mn+1. Since this byte is active and not yet postactive, it has more than one consistent value; thus mn+1 < Mn+1. The current number of allowable values for the decoder's internal buffer is A[j]; and M = m+(A[j]-1). Recall that A[j] is a 2-byte value. If you consider the operation of adding (A[j]-1) to m byte-by-byte to obtain M, you see that a carry must occur from byte 1 to byte 2 and on upward to byte (n+1) in order for byte (n+1) to take differing values for m and M. You must also have mn+1+1 = Mn+1. Furthermore, if n > 1, then m2,m3,...,mn must all take the maximum possible value, 255, in order for the carry to propagate from byte 1 to byte (n+1). (Hence, M2,M3,...,Mn, all take the minimum possible value, 0.)
This is convenient in that you need not separately store all n+2 bytes of m: You can make the 4-byte variable x_min serve. Bytes 0, 1, and 2 of x_min represent m0, m1, and (when n is positive) m2, respectively. When n>1, then byte 3 of x_min represents mn+1. Since bytes m2,...,mn all hold the same value, x_min and n together suffice to describe m in entirety. Moreover, most of the arithmetic operations to be performed on m can be performed on x_min in a completely ordinary way. The exceptions can be recognized in the code because they require the manipulation of n.
Now consider the arithmetic operations involved in encoding a symbol. Recall the sequence of events in the decoder attending the decoding of the symbol value 0. Initially, x has one of A[F+j] values. The decoder compares x to Ladder[rung].Threshold[j]; a 0 is decoded if x holds the lesser value. Then Ladder[rung].c0 is subtracted from j, reflecting the newly decreased number of allowable values for x. The allowable values eliminated are all taken from the top of the range (those greater than Ladder[rung].Threshold[j]). Thus m does not change; the encoder need not modify x_min or n but only j.
For encoding the symbol value 1, the chief difference in the sequence of events in the decoder as compared to the case for 0 is that those values representing 0, numbering Ladder[rung].Threshold[j], are eliminated from the bottom of the range of consistent values for the coded data stream. Additional values may be eliminated from the top of the range if there is a data leak. Thus the encoder, in addition to changing the value of j to track its value in the decoder, must add Ladder[rung].Threshold[j] to x_min to raise m. If the encoder is not representing M directly, the operation of eliminating consistent values from the top of the range takes care of itself. The unusual format by which x_min represents m never becomes an issue in encodeBit().
When the value of j dips to zero or lower, the encoder calls encode_export(). In contrast to decode_import(), which invariably reads a single byte from the external file, encode_export() writes from zero to several bytes in a single call. One of its tasks is to determine whether the most-significant active bytes have converged to a unique consistent value, thus becoming postactive and ready to be exported to the encoder's external file. These most-significant bytes may actually converge to a unique consistent value well before encode_export() is called, but there is no harm in waiting until then to check. On each call, encode_export() moves a single byte from the preactive to the active portion of the coded data stream. The format by which x_min represents m does become an issue in encode_export(); you must manipulate n and x_min together.
ELSCODER.H and ELSCODER.C (available electronically) contain declarations and definitions for an implementation of the ELS-coder, respectively. COMPRESS.C and EXPAND.C (also available electronically) use the ELS-coder to compress and expand a file using a 1-byte model.
Listings One through Four are designed more for didactic value than utility. The sample programs, on the other hand, emphasize utility. Note the following differences:
ELSCODER.C includes the straightforward procedures els_decode_start() and els_encode_start() for initializing, and els_decode_end() and els_encode_end() for terminating encoding and decoding.
The sample files also incorporate some enhancements to the basic ELS-coder. First is probability estimation. The user must categorize the bits to be coded into groups; the bits of each group share statistical properties. (We call each group a context for the given bit. The sample shell programs COMPRESS.C and EXPAND.C provide an example of this.) The ELS-coder then maintains its own probability estimate for each group, based on the previously coded bits for that group. It does this using a state machine that transits to a state indicating a higher probability when the bit value 1 is coded or a state indicating a lower probability when the bit value 0 is coded. By defining the constant ALACRITOUS or not, you select one of two state machines. Define ALACRITOUS if coding efficiency is your first priority; omit the definition if speed is your first priority.
Second, although the entropy-coding algorithm works with units of data smaller than a bit, the source and destination files must have lengths that are multiples of a byte. Thus a file coded with ELS-coder is likely to end with a fraction of a byte not containing meaningful data. The sample ELS-coder attempts to make a virtue of necessity by using this empty portion to encode something akin to a checksum. When encoding is concluded, a certain number of consistent values remain in the encoder; these are used to store the value of j to whatever precision is possible. This is done automatically in the procedure els_encode_end(). If desired, users can call els_decode_ok() when finished decoding but before calling els_decode_end(); this verifies that the value of j matches that sent by the encoder. Any corruption of the compressed data will most likely profoundly alter the course of execution of the decoder; the probability of ending with the same value of j is quite small. Of course, this value is only encoded to the precision possible with the fraction of a byte remaining at the end of coding; thus the probability of hitting the correct value by chance ranges from 1 in 255 to 1, depending on what that fraction is.
I would like to thank Stephen Mann for reading an early draft of this article and proposing many improvements in both form and content.
DDJ
int b; int Threshold[10] = {0, 1, 2, 4, 8, 16, 32, 64, 128, 256};
int A[] = Threshold+1; /* For expository purposes only */
FILE *external_in;
unsigned char x;
int decode_Bit(){
if (x >= Threshold[b]){
x-= Threshold[b];
if (--b <= 0)
decode_import();
return 1;
}else{
if (--b <= 0)
decode_import();
return 0;
}
}
void decode_import(){
x = fgetc(external_in);
b += 8;
}
#define F 15unsigned short A[2*F] = {1, 2, 3, 4, 5, 7, 10, 14, 20, 28, 41, 59, 85, 123, 177, 256, 371, 536, 776, 1123, 1625, 2353, 3405, 4928, 7132, 10321, 14938, 21619, 31288, 45283};
struct{
unsigned short *Threshold;
short c0;
short c1;
} Ladder[3] = {{A+14, 1, 4}, {A+13, 2, 2}, {A+11, 4, 1}};
unsigned short x;
int j;
FILE *external_in;
int decode_bit(unsigned char rung){
if (x >= Ladder[rung].Threshold[j]){
x -= Ladder[rung].Threshold[j];
if ((j -= Ladder[rung].c1) <= 0)
decode_import();
return 1;
}else{
if ((j -= Ladder[rung].c0) <= 0)
decode_import();
return 0;
}
}
void decode_import(){
j += F;
x << = 8;
x |= fgetc(external_in);
}
unsigned long x_min;int j;
int n;
FILE *external_out;
void encode_bit(unsigned char rung, int bit){
if (bit){
/* Encode a 1 */
x_min += Ladder[rung].Threshold[j];
j -= Ladder[rung].c1;
}else{
/* Encode a 0. */
j -= Ladder[rung].c0;
}
if (j <= 0)
encode_export();
}
void encode_export(){ unsigned long diff_bits;
/* First check for bytes becoming postactive; diff_bits marks differences
between max and min consistent values */
j += F;
diff_bits = (x_min + A[j] - 1) ^ x_min;
switch (n){
default: /* 2 or greater */
if (diff_bits&0xFF000000)
break;
fputc(x_min>>24, external_out);
while (-- n > 1)
fputc((x_min>>16)&0xFF, external_out);
case 1:
if (diff_bits & 0x00FF0000)
break;
fputc((x_min>>16)&0xFF, external_out);
n -- ;
case 0:
if (diff_bits & 0x0000FF00)
break;
fputc((x_min>>8) & 0xFF, external_out);
n --;
}
}
/* Move a byte from preactive to active. */
if (++ n > 2)
x_min = (x_min&0xFF000000)| ((x_min&0x0000FFFF)<< 8);
else
x_min <<= 8;
}