A Huffman Coding Algorithm


A "Huffman list" holds all characters found in a given text in decreasing order of frequency. The list is broken down into two sublists. If a character is found in the first sublist, bit 0 is sent to the compressed file and the search continues in the first sublist. Otherwise, 1 is sent to the compressed file and the search continues in the second sublist. The process continues until a list holds only one element.

A list is always broken up into two lists in such a way that the first sublist is the smallest possible, but has at least the same probability (cumulative frequency count) as the second sublist.

Assume the following frequency list:

Character Frequency  Cumulative Frequencies
                     all     (d,r,s,t)  (r,s,t) (s,t)
e         10         10
d         4          14      4
r         2          16      6           2
s         1          17      7           3      1
t         1          18      8           4      2
Assume the character s has to be encoded.

Step 1: As the total cumulative frequency count is 18, the first character, e, which has a count of 10, has already a probability of more than 50%. So the list is broken up into two lists, one holding e, the other holding the remainder of the original list. As s is not in the first list, bit 1 is sent to output.

Step 2: The new list (d,r,s,t) is evenly divisible in two lists, each with a cumulative count of 4, one holding d, the other holding r,s,t. As s is not in the first list, bit 1 is sent to the output. The search continues in the second list.

Step 3: The list is divided into a list holding r and a residual list. As s is not in the first list, bit 1 is sent to the output.

Step 4: s is in the only character left in the first list, so send bit 0.

In the example, s is encoded in four bits, as 1110. This is how each element in the list would be encoded:

e     0
d     10
r     110
s     1110
t     1111
The word red would be encoded as 110010, in six bits. This is how the six-bit stream 110010 would be decoded:

First bit is 1. No match is made, can be any of d, r, s, t.

Second bit is 1. No match is made, can be any of r, s, t.

Third bit is 0. Match is made. First character is r.

First bit is 0. Match is made. Second character is e.

First bit is 1. No match is made, can be any of d,r,s, t.

Second bit is 0. Match is made. Third character is d.