C PROGRAMMING

CUA and Data Compression

Table 1: The frequency array for our example file.

  Character  Frequency
  --------------------

     `'          5
     e           3
     o           3
     s           3
     t           3
     i           2
     m           2
     c           1
     h           1
     n           1
     p           1
     r           1
     w           1

The next step is to build the Huffman tree. The tree structure contains nodes, each of which contains the character, its frequency, a pointer to a parent node, and pointers to left and right child nodes. The tree contains entries for all 256 possible characters and 255 parent nodes. At first there are no parent nodes. The tree grows by making successive passes through the existing nodes. Each pass searches for the two nodes that have not grown a parent node and that have the two lowest frequency counts. When the program finds those two nodes, it allocates a new node, assigns it as the parent to the two nodes, and gives the new node a frequency count that is the sum of the two child nodes. The next iteration of the search ignores those two child nodes but includes the new parent node. The passes continue until only one node with no parent remains. That node will be the root node of the tree.

Figure 1 illustrates how the Huffman tree for the example file might appear. Observe that only the original leaf nodes have meaningful character values. That field in the higher nodes is meaningless. The tree search algorithms distinguish leaves from higher nodes by the appearance of pointers to child nodes. If the node has children, it is not a leaf. The node that has no parent is the root. The compression algorithm uses the tree to translate the characters in the file into bit strings. You can see in Figure 1 that if you begin at the root node and trace your way to the most frequent character, there are fewer nodes in the path than there are to the least frequent character. It is therefore a matter of assigning a bit value, 0 or 1, to each of the right or left branches from a parent node to its children. Figure 2 shows the example Huffman tree with the value 1 assigned to each left branch, and 0 assigned to each right branch. If you trace through the tree, you can see that the bit string to represent the space character, which is the most frequent, is 011, while the string for the 'n' is 01001. In a file with instances of many more characters, the longer strings will be much longer that the longest ones in this example. Only one character code in the Huffman tree begins with 011, for example, and so there is no potential for the bit strings of two characters to confuse the decompression algorithm.

Compression, then, involves traversing the tree beginning at the leaf node for the character to be compressed and navigating to the root. This navigation iteratively selects the parent of the current node and sees whether the current node is the right or left child of the parent, thus determining whether the next bit is a one or a zero. Because you are proceeding from leaf to root, you are collecting bits in the reverse order in which you will write them to the compressed file.

The assignment of the 1 bit to the left branch and the 0 bit to the right branch is arbitrary. Also, the actual tree might not always look exactly like the one in the figures. It depends on the order in which the search of the tree proceeds while it builds parents, and where in the tree it inserts those parents. The right and left branches from the root node could be reversed without affecting the compression ratio, for example. Figure 1 and Figure 2 are representative. The tree that would be built by the code in this month's column would be slightly different with respect to right and left child node assignments, but it is more difficult to draw for the examples. The tree in the figures would compress "now is" into this bit stream:

  n    o    w   ''   i   s 01001 101 1000 011 0001 110

Decompression

Decompression involves building the Huffman tree, reading the compressed file, and converting its bit streams into characters. You read the file a bit at a time. Beginning at the root node in the Huffman tree and depending on the value of the bit, you take the right or left branch of the tree and then return to read another bit. When the node you select is a leaf--that is, it has no right and left child nodes--you write its character value to the decompressed file and go back to the root node for the next bit.

Most definitions of Huffman trees treat the frequency values as decimal rather than whole numbers. A high frequency character might have a node value of .55 while a low frequency value would have a value of .00002. The value in the root node would, therefore, be 1.0, the sum of all the decimal parts. My implementation uses the actual counts for the values, so the value in the root node is the total number of bytes in the file. This approach avoids the use of floating point arithmetic.

The Huffman Programs

Listing One is htree.h. The BYTECOUNTER typedef defines the data type for the frequency count. I used an unsigned int, which means that the program can compress files of up to 64K bytes. You could change that to a long integer to compress bigger files. The htree structure defines the Huffman tree, and the buildtree prototype is for the function that the compression and decompression programs use to build a Huffman tree from a character frequency array.

Listing Two is htree.c, which contains the buildtree function. It assumes that the first 256 entries in a global tree are initialized with the leaf values of the tree. It scans the tree and initializes local pointers to the two nodes it finds that have the lowest frequency counts. It bypasses any node that already has a parent or that has a zero value in its frequency count. After each scan where two nodes are found, the function adds a node to the tree, placing the address of the new node into the parent member of both nodes that were found by the scan. It puts the sum of the two child frequency counts into the new node's frequency count, and it puts the addresses of the two child nodes into the new node's right and left child node pointer members. When the scan fails to find two nodes that do not have parents and that have nonzero frequency counts, the tree is complete, and the remaining node without a parent is the root node.

Listing Three is huffc.c, the file compression program. After opening the input and output files, the program reads through the input file, building the frequency array and counting the bytes. It also counts the number of distinct character values found in the file. When the program has read the complete file, it writes the byte count and the count of distinct character values to the output file. Then it writes the frequency array, which consists of each character value that occurred in the input file and the number of times it occurred.

There are other ways that the program could have recorded the frequency array. It could have written 256 values, with the position of each value in the array being the character it counted. The array would record all zero as well as significant counts. This might use less room in the compressed file than the method chosen, which writes a count of significant characters followed by each character value and its count. Entries for character values that do not appear in the input file are not in the compressed file. A file that has occurrences of most of the 256 characters would probably record a smaller array by using the former method. A file with fewer distinct character values--such as a 7-bit ASCII text file--would have a smaller array by using the method chosen. A really smart compression program would decide which form is smaller and write the shorter form with a control value that identifies it.

After writing the frequency array, the compression program calls the buildtree function to build the Huffman tree. Then it rewinds the input file, reads each character, and calls the compress function. The recursive compress function sends the compressed bit stream representation for a character to the output file. It starts from the character's leaf node position in the tree and calls itself with the address of the node's parent until it gets to the root node. When it returns from the call to itself, it writes a zero or one bit to the compressed file, depending on whether it was called from a right or left child node.

The program calls the outbit function to write each bit of compressed data. The outbit function rotates bits into a byte until 8 bits have been added, whereupon it writes the byte to the file. The program calls outbit a last time with a -1 parameter to tell it to write the last byte value to the file.

Listing Four is huffd.c, the file decompression program. It opens the input and output files and then reads the byte and frequency counts. The purpose of the byte count is so that the decompression program will know when it is done decompressing. The last byte in the compressed file will usually contain fewer bits than are needed to complete the decompressed file, so the byte count controls the number of bytes written.

The frequency count value tells the program how many entries to read into the frequency array. Each entry consists of a character value and its frequency count. This array, once loaded, becomes the Huffman tree when the program calls the buildtree function. Then the program decompresses by calling the decompress function to get each byte to write to the output file.

The decompress function calls the inbit function to read each successive bit value in the compressed file. The inbit function reads a byte from the compressed file and shifts bits out of it until it has returned all 8 bits, whereupon it reads the next byte. The decompress function uses the bits to walk down the Huffman tree starting at the root node. As long as the current node has child nodes--entries in the right and left pointers--the function gets another bit and moves to the left child node if the bit is a 1 and the right child node if the bit is a 0. When the current node has no child node it is a leaf, and the decompress function returns its character value.

Compression as Encryption

You can use Huffman compression to encrypt data files. To decrypt a file, you must know the algorithm that encrypted it and the encryption key value. A data file that you compress with a Huffman tree is reasonably well encrypted if you keep the frequency array private. The array becomes the key to the file's decryption. The file is further protected if the algorithm itself is part of the secret. A codebreaker would need to figure out in the first place that you used a Huffman tree to compress the file, and then, having made that determination, the spy would have to decipher the frequency array in order to decompress/decrypt it.


_C PROGRAMMING COLUMN_
by Al Stevens

[LISTING ONE]



/* ------------------- htree.h -------------------- */
typedef unsigned int BYTECOUNTER;

/* ---- Huffman tree structure ---- */
struct htree    {
    unsigned char ch;       /* character value             */
    BYTECOUNTER cnt;        /* character frequency         */
    struct htree *parent;   /* pointer to parent node      */
    struct htree *right;    /* pointer to right child node */
    struct htree *left;     /* pointer to left child node  */
};
extern struct htree ht[];
extern struct htree *root;

void buildtree(void);

[LISTING TWO]



/* ------------------- htree.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

struct htree ht[512];
struct htree *root;

/* ------ build a Huffman tree from a frequency array ------ */
void buildtree(void)
{
    int treect = 256;
    int i;

    /* ---- build the huffman tree ----- */
    while (1)   {
        struct htree *h1 = NULL, *h2 = NULL;
        /* ---- find the two smallest frequencies ---- */
        for (i = 0; i < treect; i++)   {
            if (ht+i != h1) {
                if (ht[i].cnt > 0 && ht[i].parent == NULL)   {
                    if (h1 == NULL || ht[i].cnt < h1->cnt) {
                        if (h2 == NULL || h1->cnt < h2->cnt)
                            h2 = h1;
                        h1 = ht+i;
                    }
                    else if (h2 == NULL || ht[i].cnt < h2->cnt)
                        h2 = ht+i;
                }
            }
        }
        if (h2 == NULL) {
            root = h1;
            break;
        }
        /* --- combine two nodes and add one --- */
        h1->parent = ht+treect;
        h2->parent = ht+treect;
        ht[treect].cnt = h1->cnt + h2->cnt;
        ht[treect].right = h1;
        ht[treect].left = h2;
        treect++;
    }
}

[LISTING THREE]



/* ------------------- huffc.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

static void compress(FILE *fo, struct htree *h,struct htree *child);
static void outbit(FILE *fo, int bit);

void main(int argc, char *argv[])
{
    FILE *fi, *fo;
    int c;
    BYTECOUNTER bytectr = 0;
    int freqctr = 0;
    if (argc < 3)   {
        printf("\nusage: huffc infile outfile");
        exit(1);
    }

    if ((fi = fopen(argv[1], "rb")) == NULL)    {
        printf("\nCannot open %s", argv[1]);
        exit(1);
    }
    if ((fo = fopen(argv[2], "wb")) == NULL)    {
        printf("\nCannot open %s", argv[2]);
        fclose(fi);
        exit(1);
    }

    /* - read the input file and count character frequency - */
    while ((c = fgetc(fi)) != EOF)   {
        c &= 255;
        if (ht[c].cnt == 0)   {
            freqctr++;
            ht[c].ch = c;
        }
        ht[c].cnt++;
        bytectr++;
    }

    /* --- write the byte count to the output file --- */
    fwrite(&bytectr, sizeof bytectr, 1, fo);

    /* --- write the frequency count to the output file --- */
    fwrite(&freqctr, sizeof freqctr, 1, fo);

    /* -- write the frequency array to the output file -- */
    for (c = 0; c < 256; c++)   {
        if (ht[c].cnt > 0)    {
            fwrite(&ht[c].ch, sizeof(char), 1, fo);
            fwrite(&ht[c].cnt, sizeof(BYTECOUNTER), 1, fo);
        }
    }

    /* ---- build the huffman tree ---- */
    buildtree();

    /* ------ compress the file ------ */
    fseek(fi, 0L, 0);
    while ((c = fgetc(fi)) != EOF)
        compress(fo, ht + (c & 255), NULL);
    outbit(fo, -1);
    fclose(fi);
    fclose(fo);
}

/* ---- compress a character value into a bit stream ---- */
static void compress(FILE *fo, struct htree *h,
                                struct htree *child)
{
    if (h->parent != NULL)
        compress(fo, h->parent, h);
    if (child)  {
        if (child == h->right)
            outbit(fo, 0);
        else if (child == h->left)
            outbit(fo, 1);
    }
}
static char out8;
static int ct8;

/* -- collect and write bits to the compressed output file -- */
static void outbit(FILE *fo, int bit)
{
    if (ct8 == 8 || bit == -1)  {
        fputc(out8, fo);
        ct8 = 0;
    }
    out8 = (out8 << 1) | bit;
    ct8++;
}

[LISTING FOUR]



/* ------------------- huffd.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

static int decompress(FILE *fi, struct htree *root);

void main(int argc, char *argv[])
{
    FILE *fi, *fo;
    unsigned char c;
    BYTECOUNTER bytectr;
    int freqctr;
    if (argc < 3)   {
        printf("\nusage: huffd infile outfile");
        exit(1);
    }
    if ((fi = fopen(argv[1], "rb")) == NULL)    {
        printf("\nCannot open %s", argv[1]);
        exit(1);
    }
    if ((fo = fopen(argv[2], "wb")) == NULL)    {
        printf("\nCannot open %s", argv[2]);
        fclose(fi);
        exit(1);
    }

    /* ----- read the byte count ------ */
    fread(&bytectr, sizeof bytectr, 1, fi);

    /* ----- read the frequency count ------ */
    fread(&freqctr, sizeof freqctr, 1, fi);

    while (freqctr--)   {
        fread(&c, sizeof(char), 1, fi);
        ht[c].ch = c;
        fread(&ht[c].cnt, sizeof(BYTECOUNTER), 1, fi);
    }

    /* ---- build the huffman tree ----- */
    buildtree();

    /* ----- decompress the file ------ */
    while (bytectr--)
        fputc(decompress(fi, root), fo);
    fclose(fo);
    fclose(fi);
}
static int in8;
static int ct8 = 8;

/* ---- read a bit at a time from a file ---- */
static int inbit(FILE *fi)
{
    int obit;
    if (ct8 == 8)   {
        in8 = fgetc(fi);
        ct8 = 0;
    }
    obit = in8 & 0x80;
    in8 <<= 1;
    ct8++;
    return obit;
}

/* ----- decompress file bits into characters ------ */
static int decompress(FILE *fi, struct htree *h)
{
    while (h->right != NULL)
        if (inbit(fi))
            h = h->left;
        else
            h = h->right;
    return h->ch;
}