C PROGRAMMING

CUA and Data Compression

Al Stevens

The East Coast version of Software Development '90 was held November 13-16 in the Omni Parker House hotel, the oldest hotel in Boston. SD is the annual Miller Freeman conference for programmers, and this was the first eastern edition. It featured exhibits, lectures, and workshops that interest programmers. SD East was small by comparison to the older, established West Coast conference. There were only about 30 exhibitors, and attendance was about the same as the first West Coast SD three years ago. But there were plenty of lectures and workshops to attend. Tom Plum, Robert Ward, Jack Purdum, and others presented topics that would interest C programmers at all levels. Larry Constantine and Ken Orr lectured on software development methodologies. P.J. Plauger lectured on heresies in programming and software management, one of which is that if you do not understand the latest trendy methodology, it is probably not your fault.

Intel announced their 386/486 C Code Builder Kit at SD East. This product combines a 32-bit C compiler, libraries, librarian, linker, make utility, DOS extender, and debugger. Although some of these products have been around for a while, this is the first time that Intel has marketed them as an integrated retail package. They emphasize that their customers prefer and will benefit from a single-vendor solution, but the package does not include an editor or profiler, so, until Intel adds them, you will need to look elsewhere for those capabilities. Intel does not yet support Windows development. They said that when they do, you might need the Microsoft Windows Software Development Kit because they have not decided whether they will develop a look-alike. The Intel compiler and libraries are compatible with Microsoft C. I think that their emphasis on the single-vendor solution as a marketing device is a mistake. There is no way that Intel will be able to offer every library, tool, and utility that programmers need, and programmers know that. If, by their marketing strategies, they make you believe that the strength of their product is that it is a single-vendor solution, then whatever they leave out will draw attention away from whatever its real strengths might be.

Borland showed Turbo Pascal 6.0, which includes a new package called Turbo Vision, something that Turbo C++ programmers should demand. Jeff Duntemann talked about TP 6.0 and Turbo Vision in his "Structured Programming" column (DDJ, December 1990). Turbo Vision is what C++ programmers would call a class library. It implements something close to the IBM Systems Application Architecture (SAA) Common User Access (CUA) interface. CUA is the standard with which Windows 3.0 and OS/2 Presentation Manager comply. The Turbo Vision library includes application windows, dialog boxes, radio buttons, command buttons, lists, text boxes, a 64K editor object, and so on. It implements these things in the IBM text mode with full mouse support, allowing DOS text-mode programs to resemble Windows programs. More importantly, Turbo Vision provides a way for DOS programmers to use the CUA interface, not just as a Windows look-alike but as a way to comply with an emerging standard. With Turbo Vision, you do not need to write code for the mouse and keyboard, you do not need to write menu code, you do not need to write a text editor, and so on. But what is better, you do not need to write user documentation or detailed help screens that describe your unique user interface. That is the number one advantage of a common user interface, both for programmers and users. Good or bad, the look and feel of one application is much the same among all complying applications. Inasmuch as CUA is descending upon us as a de facto standard, C and C++ programmers will soon need such libraries.

My imagination fired up by Turbo Vision, I prowled the exhibitions looking for such a text-mode CUA library for C or C++. I did not find one there, but several vendors said they are thinking about doing it. Two packages that come close are the Zinc Interface Library and Magma Software's Mewel 3, but neither company exhibited at SD.

The Zinc Interface Library

DDJ reviewed the Zinc Interface Library for Turbo C++ programmers in December 1990. Jeff Duntemann's column also referred to the Zinc library, calling it "SAA-compliant." It does not seem to be a complete implementation of SAA CUA, however. For example, I could not find support for radio buttons or clipboard operations, both of which are components of the SAA CUA. But because Zinc is a class library that works with Turbo C++, you can probably extend it to add the missing parts of CUA by deriving and adding new classes.

Mewel 3

C programmers still need a CUA function library, however, and Magma Software Systems has a product called Mewel 3 that works with Microsoft C and the C compiler component of Turbo C++. It is a C function library that uses the same API as Windows 3. To use it, you must know how to write Windows 3 programs, which is not easy. Your programs, however, will be reasonably portable between the DOS text-mode and the Windows 3 GUI platforms. This feature offers advantages to several different development environments. Windows programmers can port their applications to DOS with a minimum of fuss, thus expanding their potential user base. Developers of new programs can target them for both platforms. Windows developers can use the DOS platform for program development, avoiding the clumsier Windows debugging environment.

I ran the Mewel demonstration programs, and their look-and-feel approximate the Windows user interface. There are some occasional minor differences -- for example, double-clicking the control box does not always close the window -- but these are small points. One thing that concerns me is the size of the executable programs. The simplest programs are well in excess of 100K, some exceeding 300K. Maybe there's a penalty for using CUA. Maybe it could be done better. We'll know when the other vendors offer their CUA libraries.

The Mewel documentation explains the library reasonably well. It is, however, replete with typographical, usage, and grammatical errors. The Mewel license concerns me. It is one of those pointless licenses that allows you to make only one backup copy and use the product on only one of your computers. Furthermore, you are not allowed to "create other works based on the Documentation," which is presumptuous; Mewel itself is based very closely on SAA CUA and the Windows API. Finally, for some oddball reason, the license prohibits you from using Mewel to develop a word processor or text editor program. I do not understand that limitation. A developer should clarify these points with Magma before launching into a project that uses Mewel as the interface library.

Because Microsoft Press now publishes the Windows 3 SDK Programmer's Reference, Guide to Programming, and Programming Tools as separate books, a programmer does not need to buy the SDK to get the API documentation. Vendors such as Magma who clone the Windows API have their work made easier for them. They do not need to develop extensive programmer documentation if their libraries are true clones of the SDK. Watch for my "Examining Room" article on Mewel 3 in an upcoming issue of DDJ.

Encryption and Compression and Who Owns What?

No single subject in this column has stirred more response than the two columns devoted to the Data Encryption Standard last year. It seems that encryption is a hobby for a lot of programmers. Many of you sent me your encryption programs not only for the DES algorithm but for other encryption methods as well. I now have an extensive library of encryption algorithms as a result of your contributions. Now if I only had some secrets.

Some readers expressed their concern that I was publishing an implementation of the DES algorithm, thinking that I was somehow violating national security interests. If the government can publish the details of the algorithm in a booklet that is available to everyone, then programmers should be free to write code that implements the algorithm. No guys in trench coats have been knocking on my door, so I guess I'm safe.

This question raises the specter of software patents again. DDJ has taken an interest in the issue. We published code that implements the LZW data compression algorithm in June, and that article lit a fuse. It seems that someone patented the algorithm. This causes me to wonder what is protected by such patents. If the details of an algorithm are public knowledge -- as are those of LZW and DES -- does the patent holder own all expressions of that algorithm, or does he or she own rights only to implementations of the algorithm? What is the difference? Is the publication of source code an implementation of an algorithm or merely an expression of how it works? Obviously if only the implementation is protected and if the publication of source code is not an implementation, then magazines can print the programs but readers cannot use them. What good would that be other than as an exercise in how algorithms work and how the patent system does not?

The League for Programming Freedom wrote the article "Software Patents" (DDJ, November 1990). They abhor the practice and say why. I chatted with some of the League's members in Boston. By coincidence, the November issue of Boston Magazine has an article that discusses the League and its founder, Richard Stallman. Most of us know Stallman as the author of the EMACS editor and GNU, the Unix look-alike. You have to buy GNU for $150 from Stallman's other group, the Free Software Foundation, but you are permitted to give away copies. Unix costs $900, and you may not give away copies. I've never read where GNU is at least one-sixth as good as Unix, so I don't know if it's a deal or not. According to the Boston Magazine article, Stallman, founder of the League for Programming Freedom and the Free Software Foundation, will write programs for you for $260 an hour as long as the program is not proprietary. Software should be free, but Stallman sure isn't.

I suppose that under the "not proprietary" restriction, Stallman's clients can sell the programs he writes but cannot prevent their buyers from giving them it away. Any takers? How about in-house stuff? Most clients who develop custom software strictly for internal use consider it to be proprietary, perhaps to keep it from their competitors. Hmm. I'd follow Richard's example, but I don't think I could live on $520 a year.

Huffman Compression Trees

I don't know if Huffman trees involve a proprietary algorithm. But my research for last month's "Memory Lane" column did not find the algorithm in the DDJ morgue, so I'll take the chance and address it now. Maybe I'll be writing next month's column from the federal country club at Eglin Air Force Base. That way I can talk to my lawyer, my banker, and some fellow pilots every day.

The Huffman algorithm is a form of data compression. Others are LZW, mentioned above, and run-length encoding (RLE), where strings of the same character are replaced by a character count and the character. I used RLE in one of the encryption programs in November. There have been other RLE articles in past issues of DDJ. Contemporary file compression utility programs use combinations of several compression algorithms, examining each file to see which algorithm yields the best compression ratio. You are not likely to write yet another general-purpose file compression utility program, but you might need to use compression in one of your applications. Developers who distribute software on diskettes often compress it and have the installation program decompress it for the user. If you use one of the commercial or shareware compression utilities, you might have to pay for a license to distribute the decompression program. At the very least, your installation procedure might have to display the copyright notice of the company or person who owns the program. With the programs in this column and the LZW program from last June, you can write your own compression/decompression programs.

The Huffman compression algorithm assumes that data files consist of patterns of characters where some bytes occur more frequently than others. This is true for English language text files. We use some letters more often than others. By analyzing a file, the algorithm builds an array that identifies the frequency of each character. Then it builds the Huffman tree structure from the frequency array. The purpose of the tree is to associate each character with a bit string. The more frequent characters get shorter bit strings; the less frequent ones get longer bit strings, and so the data in the file can be compressed.

To compress the file, the Huffman algorithm reads the file a second time, translates each character into the bit string assigned to it by the Huffman tree, and writes the bit strings to the compressed file.

The decompression algorithm reverses the process. It uses the frequency array that the compression pass created to build the Huffman tree. Some applications use a global constant frequency array based on an empirical understanding of the database. With this technique, the program does not have to write the array to the compressed file. Other applications allow each file to have a unique frequency array. The compression pass writes the unique array to the compressed file and the decompression pass reads it and rebuilds the tree before decompressing the bit strings. This approach can generate a larger compressed file than the original from a short file or from one with a relatively even character distribution.

Compression

To understand how Huffman compresses data, you must observe how it builds the tree. Consider this sentence in an example file: now is the time to compress. To build the tree you first must build a frequency array. This array will tell you the frequency of each character in the file. The frequency array for the example file looks like Table 1.

Table 1: The frequency array for our example file.

  Character  Frequency
  --------------------

     `'          5
     e           3
     o           3
     s           3
     t           3
     i           2
     m           2
     c           1
     h           1
     n           1
     p           1
     r           1
     w           1

The next step is to build the Huffman tree. The tree structure contains nodes, each of which contains the character, its frequency, a pointer to a parent node, and pointers to left and right child nodes. The tree contains entries for all 256 possible characters and 255 parent nodes. At first there are no parent nodes. The tree grows by making successive passes through the existing nodes. Each pass searches for the two nodes that have not grown a parent node and that have the two lowest frequency counts. When the program finds those two nodes, it allocates a new node, assigns it as the parent to the two nodes, and gives the new node a frequency count that is the sum of the two child nodes. The next iteration of the search ignores those two child nodes but includes the new parent node. The passes continue until only one node with no parent remains. That node will be the root node of the tree.

Figure 1 illustrates how the Huffman tree for the example file might appear. Observe that only the original leaf nodes have meaningful character values. That field in the higher nodes is meaningless. The tree search algorithms distinguish leaves from higher nodes by the appearance of pointers to child nodes. If the node has children, it is not a leaf. The node that has no parent is the root. The compression algorithm uses the tree to translate the characters in the file into bit strings. You can see in Figure 1 that if you begin at the root node and trace your way to the most frequent character, there are fewer nodes in the path than there are to the least frequent character. It is therefore a matter of assigning a bit value, 0 or 1, to each of the right or left branches from a parent node to its children. Figure 2 shows the example Huffman tree with the value 1 assigned to each left branch, and 0 assigned to each right branch. If you trace through the tree, you can see that the bit string to represent the space character, which is the most frequent, is 011, while the string for the 'n' is 01001. In a file with instances of many more characters, the longer strings will be much longer that the longest ones in this example. Only one character code in the Huffman tree begins with 011, for example, and so there is no potential for the bit strings of two characters to confuse the decompression algorithm.

Compression, then, involves traversing the tree beginning at the leaf node for the character to be compressed and navigating to the root. This navigation iteratively selects the parent of the current node and sees whether the current node is the right or left child of the parent, thus determining whether the next bit is a one or a zero. Because you are proceeding from leaf to root, you are collecting bits in the reverse order in which you will write them to the compressed file.

The assignment of the 1 bit to the left branch and the 0 bit to the right branch is arbitrary. Also, the actual tree might not always look exactly like the one in the figures. It depends on the order in which the search of the tree proceeds while it builds parents, and where in the tree it inserts those parents. The right and left branches from the root node could be reversed without affecting the compression ratio, for example. Figure 1 and Figure 2 are representative. The tree that would be built by the code in this month's column would be slightly different with respect to right and left child node assignments, but it is more difficult to draw for the examples. The tree in the figures would compress "now is" into this bit stream:

  n    o    w   ''   i   s 01001 101 1000 011 0001 110

Decompression

Decompression involves building the Huffman tree, reading the compressed file, and converting its bit streams into characters. You read the file a bit at a time. Beginning at the root node in the Huffman tree and depending on the value of the bit, you take the right or left branch of the tree and then return to read another bit. When the node you select is a leaf--that is, it has no right and left child nodes--you write its character value to the decompressed file and go back to the root node for the next bit.

Most definitions of Huffman trees treat the frequency values as decimal rather than whole numbers. A high frequency character might have a node value of .55 while a low frequency value would have a value of .00002. The value in the root node would, therefore, be 1.0, the sum of all the decimal parts. My implementation uses the actual counts for the values, so the value in the root node is the total number of bytes in the file. This approach avoids the use of floating point arithmetic.

The Huffman Programs

Listing One is htree.h. The BYTECOUNTER typedef defines the data type for the frequency count. I used an unsigned int, which means that the program can compress files of up to 64K bytes. You could change that to a long integer to compress bigger files. The htree structure defines the Huffman tree, and the buildtree prototype is for the function that the compression and decompression programs use to build a Huffman tree from a character frequency array.

Listing Two is htree.c, which contains the buildtree function. It assumes that the first 256 entries in a global tree are initialized with the leaf values of the tree. It scans the tree and initializes local pointers to the two nodes it finds that have the lowest frequency counts. It bypasses any node that already has a parent or that has a zero value in its frequency count. After each scan where two nodes are found, the function adds a node to the tree, placing the address of the new node into the parent member of both nodes that were found by the scan. It puts the sum of the two child frequency counts into the new node's frequency count, and it puts the addresses of the two child nodes into the new node's right and left child node pointer members. When the scan fails to find two nodes that do not have parents and that have nonzero frequency counts, the tree is complete, and the remaining node without a parent is the root node.

Listing Three is huffc.c, the file compression program. After opening the input and output files, the program reads through the input file, building the frequency array and counting the bytes. It also counts the number of distinct character values found in the file. When the program has read the complete file, it writes the byte count and the count of distinct character values to the output file. Then it writes the frequency array, which consists of each character value that occurred in the input file and the number of times it occurred.

There are other ways that the program could have recorded the frequency array. It could have written 256 values, with the position of each value in the array being the character it counted. The array would record all zero as well as significant counts. This might use less room in the compressed file than the method chosen, which writes a count of significant characters followed by each character value and its count. Entries for character values that do not appear in the input file are not in the compressed file. A file that has occurrences of most of the 256 characters would probably record a smaller array by using the former method. A file with fewer distinct character values--such as a 7-bit ASCII text file--would have a smaller array by using the method chosen. A really smart compression program would decide which form is smaller and write the shorter form with a control value that identifies it.

After writing the frequency array, the compression program calls the buildtree function to build the Huffman tree. Then it rewinds the input file, reads each character, and calls the compress function. The recursive compress function sends the compressed bit stream representation for a character to the output file. It starts from the character's leaf node position in the tree and calls itself with the address of the node's parent until it gets to the root node. When it returns from the call to itself, it writes a zero or one bit to the compressed file, depending on whether it was called from a right or left child node.

The program calls the outbit function to write each bit of compressed data. The outbit function rotates bits into a byte until 8 bits have been added, whereupon it writes the byte to the file. The program calls outbit a last time with a -1 parameter to tell it to write the last byte value to the file.

Listing Four is huffd.c, the file decompression program. It opens the input and output files and then reads the byte and frequency counts. The purpose of the byte count is so that the decompression program will know when it is done decompressing. The last byte in the compressed file will usually contain fewer bits than are needed to complete the decompressed file, so the byte count controls the number of bytes written.

The frequency count value tells the program how many entries to read into the frequency array. Each entry consists of a character value and its frequency count. This array, once loaded, becomes the Huffman tree when the program calls the buildtree function. Then the program decompresses by calling the decompress function to get each byte to write to the output file.

The decompress function calls the inbit function to read each successive bit value in the compressed file. The inbit function reads a byte from the compressed file and shifts bits out of it until it has returned all 8 bits, whereupon it reads the next byte. The decompress function uses the bits to walk down the Huffman tree starting at the root node. As long as the current node has child nodes--entries in the right and left pointers--the function gets another bit and moves to the left child node if the bit is a 1 and the right child node if the bit is a 0. When the current node has no child node it is a leaf, and the decompress function returns its character value.

Compression as Encryption

You can use Huffman compression to encrypt data files. To decrypt a file, you must know the algorithm that encrypted it and the encryption key value. A data file that you compress with a Huffman tree is reasonably well encrypted if you keep the frequency array private. The array becomes the key to the file's decryption. The file is further protected if the algorithm itself is part of the secret. A codebreaker would need to figure out in the first place that you used a Huffman tree to compress the file, and then, having made that determination, the spy would have to decipher the frequency array in order to decompress/decrypt it.


_C PROGRAMMING COLUMN_
by Al Stevens


[LISTING ONE]


/* ------------------- htree.h -------------------- */
typedef unsigned int BYTECOUNTER;

/* ---- Huffman tree structure ---- */
struct htree    {
    unsigned char ch;       /* character value             */
    BYTECOUNTER cnt;        /* character frequency         */
    struct htree *parent;   /* pointer to parent node      */
    struct htree *right;    /* pointer to right child node */
    struct htree *left;     /* pointer to left child node  */
};
extern struct htree ht[];
extern struct htree *root;

void buildtree(void);







[LISTING TWO]


/* ------------------- htree.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

struct htree ht[512];
struct htree *root;

/* ------ build a Huffman tree from a frequency array ------ */
void buildtree(void)
{
    int treect = 256;
    int i;

    /* ---- build the huffman tree ----- */
    while (1)   {
        struct htree *h1 = NULL, *h2 = NULL;
        /* ---- find the two smallest frequencies ---- */
        for (i = 0; i < treect; i++)   {
            if (ht+i != h1) {
                if (ht[i].cnt > 0 && ht[i].parent == NULL)   {
                    if (h1 == NULL || ht[i].cnt < h1->cnt) {
                        if (h2 == NULL || h1->cnt < h2->cnt)
                            h2 = h1;
                        h1 = ht+i;
                    }
                    else if (h2 == NULL || ht[i].cnt < h2->cnt)
                        h2 = ht+i;
                }
            }
        }
        if (h2 == NULL) {
            root = h1;
            break;
        }
        /* --- combine two nodes and add one --- */
        h1->parent = ht+treect;
        h2->parent = ht+treect;
        ht[treect].cnt = h1->cnt + h2->cnt;
        ht[treect].right = h1;
        ht[treect].left = h2;
        treect++;
    }
}







[LISTING THREE]


/* ------------------- huffc.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

static void compress(FILE *fo, struct htree *h,struct htree *child);
static void outbit(FILE *fo, int bit);

void main(int argc, char *argv[])
{
    FILE *fi, *fo;
    int c;
    BYTECOUNTER bytectr = 0;
    int freqctr = 0;
    if (argc < 3)   {
        printf("\nusage: huffc infile outfile");
        exit(1);
    }

    if ((fi = fopen(argv[1], "rb")) == NULL)    {
        printf("\nCannot open %s", argv[1]);
        exit(1);
    }
    if ((fo = fopen(argv[2], "wb")) == NULL)    {
        printf("\nCannot open %s", argv[2]);
        fclose(fi);
        exit(1);
    }

    /* - read the input file and count character frequency - */
    while ((c = fgetc(fi)) != EOF)   {
        c &= 255;
        if (ht[c].cnt == 0)   {
            freqctr++;
            ht[c].ch = c;
        }
        ht[c].cnt++;
        bytectr++;
    }

    /* --- write the byte count to the output file --- */
    fwrite(&bytectr, sizeof bytectr, 1, fo);

    /* --- write the frequency count to the output file --- */
    fwrite(&freqctr, sizeof freqctr, 1, fo);

    /* -- write the frequency array to the output file -- */
    for (c = 0; c < 256; c++)   {
        if (ht[c].cnt > 0)    {
            fwrite(&ht[c].ch, sizeof(char), 1, fo);
            fwrite(&ht[c].cnt, sizeof(BYTECOUNTER), 1, fo);
        }
    }

    /* ---- build the huffman tree ---- */
    buildtree();

    /* ------ compress the file ------ */
    fseek(fi, 0L, 0);
    while ((c = fgetc(fi)) != EOF)
        compress(fo, ht + (c & 255), NULL);
    outbit(fo, -1);
    fclose(fi);
    fclose(fo);
}

/* ---- compress a character value into a bit stream ---- */
static void compress(FILE *fo, struct htree *h,
                                struct htree *child)
{
    if (h->parent != NULL)
        compress(fo, h->parent, h);
    if (child)  {
        if (child == h->right)
            outbit(fo, 0);
        else if (child == h->left)
            outbit(fo, 1);
    }
}
static char out8;
static int ct8;

/* -- collect and write bits to the compressed output file -- */
static void outbit(FILE *fo, int bit)
{
    if (ct8 == 8 || bit == -1)  {
        fputc(out8, fo);
        ct8 = 0;
    }
    out8 = (out8 << 1) | bit;
    ct8++;
}







[LISTING FOUR]


/* ------------------- huffd.c -------------------- */
#include <stdio.h>
#include <stdlib.h>
#include "htree.h"

static int decompress(FILE *fi, struct htree *root);

void main(int argc, char *argv[])
{
    FILE *fi, *fo;
    unsigned char c;
    BYTECOUNTER bytectr;
    int freqctr;
    if (argc < 3)   {
        printf("\nusage: huffd infile outfile");
        exit(1);
    }
    if ((fi = fopen(argv[1], "rb")) == NULL)    {
        printf("\nCannot open %s", argv[1]);
        exit(1);
    }
    if ((fo = fopen(argv[2], "wb")) == NULL)    {
        printf("\nCannot open %s", argv[2]);
        fclose(fi);
        exit(1);
    }

    /* ----- read the byte count ------ */
    fread(&bytectr, sizeof bytectr, 1, fi);

    /* ----- read the frequency count ------ */
    fread(&freqctr, sizeof freqctr, 1, fi);

    while (freqctr--)   {
        fread(&c, sizeof(char), 1, fi);
        ht[c].ch = c;
        fread(&ht[c].cnt, sizeof(BYTECOUNTER), 1, fi);
    }

    /* ---- build the huffman tree ----- */
    buildtree();

    /* ----- decompress the file ------ */
    while (bytectr--)
        fputc(decompress(fi, root), fo);
    fclose(fo);
    fclose(fi);
}
static int in8;
static int ct8 = 8;

/* ---- read a bit at a time from a file ---- */
static int inbit(FILE *fi)
{
    int obit;
    if (ct8 == 8)   {
        in8 = fgetc(fi);
        ct8 = 0;
    }
    obit = in8 & 0x80;
    in8 <<= 1;
    ct8++;
    return obit;
}

/* ----- decompress file bits into characters ------ */
static int decompress(FILE *fi, struct htree *h)
{
    while (h->right != NULL)
        if (inbit(fi))
            h = h->left;
        else
            h = h->right;
    return h->ch;
}

Copyright © 1991, Dr. Dobb's Journal