Extending Java Streams to Support Bit Streams

Spring 1997 Dr. Dobb's Journal

Implementing LZW compression using Java

by Laurence Vanhelsuwé

Laurence is a freelance software engineer and author. He can be contacted at lva@telework.demon.co.uk.

The java.io package is Java's solution for shielding you from input/ output related device and platform dependencies. Of the 23 core java.io classes, 19 rely on the concept of a data stream (see Figure 1). A Java stream is simply a unidirectional, sequential flow of bytes that emanates from some concrete source (for input streams) or flows toward a concrete sink (for output streams). Input streams can only be read from, and output streams can only be written to.

In this article, I'll show how to subclass stream classes by designing and implementing I/O stream classes that support bit streams rather than byte streams. I'll then demonstrate these classes by deploying them in a Java implementation of the ubiquitous Lempel-Zif-Welch (LZW) data compression and decompression algorithm.

Java I/O Stream Class Basics

Two superclasses, InputStream and OutputStream, form the roots of the dual stream inheritance branches. Example 1 shows the definition of the InputStream class, a good summary of what all input streams need to be able to do.

Example 1: Abstract class InputStream interface.

public class InputStream extends Object {
   public abstract int read() throws IOException;
   public int read(byte b[]) throws IOException;
   public int read(byte b[], int off, int len) throws IOException;
   public long skip(long n) throws IOException;
   public int available() throws IOException;
   public void close() throws IOException;
   public synchronized void mark(int readlimit);
   public synchronized void reset() throws IOException;
   public boolean markSupported();
}

The main methods are the three overloaded read() variants that let you read a single byte or a block of bytes at a time. The last three methods—mark(), reset(), and markSupported()—allow subclasses of InputStream to implement a multibyte "unread" function. Clients can drop a mark in the input stream and revert back to it at a later time if the class in question supports the unread feature. Most don't because, by default, InputStream returns "false" when its markSupported() method is called, and most subclasses inherit markSupported() without overriding it. Class InputStream is abstract because of one single component that is left unimplemented: the building block read() method, on which the other read() methods and the skip() method rely.

OutputStream is equally straightforward. It mainly provides equivalent write() methods to send bytes to any output stream (see Example 2). The mark/reset mechanism is not present in OutputStream because it doesn't make any sense on output.

Example 2: Abstract class OutputStream interface.

public class OutputStream extends Object {
   public OutputStream();
   public abstract void write(int b) throws IOException;
   public void write(byte b[]) throws IOException;
   public void write(byte b[], int off, int len) throws IOException;
   public void flush() throws IOException;
   public void close() throws IOException;
}

Table 1 shows the data sources and sinks supported by the core input and output stream classes. In addition to these, classes in the java.net networking package produce input and output stream objects that connect to various network sources (like a TCP stream) and destinations. Unlike the input stream classes, the output stream classes do not support writing data to a String since Java Strings are read only. Furthermore, you can expect Sun's future Java API additions to support and rely on stream objects.

Table 1: Data sources and sinks supported by core input- and output.

Data Types    Input Class            Output Class

external files   FileInputStream           FileOutputStream
Strings          StringBufferInputStream 
byte arrays      ByteArrayInputStream      ByteArrayOutputStream
pipes            PipedInputStream          PipedOutputStream

You can further divide both the input and output branches of java.io's inheritance hierarchy into two types of classes:

Those that connect to specific data sources (for input streams) or data sinks (for output streams).
Those that enhance stream functionality.

In general, the constructor of a stream class is what allows you to specify the entity to which your stream object will connect. For example, objects of class FileInputStream accept data from a file, so FileInputStream constructors look like:

public FileInputStream(String name) throws 
  FileNotFoundException.

public FileInputStream(File file) throws
  FileNotFoundException.

public FileInputStream(FileDescriptor fdObj)

All three constructors let you specify a file with content that will be used to feed read requests to the input stream. On the output side of the java.io hierarchy, the same approach is used (ByteArrayOutputStream being an exception). Listing One shows an example Java program that uses class FileInputStream to calculate character frequencies.

The Filter I/O Classes

Things become more interesting when you look at the two subhierarchies rooted in classes FilterInputStream and FilterOutputStream. Example 3 shows FilterInputStream's definition.

Example 3: Class FilterInputStream interface.

public class FilterInputStream extends InputStream {
   protected InputStream in;
   protected FilterInputStream(InputStream in);
   public int read() throws IOException;
   public int read(byte b[]) throws IOException;
   public int read(byte b[], int off, int len) throws IOException;
   public long skip(long n) throws IOException;
   public int available() throws IOException;
   public void close() throws IOException;
   public synchronized void mark(int readlimit);
   public synchronized void reset() throws IOException;
   public boolean markSupported();
}

Class FilterInputStream looks almost exactly like InputStream. The crucial difference is the constructor's signature: The FilterInputStream constructor takes any other InputStream object as an argument. This simple difference (along with the protected instance variable in) allows you to chain input streams together to form input-processing pipelines of arbitrary complexity, all built from concrete, reusable instances of various FilterInputStream classes. The same powerful and flexible features are in the output-stream class hierarchy.

Figure 1: Package java.io inheritance hierarchy.

Diagram

Figure 2 illustrates a fictitious input-processing pipeline. In Step 1, unencoded, uncensored, and unformatted text originating from the Internet enters the pipeline. Step 2 buffers this data in large chunks, thereby enhancing the performance of the entire pipeline. Step 3 decodes the text into plain ASCII. Step 4 enforces the CDA and substitutes certain unspeakable words with sequences of Xs (I entirely support the Free Speech On-line campaign, so this example is purely tongue-in-cheek.) Finally, Step 5 reformats the text so it fits properly on cheap, low-resolution displays hooked up to the NCs of the future.

Figure 2 shows the general raison d'être for all filter stream subclasses—enhancing stream functionality.

Figure 2: A fictitious input- processing pipeline.

Diagram

Chaining Filter Stream Objects

The sequence of stream object constructor invocations determines the structure of your pipelines. You can significantly improve the performance of the CharFreq program in Listing One by turning the single-stage FileInputStream "pipeline" into a two-stage, buffering pipeline. Simply replace the line constructing the FileInputStream object with file = new BufferedInputStream(new FileInputStream(args[0]) );. Note that this change does not affect any other parts of the program. Although the main loop is now invoking read() on a BufferedInputStream object, the program's function is the same. The chaining mechanism is limitless; you could "wrap" the above BufferedInputStream object with yet another type of input stream, thus creating a three-stage input pipeline, and so on, ad infinitum. This technique is as idiomatic within Java as using the value of assignments is in C or C++ (if ((ch=getch()) == LF), for example).

Chained input or output pipelines frequently consist of three or more stream objects, so the aforementioned syntax can quickly become unmanageable. The preferred way of coding your chains is to use one constructor per line and declare variables to hold every object; see the coding template in Example 4.

Example 4: Preferred method for coding input and output chains.

<StreamType> varNameSource = new <StreamType>( concrete source );
<StreamType> varNameLink1  = new <StreamType>( varNameSource);
<StreamType> varNameLink2  = new <StreamType>( varNameLink1);
         :                         :
<StreamType> varNameLinkN-1  = new <StreamType>( varNameLinkN-2);
<StreamType> varNameLinkN  = new <StreamType>( varNameLinkN-1);

Reading and Writing Arbitrary Bit Streams

BitStreamInputStream and BitStreamOutputStream are subclasses of the filter stream classes. Our requirements for these bit-stream classes (see Listings Two and Three) are:

Support bitfields ranging in length from 1 to 32 bits inclusive.
Bitfields are aligned on their natural bitfield boundary (byte alignment is not special).
The current bitfield size can be altered at any point during reading or writing.

Supporting 1- to 32-bit bitfields suggests two new methods: readBitField() for the input class, and writeBitField() for the output class. These two methods are the logical analogs of read() and write().

On the input side, the return mechanism for readBitField() needs to use the same trick as read() as defined in InputStream. read() can either return a single byte or the integer 1 (which represents EOF). Since the integer value of 1 is equivalent to the byte value 255, read() returns a 32-bit int rather than an 8-bit byte. For the sake of consistency, but not elegance, I use the same approach, returning 64-bit long to allow 32-bit bitfields and 1 for EOF. It would be much cleaner to use the Java exception mechanism to trap EOF. This technique is, in fact, used by class DataInputStream, but this is an exception (no pun intended).

Aligning bitfields on their natural bitfield boundaries is handled transparently: No methods are required to support arbitrary bitfield alignment. Altering the bitfield size suggests that the method setBitFieldSize(int bfSize) is present in both classes. Examples 5 and 6 show the interface definitions for these two classes.

Example 5: Class BitStreamInputStream interface.

public class BitStreamInputStream extends FilterInputStream {
   public BitStreamInputStream(InputStream in)
   public BitStreamInputStream(InputStream in, int bitFieldSize)
   public long readBitField() throws IOException
   public void setBitFieldSize(int bits) throws IllegalArgumentException
   public int getBitFieldSize()
}

Example 6: Class BitStreamOutputStream interface.

public class BitStreamOutputStream {
   public BitStreamOutputStream(OutputStream out)
   public BitStreamOutputStream(OutputStream out, int bitFieldSize)
   public long writeBitField() throws IOException
   public void setBitFieldSize(int bits) throws IllegalArgumentException
   public int getBitFieldSize()
}

The getBitFieldSize() method is provided for completeness and symmetry. Both classes provide two constructor options. The simpler constructor creates streams supporting 8-bit bitfields. In other words, it produces an everyday byte-aligned byte stream by default. The second and (more useful) constructor allows you to explicitly specify the initial bitfield size for the stream.

Since I built my bit stream classes on top of byte stream classes (and not vice versa), I need to pack and align variable-sized bitfields into bytes for writing and unpack and align these same fields for reading.

To implement method BitStreamInputStream.readBitField(), I chose to use a single-byte buffer sitting between the stream and the bitfield assembly code. While this might be less efficient than using a larger buffer (say, 8 bytes, capable of holding the largest bitfield—32 bits—under all alignment circumstances), it greatly simplifies implementation and EOF handling. The basic structure of the algorithm is shown in Example 7 in pseudocode.

Example 7: Pseudocode for BitStreamInputStream.readBitField().

bitField = 0
bitsToRead = current field size
while bitsToRead > 0
    if read buffer is empty, re-fill byte read buffer (and possibly
      generate EOF here)
    extract largest possible subfield (possibly entire bitfield requested)
    gradually build bitField up by adding subfield to bitField (using
      shifts and an OR)
    decrement bookkeeping variable tracking the number of unread bits
      in buffer (bitsInCache)
    decrement bitsToRead by the number of bits extracted in this iteration
return bitField

Instead of using an explicit bitstream pointer (the equivalent of a filepointer for byte streams) to track which bits have already been consumed, the code keeps the next bitfield left-aligned in the read buffer. Bitfield fragments (called "partials" or "subfields" in the code) are extracted from the read buffer and right-aligned before being shifted left the necessary number of bits and ORed into the bitfield construction variable (bitField).

The choice of alignment directions can be reversed without affecting the basic algorithm, although this would mean more than changing the few << and >> operators. The algorithm and code are rather hard to understand because of their generic and very low-level character: Any size bitfield (from 1 to 32 bits inclusive), aligned to any bit boundary should be fair game to the algorithm. To convince yourself of the correctness of the algorithm, you should try pen-and-paper simulations using the following test cases:

Bitfields of length 1 bit (repeat nine times to convince yourself byte boundaries are handled properly).
Bitfields of length 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 31, 32 bits.

The bitfield-writing algorithm requires bit-shuffling operations similar to those in the reading algorithm, except it will be the write buffer (again a single byte) that receives bitfield fragments. These need to be aligned to the bitstream's current bit alignment. Whenever the write buffer contains eight bits of data, the buffer is flushed to make room for more bitfield data (either whole bitfields, if the current bitfield size is between 1 and 8, or parts of bitfields if the size is between 9 and 32).

While the bitfield-reading algorithm had to deal with encountering and passing an EOF, the writing algorithm has to ensure that all bitfield data is written when the stream is closed. This is done by overriding the close() method: When the BitStreamOutputStream is closed, it checks if there are any unwritten bits in the write buffer, and flushes them if they exist.

A Question of Semantics

BitStreamInputStream and BitStreamOutputStream both inherit the read() and write() methods from their respective filter stream superclasses. Since these and other inherited methods know nothing about the new bit-stream model, and could corrupt the stream (by skipping or not writing bits), I had to reimplement these methods.

When subclassing a class, you should pay careful attention to the new semantics of your new class to ensure that these are as logical and as close to the original as possible. When designing my bit-stream classes, I had to decide the exact semantics for these same read() and write() methods in my extended classes. For the read() method, my options were:

Read a bitfield of the current size (exactly like readBitField()).
Read a byte (normal byte alignment).
Read an 8-bit bitfield (arbitrarily aligned).

While the first option is elegant, I rejected it because it deviates too much from the byte bias of the java.io classes. The read() and write() methods of all the java.io classes implicitly deal with 8-bit bytes. The second option can discard data depending on the alignment of the last bitfield and/or its size, so that option is rejected too. I chose reading an arbitrarily aligned 8-bit bitfield as the new behavior for read(). Symmetry forced me to use analogous semantics for the write() methods. Similar reasoning led me to decide that skip() shall skip multiples of 8 bits (and not an arbitrary number of bits).

Using the Bit-Stream Classes: Implementing LZW Compression

One common application of reading and writing bitfields is data compression. As a real-life test suite for the bit-stream classes, I present a straightforward implementation of the LZW data compression and decompression algorithm, as explained by Mark Nelson ("LZW Data Compression," Dr. Dobb's Journal, October 1989), and further refined by Shawn M. Regan ("LZW Revisited," Dr. Dobb's Journal, June 1990).

I ignored Mark's C source code and used his pseudocode listings to implement the compress() and decompress() methods in class LZW. At one point, Mark refers to having the "implementation blues." However, using my bit-stream classes and the Hashtable class in Java's java.util package, I found coding to be refreshingly simple.

As Shawn clearly explained in his article, one optimization of the LZW algorithm is to use dynamic code sizing (that is, bitfield sizes). While both authors found this to be an additional obstacle (which Shawn tackled), I found that, because of the bit-stream classes' dynamic bitfield size capability coupled with class Hashtable's unlimited capacity, the Java implementation did not share these obstacles.

The design of the LZW class (Listing Four) uses a helper class, LZWDictionary (Listing Five), to encapsulate the core LZW data structure: a two-way code-to-string/ string-to-code look-up structure. While the previous articles on LZW used fixed-sized arrays (in C) to implement this functionality, I simply use two Hashtable objects—one per look-up direction. This implementation, therefore, does not have any "table full" conditions: class java .util.Hashtable transparently grows whenever necessary.

Java Performance Blues

You might wonder why I didn't create even more filter stream subclasses that encapsulated the LZW algorithm. This would most definitely result in an elegant set of compressing and decompressing stream classes. However, the LZWClient demonstration program (Listing Six) clearly suffers from lack of speed. Even on my vanilla Pentium/100 using the vanilla Sun JDK, it was obvious that Java was being stretched to its limits by this implementation of the LZW algorithm. Casting the algorithm further in the powerful (but inefficient) harness of a filter stream subclass would cripple the code beyond rescue.

Do the new bit-stream classes cause the bottleneck? If you run the LZWClient program with the prof JVM option, then study the resulting profiling report generated in the file java.prof, you'll see that the bit-stream classes and their expensive-looking bit manipulation operations are not the cause. The top cycle guzzlers are Hashtable's containsKey() and put(), and StringBuffer's append() and toString(). Class StringBuffer's involvement without my program's explicit usage of this class might seem surprising. The explanation lies with the behind-the-scenes reliance of Java compilers on class StringBuffer to translate simple string concatenation expressions like stringA + stringB into((newStringBuffer(stringA)).append(stringB)).toString(). Method LZW.compress() does, indeed, use the concatenation operator ('+') to append every newly read source character to the current string, hence the numerous append() and toString() invocations. Because the performance hot spots are not located in the bit-stream classes, I leave the optimization challenge (and fun) to you.

Conclusion

Subclassing Java's standard I/O classes is straightforward. It allows you to create I/O pipelines that incorporate your own filters and enhancers. The elegance with which both operations can be performed is testimony to Sun's enlightened investment in design rather than implementation. The only current annoyance is Java's overall lackluster run-time performance. Luckily, this is only a temporary blemish on the language: Sun's hardware implementations of the Java Virtual Machine architecture will soon be incorporated into PC hardware accelerator cards, finally silencing the performance issue and allowing Java to take the throne.

Listings