Database Management and Java

Dr. Dobb's Journal May 1997

A Java class that provides one interface to multiple file formats

By Art Sulger

Art is a system administrator for the State of New York and is currently involved with, among other things, a large web project. He can be contacted at asulger@ibm.net.

In "Database Management in C++" (DDJ, November 1994), I described a class structure implemented in C++ that provided a single interface to multiple file formats. Since that time, the programming environment has become much more heterogeneous because of the growth of the Internet. I recently needed a method to interface to Health Level 7 (HL7) messages, an ANSI specification for transmitting health-related information within the Application layer of the OSI network model. The class structure I previously presented in DDJ implements both xBase and CData file formats, both of which are fixed-length record formats. (The CData file format is described in C Database Development, Second Edition, by Al Stevens, MIS Press, 1991.) By contrast, an HL7 message is a complex variable-length file consisting of a variable number of different records composed of variable-length fields. HL7 messages are sent over the Internet. Applications that send and receive HL7 messages run on various computer architectures. Compiled languages like C++ require substantial modifications between the various versions of Windows, OS/2, Solaris, and other operating systems on the Internet. Additionally, we have to distribute tailored versions to each platform.

Java, a platform-independent, interpreted language, addresses many of these portability problems. Could the original design be modified to add support for variable-length fields? This would be a good test of the robustness of the basic class design.

In this article, I'll describe the changes required to port the more than 2300 lines of C++ code (published with my previous article) to Java. I'll also describe the changes required to add variable record-length file formats such as HL7. What I won't do is provide an in-depth description of how to manipulate HL7 data, which consists of "messages" containing an assortment of record types. The HL7 code here is offered as "proof-of-concept" only. You can use these classes, however, to read a record containing variable-length columns. As with the previous article, my main point is that encapsulating column behavior in a separate class dramatically extends the capability of common relational file formats.

You can use this model to interface to standard-format files, and the file structures can manage extended data types that they would not ordinarily. These data types can be record sequence numbers, pointers to multimedia files, calculated fields, and the like. The files can store any data that Java can recognize using this model. You can also extend some of the built-in Java classes to be special column types in the database. This may sound like an object-oriented database, but it really is just fundamental relational design. The code includes examples of a file field that displays values from an internal list, and a field that does C-like date arithmetic.

Rows and Columns

Relational database design is based on rows and columns. The columns should be strongly typed, which reduces ambiguity and errors when storing, displaying, or comparing column values. Critics of relational databases often ignore this strong typing requirement. A strongly typed column, though, can manage complex data types. Separating column objects from the table object is how columns can be strongly typed, independent of the underlying file-storage mechanism. In Figure 1, which illustrates the essence of the design, a Table class "has" one or more columnizer objects. The columnizer is the mechanism that does the typing, or, in relational terms, it describes the domain.

In the C++ design, everything descended from a master database class, LogicalDBMS (see Figure 2). The blue objects -- LogicalDBMS, Table, and Column -- are abstract classes. The yellow objects can be instantiated. Complex data types extend from the Column class. They are declared as arrays in the Table descendants. The essence of this design is a polymorphic Column array instantiated within a Table class.

I planned to mimic the C++ design completely by making an abstract Column that could be extended to support the complex data types that a modern database may contain. However, I also wanted to take advantage of the rich set of built-in Java classes. There is a way to do this in Java with "interfaces" -- classes that describe methods but do nothing about the implementation. The usefulness comes when other classes that do real work "implement" the interface. You can manipulate these objects using either the interface name or the class name. Java solves some of the name and initialization confusion found in C++ multiple inheritance by forcing you to define all interface methods in all classes that implement the interface.

Tables contain Columns. Columns have contradictory objectives. On the one hand, a Column should be completely expressive -- a Column containing sound should be able to play the sound, edit the sound, and do all the things sound can do. At the same time, the Table object should treat Columns in a uniform manner, whether the Column contains sounds, pictures, or numbers. Tables should be able to declare arrays of Columns and perform similar operations on each array member. The Java model uses an interface object to describe everything important about a Column; see Figure 3. Here is how it works:

The abstract class Table (abstract classes are again shown in blue) contains a Columnizer array (interface classes usually have names that end in "able," "izer," or "izable" because they change the behavior of other, real, classes). It is an array of unassigned references, because an interface cannot be instantiated. How, then, does a Table own the Columns? Simple. Create your column objects directly, or derive them from your own classes or any of the extensible Java classes. Any Column object you make must "implement Columnizer." After you instantiate the Column by calling new, assign it to an element of the Columnizer array (shown in green). The Columnizer array always remains an array of references. However, it now references a real Column object. You can use an element of the Columnizer array as if it were the real column to which the element points; see Example 1.

Using an interface class gives an added capability. You can extend some of the Java built-in classes. Unfortunately, though, the Java designers marked as "final," or nonextensible, many of the classes that would be useful to extend as Column types. These are the classes that wrap String, Integer, and Character. You can build wrappers around these final classes, or you can make your column classes extend any built-in classes not marked final. There are many possibilities, as long as the Column class implements Columnizer. The examples include an xDate column that extends the Java Date class. It lets you perform date arithmetic on an xBase field.

The Java version of the Column class keeps track of domain information in a similar way to the C++ class. Domain information includes whether the column is a string, a date, or a number, and the like. In general, the more precisely that you can define a column's domain, the easier it is to write programs that use that data. This is the heart of expanding relational designs to include nonprimitive data types, such as multimedia. The column is the object. The column description, which is the domain, should precisely describe everything it can about the object -- how it is displayed, the paradigm in which it exists, and even enumerate the possible instances of the object. There is an HL7 column (HL7ColumnT0001) in the code listings that has a domain of only four values -- Male, Female, Other, and Unknown.

Column domains often consist of a combination of basic domain types. For example, a column in the NUMBER domain could also be in the FLOAT domain, as well as the RIGHT-JUSTIFIED domain. These are all valid possibilities. The C++ classes handled this by defining a column using a series of bits as an enumerated type. The enumerated types can be combined, if, for instance, a Column was a multitype instance, like FLOAT and NUMBER, or INTEGER and CURRENCY. You would then write the code to "express" the particular domain of the column. Java does not have enumerated types (yet), but the java.util BitSet object provides an easy way to do bit-level ANDs and ORs; see Example 2.

The Table Class

The Table is the super class of all file formats. Tables handle such things as the underlying file I/O, if needed, and hold information like the number of rows in the table and the length of the row, for fixed length records. I was torn between naming this class "Row" and "Table." An examination of the methods convinced me to call it Table rather then Row. It makes more sense to "open" a Table, or get the "next" thing in a Table.

Extend this class into a specific file format that you want to implement. Each file will have specific constructors and specific open and close methods. For example, an xBase class will need to read the header information located in the first 32 bytes of the file, then read the column information located in subsequent 32-byte chunks. A class that implements JDBC must, on the other hand, create a java.sql.ResultSet object by executing a query, and also map the Table methods to the JDBC methods in Example 3.

The HL7Table and the xTable use a StringBuffer object and a byte array to store data read by the RandomAccessFile object. I expected to have trouble accommodating Java's lack of pointers. After all, the original C++ code often contained constructs like Example 4.

At times, Java's strong typing did raise some hurdles. For instance, the Column object needs to know where the data is in the byte array. A C programmer would hand off a pointer and length indicator to the Column. To Java, however, a String is not an array of bytes, and because of Unicode support, even an array of bytes is not an array of characters. So the classes go through the bit of skullduggery in Example 5 to get the Columns to know where the data is. The readBuffer byte array is copied to a StringBuffer object. The Table is responsible for telling each column the offset and length of its data in the StringBuffer.

I encountered another problem with Java's strong type checking. The xBase 4byte integer is Little-endian byte order, and Java uses Big-endian byte order. A C programmer would easily deal with this using unions, pointers, or arrays. Example 6 presents the rather wordy (even with several lines of error checking removed) Java equivalent.

Some Things are Easier

Multiplatform coding in Java usually is simpler because it runs on a Java machine that is identical across platforms. For example, the main purpose of the C++ topmost class, LogicalDBMS, was to ensure portability between various operating environments. Java eliminates this coding by providing a uniform operating environment. Another difficult problem in multi-platform support is how to signal error conditions. Java also makes this problem easier with its extensible error and exception classes. It also forces you to think about the various array-overrun type errors and provides a mechanism for expanding on the provided error description. Therefore, many of the Column methods that formerly returned the enum SYSERR data type now return True, False, or void. The error is instead "thrown" to a DatabaseException class in the exception chain; see Example 7.

You can customize error messages to better fit the run platform. The example merely invokes the parent Exception class or prints to the Java console. If you use the database classes in a windowed environment, the DatabaseException class can invoke a dialog. If you use the classes in a batch mode, the class can emit to standard error or a log file.

Changes for Variable-Length Fields

You can read and write xBase and CData files very quickly. fseeks go quickly to any record offset; new records are added to the end of the file; deleted records are never actually removed, only marked. Variable-length records like the Observation (OBR) and Patient ID (PID) in HL7, on the other hand, require a calculation of each row length to travel to the next record. Random reads are physically impossible unless you maintain a separate row-location index. Some optimizations are possible, though. In this implementation, each HL7 row starts with a row-length indicator. There is also an array of offsets that give the location of each column. Given such an array of offsets, the private parseRow() method passes the location of the raw data to the column object when you want the column to display itself, or want to assign data to it. You need to update the row's array information whenever a column is updated, as well as the row-length indicator, because the column length might change. If you didn't have the array, you would have to build it dynamically the first time a column from a row needed to be accessed, either for displaying or updating. This is one of many ways to handle a variable-length record. Another way would be to expand each column to maximum length and treat records as fixed length, which of course, they then would be. The method I used is a preliminary design and may not necessarily be the best for all circumstances. However, it does show that a physical database design treating the row object separately from the Column is easy to extend. In fact, HL7 table classes required about the same coding as fixed-length formats. The primary difference is that the columns must be initialized each time the row is accessed, whereas the fixed-length format classes only required this processing during the open method.

Using the Classes

Listing One reates an xBase file that is a subset of an existing xBase file. The complete code, available electronically (see "Availability," page 3), includes some short xBase and HL7PID records for testing. Additionally, the HL7PIDTable.java source includes a main section that scans through the test data. The Java code includes extensive, built-in documentation that you can compile using javadoc.

Conclusion

Java required half as many statements as the C++ code. Java lags behind in speed. The C++ sample application found strings in xBase files at least ten times faster than a Java application. I am not daunted by the lack of speed. Java is typically used in a client/server application, where network speed might be an order of magnitude slower than file I/O. Furthermore, large amounts of data can be stored in a fast SQL database and these classes will be responsible for dealing with subsets of that data via JDBC calls.

DDJ

Listing One

//---------------------begin source file--------------import java.util.*;
import java.io.*;
import java.lang.*;

public class testX
   {
   public static void main(String [] args)
      {
      try{
       DataInputStream in = new DataInputStream(System.in);
       int i;
       String fname, outName;
       if (args.length < 1) 
          {
          System.out.println
          ("this extracts rows from an xBase file\n" +
           "you will be asked for a filename, column number and value,\n" +
           "the program will create a new xBase file composed of rows\n" +
           "that match the search\n");
          System.out.print("enter xBase file to read:");
          System.out.flush();
          fname = in.readLine();
          } 
       else 
          fname = args[0];
       Table t = new xTable(fname,true); 
       System.out.print("enter xBase file to create:");
       System.out.flush();
       outName = in.readLine();
       xColumn[]xcols = new xColumn[t.columnCount()];
       int iOffset = 1;
       for (i = 1;i <= t.columnCount();i++)
          {
          xcols[i - 1] = new xColumn(new StringBuffer(),iOffset,
          t.columnLength(i),t.columnName(i),
          t.columnDomain(i),
          t.columnDecimals(i)," ");
          iOffset += t.columnLength(i);
          }
       Table newX = new xTable(outName, xcols);
       for (i = 1; i <= t.columnCount(); i++)
          {
          System.out.print(i + ": " + t.columnName(i) + "\t");
          }
       System.out.print("\nenter column number to search on,\n" +
       "or enter 0 (zero) to skip the search test: ");
       System.out.flush();
       int C = 0;
       try{
          C = Integer.parseInt(in.readLine());
          }
       catch(NumberFormatException e)
          {
          System.out.println
             ("Invalid Number Format:" + e.getMessage());
          System.exit(0);
          }
       if (C != 0) 
          {
          if(!t.isColumn(C))
             {                  
             System.out.println("Invalid column number " + C);
             System.exit(0);
             }
          System.out.print
          ("enter string or string prefix to search for: ");
          System.out.flush();
          String searchTerm = in.readLine();
          System.out.println(new Date());
          String cName = t.columnName(C);
          System.out.println
          ("looking for " + searchTerm + " in " + cName);
          System.out.flush();
          try{
             while(t.next())
                {
                if (t.isMatch(C, searchTerm))
                   {
                   newX.newRow();
                   for (int j = 1; j <= t.columnCount(); j++)
                      {
                      System.out.print(t.display(j) + "\t");
                      newX.assign(j, t.display(j));
                      }
                   newX.write();
                   System.out.println(" ");
                   System.out.flush();
                   }
                }
             t.close();
             newX.close();
             System.out.println(new Date());
             System.out.flush();
             }
          catch (DatabaseException e)
             {System.out.println(e.getMessage());}
          }
       }
      catch (DatabaseException e)
         {System.out.println(e.getMessage());}
      catch (IOException e)
         {System.out.println(e.getMessage());}
      }
  } // end of class

Back to Article