Extensible Data Processing Without Inheritance

C/C++ Users Journal August, 2005

Processor objects don't need to be related through class hierarchies

By Geoffrey C. Wedig and Stephen Gross

Geoffrey Wedig is a senior developer at Case Western Reserve University, designing statistical modeling software for genetic research. He can be contacted at wedig@darwin.epbi.cwru.edu. Stephen Gross also works at CWRU. He can be contacted at sgross@darwin.epbi.cwru.edu.

Statistical Analysis for Genetic Epidemiology, or S.A.G.E., is a suite of software tools for researchers in the field of Genetic Epidemiology (http://darwin.cwru.edu/). These tools provide statistical analysis of genetic data with the goal of identifying the genes underlying complex diseases. At present, S.A.G.E. includes a dozen different programs, each of which can perform one or several different analyses upon their data, but are built on a common set of libraries and tools.

A standard way to provide analysis definitions to the programs is provided through parameter files. These files are text based and consist of a list of analysis configurations to be performed. Each analysis type is identified by a name ("foo_analysis," "bar_analysis," and so on) and contains the analysis-specific settings. When a program in the suite executes, it parses each of the parameter file's analysis configurations according to analysis-specific syntax rules. Each S.A.G.E. program recognizes a different set of analysis configurations.

Traditionally, each program was required to create its own custom structures for the parsing and containment of the data and analysis configurations associated with the run. This is simplified by having a common base class that stores data common to all programs from which applications derive and add any specific features unique to their particular needs. Even so, a lot of effort was required from each individual program to customize the parsing routines. This led to a great deal of redundant code that, because it was specific to the analysis type, could not be easily generalized. This was particularly troubling for writing component tests, each of which would need to have component-specific versions of the same routines.

We wanted to create a new library for parsing and containing analysis definitions, requiring a minimal effort from the application using the library. At most, we wanted the application to have to define the specific analysis and its parameters and how to parse it (the parsing itself is also generalized, but is not the subject of this article). We didn't want to require a lot of boilerplate code to make it work.

The requirements of the new system were:

Provide a methodology for specifying new analysis configurations and the parsing rules thereof.
Be able to add these new configurations when needed, without extensive coding of derived types.
Have the system classify each analysis configuration it encounters and process it correctly using the appropriate passing rules.
If the list of allowable analysis configurations was fixed, this would not be a difficult problem. We could write a parser for each analysis type, and then in some ParsingManager class, iterate across the contents of the input file and use a switch to choose which fooParser to use on a particular analysis configuration. Unfortunately, the requirement that we be able to add new analysis types at need makes things more difficult.
Our first solution was to consider an inherited parser schema, with a BaseParser class containing a virtual processInput() function that could be overridden. These derived parsers could be registered with the primary processor, which would do the classification of each analysis and pass it to the appropriate parser. This didn't have the same level of boilerplate that inheritance of the processor itself would, but it still required a fixed layout and was therefore hard to extend should unforeseen parsing needs emerge. We'd prefer a system whereby the parsing routines were only required to support a single member function, one that takes the argument to be processed and returns an object of the correct analysis type. But how could we make the processor accept a number of parsers that do not share a familial relationship through inheritance? To make our problem even more complex, we wanted this to be general, able to be used in a variety of circumstances, essentially whenever we have an iteratable list of data elements, each of which must be processed based upon some sorting criteria.

The Solution

Our solution required an interesting mix of templates and function pointers. To begin, consider a simple, nonclass-based solution to the requirement that processor objects need not be derived from a common base. What if we write processing logic in standalone functions, then store pointers to those functions? Each function must follow a predefined signature and correspond to a name-based set of input. The ProcessorMgr (available at http://www.cuj.com/code/) then stores a map of strings to function pointers; see Listing 1.

This certainly works, although we've eliminated all the advantages of object-oriented design from our solution. If we could store member function pointers in the ProcessorMgr, we could let users write complete processor classes rather than standalone processor functions. Unfortunately, member function pointers are bound to specific classes, and there is no way to store a container of member functions that are bound to a mixture of classes. What if we require that the processing member function have a specific name instead? That works, but we need a way to store the processors internally. Because they share no ancestry, the only thing we can store is a void*. We'd like the code to look something like Listing 2.

Listing 2 is a nice idea, but it is missing crucial information to make it work, namely the types of the processors. Could we, perhaps, templatize processInput() so that it knows how to cast the void pointer it finds (Listing 3)?

That would certainly work, but it places the burden of determining which processor to use on users, rather than have ProcessingManager figure it out, which makes you wonder why the object is there at all. We have to make the ProcessorManager figure out what type to cast the void* to. What if we had a function of the ProcessorManager templatized on the processor type that could do this cast for us? How would we make sure it was the right version of this templatized proxy function? Merging the previous examples shows us how. Because the ProcessorMgr knows its own type, it can store function pointers to itself along with the processor to be cast (Listing 4).

Now, the addProcessor() function is properly templatized on the processor type. When a processor is added, it is still stored as a void*, but in addition, a pointer to the proxyFunc() member function is stored as well. This is a function that is templatized on the processor type, and therefore, can correctly cast a void pointer back to its original PROCESSOR_TYPE, and then invoke the operator() on that object. When an InputObj is processed using processInput(), the function locates the ProcessorInfo corresponding to the given input category name and invokes the correctly templatized proxy function, which recasts our processor. Using function pointers, we have effectively templatized the correct function call in advance of its invocation. You are now able to write your own processor object in whatever manner is needed. As long as that class has an operator() function with the correct signature, the processor works.

There are still a few problems, however. First, the processInput() function must have the classification fed to it from an external source. We'd like ProcessorMgr to determine which processor to use. And we'd like the processor to not be restricted to strings as a classification method either. Listing 5 shows something like what we want, using an arbitrary classification schema, and not telling the process manager what to do with a specific InputObj.

To make this work, we templatize ProcessorMgr on a CLASSIFIER_TYPE—a functor that takes an InputObj and returns that InputObj's category as a CLASSIFIER_TYPE::return_type. Then, in processInput(), we search for the correct ProcessorInfo based on the classification of the input type (Listing 6).

So far, we have been using InputObj as the standard input type to a processor object. Again, making things general, we modify the code so that ProcessorMgr is templatized on the input type. Also, we provide a default classifier object that simply returns the input type, so we can process any sortable types without having to define a new function (Listing 7).

At this point, the majority of the system is in place. We can take any input, classify that input based upon a function, and process it, requiring only minimal restrictions to interface and implementation concerns. All that is left to do is some cleanup. We added default processing for unclassified types, replaced the processor points in the addProcess() function with a functor-style interface, and so on. Internally, we use shared pointers to void to clean up our memory management. The result was an easily extensible method for classifying data and processing it based upon that classification.

Acknowledgment

S.A.G.E. (Statistical Analysis for Genetic Epidemiology) is supported by a U.S. Public Health Service Resource Grant (RR03655) from the National Center for Research Resources.