November 2002 C++ Experts Forum/The (B)Leading Edge

C++ Experts Forum

The (B)Leading Edge: Finishing Up IndexedFile

by Jack W. Reeves

In this (hopefully) final installment on using XDRStream to implement an IndexedFile class, I am going to take another high-level look at the IndexedFile/BtreeIndex class interfaces and then go back over some lessons learned.

BtreeIndex/IndexedFile

I have now been using — well actually testing — my IndexedFile class library enough to have figured out some things that I think need to be changed in the interface.

Iterators

In one case, I got hoist by my own petard — so to speak — by copying the Standard C++ library map/multimap interface a little too closely. With those classes, the erase function has a void return. This is acceptable since erasing an element in those classes does not invalidate any iterator except the one being erased. Therefore, some simple idiom like erase(it++) will let you erase one element, but keep a valid iterator into the container. IndexedFile on the other hand invalidates all of its iterators whenever any item is erased. This means that there is no way to step through the elements in an indexed file and erase some of them. I quickly changed the erase signature to be like those in the standard library vector and other sequential container classes — it now returns the iterator of the element immediately after the one that was erased — or end. Changing the interface to provide this was not difficult, nor was finding where in the code the new iterator value needed to be computed. Actually getting the iterator correct in all the cases where pages could be empty or consolidated with other index pages turned out to be a pain. Of course that just proves that it is something the BtreeIndex class needs to be responsible for doing.

This was the only case (so far) of something that I had to change, but I had been developing a growing dissatisfaction with a few other aspects of the design. Finally, I decided that my idea of returning a proxy object when an iterator was dereferenced was not working out very well. My original intention was to allow the regular STL-like idiom for updating an element:
(*it).second = new_value;
even though dereferencing an BtreeIndex/IndexedFile iterator does not return a reference to the underlying value. This worked, but I quickly discovered that certain other idioms did not work. For example, suppose an IndexedFile's Item contained an element such a "name." Traditionally, you would expect to be able to write:
std::string name = (*it).second.name;
to retrieve it. This does not work however because (*it).second is a proxy for Item, not the item itself. For a while I thought this was actually a positive side effect. Since retrieving the item from the external store is expensive, you want the client to make a copy and then access the elements in the copy rather than repeatedly dereference the iterator. Nevertheless, I finally realized that I was running into what I call the "like a duck" syndrome. You know — if something walks like duck, quacks like a duck, and is found where ducks are found, then reasonable people can be excused for expecting it to behave like a duck. I had gone to some trouble to make my BtreeIndex/IndexedFile iterators look like other STL container iterators, but unfortunately they do not have the performance characteristics of an STL iterator. There were things you could do with them that you really should not be doing. I caught myself making this mistake several times. I decided it was appropriate to step back from the STL iterator idioms a little way. To do this, I removed the dereference operator from the iterators and replaced it with a get_value function. The BtreeIndex::Iterator version now returns the BtreeIndex::value_type (std::pair<Key,size_t>) object directly. The IndexedFile::Iterator version requires that an IndexedFile::value_type (std::pair<Key,Item>) object be passed to it by reference to be filled in. I deliberately made the IndexedFile iterator require the client supply an Item object rather than just returning one so that it is impossible for the client not to make an copy of the item once it is read from the external store. I actually tried doing this for the BtreeIndex::Iterator as well, but quickly discovered that having the key returned as a function value was too useful to give up. Both Iterator types now also have an update_value function that does the job of the previous proxy object and allows a client to update the item value associated with a key. I quickly discovered that these changes made things feel much better. The classes are only slightly less convenient to use than before, and now I know that the client code is not accidentally doing something horrendously inefficient.

On the assumption that one good change deserves another, I decided to take care of something else that was bothering me. I had made the BtreeIndex page size a template argument on the premise that is was something the client should be allowed to change. But the page size is only important when a BtreeIndex is created; after that, it is stored in the external file itself and retrieved when the file is opened. I decided to get rid of the template argument, but I still needed a way for the client to specify the page size when an IndexedFile or BtreeIndex is first created. Rather than add a parameter to the open function, I decided to add a create function instead. It turned out that this allowed me to kill several minor problems with one stone, uh change. First, as planned create had a pageSize parameter that allowed me to get rid of pageSize as a template argument. Second, as I described in a previous column, I had discovered problems with the traditional iostream openmode parameter. To recapitulate briefly, if you specify an openmode of in or in|out, then the file has to exist or you get an open error. To create a new file needs an openmode of just out, (or possibly in|out|truncate), but that didn't make a lot of sense. Originally I was handling this problem internally, but by adding the create function I was able to remove the openmode parameter from the open function altogether. The create function does not "open" the file, it just creates a new one. This will destroy any existing file, so it has to be used judiciously, but now the open function can get by with just a Boolean flag to indicate whether the file should be read-only.

Thirdly, I discovered in my testing that most indices need unique keys. I had decided early on that BtreeIndex would support non-unique keys, and it would be up to the client to handle the case where unique keys were desired. The alternative that I considered was to support only unique keys, which would require the client to make keys unique somehow if they were not already. I still think support for non-unique keys is necessary, but I discovered that forcing the client to check for a key's existence before every insert imposed considerable overhead. I decided to have the BtreeIndex support either unique or non-unique keys. This became another flag to create, and I even went so far as to make the default be unique keys.

So adding create allowed me improve the interface in several ways. Of course, with the support for unique keys came the requirement that insert functions now return both an iterator and a Boolean flag to indicate whether the insertion was successful. This was not too big a problem. All in all, these changes took me several days of spare time. But the result quickly felt like a much better design.

Finally, I went back through the code and discovered that I had left the BtreeIndex::reindex function and the IndexedFile::compact function marked as TBD. I quickly set out to fill them in. The BtreeIndex::reindex function is provided so that a client can rebuild an index that might have gotten grossly unbalanced because of a series of key removals that has left a lot of mostly empty pages in the B-tree. All reindex does is copy the index values into a new temporary file and then delete the old file and rename the new file to the original index's file name. I will leave it up to the reader to convince himself that even a straight in-order copy does a pretty good job of recreating a balanced B-tree.

IndexedFile::compact turned out to be a completely different can of worms. Originally, I provided it because removing an element from an IndexedFile just removes its key in the index without reclaiming the space in the data file. When I tried to write compact, however, I discovered that it was impossible to do without imposing restrictions on the type of data that could be stored as the Item in an IndexedFile — restrictions that I had deliberately worked hard to keep from imposing. So, I quickly (although reluctantly) decided that I had to remove the compact function. Now, compaction is left up to the client as a higher-level function.

I do not like doing this. First, it feels like laziness on the part of the designer. Second, we all have had way too much exposure to functions and libraries where the user has to know too much in order to use the function effectively. The classic example of this is where a function expects the user to provide a buffer "of the required size" (e.g., sprintf). Good design avoids these kinds of restrictions. In this case, however, I could not see any alternative. Fortunately, copying the elements in one IndexedFile into another is a trivial program to write, so I don't feel like it is too much a burden on the client, but it is inefficient compared to just rewriting the data file and updating the elements of the index.

That about wraps up my overview of the BtreeIndex/IndexedFile design. I continue to test the class when I have time, which really means that I continue to debug it. Nevertheless, it is getting close to being usable. I have run a test case with 3,000,000 polymorphic items, and look-up times seemed the same as for a 3,000 item file. I am still concerned that the error handling is not yet acceptable, however. In particular, IndexedFile only throws exceptions to signal an error, when it probably should give the client the option of an exception or some other method. Now that I have gotten this far, I am starting to think about Fred Brook's oft-quoted law: "plan to throw one away; you will anyway." I think the external interface is pretty close to being a "keeper," but the internals may need a major overhaul before I am satisfied.

Recommendations and Lessons Learned

If you have stuck with me to this point, I really appreciate it. I will try to make this quick and to the point, but after this much time, it would be surprising if there were not a number of things that fall into the lessons learned category. Recall that I set out to create an XDRStream class that took advantage of the fact that the Standard C++ IOStreams library is now template based and parameterized by character type. My goal was to build a binary stream capability using XDR as a platform-independent storage protocol. The BtreeIndex/IndexedFile classes were then designed to provide a higher-level storage mechanism that used the XDRStream classes for their underlying storage/retrieval.

Even though IndexedFile is intimately related to XDRStream, it turns out that the lessons learned fall rather cleanly into separate categories depending upon whether it is XDRStream that is the subject or IndexedFile.

The broadest lesson learned is unfortunately a negative one: I have managed to break every compiler and library that I have tried these classes on — so far. I must admit that I have not tried it on all that many, but that is only partially because I don't have very many available. Even before I lost access to several of the platforms that I once used, I knew they couldn't handle the problem. Now I am actively looking for a platform that can handle these classes. I have found a couple of compilers that seem to compile the code, but either the library fails to instantiate properly, or the code does not run correctly. (My debugging efforts have led me to believe in bad code generation.)

On the other hand, I have managed to work around the problems on the platforms that I have access to and gotten things to work at least well enough to be able to say that the ideas are sound and draw some specific conclusions about how things like this should be done when we get ISO Standard C++ compliant platforms.

XDRStream

I really like this class. It just plain works. As I was developing IndexedFile, I pretty much forgot about XDRStream. I had to pay attention when I started debugging, and it turned out that I had made some mistakes in my internally derived versions of XDRStreams, but that was fairly quickly fixed. Those readers who have followed this from the outset will recall that I went through a couple of iterations of the XDRStream interface while I was developing it, but in retrospect it settled down rather quickly. Now I simply use it pretty much like I would any other mature/stable class. In one sense, this is hardly surprising since I deliberately modeled XDRStream on the existing IOStream classes. I also had the XDR RFC to work from. On the other hand, this says a lot about how mature the IOStream class design really is. The biggest problems with XDRStream were in making sure that all the i's were dotted and the t's crossed to meet the Standard's requirements for a character type used to specialize the basic_streambuf template. Let me hit those requirements and make a few recommendations to anyone contemplating doing something similar.

Make your char type a POD (Plain-Old-Data) struct. The Standard says that a "character" type has to be a POD type. Originally, I chose to make my XDR_Char a POD struct. I changed my mind after running into a problem and switched it to be a typedef for a long int. I have now switched back. While it is convenient to use a built-in type during testing, I don't recommend it for a general library release. Recall that a typedef is just a synonym, whereas a struct introduces a new type. Doing the latter lets you take full advantage of the compiler's type system and keeps you from accidentally coming in conflict with somebody else's use of the same type.

Explicitly specialize std::char_traits for your char type. Specializing basic_streambuf requires both a character type and a traits type to go with it. Originally, I started out providing a specialization of std::char_traits<XDR_Char>. I switched to a stand-alone XDR_Char_Traits class when I switched to using a built-in type for XDR_Char. This was necessary since the Standard is clear that providing an explicit specialization of a template in the std namespace for a built-in type yields undefined behavior. The Standard is also clear that you can add explicit specializations of std templates for user-defined types. For char_traits, that is what I recommend. This is probably not very important, but I simply feel that it is clearer if you use the same naming convention as the Standard for your own traits class.

Beware the ctype<> facet. This is not really a recommendation, but something to watch out for. I never intended that my XDRStream classes would need any locale information, but I discovered that during initialization of basic_ios, which XDRStream uses as a base class, certain library implementations were expecting there to be a ctype<XDR_Char> facet in the global locale. Initially, my reading of the Standard said they were correct. My error has since been pointed out to me. Before that happened, however, I set out to work around the problem by installing a ctype<XDR_Char> facet into the global locale. I ran into the fact that locale and standard facet implementations lag even farther behind the Standard than do IOStreams implementations. The real recommendation is that if you do not need any local information in your class, then find an implementation that does not require you to have any. If you do need locale information, in particular if you need a cytpe<> facet for your "character" type, I am afraid at this point that you are going to have to figure out how your platform works on your own. Some platforms might just let you instantiate the ctype<> template and then derive from it. (The ctype facet has virtual functions.) Others might require that you provide specializations of the base class virtual functions themselves. Others might require that you provide the entire ctype<> facet as an explicit specialization. I am afraid that there are a number of people who apparently feel that each of these approaches — or maybe all of them — is compliant with the Standard.

Be careful with the IOStreams error-handling conventions. Again, this recommendation is broader than I would like, but if you want your stream class to have the same behavior as required by the IOStream classes, you have to take some care with how you handle exceptions. To be more specific, most IOStream classes do not throw exceptions by default. If you want your stream-like class to work the same way, you have to be sure and catch all exceptions thrown by lower-level classes. Once caught, you have to make sure the error is captured in some internal state. If your class is based on basic_ios, you can take advantage of the stream state flags it provides, but you have to be aware that setting those flags can cause an exception to be thrown. If setting a basic_ios state flag would cause an exception, then you want to re-throw the exception you originally caught. This gets a little tricky, but once you understand the idiom, it is pretty straightforward — you just have to make sure you do it everywhere it is required.

BtreeIndex/IndexedFile

While most of the lessons learned in developing XDRStream were related to how to use the standard library, BtreeIndex/IndexedFile are just ordinary development efforts. Still, there are some specific recommendations:

Use traits. The char_traits idiom is broadly applicable for any type of template. A traits type is a collection of types and static functions. At one point, I figured I would just require my IndexedFile clients to provide the operator<< and operator>> functions for the types used to specialize IndexedFile. Using a traits type to provide encode and decode functions instead provides more flexibility.

Take advantage of the standard library for its interfaces as well as its implementations, but beware of the underlying requirements. Stated another way, it is usually a good idea to model your classes as close as possible to those in the standard library, but no closer. A good example of this point is my iterator discussion above. Iterator is an excellent design pattern all by itself, and making a BtreeIndex iterator look like an STL iterator seems like a good idea. But STL iterators have some implicit and explicit performance requirements. I decided that the expected constant-time performance for a dereference operator was sufficiently important that I removed that operation from my iterators since I could not meet it. On the other hand, I decided that increment and decrement operators were too useful an idiom even though their performance might not be quite the same as an STL container. Obviously you will have to make these choices depending upon your own circumstances. I simply point out the fact that you will need to put some thought into it.

Well that just about wraps it up for now. I am sure I will probably discover some more things down the road, but I will burn that bridge when I come to it, as the saying goes.

Jack W. Reeves is an engineer and consultant specializing in object-oriented software design and implementation. His background includes Space Shuttle simulators, military CCCI systems, medical imaging systems, financial data systems, and numerous middleware and low-level libraries. He currently is living and working in Europe and can be contacted via jack_reeves@bleading-edge.com.