March 2000/An STL-Based N-Way Set

Features

An STL-Based N-Way Set

Mark L. Smith

Databases support access with multiple keys. Why not allow the same latitude with an STL-style container?

Introduction

STL defines associative containers, set and map (and their more lenient cousins, multiset and multimap), to organize data that needs to be quickly searched. Unfortunately, these containers are limited to searching on just one key, or index. In many instances, it's convenient to have a collection of data with multiple indexes. A multiple-key lookup can be modeled using several traditional STL containers glued together, but this approach becomes tedious and error prone when the data being stored is dynamic (requiring updates to several containers) or when the exact number and type of indexes to maintain isn't known at design time.

A better solution is a wrapper class that hides the disparate containers behind a consistent, familiar interface. This is what class nwayset endeavors to accomplish. nwayset allows the insertion and removal of data items and the insertion and removal of indexes on those items. Additionally, iterators over the data items and indexes are exposed. nwayset also provides a way to automatically iterate over all indexes and search them for a particular data item. Finally, nwayset has searching performance characteristics very close to that of an STL multiset.

Here is an example. A DNS server maintains an in-memory database of host information indexed on hostname and IP address. It searches both indexes to satisfy name resolution queries from clients. A DNS server might, conceivably, read in a host database file into an STL list of CDNSRecord instances at server startup. Then, two STL sets of CDNSRecord pointers (one that sorts on CDNSRecord::m_strAddress and the other that sorts on CDNSRecord::m_strName) might be populated from the list. As requests came in, the sets could be searched for hostnames and IP addresses and the result dereferenced to gain access to the appropriate CDNSRecord instance. Many DNS servers also support dynamic updates to the host database as hosts connect and disconnect from the network (such as Dynamic DNS with DHCP). In that dynamic update scenario, the DNS server would have to update the list of CDNSRecords and then update each set.

In this example, the interaction between the master list and auxiliary sets is tedious and could be difficult to maintain. If a new variety of query type is added to the DNS server (such as mail-exchange information), an additional set (for the new mail-exchange index) has to be added. Additionally, the logic sections that handle reading in the host database file, adding new host information, and removing host information that is no longer valid all have to be changed. Clearly, a more self-managing container (such as nwayset) is called for.

Architecture

nwayset is defined in the header nwayset.h, shown in Listing 1. It maintains two member variables: an STL list of data items and an STL list of indexes. The list of data items (m_lData) is a simple list of the type that is being managed (template parameter T). The list of indexes (m_lIndexes) is a list of TIndex structures. A TIndex is a multiset of a special iterator class (CDataListIterator) that points into m_lData. The comparison predicate for the multiset is another special helper class (CCompareHelper) that contains a pointer to the real compare function. The real compare function must have the signature:
bool func(const T &, const T&, bool)
and must have an address. (The function cannot be inlined.)

One important subtlety to highlight is the choice of the list as the container for both indexes and data items. On casual inspection, it might look like a deque or vector would work just as well, but keep in mind that iterators are being stored and handed out. Since each TIndex essentially holds a list::iterator from m_lData for each data item managed by nwayset, iterators need to remain valid as items are added and removed from m_lData. This same behavior for indexes is desirable as well so that once a user of nwayset has obtained an iterator to an index, the user doesn't have to worry about that iterator suddenly becoming invalid because of some unrelated mutating operation. Additionally, a list iterator is sufficiently complex to be a separately defined class and not a fundamental type (like a pointer). This is important since nwayset derives from list::iterator. (Fundamental types cannot be derived from.)

Special Iterators

To inherit as much functionality as possible from the underlying STL classes and to automatically obscure some of the magic that goes on inside nwayset, two special iterator derivatives have been defined.

CDataListIterator is the data-item pointer from m_lData that is stored by TIndex. CDataListIterator is based on list<T>::iterator and therefore can be resolved into the real data item in m_lData (which is necessary for TIndex to be able to sort). CIterator is the publicly accessible iterator type that users of nwayset utilize to iterate over an index, use as a pointer value for data-item removal, and receive as the return value from find operations. In fact, CIterator is a typedef for iterator for convenience. There is also a const cousin of CIterator called CConstIterator that is an alias for const_iterator.

Deriving from a base iterator class accomplishes three things. First, all of the iterator functionality (including the myriad of operators supported by iterators) stays intact and does not have to be duplicated, resulting in much less code to implement or to wrap. Second, operator* can be overridden to act more appropriately depending on the context. When CIterator is dereferenced using operator*, the more desirable behavior is to go ahead and return the underlying data item and not the internal nwayset class CDataListIterator. To accomplish this, operator* for CIterator essentially does a double dereference — one to get from a CIterator to a CDataListIterator and another to get from there to a T&. In actuality, a third dereference occurs — to convert from the this pointer to a CIterator. But that happens transparently with a normal iterator dereference as well.

The final favor that deriving from the base iterator class affords is the ability to build an iterator around an item to search for. Since a TIndex manages a CDataListIterator, it can only search using a CDataListIterator as the criterion. The preferred search-item data type is T&, so a CDataListIterator can be constructed directly from an item of type T&. The idea is that the compiler will see that it has a type T& but needs a CDataListIterator and will see that a CDataListIterator can be constructed from a type T&. It will then generate a temporary CDataListIterator in place. While the concept of building a mock iterator is peculiar (the derived iterator classes really acts like two disparate classes fused together), it does allow for easier searching.

As mentioned before, one item of note is that neither list::iterator nor multiset::iterator can be a fundamental data type (such as a pointer). The restriction stems from the derivation we want to perform; fundamental data types cannot be derived from. Fortunately, given the nature of linked lists and binary trees (which list and multiset are supposed to model, respectively), all implementations should have list::iterator and multiset::iterator defined as classes to deal with the multiple pointers of a doubly linked list and the left and right children of a binary tree.

CCompareHelper is a wrapper class to bind the CDataListIterator to the comparison index functions (typedefed as TFncCompare for convenience). While developers of comparison function don't need to be concerned about the existence of CDataListIterator, a TIndex that the comparison function will be plugged into manages a CDataListIterator. The CCompareHelper wraps itself around the user-defined comparison function and then dereferences a CDataListIterator for that function when TIndex needs some values compared. Since CCompareHelper is a general class that calls the comparison function through a pointer, the comparison function must have an address. This means that the function cannot be inlined and that the function must actually exist. Keep in mind that instances of function objects, such as less<string> objCompare, do have an address.

Index Operations

One of the most powerful features of nwayset is the ability to manage indexes on the fly: an ability that is really more akin to SQL databases or xBase than to STL containers. nwayset automatically keeps all indexes synchronized as data items are added or deleted. This necessarily means that mutating data operations perform more slowly as more indexes are added. Additionally, every time an index is added to nwayset, each item in m_lData must be touched. For an nwayset that contains a large number of data items, adding indexes can be costly.

Indexes can be added to nwayset using the member function index_insert. index_insert takes a function pointer as an argument and uses that as the sort function for the TIndex multiset. The return value is an index_iterator that can be retained to search the newly added index or remove the index later. To remove the index, a member function index_erase is exposed that takes an index_iterator and removes it from the m_lIndexes list.

Iterating over all indexes is accomplished by using index_begin and index_end. Each of these functions returns an index_iterator. They are obvious analogs to begin and end in many other STL contexts. When an index_iterator is dereferenced, it returns a reference to a TIndex. Since a TIndex is really a multiset, the normal multiset member function find behaves just as expected. The return value from a TIndex::find is CIterator compatible (using our hijacked iterator classes) so that TIndex::find fits in with the rest of the nwayset paradigm.

Data Operations

nwayset exposes a small set of member functions to perform list data operations. insert takes an item as an argument, adds it to the data list (m_lData), and updates the indexes to reflect the new item. erase takes a CIterator as an argument, removes an item from m_lData, and deletes the corresponding item from the indexes by iterating over m_lIndexes, searching for the item to delete and then finding the exact iterator that matches the CIterator passed in to erase. (This means that a particular index can have duplicate entries, and the correct entry will be removed.) The number of items managed by nwayset can be determined by calling size, which simply rebroadcasts m_lData.size().

nwayset also exposes a powerful searching member function called find. It takes a data item (of type T) as an argument and searches all indexes looking for a match. In essence, the data item passed to find is very much like an SQL query object or xBase expression, where each data member (or data member combination) that an index is built on is a column to search. Each data member value is the column criterion that is ORed together with the other column criteria and then submitted. If a match is found, a CIterator designating that data item is returned. Otherwise, the value end() is returned to signal that no match was found.

A DNS Search Utility

The capabilities of nwayset can be demonstrated using an example similar to the DNS server example above. Using a combination of the standard nslookup command and nwayset, an interactive host information utility can be built. The utility takes a host database file generated by nslookup, reads it into an nwayset (indexing as the file is read), and allows searches to be performed against it.

The nslookup command is a utility for connecting to a DNS server and asking it questions related to name resolution. When nslookup is presented with either a hostname or an IP address, it passes it to a DNS server and returns the opposite of what was passed (a hostname for IP address and vice versa). nslookup can also be used to transfer the host database to the local machine. The following example parses a host database file, generated by nslookup, and allows interactive lookups much like nslookup does, except that it doesn't connect to a DNS server.

To generate a host database file, run nslookup. Various DNS related messages should scroll by. Once the nslookup prompt (>) appears, type:
ls > host.db
This should create a file named host.db (see Figure 1 for an example) that contains the guts of the DNS that nslookup is connected to. If nslookup complains about zone transfers not being allowed or prints some other error, the DNS that nslookup defaults to is too secure to allow its host database to be transferred. Either switch to a less paranoid DNS using the command:
server NEW_DNS_SERVER_ADDRESS
or make up a host database file.

One other caveat: different nslookups produce different results for the ls command. The following example expects the output to look like the results from the Microsoft Windows 9x/NT nslookup. Unix nslookups tend to generate a different format.

Listing 2 shows the file hostdb.cpp, which contains the host database search utility. The first nwayset is declared passing the CDNSRecord class as the template parameter T:
nwayset<CDNSRecord> nsDNSLookup;
Next two indexes are added (on hostname and on IP address) while nwayset is empty.
nsDNSLookup.index_insert (DNSNameCompare);
nsDNSLookup.index_insert (DNSAddressCompare);
DNSNameCompare is a plain function that takes two CDNSRecord instances and compares their m_strName member. DNSAddressCompare performs a similar operation on the m_strAddress member.

After some file opening and parsing magic, CDNSRecord instances are inserted into nwayset:
nsDNSLookup.insert(CDNSRecord
   (pszName, pszType, pszAddress));
Now that the host database is in our nwayset, it can be searched. The console is polled for a string to search for. That string is used to build a CDNSRecord search object:
CDNSRecord objSearch (szLine,"",szLine);
Using the search object, the nwayset::find member function is used so that both the hostname and IP address indexes are searched:
nwayset<CDNSRecord>::const_iterator it =
   nsDNSLookup.find(objSearch);
In this example utility, the most difficult part of the whole process was parsing the file to begin with. All of the strange iterator hijacking and layered containers are sufficiently insulated so that the code is clean, straightforward, and consistent with STL style and naming conventions. Later on, if the program logic dictates that additional indexes are needed (perhaps indexes that search CDNSRecord::m_strType or case-insensitive versions of CDNSRecord::m_strAddress and CDNSRecord::m_strName), they can be added without losing the contents of nwayset and forcing the host database file to be reprocessed.

Conclusion

nwayset hides the messy details of managing a collection of data and indexes using STL while pushing most of the real work off on the STL containers themselves. This results in a small code implementation that performs quickly and with very little overhead.

nwayset could be extended to more closely mimic a database table by supporting keyed indexes (all values unique). One way to accomplish this is to add a Boolean uniqueness indicator, along with each index, that is checked when new values are pumped in. A further improvement would be to extend this class to support disk-based lists, perhaps through a disk-based allocator for m_lData, with in-memory indexes.

Further Reading

For more on nslookup and the internal workings of DNS servers, see Paul Abitz and Cricket Liu, DNS and BIND (O'Reilly and Associates, 1998).

For an excellent STL tutorial and discussions on mutating operations and their effect on iterators, see Leen Ammerall, STL for C++ Programmers (John Wiley and Sons, 1997).

Mark L. Smith is head of aiWorks, a firm specializing in software toolkits and full software applications for document and forms processing with Adobe Acrobat/PDF and Lotus Domino. Mark has been developing software for 15 years (8 of those with C++) and feels that STL is the biggest advance in C++ programming this decade. Feel free to contact Mark at msmith@aiworks.com or check on what he is up to at www.aiworks.com.