C++ Experts Forum


The (B)Leading Edge: Building an Indexed File Using XDRStream, Part 3

by Jack W. Reeves


Introduction

In this installment of "The (B)Leading Edge," I am (finally) going to start walking through the implementation of the IndexedFile class that I have designed. This is going to take several installments; I am sure most of you probably look forward to a code walkthrough with much the same enthusiasm as you do a visit to the dentist. Nevertheless, I want to do this for a couple of reasons: first there is a lot of C++ here and that is what this column is all about. Actually, most of the code here is pretty much classical C++ — with the obvious exception of the templates. Also, a lot of this code is built around the STL and uses some seldom seen functions of the standard library, so it is possibly new to a lot of readers. There are also a few patterns used here, and I think that is important. In any case, the code gives me the opportunity to show how various features of C++, both old and new, work together in a fairly complicated design. Secondly, walking through the design in detail gives me a chance to explain and discuss several trade-offs that I had to make. I mentioned a few of these in my last column, but actually seeing what impact they have on the code (or could have) should make things a lot clearer. These trade-offs are what makes software design difficult. They are also what makes it interesting. As it is, I do not intend to go through all of the code, just the important parts. Unfortunately, when you get right down to it, there is usually no such thing as un-important code.

BtreeIndex Overview

Listing 1 [1] contains the definition of the class IndexedFile and its associated classes.

The following classes are defined in the IndexedFile header. The nesting applies to the classes definitions as well.

XDR_Traits
BtreeIndex
    Iterator
    StreamPosProxy
    Page
    RootPage
    PageCache
    XIODataStreambuf
    XInDataBuf
    XOutDataBuf
    KeyInfo
IndexedFile
    ItemProxy
    Iterator

In this column, I am going to concentrate on the internal BtreeIndex classes Page, RootPage, XInDataBuf, and XOutDataBuf.

The heart of IndexedFile is its use of a B-tree as the index. While B-trees (devised by Bayer & McCreight) are described in a number of books, my reference is Sedgewick [2]. Fundamentally, a B-tree consists of nodes that are stored externally in fixed-size pages. (I will simply refer to pages from now on.) Each page contains several keys plus links to other pages. Figure 1 shows a simple depiction of a B-tree that uses integers for keys. Typically, when keys are fixed size, each page can hold M keys. The requirement for a B-tree is that every page (except the root) must contain at least M/2 keys. As the tree grows, this requirement is maintained by splitting a page when it is full. When the tree shrinks, meeting this requirement requires shuffling keys between pages to ensure that each has at least M/2 keys, and/or consolidating pages whenever two adjacent pages contain less than M keys between them. As I discussed in my introduction to IndexedFile [3], it is often useful to relax the requirement for having at least M/2 keys in a page to simplify the process of removing an entry from the B-tree.

The above description is for a classical B-tree, but, in my implementation, keys are not required to be fixed-size. The only requirement is that it be possible to store the key in an XDRStream. So I have made the following changes to B-tree requirements. In class BtreeIndex, an internal page (node) is stored externally as a fixed-size page that is encoded using XDR. Internally, a page must be able to hold at least two keys (actually key/link pairs). Stated another way, the encoded representation of a key must take up less than half the available space in a page. A page is split whenever adding another key to it would cause the encoded length of the page to exceed the specified PageSize. Splitting a page involves finding the key in the page where encoding the keys up to that point would use more than half the available space and then moving all the rest of the keys to a new page.

With these changes, I can no longer say that a page will hold exactly M keys when full, or that every page will contain at least M/2 keys, but on average the requirements still hold.

Encoding a Page

Since a page is the heart of a B-tree, let us start with that. Again, let me emphasize that a page has both an internal and an external representation. Externally, a page is fixed size, and its XDR encoding is described like this:

Page {
    unsigned int      num_keys;
    unsigned int      prev_page;
    unsigned int      next_page;
    unsigned int      dummy;      // always 0
    opaque            data[PageSize - 16];
};

Remember this is XDR notation here, not C, even though it looks pretty much the same. The PageSize constant is the fixed size of a page (in bytes) as specified by the template argument. The data portion of the page — the part that actually contains the keys — is treated as an opaque array whose length is the PageSize minus the encoded lengths of the previous members. Those members indicate the number of keys actually encoded in the data area, and the page reference IDs of this page's siblings in the B-tree. A page's ID is the same as its XDRStream offset. A page ID of -1 is considered invalid and indicates that there is no such page. Note that if the page is empty — which means it has had all the keys it once contained removed — it will be placed on a linked list of similar pages. In this case, next_page references the next empty page on the list. The dummy member will be explained below.

In addition to regular pages, there is certain information about the B-tree index as a whole that needs to be stored externally. I decided to store that information in the root page. The XDR encoding of the root page is shown here:

RootPage {
    unsigned int      depth_page_size;
    unsigned int      total_keys;
    unsigned int      num_keys;
    unsigned int      free_page;
    opaque            data[PageSize - 16];
};

This is just a little trickier. The first element is actually a combination of the depth of the tree and the page_size. I assume that an ordinary short int (C++ type) is sufficient to represent either of these values, so I encode them together into a single unsigned integer (XDR type). The next value (total_keys) is the total number of keys currently held in the BtreeIndex. The num_keys member is the same as for a regular page — the number of keys in the data portion. Finally, the free_page member is the page ID of the first empty page on the free list. If this member is -1, there are no free pages in the index. Although it was not strictly necessary, I added a dummy member to the Page structure so that the data portion of both Page and RootPage would have the same length.

Finally, for completeness sake, I note that the data portion of either page can be thought of as an encoded XDRStream containing zero or more of the following:

KeyInfo {
    Key                key;
    unsigned int       offset;
};

In this case, Key is the template argument, and it is encoded/decoded using the functions supplied in the XDR_Traits template argument. The offset member is a stream position expressed in terms of XDR_Chars. If the page is an internal node in the BtreeIndex, then the offset represents the page ID of another page in the index. If the page represents a leaf node in the tree, then the offset is the location in the data file of the actual item for that key.

Now, let us look at class Page itself. Internally, a Page contains the following data members:

Finally, a Page contains a _keyList member, which is a collection of KeyInfo objects.

The KeyInfo struct is internal to Page and contains the expected Key object and the link to either another page or the data. In addition, KeyInfo contains a data member that holds the length of the object when it is encoded.

Choosing the KeyList container involved the first of several trade-offs that I mentioned above. The obvious implementation would use a vector for the container. Unfortunately, a vector also has some obvious drawbacks: specifically whenever a new key is added to the index or a key is removed, then an insert or a delete has to be performed on the vector. As everyone knows, inserting or removing an element from a vector (anywhere except the end) is an expensive (O(n)) operation. This is even truer when a page has to be split or consolidated since that would involve copying a number of elements to a new vector and then erasing the same elements in the original vector. The expense of all this copying depends upon how expensive it is to copy a Key object — which is a user supplied type that I have no control over. As a result, I decided to use a list container instead of a vector. Besides not having to make any copies when an element is inserted or deleted from the container, the standard list container provides several splice functions, which take the elements from one list and move them to another list by simply updating internal pointers. This eliminates the need to copy elements between containers at all.

The downside of using a list is that my BtreeIndex iterators carry a slot number representing the actual KeyInfo item within a page. It is necessary to use this index instead of something like an iterator (or pointer) because the latter could be invalidated if a Page is flushed out of the cache. This means looking up an item involves scanning through the list to find the correct slot. This is not an especially complex or time-consuming operation, but it is more complex and time consuming (O(n)) than the constant time (O(1)) access a vector provides. Nevertheless, I decided to accept this extra overhead because it is an overhead that is relatively well controlled. Yes, it will vary somewhat by the number of items in a page, which in turn can vary by the PageSize specified by the user, but I figured that was much less likely to cause major fluctuations in performance than what might otherwise happen when trying to copy large numbers of Key objects.

Having said all that, I note two things: first, all this depends upon pages being in memory. If a page is not in the cache, it has to be fetched from the external store, and it is assumed that this operation completely dominates all other performance considerations. Secondly, I will admit that I designed the Page from the outset to use a list primarily because I thought it was cool to use the splice functions. Only later, when I switched (briefly) to using a vector, did I come up with the rationalizations above. I mention this because it also is part of the reality of software design — sometimes you get things right, but for the wrong reasons. It never hurts to check out the alternatives just to be sure, however.

Member Functions

For member functions, Page contains the usual suspects of constructor, destructor, member access functions, and a few member update functions. In addition it contains the following special operations:

bool          must_split() const;
size_t        add_key(BtreeIndex&,
                  const std::pair<const Key, size_t>&);
bool          remove_key(int slot);
void          update_key(size_t slot, size_t link);
void          insert_page_link(KeyList::iterator kit, Page* p);
void          update_page_link(Page* p);
bool          remove_page_link(Page* p);

// virtual functions
virtual Page*              split(BtreeIndex&); 
virtual Page*              consolidate(BtreeIndex&);
virtual Page*              shuffle(BtreeIndex&); 
virtual oXDR_Stream&   encode(oXDR_Stream&);
virtual iXDR_Stream&   decode(iXDR_Stream&);

I will go through each of these (and the constructor) below.

Page Constructor

The constructor is shown below. It takes one argument and just initializes the member variables. You will note that the member _pageId is marked as a const member, so it has to be initialized by the constructor. The pageId is the link between the internal page structure and the location of the external page and can never change. You will also note that the copy constructor and copy assignment operator are declared private and not implemented. This is the typical C++ idiom to prevent copying of an object. It doesn't make any sense to have more than one internal representation of a single external object floating around, so I disallowed copying.

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::Page(
    size_t pid)
 : _pageId(pid),
    _dataLen(0),
    _prevPage(-1),
    _nextPage(-1),
    _dirty(false),
    _keyList()
{
       // empty
}

Encode/Decode

Encoding and decoding a page provide the connection between the internal representation and the externally stored version. In a sense, the Page::decode member function is the real constructor. So let me get these two functions out of the way. They are both shown below:

oXDR_Stream&
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::encode(
    oXDR_Stream& xstrm)
{
    xstrm.seekp(_pageId);

    xstrm << _keyList.size();
    xstrm << _prevPage;
    xstrm << _nextPage;
    xstrm << 0;

    char data[PageSize - HeaderSize];
    XOutDataBuf xdatastrm(xstrm, data, sizeof(data));

    KeyList::iterator it = _keyList.begin(); 
    for (; it != _keyList.end(); ++it) {
        KeyTraits::encode(xdatastrm, (*it).key) << (*it).link;
    }
    xstrm.put_opaque(data, sizeof(data));

    _dirty = false;
    return xstrm;
}


iXDR_Stream&
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::decode
    (iXDR_Stream& xstrm)
{
    xstrm.seekg(_pageId);

    unsigned int numKeys;
    unsigned int dummy;
    xstrm >> numKeys;
    xstrm >> _prevPage;
    xstrm >> _nextPage;
    xstrm >> dummy;

    char data[PageSize - HeaderSize];
    xstrm.get_opaque(data, sizeof(data));

    XInDataBuf xdatastrm(xstrm, data, sizeof(data));

    _dataLen = 0;
    _keyList.clear();
    while (numKeys-- > 0) {
        KeyInfo info;
        info.xlen = short(xdatastrm.tellg()); // the start
        KeyTraits::decode(xdatastrm, info.key);
        xdatastrm >> info.link;
        info.xlen = short(xdatastrm.tellg()) - info.xlen;
        _dataLen += info.xlen;
        _keyList.push_back(info);
    }
    return xstrm;
}

There is nothing too special about either of these functions, but a couple of points should be noted. First, they both begin with a seek to position the stream to the correct offset for this page. Second, the external representation stores the number of KeyInfo items in the data area, but I let the container handle that information in the internal version. On the other hand, the internal version keeps track of the total encoded length of the data area, but I do not actually encode that information. Finally, note that encode clears the dirty flag as its last operation. This indicates that the page no longer has changes that have not been stored externally.

The major aspect of both functions is the use of classes XOutDataBuf and XInDataBuf to actually encode/decode the KeyInfo objects. I will concentrate on the decode function since it is slightly more complicated.

After decoding the header information from an external page, the function then reads the KeyInfo data into a buffer using the get_opaque function. Recall that this is an XDRStream function that simply reads octets without any interpretation. This information is read into a fixed-sized array that is declared to be the size of a page minus the header information. At this point, all that is left is to decode the key/link information from the buffer. Unfortunately, it is at this point that some more of those design trade-offs crop up.

On the one hand, a BtreeIndex is not an XDRStream — in the classical OO design sense of inheriting from class XDR_Stream. On the other hand, I require the client to provide XDR_Stream encode and decode functions for a Key. From the client's standpoint, or at least from the standpoint of encoding/decoding a Key, a BtreeIndex must be an XDR_Stream. There are certain XDR_Stream functions that the Key encode/decode functions have access to that may need to be set externally by the client in order for the functions to work correctly. (If we were talking about regular IOStream operations, what I mean is something like the client being able to set the locale for the stream before certain insert/extract operations will work correctly.) Likewise, there are functions such as xalloc, iword, and pword, which the encode/decode functions might use and expect to act as if they came from the same stream each time. You will note that the XInDataBuf constructor takes as an argument the current XDRStream object. This is precisely so that I can make a copy of all the pertinent information when the XInDataBuf stream is created. I will discuss XInDataBuf (and XOutDataBuf) below, but for now let me address some higher-level questions.

The first question is "why not inherit BtreeIndex from XDR_Stream" and be done with it. More specifically, why not have BtreeIndex be an XDRStream and just use it as such whenever needed? The simple answer is that I do not consider a BtreeIndex as a substitute for an XDRStream. To me, this an example of the classic OO design problem of whether you can inherit class Square from class Rectangle. Is a Square a Rectangle? Yes, mathematically, but not (necessarily) in a software design sense. The BtreeIndex "is-a" XDRStream question is complicated considerably by the fact that the external representation of a BtreeIndex is a file that is (deliberately) an XDR data stream. Likewise, there are a number of XDRStream functions that BtreeIndex has to provide (not to mention ordinary IOStream functions like open), so why shouldn't the internal representation reflect this commonality. I will try to answer this question from a couple of different perspectives.

The first is that XDRStream does not care about the data in the stream, as long as it meets the minimum criteria of being an integral number of XDR_Chars in length. BtreeIndex does care about the data in the stream. It requires that the stream contain stored Page objects, with the additional requirements that the first Page has to be a stored RootPage, and that every file representing a BtreeIndex has to have at least the RootPage stored in it. There are other criteria about how the pages relate to each other, but the point is that BtreeIndex is a higher-level abstraction than XDRStream; it is not simply a special form of XDRStream.

Another way of looking at it is to ask, "If I use a BtreeIndex somewhere an XDR_Stream is expected, can XDR_Stream operations destroy the integrity of my BtreeIndex?" Again, the answer is complicated by the fact that we must consider both internal and external representations. Obviously, if we treat a BtreeIndex as an ordinary XDR_Stream, we can do things to the file that will destroy the index, but is this relevant to the design of the internal objects? Could I invalidate my BtreeIndex object by applying XDR_Stream operations to it? To me, the internal (object) representation of a BtreeIndex and its external (XDR file stream) representation are linked, and as such it should not be possible to use the internal object is such a way that it would destroy the integrity of the external file. Having said all that, I will admit that it is still something of a judgment call.

So, if I choose to not inherit BtreeIndex from XDR_Stream, I still have the question of why use a separate stream (XInDataBuf) to decode the KeyInfo objects instead of just reading them using the main XDR_Stream member object? The answer to this is simpler. Whenever a new key is added to the index, I have to figure out how long its encoded form will be. Rather than write it to the external file, it seems much better to just encode it into a string and then ask how long the string is. I had an XIOStringStream class type already created that works much like the standard stringstream class, but I realized that I needed both more functionality than XIOStringStream provided, and less. So I originally created a XIODataBuf class to use whenever I needed to encode a key to find out its encoded length. Once I had the class (or at least its XIODataStreambuf), I figured that I might as well use it for other things. I ended up with two BtreeIndex internal classes: XInDataBuf and XOutDataBuf. These two classes, along with their XIODataStreambuf class, are also declared in header IndexedFile (Listing 1). Their definitions are shown in Listing 2. Let me describe them here.

XIODataStreambuf

The class XIODataStreambuf wraps an XDR_Streambuf around an existing data buffer. The key point is that it does not do any buffer management of its own. The constructor takes three parameters: the buffer pointer, the size of the buffer, and the length of the currently used portion of the buffer. It uses these values to set up the get and put areas in the base class. Once the base class has been initialized, the default implementations of all the XDR_Streambuf functions are acceptable — except for the two seek functions. By default, XDR_Streambuf does not implement the seek functions; they have to be provided by derived classes. Since the seek functions are used by certain public XDRStream functions like seekg and tellp, I have to provide them.

As you can see, the definitions for XInDataBuf and XOutDataBuf are even simpler than XIODataStreambuf. The important thing to note in their case, however, is that the constructors take as a first argument a reference to another XDRStream object. The constructor and destructor for XInDataBuf (used in decode) are shown below.

BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::XInDataBuf::XInDataBuf(
    iXDR_Stream& xstrm,
    char* buf, 
    size_t len)
  : _refstrm(xstrm),
    _xstreambuf(buf, len, len)
{
    // Initialize the base class
    init(&_xstreambuf);

    // Copy the formating, and polymorphic 
    // support information
    copyfmt(xstrm);
    id2obj()  = xstrm.id2obj();
    id2type() = xstrm.id2type();
}

BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::XInDataBuf::~XInDataBuf()
{
    // copy the formating data back
    _refstrm.copyfmt(*this);
    _refstrm.id2obj() = id2obj();
     _refstrm.id2type() = id2type();
}

The constructor saves the iXDR_Stream reference that is passed to it for later use by the destructor. Then it initializes the XIODataStreambuf object. In the constructor body proper, it first calls the base class init function to establish the streambuf. Then it copies the formatting information from the reference XDRStream object into this one. This involves invoking the copyfmt member function in the base class (I'll bet you haven't seen that one used very often before) and then directly copying the IdToObj and IdToType objects that are part of an iXDR_Stream class. The destructor of XInDataBuf copies the same information back to the reference stream. This permits the client-defined decode operation for Key to have access to any client-defined formatting data that is stored in the main BtreeIndex stream. Copying it back in the destructor means that any changes made to the formatting data is persistent across different instantiations of XInDataBuf objects.

Obviously, XOutDataBuf's constructor and destructor do similar things, the only difference being that they copy ObjToId and TypeToId objects.

These two classes reveal a shortcoming in my existing XDRStream classes. Since the iXDR_Stream and oXDR_Stream classes provide the internal data structures to allow clients to store certain information in the stream for use by client-defined functions, then it also makes sense that they should provide a means to copy this information. In other words, oXDR_Stream, iXDR_Stream, and XDR_Stream need their own versions of the basic_ios::copyfmt function. I will make this change to XDRStream in a later version.

One final note about copyfmt seems appropriate. This copying is done whenever a new version of XInDataBuf or XOutDataBuf is created or destroyed. XInDataBuf is used only in the Page::decode function, but XOutDataBuf is used both in Page::encode and by the Page::add_key function (to get the encoded length of a new KeyInfo object). The last is of some concern since it means that a copyfmt operation is done (twice) whenever a new key is added to the index. Most of the time, there will be nothing in the structures being copied, so it can be assumed that the copying will be a very quick operation. The question arises about what happens in the special case where there is some complex formatting information stored in the stream. In particular, the base class copyfmt operation will invoke any registered callback functions as part of its operation, and these can do just about anything they want.

I speculated for a while about ways to avoid doing these copies all the time. The obvious thing is to have a XIODataBuf object be a member of BtreeIndex and just reuse it as needed. I have put that on my "things to do in the next version" list, but the current implementation probably works just fine for 99 percent of the cases.

Listing 2 contains the code for all the internal XDR stream functions.

Now that we understand what XInDataBuf is all about, we can finish our examination of Page::decode. After we have read the KeyInfo information into a buffer using get_opaque, we wrap a XInDataBuf stream around it and decode the individual KeyInfo objects. Since we store only the Key and link information in the stream, we have to compute the encoded length from information provided by the stream. As you will see, this information is used later if we have to split a page.

The Page::encode function is the dual of Page::decode, but is simpler since it does not have to calculate the encoded length for each KeyInfo object. Note that Page::encode does no checking to make sure that the encoded page fits within the PageSize limit. It is assumed that higher levels have checked this. I will admit that this is one area where a little more paranoia might be a good idea. An exception here would quickly reveal a logic error somewhere else in the implementation. Of course, like ever other programmer, I am reluctant to add code that tests for conditions that "should never happen."

Page add_key/remove_key/update_key

Now let me turn to the three main functions that maintain the KeyInfo in a Page. Page::add_key is called whenever a new key is inserted into the BtreeIndex. remove_key is invoked whenever a key is deleted. update_key is used to change the link value associated with a key. It is invoked when a client updates the link through the iterator.

None of these functions are virtual, so they work equally well on either the root page or any other page. As a final note, these three functions are only applied to leaf pages of the index. Internal pages of the index hold links to other index pages. The keys on these pages are manipulated by the page-link functions described later.

add_key is shown below:

size_t
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::add_key(
    BtreeIndex& btree,
    const std::pair<const Key, size_t>& x)
{
    // Encode the <key,link> to see how long it is
    char buf[ (PageSize - HeaderSize) / 2 ];
    XOutDataBuf xstrm(btree._xstream, buf, sizeof(buf));

    KeyTraits::encode(xstrm, x.first) << x.second;
    if (xstrm.fail()) 
        throw std::ios_base::failure("Key too large");

    KeyInfo info(x.first, x.second, short(xstrm.tellp()));
    
    // Now put the KeyInfo object into the right
    // place in the list
    int slot = 0;
    KeyList::iterator kit = _keyList.begin();
    for (; kit != _keyList.end(); ++kit) {
        if ( btree->_comp(x.first, (*kit).key) ) break;
        ++slot;
    }
    kit = _keyList.insert(kit, info);
    _dataLen += info.xlen;
    _dirty = true;
    return slot;
}

Conceptually there are two sections. First, encode the <key,link> into a buffer to see how long the encoded form is. Note that we require an encoded key to take up less than half of the available PageSize. Stated another way, we require that PageSize be large enough to hold at least two keys in the worst case. This may seem like an arbitrary restriction — and perhaps it is. I do it because it simplifies certain other routines as we will see below.

After the encoded length of the <key,link> pair is determined, we create the actual KeyInfo object. Then we figure out where to insert it in the KeyList. To do this, we apply the Compare functor repeatedly until we find a key in the list that is greater than the new key. In the process, we keep track of the index position in the list. After inserting the new key in its correct place, we add its encoded length to the overall length of the data portion of the page, mark the page as "dirty," and return the index position where the new key is located. The later is used to create the BtreeIndex::Iterator that refers to this newly inserted key.

remove_key is simpler:

bool
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::remove_key(
    int slot)
{
    if (size_t(slot) >= _keyList.size()) 
        throw std::ios_base::failure("Invalid slot for erase");
    bool rtn = _dataLen < (PageSize - HeaderSize) / 2;
    // Find the slot
    KeyList::iterator kit = _keyList.begin();
    while (slot > 0) {
        ++kit;
        --slot;
    }
    _keyList.erase(kit);
    _dataLen -= (*kit).xlen;
    _dirty = true;
    return rtn;
}

Note first that it is based on slot number. This is what is carried in a BtreeIndex::Iterator. First we validate the slot number. I could have just ignored this test with the usual caveat about "undefined behavior" if the slot number is invalid. Since the test is simple, however, I go ahead and do it.

The next statement needs a little explanation. One of the "requirements" for a B-tree is that all pages (except root) be at least half full. To fulfill this requirement, when a key is deleted and a page becomes less than half full, we might have to move a key from a sibling page or even consolidate two pages. Since our BtreeIndex allows variable length keys, the trick is to determine when a page is actually less than half full. We check the total encoded length of the keys against the available space, but we do it before we actually remove the key. This way, if the page was more than half full before we remove the key, we think of it as being at least half full after we remove the key. If the page was less than half full before we remove the key, then we know it is really less than half full afterward. The result of the test is the function return value. This signals higher-level routines to handle any necessary key shuffling or page consolidation. You will note that the restriction of a key to a maximum length less than half the available buffer size facilitates this test.

Finally, we locate the key by its slot number and then erase it.

update_key is the simplest of all:

void
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::update_key(
    size_t slot, 
    size_t link)
{
    if (slot >= _keyList.size()) 
        throw std::ios_base::failure("Invalid slot for update");
    // Find the slot
    KeyList::iterator kit = _keyList.begin();
    while (slot > 0) {
        ++kit;
        --slot;
    }
    (*kit).link = link;
    _dirty = true;
    return;
}

Again, I test the slot index for validity. After that, we just find the key and then change its link field.

Page insert_page_link/update_page_link/remove_page_link

Whereas the three functions described above are used when the target page is a leaf page, these three functions perform analogous operations on internal pages. insert_page_link is called whenever a new page has been created as the result of a page split. remove_page_link is invoked when a page is removed because it is empty. update_page_link is used when the first key on a page is changed. This can happen when a new key is inserted in the index that is less than all the already existing keys, when a key is deleted, or when we shuffle keys between pages to keep them all more or less half full. When the first key on a page changes, the key that links to that page at a higher level in the index has to be changed to correctly reflect what key values are available on the lower-level page.

Now, since all pages — even the root — hold <key,link> pairs where the link is just the XDRStream offset, you might wonder why the previous three functions cannot handle all the functionality that these functions provide. The simple answer is that they could, but that would have required more work by higher-level routines. In a sense, the xxx_page_link functions are utilities that take advantage of certain internal knowledge. Let me dive into insert_page_link and explain what I mean:

void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Page::insert_page_link(
    KeyList::iterator kit, 
    Page* p)
{
    // Build a KeyInfo object from the link page
    KeyInfo info = p->key_list().front();
    info.link = p->page_id();
    _dataLen += info.xlen;
    _keyList.insert(kit, info);
     _dirty = true;
}

insert_page_link is called whenever a lower-level page is split and creates a new page. You will have noted above that the add_key function did not test whether it should split itself. This test in performed by the BtreeIndex function that actually calls add_key (and some other places).

I will go into this in more detail when I show the code for BtreeIndex itself, but for now it suffices to know that most operations in BtreeIndex are recursive. Each takes a pointer to the page that it is to operate upon and either finds the next lower page and invokes itself or stops when it reaches a leaf page. This process effectively defines the path from the root to a specific key in the index. As the call stack unwinds, any operations that need to propagate up the tree are performed. This includes (amongst other things) putting references to new pages into the parent page. insert_page_link is therefore invoked by a BtreeIndex function that has first located a lower-level page, invoked an operation on said lower-level page, and then split that lower-level page. The BtreeIndex function has both a pointer to the lower-level page and to that page's parent.

You may have also noticed that Page does not have a find_key function. Instead, it provides a key_list function that allows clients direct access to the KeyInfo container. This allows clients of Page to search the container themselves and to actually update the container. Obviously this violates data hiding and encapsulation rules, and I am not real happy about it. Since Page is an internal type, I have decided I can live with it for now, but just barely. In this case, the BtreeIndex function holds the KeyList::iterator that linked to the page that was just split. This makes it easy to tell the parent page where to insert the new link. This is important because insert_page_link can be necessary under circumstances where the page itself would have difficulty determining the proper place to insert the new link.

The function takes advantage of certain information already in the tree. Unlike add_key above, it does not have to encode the key to determine its length. The key for a new page is always the same as the first key on the page, so this function just copies that KeyInfo object (which already contains the encoded length) and updates the link member (which doesn't change its encoded length). It then inserts the new KeyInfo object at the point where it has been instructed, updates the overall data length, and marks the page as dirty.

remove_page_link is slightly more complex than insert_page_link. It is shown here:

bool
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Page::remove_page_link(Page* p)
{
    bool rtn = _dataLen > (PageSize - HeaderSize) / 2;
    // Find the page reference
    KeyList::iterator kit = _keyList.begin();
    for (; kit != _keyList.end(); ++kit) {
        if ((*kit).link == p->page_id()) break;
    }
    if (kit == _keyList.end()) 
        throw std::ios_base::failure("Invalid page");
    // Remove it
    _dataLen -= (*kit).xlen;
    _keyList.erase(kit);
    _dirty = true;
    return rtn;
}

This function is obviously used whenever a descendent page is removed from the index. Like the remove_key function above, it first tests itself to see if it is half full. It then locates the key with the page reference to be removed. Remember that a page's ID is the same as its offset in the index file, and that this is the value held in the link portion of the KeyInfo object. Since remove_page_link is an internally invoked function, we should never encounter the error condition that I test for, but in this case I was cautious.

update_page_link is shown below:

void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Page::update_page_link(Page* p)
{
    // Find the page reference
    KeyList::iterator kit = _keyList.begin();
    for (; kit != _keyList.end(); ++kit) {
        if ((*kit).link == p->page_id()) break;
    }
    if (kit == _keyList.end()) 
        throw std::ios_base::failure("Invalid page");
    // Carefully replace the keyinfo at this slot
    _dataLen -= (*kit).xlen;
    (*kit) = p->key_list().front();
    (*kit).link = p->page_id();
    _dataLen += (*kit).xlen;
    _dirty = true;
}

As noted above, this function is necessary whenever the first key on a page changes. The tricky thing about this function is that even though the number of keys on the page is not changing, the encoded length of the page may very well change. So, we have to treat the length as if we first removed the existing key and then inserted the new one. We simply copy the actual key from the lower-level page and then change the link.

Page must_split

The final non-virtual function in Page is the must_split function. This simply tests whether the combined encoded length of the keys held by the page exceeds the space available to actually encode them. It is a one-line function and probably could be inlined, but I didn't bother.

bool
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Page::must_split() const
{
    return _dataLen > (PageSize - HeaderSize);
}

Page Virtual Functions

The seven functions described above apply to all pages. The encode/decode functions are virtual since the root page contains different information in its header. Because of the trivial differences, I will not show the RootPage encode and decode functions. This leaves three other virtual functions defined by Page: split, consolidate, and shuffle. I will show both the Page and the RootPage versions of these below, but first a few general comments. You will note that each of these functions returns a pointer-to-Page. For split, the reference is the new page that needs to be added to the parent. For consolidate, it is the page that needs to be removed from the parent. Finally, for shuffle, it is the page that has had its first key changed and thus needs to have its key updated in the parent. Naturally, the RootPage versions of the functions do not need to return anything since root has no parent but the signatures remain the same. (shuffle is meaningless for the root, so it does not really need to be a virtual function of Page. I may fix this in the next version.)

Page split

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::split(
    BtreeIndex& btree)
{
    // Go through the page and find where to split it
    KeyList::iterator kit = _keyList.begin();
    size_t len = 0;
    while (len < (PageSize - HeaderSize) / 2) {
        len += (*kit).xlen;
        ++kit;
    }
    
    // Now move the rest of the elements to a new page
    Page* newp = btree.get_free_page();
    KeyList& kl = newp->_keyList;
    kl.splice(kl.end(), _keyList, kit, _keyList.end());
    newp->_dataLen = _dataLen - len;
    _dataLen = len;

    // link the page in after this one
    newp->next_page(_nextPage);
    newp->prev_page(_pageId);
    if (Page* nextp = btree.get_page(_nextPage))
        nextp->prev_page(newp->page_id());
    _nextPage = newp->page_id();

    // make sure both pages get flushed
    _dirty = newp->_dirty = true;

    // Return the new page
    return newp;
}

Splitting a regular page involves first figuring out where to split it. Since keys can vary in size, we need to do something slightly more intelligent than just dividing the number of keys on the page by two. My criteria is to go through the keys and add their encoded lengths until the result is greater than half the available space. This means that after the split the original page will be more than half full, and the new page will be what we define as exactly half full. You will note that because we require that the encoded <key,link> length be no greater than half the available space, we are always assured that we can split a page into just two pages. Without the restriction on encoded key length, we run the risk of a pathological case that would require we split a single page into three pages.

The next step is to get a new page. As I described in my last column [4], this involves more than just a call to new. Empty index pages are linked together through a member of the root page. Whenever a new page is needed, this linked list is first checked. If an empty page is available, it is reused. If not, a new page is created. In either case, the page is added to the page cache. All this is handled by the BtreeIndex::get_free_page function. A reference to the BtreeIndex itself is passed to the function partially to allow for this.

Once we have a new page, the necessary keys have to be moved to it. This is done using the splice function of the standard list container. splice is a little like insert — it takes an iterator where the splice is to be done — but it takes another list container and two iterators as its additional arguments. The other container has to be of the same type. The function then moves (not copies) the range represented by the iterators from their original list into the new one by changing pointer links. No data copies are made. After calling splice to move the keys, we just have to update the _dataLen member in both pages.

After getting the keys moved, we link the new page into the double linked list of sibling pages that is maintained at each level in the index. These links will be discussed more when I describe the BtreeIndex functions. For now, we have to fetch the page that was the next page (if it exists) so that we can update its prev page link.

Finally, after updating everything, both pages are marked as dirty so that the cache will flush them when it should. The pointer to the new page is returned.

The RootPage version of split is similar.

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,PageSize>::RootPage::split(
    BtreeIndex& btree)
{
    // Find where to split the page
    KeyList::iterator kit = _keyList.begin();
    size_t len = 0;
    while (len < (PageSize - HeaderSize) / 2) {
        len += (*kit).xlen;
        ++kit;
    }
    // Spliting the root page means creating two new pages
    Page* newp1 = btree.get_free_page();
    Page* newp2 = btree.get_free_page();

    // move the first part to new page 1
    KeyList& kl1 = newp1->key_list();
    kl1.splice(kl1.end(), _keyList, _keyList.begin(), kit);
    newp1->data_len(len);

    // move whats left to the second new page
    KeyList& kl2 = newp2->key_list();
    kl2.splice(kl2.end(), _keyList);
    newp2->data_len(_dataLen - len);
    _dataLen = 0;

    // link the two pages together
    newp1->next_page(newp2->page_id());
    newp2->prev_page(newp1->page_id());

    // put the page links into the now empty root page
    insert_page_link(_keyList.end(), newp1);
    insert_page_link(_keyList.end(), newp2);
    _depth += 1;
    _dirty = true;
    return 0;
}

Like before, we start by figuring out where to split the page. Splitting the root page involves creating a new root page and increasing the depth of the index tree. Since the root page is a different kind of page object, I don't bother to create a new one; I simply create two new internal pages and move all the keys into them. Using the list splice function makes this very easy. The second form of splice here takes only another container. It moves all the elements from that container into the specified one at the requested insertion point. Once we have moved all the keys to the lower-level pages, we link them together (they are currently the only two pages at that level of the tree), and then we update the root page with the new links. In this case, the root page updates itself — it is the parent.

Page consolidate

Consolidation is effectively the opposite of splitting. Whereas the test of whether a page should be split was factored into a separate function, the consolidate function actually determines whether any consolidation is possible. (This is a slight design inconsistency that I may go back and clean up at some point — by putting the must_split test into the split function). In this case, consolidation is possible if a page is now less than half full and one of its siblings is no more than half full. Now you see why the functions that remove keys from a page return a boolean indicating that they are less than half full so that their invoking functions can decide whether to call consolidate or not to bother. Note that functions such as remove_key do not themselves call consolidate because deciding to attempt consolidation is a strategy function of the BtreeIndex as a whole. So, functions like remove_key return an indication of whether consolidation might be possible, and higher-level functions then determine whether or not it should be attempted.

The functions themselves are fairly straightforward. Page::consolidate is shown here:

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::consolidate(
    BtreeIndex& btree)
{
    // Get the prev and next pages - we will need them both
    Page* prevp = btree.get_page(_prevPage);
    Page* nextp = btree.get_page(_nextPage);

    // See if we can consolidate with the prev page
    // consolidate
    if (prevp and prevp->_dataLen + _dataLen <= (
        PageSize - HeaderSize)) {
        KeyList& kl = prevp->_keyList;
        kl.splice(kl.end(), _keyList);
        prevp->_dataLen += _dataLen;

        // make the current page free
        prevp->next_page(_nextPage);
        if (nextp)
            nextp->prev_page(_prevPage);
        btree.add_free_page(this);
        return this;
    }

    // See if we can consolidate with the next page
    if (nextp and nextp->_dataLen + _dataLen <= (
        PageSize - HeaderSize)) {
        KeyList& kl = nextp->_keyList;
        _keyList.splice(_keyList.end(), kl);
        _dataLen += nextp->_dataLen;

        // free the next page
        Page* p = btree.get_page(nextp->_nextPage);
        if (p) {
            p->prev_page(_pageId);
            next_page(p->_pageId);
        }
        btree.add_free_page(nextp);
        return nextp;
    }
     
    // No consolodation possible
    return null;
}

The function first checks whether the current page can be consolidated into the previous sibling page (assuming it exists). If so, the keys are spliced into that page, the data length is updated, and the current page moved to the free page list. A pointer to the freed page is returned.

If consolidation with the previous sibling is not possible, the next sibling is checked. You will note that consolidation always involves adding keys to the end of the previous page. This means that the higher-level BtreeIndex function just has to remove the empty page link from the parent page — no other keys have to be updated.

The consolidation function for RootPage is shown below. It is actually much simpler.

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,PageSize>::RootPage::consolidate(
    BtreeIndex& btree)
{
    // consolidating the root page means eliminating
    // the lower level page
    Page* oldp = btree.get_page(_keyList.front().link);
    _keyList.clear();

    // move the keys from the lower level page
    _keyList.swap(oldp->key_list());
    _dataLen = oldp->data_len();
       
    btree.add_free_page(oldp);
    _depth -= 1;
    _dirty = true;
    return oldp;
}

The root page can only be consolidated when it contains a single <key,link> element. Furthermore, the depth of the tree must be greater than one. These tests are done by the higher-level BtreeIndex function and are not repeated in the actual RootPage::consolidate function. We just move all of the single lower-level page's keys into the root and then free that page. We also decrement the depth of the tree by one.

Finally, there is the shuffle function. This function is invoked when a page has become less than half full, and consolidation with a sibling page is not possible. In order to keep the current page as close to half full as possible, we need to move a key from one of the sibling pages. While actually moving the key is not hard (yeah for splice), one page is going to have its first key value changed. (This may not actually be true if we have duplicate keys, but we don't check for that case.) This means that the link to that page in its parent has to be changed. This makes things complicated for the BtreeIndex as a whole, but fortunately, shuffle is easy.

BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,PageSize>::Page::shuffle(
    BtreeIndex& btree)
{
    // See if we can get a key from the previous page
    Page* prevp = btree.get_page(_prevPage);
    if (prevp) {
        KeyList& k1 = prevp->_keyList;
        _keyList.splice(_keyList.begin(), k1, --k1.end());
        prevp->_dataLen -= _keyList.front().xlen;
        _dataLen += _keyList.front().xlen;
        return this;
    }

    // Otherwise see if we can get a key from the next page
    Page* nextp = btree.get_page(_nextPage);
    if (nextp) {
        KeyList& kl = nextp->_keyList;
        _keyList.splice(_keyList.end(), kl, kl.begin());
        nextp->_dataLen -= _keyList.back().xlen;
        _dataLen += _keyList.back().xlen;
        return nextp;
    }
       
    // No shuffle possible
    return 0;
}

Note that here we see the third and final version of splice used. This version just moves one element from one list to another.

Since the root page does not have any siblings, there is no shuffle function defined for it.

That wraps up the look at BtreeIndex::Page and BtreeIndex::RootPage plus some associated classes. Next time I will review the PageCache class, and (hopefully) the rest of BtreeIndex functions.

Source Code

All of the IndexedFile source code, along with the necessary XDRStream source code, is included in reeves.zip. This code differs from that shown in this column somewhat. In order to facilitate testing on compilers that are less than friendly to nested classes defined within template classes, I moved all the nested class definitions to the main file level. The functionality remains exactly the same. What I have shown in this column is how the code should look (in my opinion) if developed with a truly Standard C++ compiler.

Notes

[1] The code shown in Listing 1 is how I think it should look. The subordinate classes of BtreeIndex are declared as nested classes. The code available for download from the CUJ website (reeves.zip) has the same functionality, but has been reorganized to make it friendlier for certain compilers.

[2] Robert Sedgewick. Algorithms in C++, Third Edition, Parts 1-4 (Addison-Wesley, 1998).

[3] Jack Reeves. "The (B)Leading Edge: Building an IndexedFile Class using XDR_Stream," C/C++ Users Journal C++ Experts Forum, March 2002, <www.cuj.com/experts/2003/reeves.htm>.

[4] Jack Reeves. "The (B)Leading Edge: Musings on the IndexedFile Design", C/C++ Users Journal C++ Experts Forum, May 2002, <www.cuj.com/experts/2005/reeves.htm>.

Jack W. Reeves is an engineer and consultant specializing in object-oriented software design and implementation. His background includes Space Shuttle simulators, military CCCI systems, medical imaging systems, financial data systems, and numerous middleware and low-level libraries. He currently is living and working in Europe and can be contacted via jack_reeves@bleading-edge.com.