Introduction
In this installment of "The (B)Leading Edge," I am going to (hopefully) complete the walkthrough of the BtreeIndex class that underpins my IndexedFile class. Just to recap a tiny bit (one of the really great things about writing this column on the Web is that I can assume readers have access to the previous columns, so I don't have to do detailed summaries), the BtreeIndex class contains the following nested classes:
BtreeIndex Iterator StreamPosProxy Page RootPage PageCache XIODataStreambuf XInDataBuf XOutDataBuf KeyInfoIn the last column [1], I went through PageCache, Iterator, and StreamPosProxy. In the column before that [2], I discussed Page, RootPage, KeyInfo, and the three XIODataBuf classes. Now I get to put them all together. As before, Listing 1 is the definition of the BtreeIndex class. (I have removed the actual definitions of the nested classes to make things easier to read.) At the end of the class definition, you will see 12 private functions defined. All of the actual work of the BtreeIndex is handled by these utility functions. Let us start with the last four functions, since these are the very heart of the BtreeIndex functionality.
lower_bound/upper_bound
As you know, a BtreeIndex consists of a collection of pages arranged as a tree in memory. Each page contains one or more keys, plus a link. The link is either to another page in the index or to the actual data the key represents. The actual data is presumed to be stored in a separate file, but the link which is just a 32-bit integer can reference anything. It can even be empty, in which case the key itself is the actual data. Figure 1 shows how a tiny section of BtreeIndex with integer keys might be organized. Just about any operation on a BtreeIndex involves searching the tree from the root to a leaf node. As you might suspect, this is a perfect situation for a recursive algorithm.
Each of the four internal utility functions, lower_bound, upper_bound, insert, and erase, corresponds to an identically named function in the public interface of BtreeIndex. If these names seem familiar (they should), remember that I modeled the public interface of BtreeIndex on the STL multimap container. Let's take a look at the public lower_bound function and its implementation via the private lower_bound function. The code for the public function is:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> typename BtreeIndex<Key,Compare,KeyTraits, PageSize>::Iterator BtreeIndex<Key,Compare,KeyTraits, PageSize>::lower_bound(const Key& k) { if (empty()) return end(); Iterator it = lower_bound(_root, _root->depth(), k); _pageCache.reset(); return it; }It's pretty simple, but a few points need to be made. First, I have the top-level functions deal with boundary conditions. This seems like a fairly arbitrary decision, but it turns out that it is much easier to handle the recursion in the lower-level function and everything peripheral to the recursion one level removed. This is one of those things that you never seem to stop and think about consciously when doing a design, and, with a little experience, you will get them right most of the time. If you get it wrong, however, it can really make things complicated. In this case, the boundary conditions are to check for an empty index before searching the tree and to reset the page cache afterwards. Recall that this function (PageCache::reset) tells the page cache that it can remove any pages in excess of its maximum size [1]. The heart of this function is the call to the internal version of lower_bound, which actually does all the work. Besides the key to be looked up, it is passed a pointer to the root page and the depth of the tree. Now the work really begins.
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> typename BtreeIndex<Key,Compare,KeyTraits, PageSize>::Iterator BtreeIndex<Key,Compare,KeyTraits,PageSize>::lower_bound (Page* p, int depth, const Key& key) { if (depth == 0) { typename Page::KeyList& kl = p->key_list(); typename Page::KeyList::iterator it = kl.begin(); size_t slot = 0; for (; it != kl.end(); ++it, ++slot) { if ( not _comp((*it).key, key) ) break; } if (it != kl.end()) { return Iterator(this, p->page_id(), slot); } else if (p->next_page() == size_t(-1)) { return end(); } else { return Iterator(this, p->next_page(), 0); } } else { // find the next page to look on typename Page::KeyList& kl = p->key_list(); typename Page::KeyList::iterator it = kl.begin(); for (++it; it != kl.end(); ++it) { if ( not _comp((*it).key, key) ) break; } --it; return lower_bound(get_page((*it).link), depth-1, key); } }As you will see in the other functions, every one of the internal utilities follows this same pattern: if the depth is zero, then the page argument is a leaf page and one algorithm is performed; otherwise the page is an internal page, and a different approach is taken, which always involves a recursive call to the same function. For lower_bound, if the page is a leaf page, then the keys are searched using the lower-bound test (k <= page[i] for some slot number i on the page) . At the end of that search, one of three conditions holds: a key has been found that represents the lower bound (its iterator is returned), no key has been found on this page and there is no next-page (this must be the last leaf page in the tree, so the end() iterator is returned), or there is a next-page link. In the last case, we know from the way the tree is organized that its keys are all higher than the requested key, so we just return an iterator to the first key on that page.
If the page is not a leaf page, then we do the same search almost. In the second case, we are not trying to find the actual lower bound, but to determine which lower-level page to search. In this case, when we have found the lower bound, or run off the end of the list, we know we have gone one slot too far. So we back up one and then recursively call the same function. In the recursive call, we pass it the next-level page we have found a link to, and we reduce the depth by one. Note carefully that because we always back up one slot, we start the search with the second slot on the page. Since every page has at least one key, this works. This is the kind of thing that causes off-by-one errors and makes this stuff tricky. It is also the kind of thing that if you get right nobody else ever has to worry about.
The code for upper_bound, both public and internal functions, is virtually identical to lower_bound except that a different test is performed, so I will skip directly to insert. The public version of insert is:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> typename BtreeIndex<Key,Compare,KeyTraits, PageSize>::Iterator BtreeIndex<Key,Compare,KeyTraits, PageSize>::insert(const value_type& x) { Iterator it = insert(_root, _root->depth(), x); if (_root->must_split()) { _root->split(*this); _pageCache.min_size(_root->depth()); } _root->incr_total(); flush(_uStrategy); return it; }In this case, all the "boundary" conditions are after we do the recursive insert call, so let us look at that first, and then I'll come back to this version.
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> typename BtreeIndex<Key,Compare,KeyTraits, PageSize>::Iterator BtreeIndex<Key,Compare,KeyTraits,PageSize>::insert (Page* p, int depth, const value_type& x) { if (depth == 0) { // leaf page size_t slot = p->add_key(*this, x); return Iterator(this, p->page_id(), slot); } else { // internal page Iterator it; Page* nextp; typename Page::KeyList& kl = p->key_list(); typename Page::KeyList::iterator kit = kl.begin(); if ( _comp(x.first, (*kit).key) ) { // Boundary condition -- // key is less than first key on page nextp = get_page((*kit).link); it = insert(nextp, depth-1, x); p->update_page_link(nextp); } else { // find which page to put the new key in for (++kit; kit != kl.end(); ++kit) { if ( _comp(x.first, (*kit).key) ) break; } --kit; nextp = get_page((*kit).link); it = insert(nextp, depth-1, x); } if (nextp->must_split()) { Page* newp = nextp->split(*this); p->insert_page_link(++kit, newp); } return it; } }In this case, the leaf page case is trivial. All the real work of inserting a new key into a leaf page is handled by the page itself via its add_key function. This was described in [2], but to summarize, it encodes the key to see how long its encoded form is and then creates a KeyInfo object that it adds to the pages' key list in the right place. Once the new key has been added, the Iterator referring to that key is created and returned. This Iterator will percolate up the call stack and eventually be returned to the user as the return value of the public insert function, but there are a few things that have to happen along the way.
The non-leaf page section of insert is where we take advantage of the fact that recursion basically builds a path from the root page to the page where we insert the new key. There are two separate if statements here. The first tests whether the key is less than the first key on the page. This is a boundary condition where the key being inserted is less than all the other keys in the tree. In this case, this test will succeed for every page as we walk down the tree, and we will eventually insert the key at the beginning of the first leaf page in the tree. After we do that, as the stack unwinds, we need to adjust the reference for every page all the way back up to the root. The update_page_reference function handles this. It is part of the Page class and was described previously. Again to summarize, every key in an internal page is the same as the first key on the lower-level page that it links to. (The first key in the root page is the first key in the index.) Whenever the first key on a page changes, usually because that key was removed, then the link to that page in the parent page must be updated. In this special case, the first key in the entire index is changing, so potentially several pages have to be updated.
If the key is greater than the first key in the list, then we will find an existing page to insert the key into. The else clause of the first if is exactly the same as the upper_bound function. In other words, it looks just like lower_bound except with a different test condition for ending the loop. I did it this way so that when an index has multiple identical keys, the items indexed will remain in the order that they were inserted. I did not have to do it this way, but it is one of those little things that users seem to expect.
The second if statement in the non-leaf branch deals with what happens after a key has been inserted into a lower-level page. Visualize that the stack has now been unwound one level. The lower-level page has a new key in it. That key may have just caused it to exceed the specified size of an external page. When that happens, the page must be split into two pages. So that is what we do here. Note that we do it one level up the call stack where we have the reference to the parent page that we will need to insert the new page reference into if we split the lower-level page. Of course, if we insert a new page reference into a page, then that page could become too large also, so this test is applied at every level as we unwind the stack. Note that the test also must be applied even if the only change was to update a page reference. In that case, a key may have a longer encoded length than the previous key, which could still cause the page to exceed its maximum size.
As you can see, this stuff has its tricky aspects, but it is not all that difficult once you understand the basic concept: walk down the tree until you find where to insert the new key, letting the call stack build a path of those pages visited in the search. Then insert the key and unwind the stack, splitting pages and inserting new page references as necessary in the process.
When we have completely unwound the recursive call stack, we return to the public insert function. At this point, we have to check whether we need to split the root page. If so, then that increases the depth of the tree, so we notify the page cache that its minimum size has increased. Finally, the public insert function increments the total key count in the root page and then calls flush to handle any necessary writing of changed pages to memory. Note that the flush function takes as its parameter the current update strategy. This is something that the client can change, and I will discuss it more when I describe flush itself.
erase
Erasing a key is the most difficult part of the BtreeIndex functionality. Most textbooks leave it as an exercise for the reader, but I am not going to do that to you. Here is the public erase function:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> void BtreeIndex<Key,Compare,KeyTraits, PageSize>::erase(Iterator pos) { // The 'pos' does not contain the key, // but we need it to chase the links Key k = (*pos).first; // Find a path to the leaf page and remove the key erase(_root, _root->depth(), k, pos); _root->decr_total(); // See if the changes perculated to the root if (_root->depth() > 0 and _root->num_keys() == 1) { _root->consolidate(*this); _pageCache.min_size(_root->depth()); } else if (_root->must_split()) { _root->split(*this); _pageCache.min_size(_root->depth()); } flush(_uStrategy); }As with STL erase functions, it takes an iterator to indicate what should be erased. Since the iterator contains the page reference and slot number where the external key is located, a naïve first approach to this function would be to simply fetch the page and remove the key. That will not cut it, however, for several reasons. The most obvious is that a page could become empty. When that happens, we need to remove the page's reference from its parent. Alternatively, we might remove the first key on a page, in which case we need to update the page's reference in its parent. In an alternate universe (usually another one of those exercises left to the reader), this last step could be ignored. In such a design, internal pages might not contain the same key as the first key on the lower-level page, but rather a key guaranteed to be less than or equal to the first key on the lower-level page. Since you have to deal with the empty page problem anyway, I figured it was not much more effort to ensure that the keys stayed in sync as well.
In order to handle any necessary updates in parent pages, we must have a link to the parent. As a result, erase is pretty similar to insert. First we build the path to the key to be erased by recursively walking down the tree; then we unwind the call stack and update any parent pages as we go. The one thing that makes erase trickier than insert is that we can have multiple identical keys in the index. Therefore, we have to not only know what key we are trying to erase, but specifically which slot. For that reason, we must pass the iterator itself down the call stack. You will note that I make a copy of the key's value and also pass it as a parameter to the recursive erase function. Naturally, it is passed as a reference-to-const. I do this so that each iteration of erase does not have to make a copy of the key. Since Key is a user-defined type, I do not know how expensive it may be to copy the key. This may be a worthless premature optimization, but it really can't hurt anything either.
The internal erase function is:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> std::pair<bool,bool> BtreeIndex<Key,Compare,KeyTraits,PageSize>::erase (Page* p, int depth, const Key& key, Iterator pos) { using std::pair; using std::make_pair; // we have to find the correct path to // the page in pos if (depth == 0) { if (pos._page != p->page_id()) return std::make_pair(false, false); // Ok - remove the key bool notHalfFull = p->remove_key(pos._slot); return std::make_pair(true, notHalfFull); } else { // find the link we think will lead to pos typename Page::KeyList& kl = p->key_list(); typename Page::KeyList::iterator kit = kl.begin(); for (++kit; kit != kl.end(); ++kit) { // use lower bound test if (not _comp(key, (*kit).key)) break; } --kit; // back up one link Page* nextp = get_page((*kit).link); pair<bool,bool> rtn = make_pair(false, false); for (;;) { rtn = erase(nextp, depth-1, key, pos); if (rtn.first) break; ++kit; if (kit == kl.end()) return make_pair(false, false); nextp = get_page((*kit).link); } // if rtn.first we got to the right page and // updated it now see what we have to do at // this level if (nextp->num_keys() == 0) { add_free_page(nextp); bool notHalfFull = p->remove_page_link(nextp); return make_pair(true, notHalfFull); } // See if the page link is still correct if ( _comp((*kit).key, nextp->key_list().front().key) ) { // we removed or updated the first key // in the lower page p->update_page_link(nextp); // might grow the page } if (rtn.second) { // the lower page is less than half full if (_dStrategy != None) { Page* oldp = nextp->consolidate(*this); if (oldp) { bool notHalfFull = p->remove_page_link(oldp); return make_pair(true, notHalfFull); } if (_dStrategy == Shuffle) { Page* op = nextp->shuffle(*this); p->update_page_link(op); // might grow the page } } } else if (nextp->must_split()) { // the lower page might have grown Page* newp = nextp->split(*this); p->insert_page_link(++kit, newp); } return make_pair(true, false); } }Like the previous functions, it has two sections: one for a leaf page and one for an internal page. When we get to a leaf page, and the page is the one specified in the iterator, then we remove the key. The return value from erase is a pair of Boolean values. The first one is necessary because we might have multiple identical keys that might span more than one leaf page. If it is false, then the key was not on the page specified. It tells the higher level to continue the search on the next page at that level of the tree. The second Boolean is used when the key is removed from a page. It indicates whether the page is now less than half full. Again to recap basic Btree design, if two adjacent pages both become less than half full, we can combine them into one page. In order to determine whether we should check for this possibility, the erase function returns this Boolean.
The non-leaf section of erase has several sections: first we find the lower bound of the key we are looking for. Then we iteratively search each page that might contain the key. When we find it, we move to the next step. If we don't find it, we report that failure to the next higher-level that is sitting in its loop doing exactly the same thing.
After we find the right-leaf page, we start unwinding the stack. First we check if the lower-level page is empty. If so, we stick the page on the free list and remove its link at this level. As with insert and page splitting, we check for the empty page one level up in the tree/call stack because at that point we have the reference to the parent page that might need to be updated.
If the lower level page is not empty, then we check to see if its first key was removed or updated. If this happens, we need to update the page link in the parent.
Finally, if the descendent page is less than half full, and the update strategy calls for it, then we see if we can consolidate pages. If that does not work, and again if the update strategy requires it, we shuffle keys to try and keep pages as close to half full as possible.
You will note that erase contains a check for must_split, which might seem misplaced. It is there because when a page has its first key removed or updated, that change will cause an update in the parent page, which might actually increase the size of the parent.
When we unwind the stack back to the public version of erase, we decrement the number of keys in the index and then see if the root page needs to be consolidated (or split). Note that the test for root page consolidation is different. Since the root page has no siblings, it cannot be consolidated that way. Instead, if the root page contains only one key, and it is a link to a lower-level page, then the root page can be eliminated and the depth of the tree reduced one level.
That is basically the entire functionality of BtreeIndex. Naturally, it all depends on supporting functionality in classes like Page/RootPage, but that is why they were created in the first place. Of course, I didn't get all of this right the first time through either, but it is getting pretty close by now. A few more pieces, and we will be done.
get_page/get_free_page/add_free_page
These functions are basically wrappers at the BtreeIndex level for lower-level functions. get_page invokes the similar PageCache function. add_free_page puts an empty page on the list of free pages linked together from the root page. The get_free_page function retrieves a free page from the linked list, if there is one. Otherwise it creates a new Page from the free store. In the latter case, it also assigns the page to the page cache. get_free_page has one tricky aspect: when it creates a new Page from the free store, it immediately flushes that empty page to the external store. I did not want to do this at first, but since a page's ID is its streampos location in the external store, this is the only way to ensure that the page's ID is immediately saved. This is necessary because splitting the root page makes two consecutive calls to get_free_page.
flush
I mentioned above that flush takes an argument that indicates how it works. The code is still pretty straightforward:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> void BtreeIndex<Key,Compare,KeyTraits, PageSize>::flush(UpdateStrategy s) { _pageCache.reset(); // remove excess pages if (s == Safe) { _pageCache.flush(); if (_root->is_total_dirty() or _root->is_dirty()) _root->encode(_xstream); } else if (s == MostlySafe) { _pageCache.flush(); if (_root->is_dirty()) _root->encode(_xstream); } }First, we reset the page cache. This happens no matter what. Then, if the update strategy is Safe, we flush any remaining pages in the page cache and flush the root page if it has had any changes. If the update strategy is MostlySafe, we do almost exactly the same thing except that we don't flush the root page if the only change has been to update the total number of keys that it maintains. Finally, although it is not shown here, if the update strategy is Unsafe, then we don't do anything other than reset the page cache.
get_value
The functions get_value, prev_iter, and next_iter are used by the Iterator class to implement its functions. You will recall that dereferencing a BtreeIndex::iterator returns a pair consisting of the key and a StreamposProxy object that contains a link to the actual data. This is the job of get_value. Given what you have already seen, it is very straightforward:
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> std::pair<const Key, typename BtreeIndex<Key,Compare, KeyTraits,PageSize>::StreamposProxy> BtreeIndex<Key,Compare,KeyTraits, PageSize>::get_value(size_t page, size_t slot) { Page* p = get_page(page); // Find the slot typename Page::KeyList::iterator kit = p->key_list().begin(); while (slot > 0 and kit != p->key_list().end()) { ++kit; --slot; } if (kit == p->key_list().end()) throw std::ios_base::failure ("Invalid slot"); return std::make_pair((*kit).key, StreamposProxy((*kit).link, this, page, slot)); }Unlike erase, we don't need the path to the leaf page, so we just fetch it directly from its ID in the Iterator.
prev_iter
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> void BtreeIndex<Key,Compare,KeyTraits, PageSize>::prev_iter(Iterator& it) { if (it._slot > 0) { --it._slot; } else if (it == end()) { // Walk down the tree and find the // last slot on the last page int depth = _root->_depth; Page* p = _root; while (depth > 0) { p = get_page(p->key_list().back().link); --depth; } it._page = p->page_id(); it._slot = p->num_keys() - 1; } else { Page* p = get_page(it._page); Page* prevp = get_page(p->prev_page()); if (prevp) { it._page = prevp->page_id(); it._slot = prevp->num_keys() - 1; } } }prev_iter (and next_iter) are greatly simplified by the fact that pages at the same level in the tree are linked together. As you will see below, next_iter is pretty simple. prev_iter is pretty simple also, except for the boundary condition of when we try to decrement the end iterator. The end iterator is a special value with an invalid page ID (-1). In order to increment backwards to the last valid key in the index, we have to actually search the tree for that key. In this case, I don't bother to recursively walk the tree because I do not need to save the context. Instead I just loop chasing the last link on every page until I get to a leaf node. Then the last link on that page is the one I want. Once I have a valid iterator other than end, decrementing said iterator is a simple matter of decrementing the slot number, provided it is greater than zero. If it is zero, then we get the last key on the previous page.
Note that if there is no previous page, we have reached the begin iterator. Technically, trying to decrement begin yields undefined behavior, so it does not matter what I do in that case. What I actually do is nothing at all. This is probably not the best design. Since I make the test anyway, I can just as easily throw an exception. That is my preferred form of "undefined behavior," so I will probably change this in the next version.
next_iter
template<typename Key, typename Compare, typename KeyTraits, size_t PageSize> void BtreeIndex<Key,Compare,KeyTraits, PageSize>::next_iter(Iterator& it) { Page* p = get_page(it._page); if (p) { if (++it._slot < p->num_keys()) return; else { it._page = p->next_page(); it._slot = 0; } } }next_iter is simpler than prev_iter because it does not have the problem boundary condition. Assuming the iterator is valid, which means that the page ID is valid, then we simply increment the slot number and check to see if it is still on the page. If not, get the first slot of the next page. If there is no next page, this automatically sets the iterator to be the end iterator. Again, if I have an invalid iterator, I can and probably should throw an exception.
Well, that just about wraps it up. All of the other public functions are implemented in terms of these functions, or they represent typical I/O functions such as open and close. I will mention the clear function, however, without showing its code. Rather than erase all the keys in the index, it simply creates a new external file. This is a lot faster.
On that note, I want to say something about the code files that are available for download. First off, as I said when I started this series of columns, this is a work in progress. When I started it, I actually thought I had a potential application where I would be able to test things. That has not happened, so my testing has been pretty haphazard. Every time I have sat down and done some, I have discovered a few more bugs. As a result, I have to warn everyone that this code is experimental and probably rates a version number of about 0.8. Nevertheless, I have managed to run some serious test cases, so I know the basic concepts are pretty much correct.
reeves.zip now contains two complete copies of source code. One set, in the "Std" directory, is what I have been describing in these columns. It is code designed to compile and run on a C++ platform (compiler and library) that is ISO Standard compliant. Unfortunately, those are still rather scarce in fact non-existent is my working assumption although to be fair I certainly do not have access to every possible platform. In any case, the platforms that I do have access to have some shortcomings, although they are getting steadily better. The one platform that I have constant access to is Microsoft's Visual C++. Unfortunately (again) this is not their latest platform, but rather Visual C++ v6.0. This version has some restrictions on nesting classes within templates amongst other problems. In order to be able to test on Visual C++, I have created a separate version of the code that moves all the nested classes up to namespace scope. This version is in the "Msvc" subdirectory. As you will see, I have done my best to keep the actual code as close to identical as possible. This is actually much closer than you might think. I am not happy about this situation, but I thought it was better to provide something that worked, even if it wasn't the ISO Standard version that I prefer.
As always, feedback is welcomed.
References
[1] Jack W. Reeves. "The (B)Leading Edge: Building an IndexedFile Using XDRStream, Part 4," C/C++ Users Journal C++ Experts Forum, August 2002, <www.cuj.com/experts/2008/reeves.htm>.
[2] Jack W. Reeves. "The (B)Leading Edge: Building an Indexed File Using XDRStream, Part 3," C/C++ Users Journal C++ Experts Forum, July 2002, <www.cuj.com/experts/2007/reeves.htm>.
Jack W. Reeves is an engineer and consultant specializing in object-oriented software design and implementation. His background includes Space Shuttle simulators, military CCCI systems, medical imaging systems, financial data systems, and numerous middleware and low-level libraries. He currently is living and working in Europe and can be contacted via jack_reeves@bleading-edge.com.