August 2002 C++ Experts Forum/The (B)Leading Edge

C++ Experts Forum

The (B)Leading Edge: Building an IndexedFile Using XDRStream, Part 4

by Jack W. Reeves

Introduction

In this installment of "The (B)Leading Edge," I am going to continue my walkthrough of the code for the BtreeIndex of my IndexedFile system. The editors of CUJ Online have been kind enough to run this column with only one month's gap so I can maintain the continuity — and get it over with faster.

As in previous columns, Listing 1 is my IndexedFile.h header. At the top level, there is my namespace declaration — in this case namespace Util. Within the namespace are declared three classes: XDR_Traits, BtreeIndex, and IndexedFile itself. The nested classes declared with BtreeIndex are:
BtreeIndex
  Iterator
  StreamPosProxy
  Page
  RootPage
  PageCache
  XIODataStreambuf
  XInDataBuf
  XOutDataBuf
  KeyInfo
In the last installment [1], I described the classes XIODataStreambuf, XInDataBuf, XOutDataBuf, KeyInfo, Page, and RootPage (mostly — I missed something that I will cover in this installment). In this installment, I am going to describe the PageCache and Iterator classes. Next time, I will (hopefully) wrap up everything with the details of the BtreeIndex class itself.

Class PageCache

I knew from the beginning that I would need some type of caching mechanism for the pages being read from external memory. It was simply not acceptable to read a page in every time it was needed. As it turned out, PageCache is a fairly simple class, but there are some issues that came up in its design that I want to describe.

First, from a data standpoint, you will note that PageCache contains a reference to an XDR_Stream. This is a reference, so it has to be initialized by the constructor. As you can probably guess, this is the actual external XDR_Stream that contains the pages of the B-tree index. It turns out that, except for opening and closing the file, all I/O operations on external storage are handled by the cache. There are two data members of type size_t(_minSize and _maxSize), which I will describe below. Finally, there is a list<Page*> that is the actual cache.

Ordinarily, the basic operation of a cache is fairly simple — if you need something that is not in the cache and the cache is full, then you throw something away to make room for the new item. In the case of PageCache, I keep the list of pages ordered by which was the most recently requested. This means that if I need to make room in the cache, I could simply take the page at the end of the linked list and delete it (flushing it to external storage if needed). Unfortunately, as I was developing the code for BtreeIndex, I realized that I cannot let PageCache be totally responsible for its own actions. The problem is that several places in BtreeIndex I have to access three or even four pages at once. This is in addition to the pages higher up in the tree that might have been accessed in order to get to the current level. I realized that if PageCache was set to its minimum size, it was quite likely that I might actually be accessing more pages than the cache could technically hold. If PageCache were allowed to arbitrarily delete an old page to make room for a new request, I could very easily end up with a dangling pointer. Dangling pointers fall into the category of Really Bad Things to have happen in C++.

As a result, I realized that PageCache needed some external help in order to do its job. Specifically, it needs to be told when all page references are finished and it can safely delete those that are excess. The function reset is called by higher-level BtreeIndex routines when they are finished. It tells the cache that if it has exceeded its maximum size, it can now flush pages until it is back within its limit.

I noted the two data members _minSize and _maxSize above. It is pretty obvious what _maxSize is for, but _minSize probably needs some explanation. In simple terms, the minimum useful size for the cache is the depth of the tree. Any B-tree index operation is going to have to walk down the tree from the root to a leaf. It makes no sense for the cache to ever hold less than one complete root-to-leaf traversal. (At least it makes no sense to me, and I wrote the silly thing.) Yet again, this means that PageCache must have some external help. It has to be notified to change its minimum size whenever the depth of the tree increases or decreases. These are not especially tricky things to do, but it is one more little item that has to be done correctly or the entire BtreeIndex does not work properly.

Now that you have the basic theory, let me go through the individual functions.

PageCache::constructor

The PageCache constructor (shown below) is really trivial. Its primary responsibility is to initialize the _xdrstream member with its argument. Naturally, it also initializes the other data members.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::PageCache(XDR_Stream& xstrm)
  : _xstream(xstrm), _minSize(0), _maxSize(0), _pageList()
{
    // empty
}
PageCache::destructor

The PageCache destructor is as trivial as the constructor — it just calls the clear member function to empty the cache.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::~PageCache()
{
    clear();
}
PageCache::get_page

Function get_page is really the heart of PageCache. It is shown here and described below.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Page*
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::get_page(size_t page)
{
    // See if the page is in the cache
    PageList::iterator pit = _pageList.begin();
    for (; pit != _pageList.end(); ++pit) {
        if ((*pit)->page_id() == page) break;
    }
    Page* p = 0;
    if (pit != _pageList.end()) {    
        // Ok - its in the cache, 
        // move it to the front
        p = (*pit);
        pageList.splice(_pageList.begin(),
            _pageList, pit);
    } else { 
        // Fetch the page from external store
        p = new Page(page);
        p->decode(_xstream);
        _pageList.push_front(p);
    }
    return p;
}
This routine basically splits into two parts. First, the list of pages already in the cache is searched to see if one of them is the requested page. This is done by checking the requested ID. You will recall from last time that a page's ID is set when it is created and cannot be changed. This is a linear search (as you can see), but one of the hoped for benefits of keeping the page list in most recently used order is that many times the page being requested will be one of the first few pages in the list. Naturally, if the requested page is not in the list, that will not be determined until the entire list has been searched. For what I would expect to be normal-sized caches, I do not see this linear search being a problem. Do not forget that if a page is not in the cache, then it will have to be read from external storage, and one of the principal assumptions about B-tree performance is that loading a page from external storage completely swamps all other considerations.

If the page is in the cache, then it is moved to the beginning of the list. The list member function splice is used here. If you did not read last month's column, this is a little known, but very useful, member function of the standard list container that moves an element from one list to another by simply changing internal pointers. In this case, the destination list is the same as the source (which is permitted for this version of splice). Using splice in this manner guarantees that no calls to the allocator are necessary. This would not be true if I coded the same functionality as an erase followed by a push_front (as I did in an earlier version). This is a fairly minor consideration, but the whole point of PageCache is to eliminate unnecessary overhead once the page is in memory.

If the page is not in the cache, then a new Page object is created and initialized by reading its contents from the external store. Once created, it is pushed onto the front of the list. Note that in the later case this routine does no checking to make sure the page is read from the external store correctly. This is perhaps a little too optimistic, but I have not yet decided whether to address the issue.

PageCache::add_page

The add_page routine is trivial. It is used by higher-level routines when a page is split to give the PageCache control of the new page.
template<typename Key, typename Compare,
    typename KeyTraits, size_t PageSize>
void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::add_page(Page* page)
{
    _pageList.push_front(page);
}
PageCache::clear

Function clear is another straightforward routine. It simply goes through all the pages in the cache, flushes them to external storage if needed, and deletes them. Besides being called by the PageCache destructor, this function is called whenever the BtreeIndex is closed.
template<typename Key, typename Compare,
    typename KeyTraits, size_t PageSize>
void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCached::clear()
{
    // Remove all pages
    while (_pageList.size() > 0) {
        Page* p = _pageList.front();
        if (p->is_dirty()) p->encode(_xstream);
        delete p;
        _pageList.pop_front();
    }
}
PageCache::reset

Function get_page may be the heart of PageCache, but reset is almost as important. This is the function that is called by higher-level BtreeIndex routines to remove excess pages from the cache. Note that this routine will not reduce the size of the cache below the minimum size, even if the minimum exceeds the maximum size.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::reset()
{
    // Remove any excess pages
    int siz = std::max(_minSize, _maxSize);
    while (_pageList.size() > size_t(siz)) {
        Page* p = _pageList.back();
        if (p->is_dirty()) p->encode(_xstream);
        delete p;
        _pageList.pop_back();
    }
}
PageCache::flush

Function flush is something of an auxiliary routine. What it does is fairly obvious; it makes sure that there are no dirty pages in the cache.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::PageCache::flush()
{
    PageList::iterator pit = _pageList.begin();
    for (; pit != _pageList.end(); ++pit) {
        if ((*pit)->is_dirty()) (*pit)->encode(_xstream);
    }
}
That takes care of PageCache. Like I said, not very complicated, but essential to the overall performance of a BtreeIndex.

Class RootPage Revisted

While I am on the subject of PageCache, let me clear up a minor point about RootPage that I did not mention last time. As you can see, PageCache writes pages to external storage only if the page is marked dirty. If a page has not been changed since it was loaded into memory, then it doesn't need to be written out. This same reasoning applies to the root page as well as any other, but the root page contains some extra information. Besides the usual set of <key,links>, the root page maintains the head of the free page list and the total number of keys in the BtreeIndex. This last item was of some concern to me. The _totalKeys element of RootPage will be updated whenever a new key is inserted, or a key is deleted, from the BtreeIndex. This means that the root page in technically dirty for every change to the BtreeIndex. This could basically double the I/O involved when the BtreeIndex was changed — assuming that most insertions and/or deletions normally affect only one page.

I decided to separate this particular change to the root page from the more substantial ones and gave RootPage a separate flag to indicate that the total was dirty. Now, in the default case, an ordinary page will be flushed to external storage whenever it is updated, but the root page will not be flushed if the only thing changed is the totalKeys element. If the client wishes, they can change the update strategy from MostlySafe to Safe and cause the root page to be flushed also whenever any other changes take place. The client can also opt for Unsafe in which case dirty pages will only be flushed when they fall out of the page cache.

Class Iterator

Now, let's look at class Iterator and its helper StreamposProxy, and we will be through with the internal classes of BtreeIndex.

Looking at Listing 1, you will see that class Iterator is quite small. In fact, even though it is not shown in Listing 1, all of the member functions of Iterator are inline. I discussed some design issues of Iterator in a previous column [2], so I will not go into detail here. I will summarize the following points however. First a BtreeIndex::Iterator has three data elements: a pointer to its BtreeIndex, the page ID, and the slot number of the item in the page. The Iterator does not directly carry a pointer to the page because the page might fall out of the cache and be deleted from memory, in which case the Iterator would be left with a dangling pointer (see comment above). If I put the actual Page* into an Iterator, then I would have to ensure that once referenced a page could never be removed from memory until something happened that would invalidate the iterators. This seemed a little difficult, so I decided not to do it. Hopefully, a page will still be in the cache if an Iterator needs it, in which case the page ID should retrieve it quickly enough.

Since an Iterator carries only the page's ID, it has to have a pointer to the BtreeIndex in order to access the page. Finally, the slot number is used for the same reason as the page ID — it stays valid if the page is flushed from the cache. As I discussed last time, the elements are stored in a Page using a list instead of a vector. Accessing an element of a list by slot number means stepping through the list in linear time. While I am not real happy about this, I will wait until I have firm data indicating it is a problem before I start optimizing.

The three functions of real interest in Iterator are shown below:
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
inline std::pair<const Key, BtreeIndex<Key,
    Compare,KeyTraits,PageSize>::StreamposProxy >
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Iterator::operator*()
{ return _btree->get_value(_page, _slot); }

template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
inline BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Iterator&
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Iterator::operator++()
{ _btree->next_iter(*this); return *this; }

template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
inline BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Iterator&
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::Iterator::operator--()
{ _btree->prev_iter(*this); return *this; }
As you can see, dereferencing, incrementing, or decrementing an Iterator just invokes a private (friend) function of the BtreeIndex itself. I will discuss these functions when I get to the BtreeIndex's main functions. For now, the only important thing to note is that the dereference operator returns a pair<> that consists of the key used by the BtreeIndex and a StreamposProxy object. As the name suggests, the StreamposProxy is an example of the Proxy pattern [3]. In this case, the object acts as a proxy for the data item associated with the key in the BtreeIndex. This item is (by definition — if not in fact) an XDR stream position where the real data of an IndexedFile can be found — hence, the name StreamposProxy.

If you look at StreamposProxy in Listing 1, you will observe that it contains all the same data elements as Iterator, plus an element named _link. The latter is the actual data that the proxy refers to. The reason that I do not just return the data when I dereference an Iterator is that I wanted to be able to update the link for a key without having to delete the key and reinsert it into the BtreeIndex. In other words, assuming that btit in a BtreeIndex::Iterator, I wanted to be able to write:
��� (*btit).second = new_link;
as well as
��� size_t link = (*btit).second;
These are the standard idioms for use with STL iterators, and I wanted my Iterators to follow the same convention.

Ordinary STL iterators return a reference to the element in the container when dereferenced, however. The BtreeIndex Iterator cannot return a reference because any such reference would be to a (possibly) transient Page element. Instead of returning a reference, Iterator returns a proxy. The proxy provides an assignment operator that knows how to find the item in the BtreeIndex and how to update the item in place. Naturally, it does all this by calling a friend function of BtreeIndex itself.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
inline void
BtreeIndex<Key,Compare,KeyTraits,
    PageSize>::StreamposProxy::operator=(size_t newlink)
{ _btree->update_link(_page, _slot, newlink); }
Of course the proxy can convert itself to the data object if that is what is needed.
template<typename Key, typename Compare, 
    typename KeyTraits, size_t PageSize>
inline BtreeIndex<Key,Compare,
    KeyTraits,PageSize>::StreamposProxy::operator size_t()
{ return _link; }
Without the use of the Proxy pattern, the standard update idiom of ordinary assignment could not be used. Instead, a client that wanted to update an item would have to call a BtreeIndex member function directly. This may seem like a small thing, but my experience says that it is just such little things that make the difference between something being intuitively useful (and hence used), and just another ho-hum, "have to look it up in the manual if I want to use it so maybe I won't bother" class. In the case of BtreeIndex, I have deliberately tried to keep it as close to the STL map class as possible. This was a conscious decision to try to capitalize on the collective knowledge base represented by the STL. If that means introducing a StreamposProxy to hide the BtreeIndex member function that actually does an update, that is fine with me.

That pretty much wraps up all the internal classes used in the implementation of BtreeIndex. In the next installment (next month), I will tackle the main functions of BtreeIndex itself.

If you have read this far, I want to thank you for your patience. I don't like code walkthroughs any more than anybody else, and I assume that most of the readers of this column are quite capable of reading the code in the listings or even in handling the occasional "left as an exercise for the reader." On the other hand, I am sure I am not the only person who has come across some article about an algorithm or a class that I was interested in, and then found the article to be totally inadequate in explaining the accompanying code. We all know that any design involves lots and lots of little details. Often what separates a good, robust design from something else are little details that really do not have much to do with the main algorithms or data structures under discussion. I am not going to claim that I have a good, robust class design here (not yet), but I am trying to make it such, and I think it is useful to actually expose the underlying decision process that I have gone through. I hope you agree.

Notes

[1] Jack W. Reeves. "The (B)Leading Edge: Building an Indexed File Using XDR_Stream, Part 3," C/C++ Users Journal C++ Experts Forum, July 2002, <www.cuj.com/experts/2007/reeves.htm>.

[2] Jack W. Reeves. "The (B)Leading Edge: Musings on the IndexedFile Design", C/C++ Users Journal C++ Experts Forum, May 2002, <www.cuj.com/experts/2005/reeves.htm>.

[3] Erich Gamma et al. Design Patterns (Addison-Wesley, 1995).

Jack W. Reeves is an engineer and consultant specializing in object-oriented software design and implementation. His background includes Space Shuttle simulators, military CCCI systems, medical imaging systems, financial data systems, and numerous middleware and low-level libraries. He currently is living and working in Europe and can be contacted via jack_reeves@bleading-edge.com.