Optimizing Custom Memory Allocators

Sebastien Marc

Yet another success story for modern template techniques -- this time optimizing a custom memory scheme.

It all started innocently. We had just finished the implementation of an XP ( eXtreme Programming) iteration of our Linux financial risk engine and were doing some load testing. We had to handle tens of thousands of financial positions in a matter of minutes, so scalability was important. The first big report got through, but the engine shot up in memory to 500 MB. The second run worried me as the engine shot up to 700 MB for the same report. The third run was a disaster. The server swapped, spending most of its time in an I/O wait state and taking forever to compute the profit and losses of each position.

Our application was suffering from unacceptable memory fragmentation.

Fortunately C++ provides the tools to fight memory fragmentation: custom memory allocators. We began to implement custom allocators to map (mmap system call) critical chunks of memory needed during the computation of the report, and we decided to implement custom containers using these allocators. The solutions worked well. Memory went down at the end of each report, and the fragmentation disappeared. But the code was very inefficient. The required time for a report nearly doubled. After quantifying the engine on Solaris (a pity quantify does not exist on Linux), we realized most of the time was spent calling the destructors for the objects in our containers.

Below is code using part of the implementation of a specific container:

template <class T>
void destroyit(T* tPtr)
{
  tPtr->~T();
}

template <class Contained,
  unsigned int size>
class Container
{
public:
  Container()
      :holderPtr_(NULL), bytes_
        (size*sizeof(Contained))
    {
      holderPtr_ =
        (Contained*) mmap(0, bytes_,
          PROT_READ|PROT_WRITE,
          MAP_PRIVATE, zerofd, 0);
      // Use placement new to
      // initialize our objects.
      for(int i=0; i<size; i++)
        new (holderPtr_ + i) Contained();
    }
  virtual ~Container()
    {
      long nelements =
        bytes_ / sizeof(Contained);
      for(int i=0; i<nelements; i++)
        destroyit(holderPtr_ + i);
      munmap(holderPtr_,bytes_);
    }
  
private:
  Contained* holderPtr_;
  unsigned long bytes_;
};
This container memory-maps its memory, which reduces the fragmentation of the heap by reallocating part of the memory needed by the application into a different segment. Therefore uses like:

{
  // Uses the heap
  std::auto_ptr<MyClass> mPtr
    (new MyClass());
  // Time to allocate prices. Do
  // it using  our container
  // because we do not want  to
  //fragment the heap when prices
  //are deleted.
  // Container holding a 100 prices.
  Container<PriceClass, 100> c;
  // Do other things.
}
significantly optimize memory usage by reducing fragmentation.

That was great, but what about:

{
  // Container holding a 100 prices.
  Container<double, 100> c;
}
Stop and think about it for a minute. You may think that this code does not compile because there is a compilation error when the destroy<double> function is instantiated. However, the code really does compile. See paragraph 12.4.15 of the ANSI C++ Standard: "The notation for an explicit call of a destructor can be used for any scalar type name." This rule means that calling a destructor on a double or any fundamental type is allowed. It will not do anything, but it will compile and link.

So, when the container goes out of scope, its destructor enters a dreadful loop, trying to call 100 destructors on each double! That's not a model of efficiency and was definitely the main performance bottleneck of our risk engine. (Believe it or not, this appears to be how the standard STL vector container on the latest sparcWorks compiler is implemented).

Here is the design challenge: how to design the container to call the destructors of contained classes but avoid entering a loop when your container contains fundamental data types.

Modern C++ design and some of the template techniques described by Alexandrescu will help. Fasten your seat belts, here is the code to handle the case of double. I'll then generalize the solution to other types.

Start by defining two static types:

struct Yes_type {};
struct No_type {};
The compiler uses these two types at compile time to help decide when to instantiate a function that will destroy an array of objects (calling their destructor) and when to instantiate a function that will not call the destructors, therefore optimizing efficiency.

I can then have the following functions:

template <class T>
void destroyit(T* tPtr, long bytes, No_type)
{
}

template <class T>
void destroyme(T* tPtr)
{
  tPtr->~T();
}

template <class T>
void destroyit(T* tPtr, long bytes, Yes_type)
{
  long nelements = bytes / sizeof(T);
  for(int i=0; i<nelements; i++)
    destroyme(tPtr + i);
}
destroyit destroys an array of objects in both cases. tPtr points to the beginning of the array; bytes is the size in bytes of the memory pointed to by tPtr. No_type and Yes_type are just there to allow me to redefine the function. If I pass No_type (when a type does not have a destructor), I get rid of the loop!

Now the tricky part is to call the appropriate destroyit function in the destructor of the container. You can call the appropriate destroyit function by introducing yet another static type:

template <class T> struct Container_type_traits
{
  typedef Yes_type has_destructor;
};
and a specialization in the case of double:

struct Container_type_traits<double>
{
   typedef No_type has_destructor;
};
In plain English, the typedef of Container_type_traits will be Yes_type if T is any type but double. If the type is double, the typedef's type is No_type.

The destructor for the container is:

virtual ~Container()
{
  destroyit(holderPtr_, bytes_,
Container_type_traits<Contained>::has_destructor());
  munmap(holderPtr_,bytes_);
}
It is fairly simple to tie everything up now. I am just instantiating a Yes_type or No_type structure as the third parameter of destroyit, depending on the type of the contained object. If the object is a double, the structure is No_type; otherwise the structure is Yes_type.

I can now complete the implementation by specializing Container_type_traits structures for all the types I know do not need their destructor to be called (all the basic types qualify).

You can generalize this approach to many different applications. This technique allowed me to reduce memory fragmentation while maintaining a high level of performance.

About the Author

Sebastien Marc has spent the last eight years developing financial applications for Reuters in France and New York City. He is now working as a consultant for Cambridge Technology Partners in Switzerland.