C++ String Performance

Dr. Dobb's Journal October 2003

Multithreaded apps make special demands

By Lev Kochubeevsky

Lev is a principal engineer at Netscape/AOL and can be contacted at levk@netscape.com or lev@smartneighborhood.net.

It's easy to imagine the reaction of seasoned C++ programmers upon seeing yet another string class implementation: "So why do we need another string class? We had too many in the pre-standard C++ Library world, and now that we finally have std::string, why not just use it?"

I agree, but only if the standard implementation doesn't introduce significant performance problems. In this article, I show how the standard string implementation isn't always good enough, particularly for multithreaded applications that create lots of temporary strings. That said, it is worth noting that there's no such thing as a standard string implementation—they all differ. (For examples of the different implementations, see item 15 in Scott Meyers's Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library, Addison-Wesley, 2001.) Here, I analyze one of the more popular string implementations—the one that comes with g++ 2.96 on Red Hat Linux 7.x. To this end, I create a number of multithreaded tests showing the Standard C++ string implementation (which comes with g++ 2.96 on Red Hat Linux 7.3) can have real performance problems for multithreaded applications that need to create lots of unrelated temporary string objects. To address these performance problems, I present a string class that gives you the speed and efficiency of automatic C strings, plus the convenience of C++ string objects that are STL friendly.

std::string Class Design

All string implementations—including g++ 2.96—usually share the same design principles:

Memory for the string value is always allocated on the heap. (Okay, Scott Meyers describes one implementation that doesn't allocate memory for the very short strings.) Think about it. You have a string object on the stack, but there is always a heap allocation for the value of the string unless it is initialized through a copy constructor and it's reference counted (which generates other problems).
String values are reference counted; so if you assign string s1 to s2, their internal pointers point to the same location in memory. It's apparently done to make copy constructors and assignment operators very fast. They don't need to allocate memory and copy the value, just set the pointer and increment a reference counter. The same logic makes a destructor very fast. In most cases, it just decrements the reference counter.
Closely related to reference counting is copy-on-write (COW) optimization that allocates/copies when one of the string objects pointing to the same value changes.

The first item is usually a bad idea when you have lots of temporary strings as automatic variables in a multithreaded application. The heap is shared between threads, so some locking is inevitable; otherwise, the heap's internal data is corrupted. This approach defeats the idea of the fast non-blocking creation/access of the stack variables in multithreaded applications (every thread has its own nonshared stack). It looks deceptive to programmers, too. For instance, in Example 1, mystring is an automatic (stack) variable with no indication that memory for "Some C value" is allocated on the heap (unfortunately, that is exactly what happens).

Likewise, the second and third items may sound good until you realize they're a bad fit for multithreaded applications. Value and reference counters are shared between multiple string variables and potentially between multiple threads that absolutely require access serialization (locking) to the reference counter. Atomic increment/decrement/check can be used if these operations are supported by the CPU's command set; see "Optimizations that Aren't (In a Multithreaded World)," by Herb Sutter, C/C++ Users Journal, June 1999.

Listing One (teststring1.cpp) creates 100,000 temporary std::string objects on the stack (most of the memory is allocated on the heap). It has only one thread running, but was built as a multithreaded application, so it does all the required locking (refer to the Makefile, available electronically; see "Resource Center," page 5). I ran this and subsequent tests on a 2-GHz Dell desktop with 1 GB of RAM under Red Hat Linux 7.3. Example 2(a) shows the results.

Listing Two (teststring2.cpp) is the same test, but with 10 threads running—each creating/destroying 100,000 automatic string objects. (Again, every one of them allocates memory on the heap and employs some locking and copying.) Example 2(b) presents the less-than-encouraging results of the teststring2.cpp test. With just 10 threads, average automatic string object handling (creation/destruction) increased approximately 100 times (10000/108). Since these results seem unacceptable for high-load multithreaded applications, I decided to find out where time is being spent, then create a new string class that doesn't have those deficiencies. Possible performance culprits include locking introduced by the reference counting optimization, heap allocations (and it's locking) and copying of the value.

A New String Class

In creating a new string class, I wanted to be able to:

Avoid heap allocation and copying completely; just to set an internal pointer to some C string.
Avoid heap allocation for the nonheap variables; just copy the value to the buffer inside the object.
Force heap allocation and copying for any string object (as with a std::string implementation, minus reference counting/COW).

Since it supported the last two of these three requirements, I used SString (see Thinking In C++, by Bruce Eckel; Prentice-Hall, 1995, ISBN 0-13-917709-4) as the basis for my new string class. I then implemented the fastest and most important mode, which doesn't do any allocation or copying. It also occurred to me that the string class features should be selectable by programmers who instantiate the string objects, not the string implementation. Consequently, it's up to users of this class (see Listing Three) to instantiate particular string objects in particular modes. The complete implementation of this new CGenStringT<> class (again, Eckel's SString was used as a basis) is in util.h (also available electronically).

The mode of the string being created in Listing Three is set by the combination of two parameters: template parameter bufsize, and constructor parameter bAlloc. CGenStringT is a template with one template parameter, bufsize, that governs the size of the internal buffer. In addition to specifying the size of the internal buffer, this parameter defines the mode of the string in conjunction with a bAlloc constructor's parameter. The possible combinations are:

Mode 1: bufsize == 0 && bAlloc == 0. No heap allocation, no copying, just setting the pointer (fastest).
Mode 2: bufsize > 0. No heap allocation, copying into the internal buffer (slower).
Mode 3: bufsize == 0 && bAlloc == 1. Heap allocation and copying (slowest).

Once you understand the idea, the implementation is straightforward: Constructor, copy constructor, assignment operator, and destructor are doing the right thing for the three different modes (like allocation, copying, deleting), or none of them. For example, the copy constructor in Listing Four is a good example of handling all three modes (only ASCII null-terminated strings are currently supported). Right after the CGenStringT class definition there's a bunch of useful typedefs and a define:

typedef CGenStringT<0> CGenString;

typedef CGenStringT<100> CStackString100;

#define FAST_STRING(id,str) CGenString id(str,0);

CGenString is a Mode 3 string (allocation and copying), CStackString100 is a Mode 2 string (copying) with an internal buffer of 100 bytes, and FAST_STRING defines a Mode 1 string variable ID (no alloc, no copying) and sets its value to str.

The "stack" word is being used in comments throughout the CGenStringT implementation and in CStackString100 to denounce Mode 2 strings (with an internal buffer). In general, this term is incorrect, because the internal buffer is on the stack only when the string variable itself is a stack variable. If this variable is created on the heap (using new()), then the buffer is on the heap as well. So, in general, the buffer as a part of the string object is always allocated in the same class of memory as the string object that owns it.

Performance Testing

To gauge performance, I start by testing the slowest mode (forced heap allocation and copying) to see if the removal of reference counting helps. Teststring3.cpp (available electronically) tests one thread, while teststring4.cpp (also available electronically) tests 10 threads. They are the same, respectively, as teststring1.cpp and teststring2.cpp, except that std::string is replaced with CGenString.

Example 3(a) shows the result of running teststring3, and Example 3(b) the results for teststring4. Compared to std::string, this isn't bad at all for the slowest version—it's 45 percent faster with one thread (59 versus 108) and it's more or less 100(!) times faster with 10 threads running (~100 versus ~10000). And since the heap is still being hit, it looks like std::string reference counting hurts big time!

Moving to the faster Mode 2 strings (using automatic string variables so a buffer for every string is on the stack: no heap allocations, just copying), teststring5.cpp and teststring6.cpp (both available electronically) are the same tests as before—but using CStackString100.

Example 4(a) shows the results of running teststring5, while Example 4(b) shows the results of running teststring6. By eliminating the need to allocate on the heap, you get significantly faster performance: for one thread, 23 versus 59 (Mode 3) versus 108 (std::string), and for 10 threads, ~40 versus 100 (Mode 3) versus 10,000 (std::string).

Teststring7.cpp and teststring8.cpp, both available electronically, test the fastest mode (Mode 1: no heap allocation or copying, just setting a pointer). Example 5(a) is the result for teststring7, and 5(b) for teststring8. As expected, it's blazingly fast. Table 1 summarizes the performance test results.

Mode 1 (fastest) is 35 times faster than std::string in a one-thread application and 2500(!) times faster for the application with just 10 threads. A no-frills Mode 3 string with heap allocation and copying is still 100 times faster than std::string for a 10-thread application thanks to the absence of the reference counting/COW optimization.

Real-World Example

In the real world, many C++ applications get their input from C network protocols or database drivers. In the case of strings, one popular use is as keys in the STL map. Listing Five does exactly that—it accepts parameters as C strings, converts them to C++ strings, and uses them as keys for the STL map. Now take a closer look at the function calls ci=m_mymap.find(str); and m_mymap.erase (str);.

Listing Five looks benign until you realize that every time you do a lookup or erase, you create a temporary string object (with heap allocation and value copying). This temporary string object is created through the std::string(char*) constructor. This object's lifespan is just the duration of the map::find/erase call: It's destroyed right before return (with another heap hit, of course). It sounds expensive—and it is. So what can you do? The first solution may be to use pointers to the strings instead of the strings themselves. It would solve a creation-of-objects problem, but burdens you with memory management and comparison issues (remember a map is a balanced tree that always keeps its contents sorted).

The preferred solution would be to have all the beauty of using C++ string objects as keys, but with the C speed of conversion between C strings and C++ strings for the temporary strings. That's where my new Mode 1 (FAST_STRING) shines. To achieve the speed of temporary C strings and still be able to use C++ strings for the lookup, take a look at Listing Six and testmap.cpp (available electronically). Mode 3 (value allocated on the heap) is used only when the new element is added. For the lookup and erase, the fastest Mode 1 strings are used (just setting the pointer).

There is no allocation and no copying, which means that you have a real automatic (stack) string variable with all the performance of C strings and all power/convenience of the STL map. The constructor and destructor are extremely simple for the fastest (Mode 1) strings—jsut setting a pointer in the constructor, and no-op in the destructor.

Possible Improvements

One area of improvement would be adding support for multibyte characters. A good way to do this would be to follow STL string design (where string is typedefed to basic_string <char>) and add another template parameter for the actual character type.

DDJ

Listing One

#include <pthread.h>
#include "util.h"
unsigned int createStrings ( int n )
{
 static char *samples[]={"something1","something2","something3","something4"};
 printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n);
 unsigned int start = utilGetTickCount ();
 for ( int i=0; i<n; i++ )
 {
    string s ( samples[i%4] );
 }
 unsigned int end = utilGetTickCount ();
 printf ( "Thread <%d>: total time=%d millisecs, average=%f\n",
           pthread_self(), end-start, (float)(end-start)/n );
 return ( end - start );
}
int main ( int argc, char **argv )
{
  if ( argc > 2 )
  {
    printf ( "teststring1 <n_of_strings_to_create> \n" );
    return 0;
  }
  int n = 100000;
  if ( argc > 1 )
    n = atoi ( argv[1] );
  createStrings ( n );
  exit(0);
}

Back to Article

Listing Two

#include <pthread.h>
#include "util.h"
void *createStrings ( void *n_void )
{
 int n = (int) n_void;
 static char *samples[]={"something1","something2","something3","something4"};
 printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n);
 unsigned int start = utilGetTickCount ();
 for ( int i=0; i<n; i++ )
 {
    string s ( samples[i%4] );
 }
 unsigned int end = utilGetTickCount ();
 printf ( "Thread <%d>: total time=%d millisecs, average=%f\n", 
           pthread_self(), end-start, (float)(end-start)/n );
 return NULL;
}
int main ( int argc, char **argv )
{
  if ( argc > 3 )
  {
    printf ( 
       "teststring1 [<n_of_strings_to_create>] [<n_threads>]\n" );
    return 0;
  }
  int n = 100000;
  int n_thr = 10;
  if ( argc > 1 )
    n = atoi ( argv[1] );
  if ( argc > 2 )
    n_thr = atoi ( argv[2] );
  pthread_attr_t attr;
  pthread_attr_init ( &attr );
  pthread_t *tid = new pthread_t[n_thr];
  int i;
  // create threads
  for  ( i=0; i<n_thr; i++ )
  {
    if ( pthread_create ( &tid[i], NULL, createStrings, (void *) n ) )
    {
      printf ( "ERROR: ptread_create for thread <%d>\n", i );
      exit(-1);
    }
  }
  // wait for all threads to terminate
  for  ( i=0; i<n_thr; i++ )
    pthread_join ( tid[i], 0 );
  exit(0);
}

Back to Article

Listing Three

template<int bufsize>
class CGenStringT
{
public:
  CGenStringT ( const char *S = "", int bAlloc = 1 ) : m_s(m_buf)
  {
    if ( !bufsize )
 ...
private:
  char *m_s;
   int   m_bAlloc;
  char  m_buf[bufsize+1];
};

Back to Article

Listing Four

 ...
  CGenStringT ( const CGenStringT& rv ): m_s(m_buf)
  {
    if ( !bufsize )
    {
      if ( m_bAlloc = rv.m_bAlloc )
      {
        // make on heap
        m_s = new char [ strlen(rv.m_s)+1 ];
        strcpy ( m_s, rv.m_s );
      }
      else
        // just set a pointer
        m_s = rv.m_s;
    }
    else
    {
      // make on stack
      m_buf[bufsize] = 0;
      strncpy ( m_s, rv.m_s, bufsize );
    }
#ifdef MEM_ALLOC_DEBUG
    printf ("CGenString copy constructor called: this=%x allocation=%s"
            " ptr=%x\n", this, (!bufsize && m_bAlloc ) ? "yes":"no", m_s );
#endif
  }
 ...

Back to Article

Listing Five

class CMyMap
{
private:
  map < string, int > m_mymap;
public:
  void add ( const char *str, int i )
  {
    m_mymap[str] = i;
  }
  void remove ( const char *str )
  {
    m_mymap.erase ( str );
  }
  bool find ( const char *str, int *res )
  {
    map < string, int >::const_iterator ci;
    ci = m_mymap.find(str);
    bool found = ( ci != m_mymap.end() );
    if ( found )
       *res = ci->second;
    return found;
  }
};

Back to Article

Listing Six

class CMyMap
{
private:
  map < CGenString, int > m_mymap;
  typedef map < CGenString, int > GenStringIntMap;
public:
  void add ( const char *str, int i )
  {
    m_mymap[str] = i;
  }
  void remove ( const char *str )
  {
    FAST_STRING(s,str);
    m_mymap.erase ( s );
  }
  bool find ( const char *str, int *res )
  {
    map < CGenString, int >::const_iterator ci;
    FAST_STRING(s,str);
    ci = m_mymap.find(s);
    bool found = ( ci != m_mymap.end() );
    if ( found )
      *res = ci->second;
    return found;
  }
};

Back to Article