Features


Making C++ Safe for Threads

Dave Plauger


Dave has over 20 years experience with the design and development of operating systems. His main areas of expertise include real time and multi processing. Dave currently serves as Technical Editor of POSIX 1003.14 (Multiprocessing Profile) and has been a member of various POSIX working groups since the days of /usr/group. Dave may be reached at Mercury Computer Systems, Lowell MA. The phone number is (508) 458-3100; email djp@mc.com.

Introduction

Multithreading is a powerful technique that can speed up application programs on both multiprocessors and uniprocessors. On multi processors there is a potential speedup due to parallel computation. On uni processors there is a potential speedup due to the overlap of I/O with computation.

The speedup is potential because, in the real world, there are always trade-offs. In order to have computational parallelism you need synchronization, which adds overhead. In order to overlap I/O with computation you again need synchronization in the form of I/O completion events.

Multi threading helps by hiding (but not eliminating) the complexity and the overhead associated with both types of parallelism. In the many cases where parallelism does help, programming with threads is a convenient way to use it.

Issues of Thread-Safety

The main problem that arises when adding threads to programs is synchronizing updates to shared variables. The problem arises because updates to data structures often require several machine steps. These individual operations may be interleaved at random among multiple processors. And that can cause corrupted data structures, storage overwrites, and even more unpleasant problems. Even on a uni processor threads may be time-shared, which means they effectively run in parallel.

High-level languages generally let you ignore the underlying machine, but parallel programming drags you back into worrying about what is really happening. On the other hand, you do have enough information in the declarations to help spot the potential problems. Variables that are declared at file scope and static data within functions clearly need to be protected from uncoordinated updates by multiple threads.

Less obviously, control data structures used by library packages need careful attention as well. Examples of these are the FILE structures used in the Standard C library and the dirent structures used in UNIX and POSIX. These data structures hold state information between calls. Problems may arise if they are used in more than one thread at the same time, or even alternately.

The same issues arise with member data in C++ classes. In the C library, something has to be done for errno, stdin, stdout, and stderr (at least), to make them thread-safe. In C++, the analogous structures are cin, cout, cerr, and clog.

Two Solutions

There are two general approaches to dealing with a global shared variable in a multi-threaded program:

There are lots of mechanisms to enforce locking. Some examples are semaphores, monitors, and message queues. The POSIX threads package (IEEE P1003.4a, Draft 6) provides the function pthread_mutex_lock and pthread_mutex_unlock.

Locks should be associated with the data, not the code. Locking code that happens to access shared variables may overly limit parallelism. For example, in an operating system one has to lock the global run queue or task data during updates, but one should avoid having a scheduler lock. The latter approach results in a system that may be thread-safe, but it will be slower than the original, single-threaded version. The trade-off one makes is to implement enough fine-grained locking to maximize parallelism without adding too much overhead.

Besides adding to the overhead of even a single-threaded program, locking introduces the problem of deadlock. Deadlock can be avoided by recognizing the hierarchical structure of the data, and obtaining locks in a strict order. However, this requires global knowledge of all data, which may be impossible in a modular system, such as a third-party library.

Thread-private data may be supported explicitly by the compiler, if you're lucky. But there are routines in each thread package that provide for thread-private data, because it is indispensable. In POSIX threads, the routines are pthread_keycreate, pthread_getspecific, and pthread_setspecific. These routines implement an associative memory, in which a shared key is associated with a thread-specific pointer.

The easiest form of thread-private storage to use is the stack. Each thread has its own stack, therefore all automatic variables are thread-safe. You can also place a data structure in thread-private storage by using malloc (in C) or new (in C++) to allocate the storage, then storing the pointer in thread-specific data. When a new thread is created, its specific data pointers are set to null. Knowing this, the program can tell at runtime when to allocate and initialize another instance of the data structure. Therefore, not every thread needs a copy of every data structure. pthread_keycreate registers a destructor function to be called for each key when the thread exits. This function may free the storage and otherwise clean things up.

It's best to make variables thread-private whenever possible. The access cost of thread-specific data is much less compared to lock/unlock overhead, and the associated code can be written using the simpler, uni processor model.

Other Methods

One can dedicate a thread to particular functions, which makes data and code thread-private by convention. For example, one could contrive the program so that all calls to X-Window routines occur in a particular thread. The initial thread — the one which runs main — is the safest one to dedicate to older libraries. For backward compatibility, many threads packages associate the initial thread with global resources, such as errno.

Sometimes global shared variables can be used in a multi-threaded program without additional protection. For example, a Boolean variable used as a flag can be set or cleared without synchronization. The code sequence:

lock;
flag = 1;
unlock
is silly. But, if two or more variables need to change "simultaneously," then locking may be needed to protect the program state, disallowing certain intermediate states.

Statistics counters employ another kind of update which may go unprotected. An expression like ++count is a multi-step update of a global variable, and is a candidate for locking. But if all you ever care about are zero/non-zero values of the count, no harm will result even if multiple processors collide on an update. (The value will be at least 1.) This is worth mentioning because test coverage tools often insert such counters. It is seldom worth the trouble to make these tools thread-safe.

A Question of Semantics

Choosing whether to add locking or to make data thread-private is more than a matter of mechanics. The bigger question concerns the thread model to be used. Are threads to be independent or cooperating? The answer to this question determines the implementation strategy.

If the threads are to be cooperating, meaning they produce or consume the same stream of data, then the data must have internal locking and external synchronization. In this model, the POSIX opendir/readdir facility, for example, would operate with a shared data structure. An application would want to have threads alternately read or write directory entries.

Each independent thread, on the other hand, would produce or consume its own stream of data. In this case, the buffering would be thread-private, and there would be no need for internal locking. Either approach is viable, it just depends upon the application. But what is the library writer supposed to do? The answer may be to supply both.

A Thread-Safe C++ Library

If the underlying C implementation is thread-safe, then C++ is very close to it. However, there are a few special considerations. First, the I/O streams cin, cout, cerr, and clog are static objects provided by the library. Nothing prevents two threads from simultaneously writing to the same output stream, and thus corrupting its internal state.

Second, C++ uses an internal list to keep track of sizes of arrays of objects. This is due to the fact that C++ lets you say

delete [] array-ref
which implies that the compiler/library has to determine the number of elements. (For various reasons, most implementations can't store the size in the same chunk of memory as the array itself.) Since an array may be allocated on one thread and freed on another, this internal list has to be global. Therefore it has to be protected during updates.

An important issue is whether I/O streams be thread-private or shared. Given the nature of the interface, which is to perform I/O by the character, word, or line, it doesn't do much good to provide locking within the interface functions. For example, one thread may print hello and another print world, but the screen output will be jumbled anyway unless the threads coordinate their outputs in some way outside of the I/O library. Adding locking within each I/O primitive call (inside the interface) will make the I/O library thread-safe. Indeed, this was the approach taken for POSIX threads. The problem is that it introduces a lot of overhead in the interest of safety.

Alternatively, you could establish the rule that I/O streams are thread-private. Cooperating threads will need to synchronize outside of the interface. (Maybe they would have to anyway.) Independent threads would read or write streams as efficiently as in the single-threaded case.

Given this rule, the library only has to concern itself with the streams it provides initially, namely: cin, cout, cerr, and clog. In C++, it's fairly easy to provide a "wrapper class" that can operate on a thread-private version of the original class.

Wrapper-Class Implementation

The wrapper class defines a small object that holds the thread-specific data key, initial file descriptor, and other stream-initialization parameters. The declared object has a static initializer which takes care of allocating the key. Each of the pre-defined streams is both declared and replaced with a macro, as follows:

extern ostream_withassign cout;// stream declaration
extern t_ostream_withassign __tcout;// stream wrapper
#define cout (__tcout.ref())
Any textual reference to cout is replaced with a call to the wrapper class, which returns a reference to the thread-private copy of the stream. The thread-private stream is not allocated and initialized until the first call to the ref member function is made. This introduces the limitation that &cout may not be used in an initializer list — because the address is no longer a constant value, but must be computed at runtime. The initial thread points at the pre-defined streams. Old binaries will therefore access the initial thread's I/O streams. This permits backward compatibility with single-threaded code.

The wrapper is a very small class that defines a static object (such as __tcout) which is shared by all threads. The static initializer does three things:

Static initializers are called from the initial thread before main executes.

A pointer to the wrapper class is stored in thread-private storage. This pointer is later presented as an argument to the destructor function called by thread-exit processing. Therefore, the thread-private wrapper object and its stream object are deleted when the thread exits.

The ref member function simply uses the key stored in the static object and tries to look up the thread-private pointer to the wrapper. The very first time a new thread calls any I/O function, ref will find that its thread-private pointer is null. When the pointer is null, ref allocates a new wrapper object and a new stream, then stores the pointer to its wrapper object in thread-private storage. Subsequently, ref returns a reference to the stream object assocated with the wrapper object, both of which are private to the thread.

With this implementation, only threads that use stream I/O get copies of stream objects. Threads can be created at any time, since I/O stream allocation occurs on demand. The main cost added to implement thread-private streams is one lookup in thread-private storage for every I/O call. This is much cheaper than adding lock/unlock to every I/O call.

A C++ Wish List

It would be nicer if you could have redefined cout as an object of the wrapper class instead of having to use a macro. In other words, you would want to say

extern t_ostream_withassign cout
Then you could rely on the user-defined type conversion facility to yield a reference to the desired class. That is, you would simply define

operator ostream_withassign&() {return ref();}
in the wrapper class.

I tried this and it mostly works, but the following limitations were found:

Invocations of member functions didn't work. For example, cout.flush() had to be transformed to ((ostream_withassign&)(cout)).flush(). Here is a case where it would be nice to be able to overload the dot operator. Then you could say:

ostream_withassign& operator.()
   {return ref();}
The result is that the wrapper class could substitute a reference to the desired class that defines the member functions.

The compiler complains when the expression in a return statement does not match the function's declaration. For example:

ostream_withassign& f()
   {return cout;}
would work only if the compiler applied the user-defined type conversion.

There is a similar warning when the type of an initializer does not match the type of the variable being initialized. For example,

ostream_withassign &os = cout
would work only if the compiler applied the user-defined type conversion. If the user-defined type conversion were applied uniformly and it were possible to overload the dot operator, then it would be easier to implement a very useful form of indirection for any object.

Arrays of Objects

As I stated above, the other change needed for the C++ library was to add locking around an internal list. This change is not visible to the library's user. The implementation of the locking mechanics used is perhaps worthy of discussion.

The list is protected with simple, mutual exclusion, using pthread_mutex_lock and pthread_mutex_unlock in the underlying implementation. The list-handling file has a definition at file scope like

static_thread_lock_def list_lock;
where thread_lock_def defines a static initializer that allocates and initializes a pthread_mutex_t.

The body of the code where mutual exclusion is needed has the syntax:

{ thread_lock L(list_lock);
   // do something to the list
} //end of critical region
The braces are used to control the scope of L, which is a reference to the lock. The class thread_lock acquires the pthread_mutex in its initializer, and releases it in its destructor. Thus, the scope of L determines the bounds of the critical region.

This could be considered just another stupid C++ trick except that, should there be a return or an exception within the region, the destructor for L will be called, which will release the lock. Thus, locking is integrated with returns and exceptions because of this method.

There is a quiet gotcha in this implementation. Should you forget to supply a variable name (in this case it's L), and write thread_lock(list_lock), C++ will lock then immediately unlock list_lock, leaving the critical region unprotected.

The reason there is no warning has something to do with the great debate over the lifetime of temporaries. For example, in a simple expression like X + Y the compiler may have to make temporary copies of intermediate results, which one would expect to get deleted at the end of the expression, not at the end of the current scope. See your local language lawyer for details.

Conclusion

I successfully converted the Hewlett-Packard C++ library to a thread-safe form. Along the way, I observed: