August 1990/C Extensions For Multi-Threading

Features

C Extensions For Multi-Threading

David A. Schmitt

Dave Schmitt was a founder and president of Lattice, the well-known C compiler company, from 1983 until 1990. Prior to that, he worked at Bell Telephone Laboratories where he was involved in the design of fault-tolerant operating systems, including a nonstop version of UNIX. He is currently a free-lance author and consultant. You may contact him at Pivot, 708-469-2235.
As more C programmers begin to work with applications that involve multitasking, the weaknesses of C in this area become more apparent. One particularly vexing problem arises when you use threads in OS/2 and other operating systems. This article describes the problem and proposes a language extension that solves it. This extension has been implemented in recent versions of the Lattice C Compiler, and has greatly simplified our OS/2 programs.

Multi-Threading
Most modern operating systems are process-oriented. A process is an entity for which the operating system can allocate dynamic resources, such as the processor, memory, and I/O devices. Static resources, such as disk files, are usually allocated to a higher-level entity known as a user, while human interfaces such as the keyboard and screen are allocated to a collection of processes known as a session.
The concepts of user, session, and process are easy to visualize and understand because they relate to "real world" objects. In some sense, a user is a person, a session is a terminal, and a process is a program. Threads are a bit more difficult to understand because they don't have this real-world aspect. A thread exists only as an ephemeral object called an execution stream.

Execution Streams
A process is created when the operating system loads a program into memory and schedules it for execution by placing its starting address and initial register settings into an execution queue or list. Eventually the process percolates to the top of the execution list, where it obtains control of the processor for its allotted time period or until it pauses for an I/O operation. At that point, the operating system makes an entry in the execution list so that the process can continue when it is ready. This is a simplified description of the way most operating systems manage the primary execution stream, or thread, of an active process.
In a single-thread system, each process can have only one entry in the execution list, while a multi-thread system allows a process to have several scheduling entries. That is, multi-threading means that the operating system manages two or more execution streams for the same program.
Perhaps the best way to visualize multi-threading is to think of the initial execution stream as the parent thread and to view the child threads as functions that execute concurrently with the parent. You create a thread by telling the operating system, "Transfer control to this function, but return to me before the function is finished." Then every so often you ask the operating system, "Is that function finished yet? If so, give me its return code."

Using Threads
Consider a program that maintains a representation of a spreadsheet in memory. The program's primary purpose is to allow its user to enter data and formulas into the spreadsheet and to display the current spreadsheet information in row-column format. Each time the user changes a spreadsheet cell, the program calls a function to compute the related changes in other cells.
As the spreadsheet grows, the computation takes longer and longer. At some point the delays begin to annoy the user, whose terminal becomes sluggish or locked out while the function executes. A multi-threading system like OS/2 solves this problem easily by calling the computation function as a child thread running at a lower priority than the parent. It will then go about its business in the background while the parent thread continues to interact with the user.
Notice that it is easier and more efficient to treat the computation function as a thread instead of creating a separate process for it. Since the only resource that this procedure needs is the spreadsheet data, it would be wasteful to have the operating system go through all the work of building I/O and other resource management structures that would never be used.
In other words, if you want to perform some operation using the resources already allocated for your process, it is usually more efficient to create a thread instead of another process. When you ask the operating system to create a thread, it need only make another entry in the execution list. Process creation, on the other hand, involves a lot of time-consuming resource allocation operations. This distinction has led some people to refer to threads as lightweight processes.

Interference Among Threads
Any software system that employs shared resources is prone to interference among the sharers, be they users, sessions, processes, or threads. For instance, if two processes are sharing the same data memory, they must carefully coordinate their actions so that the data remains consistent. Similarly, if two users are accessing the same file, they must coordinate their activities so that the data in the file does not get corrupted.
Operating systems usually provide a variety of interprocess communication (IPC) techniques to facilitate this coordination. In addition, most operating systems require explicit declarations of shared resources at the user, session, and process levels. Two processes may share memory or I/O devices only if they both inform the operating system of their intention to do so. Two users may share a file only if the owner agrees and so informs the operating system. Several application programs may share the same terminal session only if they work through some arbiter, like the Presentation Manager, which prevents the screen appearance from confusing the user.
At the thread level, coordination problems are more subtle and more severe. The threads of a process share all of that process's resources, making the chance of unwanted interference very high. For example, consider the error code errno that C programmers use to determine if a library function has failed. Suppose one thread calls a library function, which takes an I/O break. During the break, a second thread calls a library function which places an error code into errno. But before the second thread can read errno, the first thread resumes and changes it to a different value. The second thread has been compromised and may make an incorrect decision because the threads have not coordinated their use of errno, which is a shared resource.
This problem would not occur if all threads used a semaphore or other IPC mechanism to control their access to errno. In fact, the ANSI C committee decreed that programmers should not assume that errno is simply a memory location, but should code it as a macro that invokes a function returning the appropriate value. In other words, a C compiler could implement errno as

#define errno get_errno()
Then each reference to errno would actually call the get_errno function, which would return the copy that is appropriate for the current thread.
The trouble with this approach is that the typical C compiler/library system is chock full of these subtly shared memory areas. For instance, what about the static work areas used by many floating point libraries? What about the stream I/O data structures and buffers? What about the static data areas used by the ANSI-defined time and date functions? The library would be much slower and larger if each of these shared objects had to be protected by a semaphore.
Furthermore, my experiences with OS/2 demonstrated that it is very easy for an application programmer to unwittingly define data items that other threads corrupt. This situation often occurs when additional threads are introduced into a debugged program in order to make it more efficient. Suddenly the program behaves strangely because the threads are not coordinating their access to shared objects which were previously used in a serial fashion.

Eliminating Thread Interference
When adapting the Lattice C compiler to OS/2, we decided that a C extension would be an appropriate solution to the problems of shared access. If we could inform the compiler about possible thread interference, it could automatically generate code to avoid this problem. Three approaches looked promising:

Tell the compiler which data objects must be replicated for each thread.

Tell the compiler that a certain sequence of statements should not be interrupted by another thread.

Tell the compiler that a certain sequence of statements should be controlled by a semaphore.
We decided that all three approaches were useful and that the first two were relatively easy to implement via two new keywords: private and critical. The third approach is more complex and involves two other new keywords: prolog and epilog. The Lattice v6 compiler for DOS and OS/2 supports private and critical, but doesn't implement the third approach.

The Private Keyword
The private keyword describes a data object which must be replicated for each thread. In other words, each thread needs a private copy of the object. Whenever the program uses a private object, the compiler generates code to access the appropriate instance.
The programmer defines and declares private objects using the same syntax as near, far, and huge, which were introduced by Microsoft in order to handle various size pointers. These keywords, together with const and volatile defined by ANSI, specify the access method for data objects. The other data definition keywords, such as int and float, are used to specify an object's "shape".

Defining Private Data
You can use the private keyword to define data and pointers.

(1) private int errno; (2) int private counter = 200; (3) char private buffer[512]; (4) char*(private ptr1); (5) int private *ptr2; (6) float private *(private ptr3);
The first two examples define private integers named errno and counter, while the third defines a private buffer of 512 bytes. In these three cases, you can write the int and private keywords in any order. You can also initialize data objects when defining them as private, as in example two.
Using the private keyword with pointers is a bit more complicated, although no more so than using near, far, and huge. When forming such a definition, you must ask yourself, "What is private? The pointer itself, the object being pointed to, or both?" Examples four, five, and six illustrate each of these cases. Example four defines a pointer that will be replicated for each thread. Notice that the private keyword appears closest to the object name, ptr1. Example five defines a normal pointer to a private integer, and example six defines a private pointer to a private floating point number. You could omit the parentheses in these three examples, but they do make the statements easier to understand.
I must point out a terminology problem that Microsoft created in the documentation of the near, far, and huge keywords. Consider the two statements:

char far *ppp; char *(far qqq);
Microsoft calls ppp a far pointer when it actually resides in the default data area and only points to a far object. In other words, although ppp refers to a far object, the pointer itself is not a far object. The second statement defines a pointer qqq that actually resides in the far area, and so seems more properly called a far pointer. Despite our misgivings, we decided to propagate Microsoft's terminology when using private. So, in the statements

char private *ppp; char *(private qqq);
ppp is called a private pointer even though it is really a pointer to a private object. It's not clear what to call qqq.

Declaring Private Data
Using the private keyword in external data declarations couldn't be easier; you just place the extern keyword in front of the definition and strip off any initializers. For example, the external declarations for the preceding definitions are:

(1) extern private int errno; (2) extern int private counter; (3) extern char private buffer[512]; (4) extern char*(private ptr1); (5) extern int private*ptr2; (6) extern float private *(private ptr3);
It is extremely important to declare private data correctly so that the compiler will generate the proper accessing instructions. This is also true with near, far, huge, volatile, const, and any other access method keywords. Later I'll explain a technique that the Lattice compiler and linker use to catch situations in which private data is not declared consistently within a group of modules.

Pointer Conversion Rules
The general rule when working with access method keywords such as private is not to make an assignment that would cause the compiler to "forget" how to access an object. For example, consider the following definitions and assignment statements:

int ni, private pi; int *np, private *pp; (1) np = &ni; (Correct) (2) np = π (Incorrect, perhaps) (3) pp = π (Correct) (4) pp = &ni; (Incorrect, perhaps)
You can assign the address of the normal integer ni to the normal pointer np, but the compiler should complain if you attempt to place the address of the private integer pi into np. Similarly, private pointer pp can hold the address of private integer pi but not the address of normal integer. The compiler should warn you about this questionable assignment.
In practice, a compiler would probably not complain about the fourth statement because most computer addressing methods can cope with this type of assignment. Similarly, the second statement would usually be accepted in the large memory model because normal pointers are large enough to hold the addresses of private objects. Similar loopholes exist in Microsoft's implementation of near, far, and huge. Nonetheless, as the concept of C access methods becomes more widely accepted, you will be ahead of the game if you use the more strict (and therefore more portable) rule stated earlier.

Private Data Initializers
Initializers can be used when defining private data objects, as shown in the earlier example where we initialized the private object counter. However, you are not allowed to take the address of a private object in a static initialization. Consider the following example:

int private counter = 200; int private *ppp = &counter; /*** WRONG ***/ foo (void) { int private *qqq = &counter; int private *rrr; rrr = &counter; *rrr -= 20; }
All statements in this example are correct except the initialized definition of pointer ppp. Static initializations like this are performed at compile time, but the address of private objects cannot be determined until runtime. Definitions such as that for qqq are allowed because the initialization occurs at runtime. And, of course, you can always take the address of a private object in an assignment statement.

Intel 16-Bit Implementation
Lattice's implementation of private on the Intel 80286 is simple and efficient. The compiler includes each private object in the PRIVATE linking class, and it generates code to access these objects via the Stack Segment (SS) register. The linker gathers all private objects into a contiguous area that is placed at the lower end of the STACK segment. Each time the program starts a thread via the thstart function from the Lattice OS/2 library, a new stack is allocated, and the private area in the current stack is copied to it. This method implies that a child's private objects have values identical to the parent's at the point thstart was called.
This technique works in the Lattice D, L, and H memory models because SS can take on different values in those models. Since in the S and P models SS must always point to the default data segment, DGROUP, the private keyword is not allowed in those models. Given that most OS/2 programs use the large model, this is not a serious restriction.
Earlier I mentioned that you must use the private keyword consistently. If an object is defined as private, then all external declarations of that object must also use the keyword. In order to catch inconsistent declarations at link time, the Lattice compiler automatically appends an @ character to each private name. Thus, the declaration

private int counter;
actually defines counter@ as a public object. If another program then declares counter without the private keyword, it will refer to the global symbol counter, which should cause the linker to complain about an unresolved reference. This technique is not foolproof, but it goes a long way towards eliminating inconsistent declarations.
The Lattice implementation of private will not work with Microsoft's compiler because Microsoft requires SS to contain the DGROUP address in all memory models, an unfortunate design decision on Microsoft's part. It not only makes multi-thread and re-entrant C programming more diffficult, but it also limits the stack size to substantially less than the 64K allowed by Lattice and other compilers.
Nonetheless, the private keyword could be implemented in the Microsoft environment by using the more general technique that Lattice employs in the 80386 32-bit mode.

Intel 32-Bit Implementation
OS/2 v2 runs on the 80386 in the 32-bit "flat address" mode. Flat addressing means that segmentation is not used, and the data area is accessed via simple byte addresses. This addressing method is similar to that of the Motorola 68000 family and other true 32-bit processors.
In the flat addressing mode, the stack segment register SS does not change for each thread. Instead, OS/2 allocates the thread stacks within the 32-bit addressing spectrum and adjusts the stack pointer register SP for each thread. This means that we cannot use SS as the base address for private data; we use EBX instead. The compiler then generates EBX-relative addressing to access private data objects. Each time thstart is called, the compiler allocates a new private data area, copies the current thread's data to it, and sets EBX to that area for the new thread.
Although reserving a base register for private data might seem wasteful, it does not seriously impair the compiler's ability to generate good 386 code. The flat address model eliminates many of the inefficiencies caused by segmentation, and the 386's 32-bit registers make pointer manipulation and floating point computations much easier. In light of these major architectural improvements, the reservation of the EBX register is small price to pay for the advantages of the private keyword.

The critical Keyword
The critical keyword describes a block of statements or a function that must not be interrupted by some other program that might interfere with it. In other words, the designated block or function forms a "critical section" of the program. Listing 1 shows how you define a critical function.
In Listing 1, the entire function foo is protected from interference. Furthermore, any functions that foo calls are also protected, even if they are not deftned as critical.
Suppose that only the work associated with type code 2 is prone to interference. Then you would use a critical block instead of a critical function, as shown in Listing 2.
Preceding a bracketed block of statements with critical informs the compiler that those statements should not be interrupted. Functions called from within the bracketed code are also protected.
You might conclude that critical is a frivolous language extension, since the C programmer can easily create a critical section by calling the appropriate lock and unlock routines. However, I've seen many cases where a programmer forgets to call the unlock routine at every function return point. The critical keyword makes the compiler do this automatically.
Listing 3 shows how our previous critical function would look if we had to explicitly call the lack and unlock routines.
The function is very messy without the critical keyword. You must call the unlocking routine before each return statement, or else use goto statements to jump to a common exit point for each type of return. It's easy to forget to call the unlocking routine, especially in if statements like the one in the default case.

Lattice v6.0 Implementation
The initial implementation of the critical keyword for DOS and OS/2 allows you to define critical functions only. Critical blocks may be supported in a future release. Furthermore, the Lattice implementation assumes that you are concerned only with interference from the other threads within your process. You can easily change this assumption by providing your own lock and unlock routines.
When you define a critical function, the compiler generates a special prolog and epilog, which call library routines named _CXENTRY and _CXEXIT, respectively. _CXENTRY uses the OS/2 Application Program Interface (API) service named DosEnterCritSec, and _CXEXIT uses DosExitCritSec.
Since _CXENTRY and _CXEXIT are just library functions, you can replace them with versions that are appropriate for another environment. For example, when writing an I/O driver, you probably want critical sections to actually block interrupts. Likewise, in a shared database scenario where threads may interfere with other processes, the critical section routines could use a global semaphore.
Note that critical functions can be nested. The protection begins when the first (i.e., outermost) critical function is entered and ends when that function returns. Nested critical functions may require that some versions of the _CXENTRY and _CXEXIT routines maintain an up-down counter.

prolog and epilog Keywords
Although the critical keyword can be used to prevent interference among threads that are sharing data, it is deficient in two ways. First, it blocks all other threads in the current process, even those that are not likely to interfere. Second, it does not block threads in other processes that may be sharing the data, unless you replace the critical section lock and unlock routines as described earlier. Most programmers employ semaphores to provide finer-grained control of individual resources. Local semaphores can coordinate the threads of a process, while global semaphores provide inter-process coordination.
With this method a semaphore controls a related set of shared data objects. You then request the operating system to give you control of the appropriate semaphore before accessing any of the controlled objects. If no other process or thread is using the semaphore, it is assigned to you; otherwise the operating system suspends your thread until whoever owns the semaphore releases it. After finishing with the data, you again call the operating system to release the semaphore.
Although this semaphore protocol is simple, it's tedious to code each time you access shared data. The prolog and epilog keywords place this burden on the compiler in a way that should make your programs easier to understand and maintain.
Listing 4 defines a general locking prolog called lock and a companion epilog called unlock. These would normally be placed in a header file just like structure definitions. We then code an application function named update which updates a shared buffer whose address is passed as its first argument. Access to the buffer is controlled via a semaphore whose handle is passed as the second argument.
The first line of the update function is the usual prototype. Then, before defining the body of the function, we specify that function's prolog and epilog calls. The body of the function is coded in the normal way. The compiler, in effect, inserts the prolog before the first statement in the body, and the epilog before each return point.
prolog and epilog form a powerful extension to the C language and address a problem that is not currently handled by C+ + or other extensions. However, the Lattice compiler does not yet support this feature because its syntax and semantics are still being discussed.

Summary
Programmers have used C in real-time and multitasking applications for many years, and yet the language has not been extended to handle the basic problems of these environments. When I worked with UNIX at Bell Laboratories, I remember that the kernel was littered with explicit calls to critical section control routines and semaphore handlers.
Perhaps the lack of features was acceptable in the days when only a few skilled programmers dealt with multitasking and shared data. Now, however, the high-level API of OS/2 allows the "average" programmer to exploit these advanced operating system features. In addition, more and more programmers need these features in their applications, whether under OS/2, UNIX, or an embedded system real-time executive.
Recognizing this, Lattice has taken a tentative step towards evolving C so that it can better support multitasking. The company welcomes comments and criticisms, with the hope that this discussion will lead to an enhanced ANSI standard.

Acknowledgements
I would like to thank my friends and associates at Lattice and SAS Institute for their assistance with this article. Special thanks to John Pruitt and Glenn Musial, who have primary responsibility for the Lattice C Compiler and are keenly concerned about the evolution of the C language.