April 2002/An Efficient and Flexible Tracing Technique

Testing & Debugging

An Efficient and Flexible Tracing Technique

Dmitri Leman

This extensible tracing framework tames the dreaded multithreaded debugging demon.

Suppose you need to track down a problem in a multithreaded or event-driven application. An interactive debugger is of little help, since it shows only a snapshot of your process where it stopped on a breakpoint. Single stepping is of no help either since other threads continue to work and distort the picture. The usual approach then is to sprinkle the program with trace operators and record a log of program execution prior to the problem. There is no shortage of different tracing implementations ranging from simple printf to complicated frameworks capable of collecting traces from multiple applications and computers. Often disappointment awaits the developer when, after the insertion of enough trace operators into the program, the bug disappears. This is especially common in debugging thread-synchronization problems since most tracing routines use thread synchronization inside. For a discussion of debugging multithreaded applications, see [1].

This article presents a utility that uses a circular memory buffer for efficient trace recording. It uses the Pentium interlocked XADD (Exchange and Add) instruction to synchronize access to the buffer without the overhead of critical sections. Originally it was written in C++ for Win32 and used memory-mapped files to collect traces from several applications. Later I ported it to Windows NT kernel mode to debug applications and drivers working together. The version included with this article also was ported to Linux and provides a Java JNI interface, which helps in debugging Java as well as C++ applications. The JNI tracing is not as efficient as C++, however. I also include an application for viewing the trace buffer without affecting the debuggee.

Design

There are two major ways to trace programs: you can manually insert trace statements in the source code, or you can use automated instrumentation of source or binary code. For quick exploration of a program when source code is unavailable or unfamiliar, the second way works best. However, the first method works best when trace statements are written and maintained along with the code. The tracer presented in this article will implement the first approach and provide macros, which should be inserted in the application source code. These macros are similar to printf and trace statements from other tracing frameworks. It should be easy to convert printf or trace statements in existing source files.

Many programmers stop using trace when an application becomes large and too many trace operators cause too much noise and reduce performance. Frequent commenting of printf operators in and out is a common practice, but it is too tiresome. Many other tracing facilities (such as syslog on Unix) use a trace level as the first parameter of trace statements to help reduce output volume. Unfortunately, during debugging, it is usually necessary to increase the trace level to maximum, which causes noise to hide the important data. The tracer presented here will use a bit mask or a keyword to achieve finer selectivity of filtering than a simple trace level. The trace statements should be divided into groups according to their membership in different functional components, such as message loops, networking, drivers, error messages, etc. (A similar technique is presented in [5].) Up to 32 group constants may be defined in enum TraceGroups in htrace.h, each having a single bit set on one of 32 positions. For larger programs, 32 groups are insufficient. In these cases, string keywords should be given as the first parameter of HTRACEK statements. This allows creation of any number of groups, but causes larger overhead for string comparison on the first call to each trace statement. Other approaches for noise reduction are proposed in [2], [3], and [4].

This article focuses on using a memory buffer to collect trace messages, but the tracer also includes routines for other types of trace output such as debug monitor, file, console, event log, and message box. All have some advantages and cannot be replaced by any single kind of output. Table 1 lists some common output methods and their approximate overhead. I’ve seen programs that called several tracing routines like printf and fprintf one after another. To avoid such a mess, the tracer assigns a 32-bit mask or a keyword to each trace output at run time. A TRACE statement is printed to all outputs that has a mask matching the trace group. For HTRACEK statements, a keyword search is performed on the first call after starting or after a keyword change.

To help simultaneously debug several applications and drivers, the tracer should allow the collecting of all trace messages in a single shared memory buffer. To achieve this, it should have routines for managing shared memory and for communicating with a driver. Figure 1 shows a component diagram with several C++ and Java applications and drivers. This is an extreme case, however. Most of the time, I use the tracer in a single application.

Implementation

The tracer implementation consists of two files: htrace.h and htrace.cpp. The format of the memory buffer is declared in GlobalTraceBufferHeader and GlobalTraceBufferFooter found in htrace.h (see Figure 2). The buffer is divided into header, footer, and text areas. The header contains a 4-byte signature and the 12-byte size of the text area in string format. The signature allows identifying the trace buffer in memory and on disk. The size string should be converted to an integer to determine the size of the text area and location of the footer. The reason for keeping the size in a string form is to allow text editors to open and view the trace buffer file as text. In earlier versions, I kept the size as an integer, which caused some text editors to open the file in binary format. Therefore, I moved all other integers to the footer.

The first field of the footer keeps the total number of bytes written to the buffer. This number is updated by using the Pentium atomic XADD instruction, which requires the field to be a 4-byte integer aligned on a 4-byte boundary. Since the buffer is circular, the number of written bytes may exceed the size of the text area. Therefore, the index of a byte next to the last written byte is equal to the total number of bytes written modulo the text area size.

The purpose of the next field, “Stop after threshold,” is to discard all new traces after the total number of bytes written reaches that threshold. This lets a program freeze the trace buffer when enough information before and after an error occurrence is recorded. The third field is an indication that the buffer is already frozen. The last three fields of the footer are used to filter the trace output by mask or keyword as described in the “Usage” section below.

At the heart of memory buffer tracing implementation is AddToTraceBuffer (see Listing 1). It expects two parameters: a pointer to a string and a string length. In the beginning, this routine retrieves a pointer to the memory buffer and its size and verifies that the given string fits into the buffer. Then it calls InterlockedExchangeAdd to increase the “Total number bytes written” field by the length of the string and to get the original value. This reserves a space in the buffer for the new string. This operation is atomic: even if a thread switch occurs or another processor executes this function at the same time, each thread will reserve a separate space. If the reserved space is beyond the “Stop after threshold” value, the function returns. In order to translate the reserved space into an index in the circular text area, its modulus is taken by the size of the text area. Then the routine checks whether the given space wraps around the end of the buffer. If it does not, it simply copies the trace string into the text area. Otherwise, two memcpy calls are needed.

AddToTraceBuffer does not protect its whole body with a critical section or a spin lock unlike other similar techniques (i.e., printk on Linux). Unlike a critical section, the Pentium XADD instruction is very efficient and cannot block the calling thread. Absence of thread synchronization calls also allows tracing drivers and OS internals even at interrupt time. For an example of tracing Windows NT kernel API calls, interrupts, and port accesses, see [6]. A problem with this approach arises from a possible thread switch after the space is already reserved, but before (or during) the memcpy. Eventually the first thread will resume and will finish writing to the reserved space. But if other threads produce a large volume of output before the first thread resumes, they may fill the buffer, wrap around, and start filling the space reserved by the first thread. When the first thread resumes, it will garble the output. To work around this problem, it usually suffices to allocate a large enough buffer to keep at least a few dozen seconds of output without wrapping around. Of course, if the first thread has a low priority, it can be suspended for a very long time and still garble the buffer.

Other routines in htrace.cpp provide several ways of allocating and sharing the trace buffer between applications and drivers. The simplest and most portable way is to use malloc. A buffer allocated by malloc cannot be shared with other applications on Windows or Linux, and it will vanish and take the valuable trace with it in the case of an application crash. But it is a good choice for an embedded environment with shared application memory. htrace.cpp includes a function to print a circular memory buffer to a file. On Windows and Linux, I usually use memory-mapped files instead of malloc. They can be shared between applications, survive an application crash, and can be viewed at run time by a simple text editor or a dynamic viewer utility (available for download from the CUJ website at <www.cuj.com/code>). Occasionally, I have needed to debug applications and drivers together on Windows NT. To support this, I have added routines to allocate memory from non-paged kernel mode space and then have mapped it to the application space. After the first driver allocates a buffer, several applications and drivers can use IOCTL calls to retrieve a pointer to the buffer and produce a combined trace output.

Usage

Like many of the other tracing methods, htrace.h defines macros HTRACE and HTRACEK, which receive a formatting string and a list of parameters like printf. HTRACE expects a trace group constant (usually declared in enum TraceGroups in htrace.h) as the first argument, while HTRACEK needs a string keyword constant. Compilation of these macros, as well as the rest of the tracing engine, is controlled by the TRACE_ compiler definition. Normally TRACE_ should not be defined in a program’s release version, and the compiler will exclude these macros from the resulting code. But the trace support is not restricted to the debug version of a program: if necessary TRACE can be enabled in a release version if a problem is not reproducible in the debug build. (See [7] for another tracing approach for release builds.)

If a program already has a lot of trace statements like MFC TRACE or syslog, it may be possible to redefine them in a header file to use HtraceImpl, which is the workhorse of the tracing engine. The only significant difference is that syslog uses a priority as the first parameter, while Htrace uses bit masks or keywords.

To use the trace, an application should call TraceInitialize at the beginning and call TraceUnInitialize before termination. The second step is to redirect different trace groups to various output streams. TraceAssignGroupsToStream allows turning on and off individual group flags defined in enum TraceGroups for each trace output specified by constants in enumTraceOutputs. TraceAssignGroupKeyWordToStream is used to assign group keywords to output streams. Figure 3 shows an example of trace initialization, as well as HTRACE and HTRACEK macros. htrace.h contains a detailed description of all tracing routines.

Several routines are provided for managing the trace memory buffer. TraceAllocateBuffer uses malloc or another platform-specific API to allocate a local buffer. TraceDumpBufferToFile prints a circular trace buffer to a file. TraceUseMemMapFileBuffer creates a memory-mapped file to share the trace buffer among several applications. TraceSetExternalBuffer and pGetTraceBuffer conveniently let one application or driver use a buffer allocated by another. TraceAttachToNTDriverBuffer works only on Windows NT and uses IOCTL calls to retrieve the buffer from a driver. TraceFreezeBufferAfter allows freezing the trace buffer after a certain amount of data (measured in a percentage of the buffer size) has been added from the moment of the call. It is valuable to make sure that the traces immediately preceding or following a bug report will not be lost. For example, calling TraceFreezeBufferAfter (50) ensures that the last trace will be in the middle of the trace buffer after the freeze. Calling TraceFreezeBufferAfter (100) at the beginning of execution effectively transforms a circular buffer into a linear one. A linear buffer is better than circular one when debugging early stages of system initialization because it prevents the very first traces from being overwritten.

Portability

htrace.cpp requires a C++ compiler. I used Visual C++ 6.0 for Win32 and Windows NT kernel mode and g++ version egcs-2.91 for Red Hat Linux 6.1. htrace.h can be included by C modules as well as C++. Java support is provided using JNI 1.1. htrace.cpp includes JNI routines, which are declared in hTrace.java.

Memory buffer tracing routines are probably the most easily portable. The minimal solution includes AddToTraceBuffer and TraceAllocateBuffer. AddToTraceBuffer uses InterlockedExchangeAdd available on Win32 and Windows NT kernel mode. When porting to Linux, I implemented InterlockedExchangeAdd using the Pentium XADD instruction. There should be no problem with compiling it on other operating systems running on Pentium.

To port to other processors, I suggest first looking into the OS API for interlocked or atomic routines. If found, it is necessary to make sure that the API does not use any critical sections, semaphores, or other synchronization objects inside. If nothing is found, I would look for an appropriate instruction in the processor manual. Modern processors usually have some support for atomic operations. It may be a single instruction (like on Pentium) or a well-defined sequence (like on Alpha — see “Atomic Update of a Single Datum” in Alpha Architecture Handbook). Extra care should be taken with multiprocessor systems. On Pentium, for example, it is necessary to add the LOCK prefix before the XADD instruction.

Recently I needed to port the trace to the StrongArm processor and didn’t find any support for atomic updates except for a simple exchange instruction. So far, I have simply used a non-atomic update, which requires at least three instructions to load, update, and store the variable. Since there is only one processor, there is a small risk of a thread switch between these instructions, which will cause one or more missing trace strings. If this will be a problem, I may try to disable interrupts for the duration of these instructions. Disabling interrupts is a common practice in simple embedded environments where all code executes with administrative privileges.

TraceAllocateBuffer requires support for malloc, which is not available on some platforms. For example in Windows NT kernel mode, I used ExAllocatePool instead. Memory-mapped file support routines are more difficult to port. I ported them to Linux using the mmap API. The biggest difficulty was to count how many processes use the file to ensure that only the first process performs initialization of the buffer. The solution was to use a counting semaphore.

htrace.cpp has a lot of other platform-specific routines to provide trace to the console, debugger, file, etc. Most of them have some kind of analog on different platforms. For example, OutputDebugMonitor is implemented as OutputDebugString on Win32, as DbgPrint in the Windows NT kernel API, and as syslog on Linux.

When porting to a new environment, I first find a platform-specific compiler definition, such as WIN32 or __linux__. If nothing unique is defined, I declare my own (i.e., TRACER_NTDRIVER_) in the compiler settings. It is also necessary to define TRACE_ to enable the trace. Then I add the most common platform-specific headers under #ifdef at the top of htrace.cpp, such as windows on Windows or "unistd.h" on Linux. From the beginning, it may not be clear which other more specific headers, such as sys/mman.h, to add. I will do it later when adding support for a particular trace output. Next I try to compile to make sure that the common trace initialization and formatting routines, such as HtraceImpl and TraceAssignGroupsToStream, can compile and link. I had the most trouble with the vsnprintf C library routine, which is called differently on Linux and Windows and not implemented on other operating systems. A quick solution is to use vsprintf instead and remember to never print a single trace statement larger than the temporary buffer inside HtraceImpl (512 bytes). Unlike vsnprintf, vsprintf does not check for buffer overwrites. Then I port a minimal set of output routines. The usual candidates are AddToTraceBuffer, TraceAllocateBuffer, WriteCircularBufferToFile, and OutputConsole. The initial porting usually takes a few hours to enable the trace and start adding HTRACE macros to a program.

The tracing framework allows easy addition of new output streams. For example, a requirement may be to dump all errors and warnings into a database, send them by email, or redirect to another computer. If the application already has trace statements with appropriate group constants, it is necessary only to add an output constant to enum TraceOutputs, write an output routine to htrace.cpp, and add a line to array s_Outputs. Then the selected trace groups may be redirected to the new output by changing appropriate masks, which are usually stored in an .ini file or the registry.

Tracing Thread Context Switches

To show an example of using the trace with applications and drivers, I wrote a small Windows NT kernel-mode driver, which traces thread switches. Seeing the exact moment in the trace file when the thread switch happens may help to find some thread-related bugs. The driver places a hardware-assisted breakpoint in the location fs:[0x124], which holds a pointer to the current thread in all versions of Windows NT 4.0 and 2000. This technique is not perfect. The tracing may stop if some other application or debugger modifies the debug breakpoint register. The machine may crash if a system-level debugger installs an interrupt handler while the tracer driver is working. Other methods of thread switch tracing are discussed in [8].

The example also includes a simple application BankAcct that implements a common “bank account” routine, which modifies the account value. When called from multiple threads without proper synchronization of the account, it will, sooner or later, corrupt the account value, which will drop below zero. Similar examples are often used to illustrate thread-related bugs. With the trace, it is possible to see exactly when the thread switch happened that caused the corruption. An excerpt from the trace file is shown in Figure 4. For details about running this sample see readme.txt included with the source code.

Conclusion

Many programmers avoid using trace because it is too tiresome to add printf operators and then comment them out. Also the large volume of output may slow down the application. I found that having a flexible and efficient tracing framework in place made it easy to start adding trace statements during the initial program development. As complexity of the system grows, additional traces may be added and trace groups refined, which allows more detailed views of the system behavior. It is very rarely necessary to remove or comment out trace statements, since overhead is negligible and no output is produced unless the corresponding group is enabled. Trace improves visibility of the inner working of applications and drivers and helps locate the toughest bugs.

Source Code

The complete source code along with the example application is available for download from the CUJ website at <www.cuj.com/code>.

Notes

[1] Peter Horwood, Shlomo Wygodny, and Martin Zardecki. “Debugging Multithreaded Applications,” Dr. Dobbs Journal, March 2000.

[2] John Robbins. “Bugslayer,” Microsoft Systems Journal, December 1997.

[3] John Robbins. “Bugslayer,” Microsoft Systems Journal, June 1999.

[4] Rainer Storn. “A Debug/Trace Tool,” Dr. Dobbs Journal, February 1997.

[5] Daniel Pilat. “An Improved TRACE() Macro,” Windows Developers Journal, July 1999.

[6] Dmitri Leman. “Tracing Kernel API Functions,” Windows Developers Journal, April 2000.

[7] Boris Bromberg. “Tracing an Application in Release Configuration”, C/C++ Users Journal. September 2000.

[8] Jacob R.Lorch and Alan Jay Smith. “The Vtrace Tool,” MSDN Magazine, October 2000.

Dmitri Leman is a software engineer in Silicon Valley. He has been developing Windows, DOS, and Linux applications, and device drivers for 10 years. He can be reached at DmitriL@Forwardlab.com.