Tracing Application Execution

C/C++ Users Journal June, 2004

Tracing without hindering performance

By Tomer Abramson

Tomer Abramson is a software engineer for cti2. He can be contacted at tomeracti2.com.

The most common and effective way to debug and get information about applications is to use some form of tracing. However, dumping traces to output devices (files or screens) can dramatically reduce an application's performance. In this article, I'll present a technique that transforms tracing from a "heavy" to "light" operation, so that traces can be used without interfering with the application's main tasks. I've used this "light trace" technique in applications ranging from telecommunications systems to semiconductor diagnostic tools. The utility is written in C++ and compiled with Visual C++ 6 under Windows 2000. To simplify the code, error checking isn't performed.

A traditional trace is dumped when the application sends the trace directly to the output device (see Figure 1). This form of tracing affects the performance of applications, especially if traces are heavily used, and creates problems in real-time applications. With light traces, the trace is dumped to an internal buffer instead of dumping directly to output devices. A thread, which runs in low priority, retrieves the trace from the buffer, then dumps it to the output device; see Figure 2.

If the working thread runs in high priority and the tracer thread in low priority, the tracer thread gets CPU time only when the working thread does not use the CPU. In this way, the tracer thread has almost no affect on the performance of the application, even if the application uses traces heavily.

To realize another advantage of this technique, assume you are using a conventional trace in a multithreaded environment with, say, 100 threads that are dumping traces at the same time to the log file. Because of the need to synchronize the access to the log file, only one thread at a time can write to the file, while all other threads are waiting. With the light trace, the threads are dumping the traces to the trace buffer. Each time a thread is dumping a trace to the buffer, there is no need to lock the entire buffer—only a specific entry. This way, the number of threads that can access the buffer at the same time is equal to the number of entries in the trace buffer.

Light-Trace Design

The light-trace utility contains three main components:

The basic flow for using the tracer is straightforward: You initialize the tracer thread in the entry point of the application. When the application produces a trace, the trace goes into the trace buffer, where it waits until the tracer thread retrieves the trace and dumps it to the output device. At the exit point of the application, the tracer thread dumps any traces left in the buffer before terminating the application.

That's it.

Light-Trace Implementation

The component responsible for most of the work in the light-trace utility is the trace buffer, implemented in the CtraceBuffer class (Listing 1). The class is multithread safe, and supports concurrent insertions/retrievals of traces to/from the buffer from different threads (although in the light-trace utility, there is only one thread that retrieves traces).

The buffer is implemented as a cyclic queue: When inserting traces to the buffer, the first trace goes into entry 0, the next into entry 1, and so on, until the last entry in the buffer is filled. When the last entry is filled, the next trace goes again into entry 0 (if the trace in this position was already dumped to the log file).

Retrieving from the buffer starts from entry 0 towards the last entry. After retrieving the trace from the last entry, the next retrieval is from position 0 (if it is full). Each entry in the buffer is a struct of type TraceInfo, declared in TraceBuffer.h (available at http://www .cuj.com/code/), which encapsulates the trace information. The buffer that holds the traces is allocated statically and has a fixed size. I strongly recommended that you not use any form of dynamic buffer or queue implementation (such as STL) because an average application produces hundreds of traces per hour. Performing this amount of allocations causes fragmentation and leads to reduced performance.

The CtraceBuffer class contains two semaphores:

The class also contains two members, m_nInsertPos and m_nRetrievePos, which are holding the next entry for insertion/retrieval from the queue, respectively.

The heart of the CtraceBuffer is its Insert() and Retrieve() methods: When calling the Insert() method, you first check if there is a free entry in the buffer by waiting on m_hNumFreeEntriesSemaphore. Each successful wait decrements the semaphore value by 1. There are no free entries in the buffer when the semaphore's value is 0. The next wait operation blocks until an entry is freed.

After validating that the buffer has a free entry, find the position of the free entry by calling InterlockedIncrement(&m_nInsertPos). The return value is the entry to which you insert the trace. After inserting the trace to the buffer, I update the number of traces in the buffer, which are waiting to be retrieved. This is done by calling ReleaseSemaphore(m_hNumItemsToCollectSemaphore).

How do you get the position of the free entry in the buffer? The straightforward way is something like this:

EnterCriticalSection()
If (m_nInsertPos == BUFFER_SIZE)
	   m_nInsertPos = 0;
else
           m_nInsertPos++;
nInsertPos = m_nInsertPos;
LeaveCriticalSection()
m_buffer[nInsertPos] = traceInfo;     


Although this code is OK, it can be improved:


nInsertPos = InterlockedIncrement(&m_nInsertPos);
	m_buffer[nInsertPos%TRACE_BUFFER_SIZE] = traceInfo;

This code increments m_InsertPos unconditionally. m_nInsertPos is defined as long, so its maximum value is 0xffffffff. When m_InsertPos equals 0xffffffff and is being incremented, its new value will be 0. nInsertPos gets the value of m_InsertPos after the increment. The real insert position in the buffer is nInsertPos%TRACE_BUFFER_SIZE. For this to work, (0xffffffff+1)%TRACE_BUFFER_SIZE must be equal to 0. This improved implementation reduces the number of operations made inside a critical section from 3 to 1, and replaces the use of CRITICAL_SECTION with a faster synchronization function, InterlockedIncrement().

These improvements, although small, make a difference when the application runs under heavy load. The Retrieve() method implements a logic similar to Insert(): Wait for the buffer to contain at least one full entry, find the entry in the buffer to retrieve from, retrieve the trace, and finally increment the number of free entries in the buffer. The Retrieve() method returns the number of traces left in the buffer. The returned value is used by the tracer thread to decide when the thread can exit.

The trace buffer acts as a pipe between two parts:

The thread is started by calling the Run() method and finishes when either the application sends a trace with a severity level of abort or the application reaches an exit point. In the second case, the destructor of the trace thread object waits for the thread to finish retrieving all the traces from the buffer before exiting.

You can have only one instance of TracerThreadObj in your application. Creating more than one instance forces you to synchronize the operation of writing to the log file.

When using the light-trace utility, you should never use the following style to exit the application:

If (someError)
{
LightTrace(eTraceLevelWarn, "exit from app");
{

Because dumping traces is an asynchronous operation, most likely the thread that dumps the traces will not have a chance to retrieve and dump all traces left in the buffer before the application terminates. Therefore, the proper way to exit the application is:

If (someError)
{
LightTrace(eTraceLevelAbort, "exit from app");
{

Using the abort trace level verifies that all the traces in the buffer are flushed to the log file before aborting the application. What happens when you try to insert a trace to the buffer and the buffer is already full? This line (taken from CtraceBuffer::Insert()) waits for at least one free entry in the buffer:

WaitForSingleObject(m_hNumFree			EntriesSemaphore, dwTimeout);

The question is what timeout should be waiting for a free entry in the trace buffer? A timeout of 0 causes the tracer to drop the trace when there's no free entry in the buffer and the trace is lost. An infinite timeout waits until an entry is freed, making sure that the trace is logged to the log file. However, this can reduce the performance of the application, which is exactly the opposite from what you want to achieve. The solution is to set the timeout according to the level of the trace: For abort trace level (meaning you want to abort the application), use an infinite timeout. For information trace level, use a timeout of 0 (meaning that you can lose a few traces when your application is running under heavy load). For warning and critical trace levels, I use a timeout of 50 and 100 milliseconds, respectively, to minimize the possibility of losing important traces during heavy load. These timeouts are suggestions only. You can tune them according to your application needs. If you see that you lose more than a few traces, it might be a good idea to define a bigger buffer and/or raise the priority of the tracer thread.

Integration

To integrate the light trace into your application, you need to perform three steps:

The application is now ready to produce light traces.

Conclusion

Main.cpp (Listing 2) tests to ensure that the tracer improves performance. The test program's TestTrace() function creates 64 threads, each dumping 250 traces to a file. I call the TestTrace application once in a conventional mode and again in a light mode, then check the times.

In my tests, the average time for the conventional trace to complete its task was 140 milliseconds, and the time for the light trace was about 90 milliseconds. A more complete test might check for another aspect: Measure the amount of time each thread used to accomplish its task, then compare the times of the threads with worst performance in each group (light versus conventional). This is a test where the conventional trace may show a real weakness because of the fact that when using this form of tracing, only one thread can access the log file at any given time, while the light trace allows access from multiple threads to the trace buffer. This test is far from simulating a commercial application, so it is hard to get a fill of the benefits that you can get from the light-trace technique.