Debugging Real-Time Production Software

C/C++ Users Journal June, 2004

Quick for debugging in unfriendly environments

By Bill Pyritz

Bill Pyritz has worked for Lucent Technologies (formerly AT&T Bell Labs) for over 17 years designing and developing real-time telecommunication systems. He can be contacted at pyritzlucent.com.

Debugging real-time software is difficult. Many times, a problem occurs only under special circumstances or under a particular system load. Debugging these types of problems typically requires data to be examined under certain conditions; for instance, 99 percent of the time it works, but I need to look at the value of x and if it's greater than 54, I want to dump the abc structure and take a look at it. Most importantly, you want to do this in a running production system without stopping and/or disrupting the application.

In this article, I present a solution to this problem. This approach is currently being used to add essential field debugging capabilities to a real-time 3G wireless application. In the process, I introduce a debug thread responsible for coordinating the setting, enabling, disabling, and removing of breakpoints in an application process. The breakpoints themselves are written in assembly language and assembled on the production machine. This implies that all the debugging can be done on the target machine, eliminating the need for a special library to be created on a development machine, shipped to the target machine, and installed. This facilitates quick turnaround times for debugging problems, especially in unfriendly environments. The breakpoints themselves cause minimal disruption to the running system and, therefore, can be enabled/disabled while the application is running under load. Although I use a Solaris SPARC target for the example implementation, the general principles can be applied to other architectures as well.

Background

There are many tools available for debugging. Many offer sophisticated GUIs and advanced capabilities for walking through code and setting breakpoints directly from source listings. These tools are wonderful and make debugging much easier, but aren't without problems when using most of these tools with real-time applications:

They disrupt the normal flow of an application by typically requiring the stopping of the process when a breakpoint is hit. This is fine if you are walking through the code on your desktop step by step, but when you are trying to debug a problem that only occurs under a system load, these tools break down. Critical timing and sequencing is disrupted when the process is stopped by breakpoint(s).
It is not practical to assume that these tools will be available for use on production machines. Typically, you use these tools either on the desktop or in a controlled lab environment. Production problems are usually debugged using one or a combination of the following methods: log file analysis, reproducing the problem in a controlled environment, and delivering special software to the customer site to gather more data. Correctly designed, log file data can be useful in debugging problems. However, it is far from comprehensive and represents a best guess by developers of the information that will be needed to debug an almost infinite number of possible error scenarios. Trying to reproduce problems discovered in production is, in many cases, a futile effort influenced by environmental conditions, system load characteristics, and the phase of the moon.
Shipping software patches to the customer site solely for debugging purposes is inefficient and not well received by most customers.

A better solution is one that allows the setting of low overhead, self-contained breakpoints on running target systems. Most debuggers require the process to be stopped, which is unacceptable for real-time applications. With the approach I present here, the process is not stopped and the overhead is simply whatever you need to do in the breakpoint—you can think of it like inserting code into your application dynamically. In short, latency is minimal. Other benefits include the ability to gather specific data on the target machine where and when the problem occurs without needing to ship special software. Additionally, temporary fixes could be installed using breakpoints in the interim of time between finding the fix and installing the official software update. What follows is a description of one possible means of realizing this type of debugging capability.

Debug Thread

The solution I propose calls for adding a debug thread to each application process. Figure 1 presents a high-level view of this. The debug thread receives commands from the user-debugging process (most likely a command-line program) and carries them out within the context of the application process. The following commands are supported:

Get Symbol Address. Given a process ID and symbol name, the address of the symbol within the process's virtual address space will be returned. Many times it will be necessary for the breakpoint assembly code to call an application function. To do this, it must determine the function's address using this facility.
Set Breakpoint. Given a process ID, breakpoint index, function name, and offset within the function, a breakpoint will be set in the disabled state. The code for the breakpoint is implicitly assumed to be in an object file named brkpt<breakpoint index>.o (brkpt0.o, for example), located in a predefined breakpoint directory (defined by an environment variable, registry, and the like). The debug thread loads the breakpoint object file into the process address space at this point. The breakpoint index is a user-specified value within a range (for example, 0 to 49) that allows multiple breakpoints to be defined and administered within the system.
Enable Breakpoint. Given a process ID and a previously set breakpoint index, the breakpoint is enabled. Code associated with the breakpoint executes within the process from this point forward. The process is not stopped when the breakpoint is executed; it is essentially a function call within the thread of execution.
Disable Breakpoint. Given a process ID and a previously set breakpoint index, the breakpoint is disabled. The application executes normally as if the breakpoint never existed.

When enabled, the debug thread overwrites the instruction at the breakpoint location with a jump instruction to the breakpoint code. The breakpoint code is responsible for duplicating the overwritten instruction(s), thereby facilitating an enter/exit option at the discretion of the user.

The debug thread lies dormant within the process until a command is issued. Commands are sent from users to the debug thread via UDP messages. A database (which could simply be shared memory, the registry, and so on) maintains a mapping from the application process ID to the UDP port. The user process can be designed as simple or complex as necessary, ranging from a message-passing routine to a system-wide breakpoint administrator. In the administrative role, it may coordinate breakpoints within a complex of target systems.

Breakpoints are defined in assembly language and assembled on target machines. This is clearly for expert users only. (Later, I'll discuss some ways to make these tools available to a wider group of users.) Users must disassemble the application code to determine exactly where to set the breakpoint. Most good compilers offer an option that displays the assembly code interleaved with the source code, making it easier to find and understand correct breakpoint locations. With a little practice, you will find it almost pleasurable to engage the program at this level of detail.

A Solaris SPARC Implementation

There are three main components to the Solaris SPARC implementation of the approach I've just described:

Shared memory for breakpoint administration.
UDP-based communication mechanism used for communicating with the debug thread.
Debug thread message handling and operation.

The entire prototype code for this implementation is available for download at http://www.cuj.com/code/.

Figure 2 shows the shared memory configuration. As you can see, it is organized into a set of process and breakpoint blocks. This separation allows the same breakpoint to be applied to multiple processes. Each debug thread is assigned a process block and a UDP port for communication. The breakpoint is defined by the function name and offset and has two states: enabled or disabled (see the SharedBreakPointDataInterface in the code package for details).

Figure 3 is a high-level view of the UDP-based communication mechanism. The debug thread sits on a UDP port and waits for commands. The command-line process acquires the correct UDP port of the debug thread by examining the shared memory block. Commands are sent to the debug thread using the UDPClient::write() method. The client then switches roles and becomes a server waiting for a response from the debug thread. The debug thread processes the command and sends a reply to the user process. Using UDP has many advantages including its simplicity and the ability to construct a network of target systems that can be debugged from a central location without requiring nailed-up connections.

Consider this application:

void func1(int y)
{
    if( y != 3 ) 
        printf("y = %d\n", y);
}
int main(int argc, char** argv)
{
    while(1)
    {
        func1(3);
        sleep(2);
    }
}

Now take a look at the disassembly of function main:

main()
   save   %sp, -104, %sp
   st     %i1, [%fp + 72]
   st     %i0, [%fp + 68]
   mov    3, %l0
   call   func1
   or     %l0, %g0, %o0
   ...

Suppose you want to set a breakpoint in the main while loop that modifies the argument passed to func1 such that the printf() executes and outputs the value 75. The following steps are taken:

Determine that you want the breakpoint placed at the call func1 instruction of main. From the disassembly, the function offset is determined to be 16 (that is, main + 16 bytes, where each instruction is 4 bytes). I use breakpoint index 0 since this is the first breakpoint I am setting on the target.
To call func1 from within your breakpoint code, you need to determine its virtual address within the application by sending a Get Symbol Address command to the debug thread. Its address is 0x11700.
Create the file brkpt0.s that contains the breakpoint assembly code:

	jmptobrkpt0:
	    ! advance the register window
	    save    %sp,-112,%sp
	    ! calling func1() addr 0x11700
	    sethi   %hi(0x11700), %o0
	    or      %o0, 0x300, %o0  
	    jmpl    %o0, %o7 
	    ! set the single argument to '75'
	    mov     75,%o0 
	    ret
	    restore

It turns out that the o or output registers are used to carry function call arguments. Further, the jmpl instruction always executes the instruction following it before executing the jump. You can take advantage of these details by executing a mov instruction after the jmpl that effectively changes the integer parameter passed to func1 to 75.
Next, assemble the code on the target and move the resultant object file into the predefined breakpoint directory. In this case, the object file is named brkpt0.o.
You are now ready to set the breakpoint and issue the Set Breakpoint command to the debug thread. In addition to the process ID, I specify breakpoint index 0, function main, and offset 16. The debug thread loads the object file brkpt0.o into the application's address space and calculates and saves in shared memory the address within main where the breakpoint is to be set. You can update the assembly code at any time and reload it into the application. That is, once the breakpoint is set, it isn't locked in the application forever. You can instruct the debug thread to set the same breakpoint as many times as necessary. Each time, the debug thread closes the existing object file and loads the updated version.
Finally, enable the breakpoint by issuing an Enable Breakpoint command to the debug thread. The debug thread overwrites the address in main with a call instruction to the breakpoint code. From this point forward, when the code in main is executed, the breakpoint executes and the value of the argument to func1 is 75 rather than 3 and the printf() executes.

Admittedly, this example is simple and not very practical, but demonstrates the potential of the mechanism. Debugging at the assembly-code level is powerful. When you couple that with a supporting infrastructure as described here, solving problems in real-time environments can be greatly simplified.

Navigating the Code Package

I wrote the code accompanying this article using a test-first approach. Therefore, the best way to understand the code is to start with the tests, which can be found in the file GenericUtilitiesTest.C. You will find acceptance and unit-level tests defined for all the major functionality. Start at the end of the file where main() is defined and work your way through the tests. Hopefully, this gives you a good basis for navigating through the package and understanding the implementation.

The code is organized to take advantage of some code-generation techniques. You will find that the major functional components are defined in <name>Impl.h files. A code generator is run against these files and produces a <name>.h and <name>.C file for each, respectively. These files contain the public interfaces only and provide a loosely coupled interface to the implementation code using redirection. I wrote the code generator (gen.rb in the package) using the Ruby language (http://www.rubycentral.com/).

Future Considerations

What I have been describing is a raw interface that requires expert knowledge of the underlying target architecture. It would be desirable to build an additional layer of abstraction on the described infrastructure that enables a larger group of users to utilize the capabilities. One of the ways to achieve this would be to define a domain-specific language that abstracts the details of the underlying architecture. Users could then specify breakpoints using symbolic names rather than addresses and registers. The aforementioned example breakpoint might be described as follows:

whenenter main line 5
{
    call func1(75)
    goto line 6 // skip func1(3) call
}

This code would be compiled into assembly code similar to what I wrote in this example.

As a start, I have built a simple debugging environment that lets users specify breakpoints and send commands to the debug thread using a simple language that I created. I developed a Ruby script to provide the user interface and to perform all the parsing (Ruby has powerful parsing capabilities). Ruby provides a nice interface to external C/C++ code, which lets me directly access all the C++ interface classes that I developed for communication with the debug thread. The language I developed still requires the user to understand architecture details such as registers and addresses, but would automatically generate the assembly code and produce the object files. The goal is to eventually let users specify breakpoints using symbols and line numbers. You can take a look at the grammar in the code package (the file is named "grammar").

Most of the ideas presented here are still evolving. Take a look at the implementation package at http://www.cuj.com/code and feel free to send me any comments, suggestions, or questions that you might have.