Debugging Heterogeneous Distributed Applications

Dr. Dobb's Journal November, 2005

A reliable logging system for recording software events

By Stephen B. Jenkins

Stephen is the senior programmer/analyst at the Aerodynamics Laboratory of the Institute for Aerospace Research, National Research Council of Canada. He can be contacted at http://www.erudil.com/.

Debugging applications is hard—especially complex, asynchronous, heterogeneous, distributed applications. Many years ago, shortly after I became a member of the Wind Tunnel Support Group at the Aerodynamics Laboratory, I inherited the responsibility for maintaining and upgrading just such an application. At that time, the computers used at the 2m×3m wind tunnel consisted of a QNX real-time system for data collection, a Silicon Graphics IRIX system for data visualization and file serving, an Aerotech multiaxis motion controller connected to a Windows NT-based PC for model attitude control, and several other PCs with various versions of Microsoft Windows. Most of the code running on the PCs was written in LabVIEW (a graphical programming language for creating "virtual instruments"), while the programs on the two UNIX-based machines were written in either C or Perl. Together, this rather Byzantine combination of technologies made up a single application that was the data acquisition, reduction, and visualization system used to perform aerodynamic experiments. Needless to say, it was a frustratingly difficult system to debug; a veritable "maze of twisty little passages." Worse, some types of errors would cause cascade failures that rippled through multiple processes running on several computers. Due to the asynchronous nature of those processes, it was often impossible to identify and understand the sequence of events that precipitated such a failure, making a trip to the dentist for a root canal seem preferable to a debugging session. What I needed was a way of collecting and recording information from all of the processes involved—in other words, an event-logging system for heterogeneous distributed applications. Since I didn't know of any existing software tool that could do the job, I decided to write one.

Design Decisions

My original design criteria were that the event-logging system had to be cross platform, support multiple programming languages, be accessible from any computer in the control room, be easy to use for both developers and end users, and be easily expandable. In addition, because of time and budgetary constraints, it had to be developed quickly and cheaply. All in all, a rather daunting list of requirements. To accommodate the OS, language, and expandability criteria, I decided on a simple client-server model using standard TCP/IP for the logging part of the system, with a separate program to view the events. This way, client routines and programs could be written in several different languages (independent of the server implementation language) making it easy for the other developers at the Laboratory to modify their code to report events to the log server.

The requirements to be low cost and cross platform, as well as my own professional dislike of being trapped by proprietary technologies, pointed to the use of open-source software wherever possible. Because speed of execution was not a significant issue —disk access times and network latency were expected to be the primary performance bottlenecks—but development time was, I decided to use Perl for the server. Perl was also the obvious choice for the event viewer program, since the requirements to be easily accessible and user-friendly suggested a web-based interface, an approach that would eventually become the standard at the Laboratory (see my article "A Web-Based Environment to Support Aerodynamic Testing," IEEE Aerospace and Electronic Systems Magazine, January 2004) and result in a collection of programs called the "Internet Software Tools for Aerospace Research Installations" (ISTARI).

Although I briefly considered using a database to store the event information, I quickly abandoned this idea. Instead, I decided on a simple, flat-file approach such as the type used by the Apache web server for its log files, with one line of text per entry. This seemed to be the best approach because, like web logs, I expected this to be a "write often, read rarely" situation. I opted to have a separate log file created for each week, after estimating as best I could the expected size and frequency of event messages. Unlike web servers, I elected not to rotate or delete log files after a specified time. Given the rather low cost of disk space, I felt that it was more important to retain the old log files as a permanent audit trail, should it ever be needed.

After running almost flawlessly for nearly five years, I decided it was time to move the system off of our, by then, antiquated SGI machine, onto a new Linux box. While I was at it, I couldn't resist the temptation to take the time to make a few improvements. In the original design, the viewer program always read the event messages from the log files, but after seeing how the system was being used, I realized that the most common query, by an order of magnitude, was to show all events from the past hour. To optimize for this request, I rewrote the server to cache the past hour's events in memory, and modified the viewer program to request this information directly, rather than read it from disk. Other improvements to the server included the creation of CGI programs to retrieve its status information and to dynamically modify its configuration. I also wrote an additional program to read the log files and extract a summary of the number of occurrences of certain types of events. Figure 1 is the block diagram of my new and improved event-logging system.

Description of an Event

In the course of discussions with the other software developers at the Laboratory, we settled on the following definition of an event message. Each would contain the names of the reporting computer, process, and subprocess, as well as a severity level and a string of text describing the event. For the severity level, I decided on a five-tiered approach: Notice, Warning, Error-Benign, Error-Fatal, and Error-Catastrophic. While the Notice and Warning levels are rather self explanatory, the three types of error levels are less so.

Benign errors are defined as errors from which the process can recover.
Fatal errors are those errors that force the individual process to terminate, but allow the other processes to continue.
Catastrophic errors are defined as those errors that require the entire wind tunnel application to be halted.

Although the content of the string describing the event can be whatever text the developer wishes to include, I have made a few recommendations for string content:

For security reasons, aerodynamic data is not to be placed in the log.
In production code, important warnings and all errors should contain a special tag that tells the server to automatically send a copy of the event to the Dynamic Data Display. This is a program I wrote to let the wind tunnel operations staff monitor the experiment from a web browser, and it includes real-time updates and audible alerts.
Fatal and catastrophic errors should contain another special tag followed by the e-mail address of that code's author. That way, when the error occurs, the server can forward a copy of the event to that developer.

The Event Log Server and Its Clients

As well as accepting standard event messages from its clients and acting on the two special tags mentioned in the previous section, I wrote the server to understand four simple commands: PASTHOUR, SYNC, STATUS, and CONFIG. The PASTHOUR command causes the server to send a copy of all of the events that occurred within the past hour to the requesting client. I included the SYNC command as a way for processes to force the server to write to the log file immediately (normally, the server buffers events to reduce disk traffic). The STATUS and CONFIG commands were added to provide a way of viewing and modifying some of the server's configuration parameters without requiring it to be shutdown and restarted. I created separate CGI programs for each of these last two commands so developers could view and change, via the Web, things such as the buffer length and disk-write timing (see my article "Making Servers Dynamically Configurable," The Perl Journal, January 2005).

I originally wrote event client subroutines in C and Perl; a LabVIEW client VI was later created by another developer here at the Laboratory. To modify a program to participate in the event-logging system, all that is needed is a single-line addition to the code. In Perl, after loading my event-system module called "Notabene.pm," that line can be as simple as:

eventgenerator('routinename',
$severitylevel, $message);

To allow manually generated events, I wrote a simple CGI client program that presents a web form containing a text box to enter a message, as well as dropdowns to select a severity level for, and a staff member to be notified of, the event. This lets users put comments such as "Heard a strange whistling noise coming from the model at this wind speed" into the log file.

The Event Log Viewer

Again, I decided to write a CGI program to let users easily retrieve and view events. This way, the log files could be interrogated from any computer with a browser and web access to the system. Figure 2 is the query form generated by the viewer program. Events can be selected based on the time of occurrence, the name of the reporting computer, and the severity level. I also provided a text box so that users can supply a regular expression to match against, allowing them to search the messages for specific strings. You'll note that the default settings display all of the events from the past hour, so users merely hit Enter after the page loads, if that is what's desired. The only other form element on the query page is a checkbox that enables the data-related events on the output page to be highlighted with a distinctive background color.

Figure 3 is the output of a small section of the log as presented by the event viewer program. I used icons and colors as visual queues to allow important events to be identified quickly, and sorted the events in reverse chronological order. Although not shown in Figure 3, the title attributes of the HTML image tags provide additional information (a string of the form "Event #11 of 16 Warning #1") when users place their cursor over any of the icons. The Download button creates a CSV file from the displayed portion of the event log and sends it to the user for further manipulation. This is especially useful to those who wish to format the output for inclusion in formal reports.

The Summary Program

The event summary program reads the log files and generates a web page containing the number of occurrences of certain types of events during the previous day, week, month, quarter, and year. Figure 4 shows the top portion of a typical summary. Other than the obvious decision to report the number of events for each of the five severity levels, I wrote the code to search for the number of wind tunnel runs and tares, the number of data points, the number of model movement commands, and the number of times the system was rebooted. I also decided to include a Top 10 list of event-generating computers, processes, and subprocesses. The summary program is run every night, via cron, and generates not only the daily web page, but also a master index page (in the form of a calendar) that has links to every one of the available daily summary pages.

Using It All

Everything I've described so far has been rather abstract, so you may be wondering, "How well does it work in the real world?" Although completely atypical, the output in Figure 3 is an accurate account of one hour of an actual wind tunnel test, captured many years ago while the logging system was still being beta tested. I chose to show this selection because it serves as a very concise example of how even limited event reporting can be used during a postmortem analysis.

At the bottom of the web page in Figure 3, you can see that from 16:01:05 to 16:01:26, the Dispatch program—written in C and running on the QNX system—reported that the model was moved and data taken twice. At 16:01:31, the operator pressed the END RUN button and eight seconds later, a Perl daemon on the UNIX system named "jabba" reported that it received a request for end-of-run plotting. Approximately nine minutes later, the LabVIEW-based model attitude control computer, m2-aerotech, generated an error saying that a hardware limit switch had been activated. Since this error had not been preceded by a motion request, the model must have been commanded to move using the manual override. Between 16:32:49 and 16:53:19 a plot was produced, and a configuration file, Run696.cfg, was modified twice. At 16:56:26, the LabVIEW system m2-wtdas-inf warned of a yaw angle of 1599.88 degrees. As that system was under development at the time, the event suggested the existence of a bug in the YawPitchToAeroTek.vi function. Testing resumed at 16:58:27 with a tare measurement and seven seconds later, a motion request that caused the attitude control system to report that the pitch-axis position exceeded a software limit at 16:58:45. Another tare measurement was begun at 16:58:49 followed by a motion request seven seconds later.

Over the course of one hour, six processes, written in three languages, running on four computers, using three dissimilar operating systems, generated just 16 events, and revealed three distinctly different types of problems. The first was an operator error: a manual override of the model position. The second was a software bug: an error in the subroutine responsible for calculating the yaw angle. The third kind of problem was an experiment setup mistake: the attitude control system was instructed to move the model to a point beyond the safe operating envelope.

A few weeks later, after many more event-generating calls had been added, I discovered a fourth type of problem in the wind tunnel data system: a complex interprocess timing error. Although space does not permit me to show the individual event entries that led me to find this bug, I will describe what happened in a few sentences. While looking through a large selection of events, I noticed a recurring, and seemingly unnecessary, five-second delay between the time the attitude control system had finished moving the model, and the time the data system began its preacquisition process. As this interval was hidden within the time required to acquire and reduce a data point, typically a 45-50 second procedure, it was not readily noticeable. After speaking with the developers responsible for each of the programs involved, we discovered that there was a communication glitch between two processes that caused a five-second TCP/IP timeout to occur before the connection was automatically retried. Although this timeout should have been trapped and a warning generated, this had not been done. After a few small changes, the 5-second delay had been eradicated resulting in a 10-percent decrease in the time required to capture aerodynamic data. Finding this single bug alone has resulted in a savings of many times the cost of developing the event-logging system.

Conclusion

The only significant problem I've had with the system occurred about two years after it was first deployed. A developer had placed an event-generating function call within a section of code that would occasionally go into a loop, generating hundreds of nearly identical events per second (sometimes for several hours before being noticed). Dubbed the "babbling idiot" problem by the developer responsible, we solved it by him ensuring that the event text was identical each time through the loop, and by me adding a little code to the server—a filter to discard consecutive, identical events that occur more than once per second.

Other than this single difficulty, the logging system has been quietly and reliably creating a permanent record of software events at an average rate of about 150,000 per year, for over five years now. Although it's not unusual to have many weeks pass without ever making use of the event viewer, it is always available if needed. I like to think of it as the Laboratory's software fire extinguisher: It's there to be used whenever programs, people, or procedures cause things to crash and burn.

Acknowledgments

Thanks to Steven Totolo for implementing the LabVIEW client software.

DDJ