March 2002/A Technique for Efficient Logging and Monitoring of Real-Time Systems

Features

A Technique for Efficient Logging and Monitoring of Real-Time Systems

Bill Pyritz

An ingenious approach to tracing execution using ideas from Aspect-Oriented Programming that leaves your code as clean as can be.

Introduction

Current methods and tools for analyzing real-time systems often do not provide adequate access and fail to produce the essential data efficiently. Real-time systems are extremely difficult to debug even under the friendliest conditions. Since we have yet to design systems that can tell us the exact nature of the problems they are experiencing, real-time applications must be extreme extroverts. By building extroverted systems that openly reveal their inner problems, you can save time, money, and loads of frustration.

This article presents a detailed description of an innovative yet simple technique for profiling and monitoring software states and data in a real-time system using C++. The technique uses code generation, generative and aspect-oriented programming, and simple application-level constructs to “open up” the code for offline analysis. Many systems today are difficult to debug in production because the data for analysis is lacking or nonexistent. The approach described in this article will address this problem and propose a solution.

Background

Current real-time systems are typically developed in a closed environment. That is, the system is built with limited access to its internal functional components. This difficult reality is exacerbated by the lack of integration of the system components themselves. Detailed knowledge of software states and other data is typically not available in a “natural” way. This mode of developing systems is the de-facto industry standard.

Many extraordinary tools for producing and debugging software are available. These tools offer integration, GUI interfaces, and many other features that make a developer’s job much easier. However, most of the better tools are expensive, require a steep learning curve, lock you into a specific development approach, and make it difficult to change directions or use a different tool in the future. In addition, most tools have limitations and constraints that reduce their effectiveness. I have yet to come across a tool that provides the same access and features on a production machine running at a customer site as those I can get in the lab under controlled conditions. It’s been my experience that most of the “easy” problems in a given system are discovered and resolved before the product is shipped. The most difficult, insidious, and perplexing problems turn up in a production setting, where all the fancy tools are useless. A better approach is to use simple debugging “tools” and methods that are available throughout the system’s lifetime and are independent of the current stage of system development. The approach described in this article does not restrict you from using the fancy tools. In this system, the fancy tools become an enrichment to certain stages rather than an end-to-end solution.

Logging

Most system designers view logging as a means of reporting errors. Developers who are debugging software enable trace statements that are subsequently disabled in production code due to their excessive overhead. (I will return to this topic shortly.) In most cases, the system logs are the only historical artifacts of a production system that debugging personnel have at their disposal for problem analysis. Issues such as race conditions, timing, and sequencing contribute to an environment where debugging is very difficult. Worse yet, log files are usually relegated to a small portion of the disk and have the terrible tendency to roll over just when you need the earlier data. Lots of data is typically generated around the time of a failure, making file rollover even more likely. Why bother having a system log if it will not contain the critical records that are needed for complete analysis?

Some of the most insidious real-time problems are side effects whereby a component leaves a resource in an invalid state due to a software bug. The resource’s next user reports the error with no clue as to the real source of the problem. Great lengths are taken to resolve these types of problems, which invariably lead to high costs for all involved.

In my solution, logging is not just the reporting of errors but an integral part of the system. Log records are continuously collected even in the production system. The benefits of this approach include the relative ease of problem isolation and resolution compared with more typical approaches. I refer to this approach as “inside/out,” where the system reveals itself via records in the log store.

You can think of this type of logging as similar to a “black box” in a commercial airliner, but it is much more. It collects all the essential data necessary to reconstruct the events leading to a disaster. By using simple embedded constructs in the application code, developers can insert log records in the log store (the black box) that are useful for debugging and reconstructing the events leading to a failure. Unlike current logging methods, developers will not use printf-like statements to create the log records. Rather, this method uses simple unobtrusive constructs in the application code. The records are efficiently collected in the log store, which is a simple database with a well-defined interface for adding and accessing the individual records. A command-line interface is available for accessing the records in whole or based on user-specified criteria (i.e., a filter).

This solution would not be possible without an efficient means of logging the potentially large amount of data generated by the system. It is also necessary for system designers to accept upfront the additional CPU and disk-space overhead required to collect the necessary data. Consider this more of a psychological than technical distinction. Instead of thinking of it as overhead, think of it as normal system operation. Furthermore, the data can be used for other purposes besides debugging (e.g., performance analysis, automated script generation, traffic analysis, etc.).

There are two major approaches to system log design:

Pre-evaluation requires that the decision to create a log record and its final contents/form is decided at the time of occurrence.

Lazy evaluation delays the decision of whether or not the record will be evaluated until the record is actually needed (e.g., when a user specifically requests to view the record). The essential contents of the record are saved in the log store, but the actual evaluation/formatting of the record is delayed until it is requested for viewing. Lazy evaluation allows efficient use of both processing time and storage space [1].

By utilizing lazy evaluation and a binary log store, you can minimize system overhead while at the same time generating comprehensive log data needed for system analysis and debugging. Further, by adding a few embedded language constructs, you can simplify the application developer’s job in specifying the contents of log records.

Implementation

I will take you through an implementation of a binary log store and its application code interfaces. The entire code is available for download at <www.cuj.com/code>. The four major parts of the system are:

The application code interface

Error handling with AOP (Aspect Oriented Programming)

The binary log store

The user interface to the binary log store

I use a simple telephone call scenario as the basis for the following discussion and examples. The example centers on a SimpleCall class that is implemented as a state machine (see the code package at <www.cuj.com/code> for details). As the stages of the call proceed, various logging activities will take place that demonstrate the ideas discussed above.

The Application Code Interface

The application code interface is simple, flexible, and highly efficient. It uses embedded constructs to specify variables or expressions of interest at particular places in the code [2]. You may be used to seeing code similar to Listing 1, which is a typical example of how most systems log and handle errors. Note that the application developer is mired in logging and error-handling details. Further, the code is clearly more complex to maintain in several dimensions. First, the formatted strings are full of implementation details, such as class names and attributes, making it necessary to remember to update these strings when data is added, removed, or modified. Second, the essence of the program is getting overwhelmed with the logging and error handling making it difficult to understand and maintain. Using the alternative suggested in this article, the same code now looks like Listing 2.

Believe it or not, Listing 2 produces as much if not more information than Listing 1. However, it does it an order of magnitude faster and with much less storage space. Where did all that formatting go? Where did the error handling go? There are no smoke and mirrors here.

The construct WATCH@ produces a log record that is inserted into the binary log store behind the scenes.

A critical part of the application interface is a code generator, written in the Ruby language, which acts as a preprocessor (available at <www.cuj.com/code>). The code generator processes the application code and replaces the embedded constructs with logging-specific code. As a result, the embedded constructs are simple and expressive, while at the same time unobtrusive. For more information on the Ruby language, visit <www.rubycentral.com>.

But that’s not all. Ideally, I would like to allow the application programmer to develop an application with little or no concern for unrelated but distinct aspects of the system, such as logging or error handling. The more these unrelated aspects of the system become integrated in the application code, the more likely it is that errors will be introduced, and perhaps more importantly, the intention of the code becomes muddled making it difficult to maintain and understand. Note that Listing 1 is dominated by tracing and error-handling code as opposed to Listing 2, which is mostly application code.

My solution utilizes the ideas introduced by AOP. By extending the programming language, AOP makes it possible for developers to segregate certain aspects that cut across the system (e.g., logging). The application is written as if these aspects were not a concern, resulting in much cleaner and expressive code. As a separate task, the aspects are specified using the language extensions.

AOP works well with my problem since logging and error handling are horizontal aspects of the system (i.e., logging is not application specific; all applications need it). Note that I consider error handling a related part of the application code interface since logging is a critical dimension of error handling (i.e., you keep logs to help debug errors). Consider Listing 2. Where did the error handling code go? There’s nothing magical; it was simply defined elsewhere as an aspect of the system. The code generator mentioned earlier is also responsible for generating the error-handling code that is specified in an “aspects” file that I will discuss in detail below.

An implementation of AOP, AspectJ, has been developed for Java (see <www.aspectj.org> for more information). No such extensions exist for C++ to date. However, it is possible to use simple preprocessing techniques to simulate AOP. My solution uses this approach and, as mentioned, the code generator is responsible for this task in concert with an aspect file that is specified by the developer.

As mentioned earlier, I will use a simple telephone call scenario to illustrate the ideas discussed above. The scenario centers on a class called SimpleCall, which is implemented as a state machine (available at <www.cuj.com/code>). The scenario traces the telephone call through various states from origination to termination. Listing 3 presents the main program (a complete listing is available at <www.cuj.com/code>).

The application programmer uses the embedded WATCH@ construct to specify variables or expressions of interest at particular places in the code. For example, when the call completes, the RunCall::run method is interested in “watching” the SimpleCall object named call. The result is a binary log record being generated that includes the contents of this object. Figure 1 shows the actual formatted record [3].

The ERROR@ construct shown in Listing 3 is used to specify error conditions [4]. In the example, if RunCall::run throws a CallStateException exception, a binary log record would be added to the log store with a criticality of MAJOR with the unformatted string specified in the exception as the data. Both constructs produce a log record that is inserted into the binary log store behind the scenes.

After the code generator is run against the source in Listing 3, the code shown in Listing 4 is produced.

Note that the generated code uses no embedded string representations of the variables and requires no formatting (lazy evaluation). Clearly the application programmer is relieved of the burdensome and error-prone responsibility of specifying the particular output record format and can concentrate on what is important in the context of the application. Simplicity and expressiveness goals can be checked off.

As mentioned, the WATCH@ construct is similar to applying a trace statement in the code. As with trace statements, the WATCH@ statements can be turned on and off at compile or run time. The simplest way to turn off the WATCH@ statements is to not generate the associated code (i.e., don’t run the code generator). Note that more sophisticated control can be automatically provided for the WATCH@ statements since the programmer is not directly specifying the form of the record or its final content (i.e., a separate mechanism perhaps implemented in shared memory could be implemented that allowed specific WATCH@ statements to be enabled/disabled via a command-line interface).

There is a two-step process for generating a log record:

Create a LogRecord object.

Insert associated data into the LogRecord object and store it.

We’ll examine each of these steps in turn starting with creating a log record.

Creating a LogRecord Object

The LogRecord class has the following interface:
class LogRecord
{
public:
  LogRecord( Criticality  criticality,
             unsigned int  type,
             unsigned int  tag,
             unsigned int  component,
             unsigned int  file,
             unsigned int  line);
        ....
};
You may be asking why a majority of the parameters to the constructor are of type integer and how those integer numbers could possibly translate to meaningful data in a log record. The code generator is also responsible for generating a file called LogRecordSpec.c (available at <www.cuj.com/code>). This file is compiled in a library available to the code that formats the records and contains an array of strings [5]. The code generator uses indexes from the string array to indirectly identify string values associated with the LogRecord parameters. To better understand this, let’s examine the formatted log record for the @WATCH@ call statement from Listing 3 and see how it translates to the associated generated code in Listing 4.

Generated code:
// Done with call
LogRecord lrecWATCH1( criticality = INFO,
                      type = 1, tag = 2,
                      component = 53,
                      file = 16,
                      line = __LINE__);
Generic log record:
<criticality> <type> <tag> <date> <time>
ProcessId: <process id>
File: <file>
Line: <line>
Component: <component> <WATCH|ERROR>
<watch|error index>
    <component data>
Formatted log record from example:
INFO simple WATCH 05/05/01 14:50:41 11
ProcessId: 9613
File: /home/aspects/SimpleCallImpl.h
Line: 93
Component:
SimpleCallImpl::SimpleCallImpl WATCH 1
        m_callid (int): 1
        m_callingNumber (string): 
          630-979-2000
        m_calledNumber (string): 
          630-979-2001
        *(m_state.get()) (CallState):
                getStringStateName() 
(string): Initializing
As you can see, the integer values in the binary record translate to string values in the formatted log record. For example, the type field of the log record above is the string simple. However, the binary log record stores 1 for this field. When the user formats the record for viewing, 1 is translated to the string simple by indexing the generated string array found in the generated file LogRecordSpec.c. Why would you go to such lengths? To save space and time. Why store potentially long strings in the log store for each record when a simple integer value will do? Further, the number of bytes to copy is reduced significantly, saving time.

Each LogRecord object has an associated buffer where all its data is stored and manipulated. The buffer is assigned an initial size that can grow if required. When the record is inserted in the log store, only valid data is copied. For example, if a log record has a buffer size of 4KB and its data is 200 bytes, only the 200 bytes will be copied to the log store for that record minimizing the overall size of the binary log store. After the record is successfully inserted in the log store, the LogRecord object is destroyed. In fact, as you can see from the generated code example in Listing 4, the LogRecord object is an automatic variable on the stack.

Inserting Associated Data into LogRecord and Storing It

The lrecInsert method is responsible for inserting the data portion of the log record into the record in binary form. It uses a name/value pair to store the data. Once again, the name is represented by an integer value that is later translated to a string via the generated string array in LogRecordSpec.c. The generic form of the method is as follows:
template <class TYPE>
void lrecInsert( LogRecord& lrec,
                 unsigned int name,
                 const TYPE& value );
The basic types of integer, bool, string, etc., use specialized lrecInsert templates that copy the variable data into the LogRecord object’s buffer based on the size of the value data type (e.g., sizeof(int)). Other types that are not basic (e.g., the watch on the class SimpleCall from Listing 3) recursively call lrecInsert for each of their data members until a basic type is reached. Notice how the formatted log record from the example above prints out the SimpleCall object. The code generator generates all lrecInsert methods for the non-basic types.

Finally, the LogRecord object is inserted into the binary log store via an interface provided by the LogStore Singleton [6] class described below.

Error Handling with AOP

As mentioned earlier, logging and error handling go hand in hand. When errors occur, a log record must be generated to record the failure event. Error handling, like logging, is a horizontal aspect of the system (i.e., all applications require error handling). By utilizing the application interface described so far in addition to ideas taken from AOP, you can make your system much simpler and easier to maintain.

Recall the discussion regarding Listings 1 and 2. Listing 2 is much cleaner than Listing 1 because it allows the programmer to concentrate on the application without being distracted with logging and error-handling concerns. This section focuses on how to achieve this.

An additional responsibility of the code generator is to examine an “errors” file where the error-handling aspects are defined. An example of an entry in the error file associated with Listing 2 might be:
MyClass::my.?()
SomethingWentWrongException
{
    // ERROR@ MAJOR sc e.string()
    // other recovery code
}
};
Several interesting things are going on here. First, instead of specifying MyClass::myFunc explicitly, a wildcard is used: my.? [7]. The code generator recognizes the wildcard characters .? and will match any member function of MyClass that starts with my and insert the associated error-handling block. This is the power of AOP. One specification can be used in multiple places, even across classes (e.g., you could specify .?::.? to apply the specification to all class methods in your application). Second, the ERROR@ construct is used rather than the explicit log code as in Listing 1. Clearly this code is much cleaner and easier to maintain. It also displays the ability of the code generator to recognize the embedded application constructs (e.g., ERROR@) within an aspect specification.

The generated code for the AOP specification above is shown in Listing 5.

Binary Log Store

The log store is the repository/database where the log records are stored in binary format. The log store can be as simple as an in-core chunk of heap memory (as implemented in the code package) or as complex as an actual database [8]. In all cases, the application interface and the concepts are the same.

The log store is implemented as a Singleton class [6], and its interface is shown in Listing 6.

The LogStore class is also an abstract base class that must be specialized [9]. As mentioned earlier, many different types of log stores could be implemented. In this article, I use a very simple implementation that uses heap memory. Although not very practical, its simplicity allows us to focus on the important ideas without getting mired in implementation details. The specialized class is named LogStoreMemory.

To insert a LogRecord object into the log store, the LogStore::store interface is used. Listing 7 shows part of the LogStoreMemory class [10].

In this implementation, a simple character buffer is used as the log store. Log records are inserted sequentially into the buffer. As you can see, the store method works with the LogRecord object to insert the record. The LogRecord object is passed a character buffer that it uses to copy its data. In this way, the LogStore object does not need to know about the exact format of LogRecord or even how it’s stored, minimizing dependencies. The size of the record is saved followed by its data in whatever format the LogRecord class decides. Other implementations of LogStore would work similarly. For example, a memory-mapped file implementation would pass the next address in the mapped file as the buffer to the LogRecord::store method [11].

The other interfaces the LogStore class provides are the output and outputAll methods. Listing 7 only shows the outputAll implementation. This method is used to select log records for formatting and insertion into the specified output stream. As you can see from Listing 7, outputting a record is simply the inverse of the insertion. Once again, the LogRecord object is responsible for extracting its data from the log store memory buffer. In addition, the LogRecord object is responsible for formatting its data (the print method). The good news is that the code generator generates all of the formatting code for us (see the code package for details).

The output method allows the user to specify a Filter object that is used by LogStore as a criterion for the selection of records. For example, the user may only be interested in records that were inserted between a certain start and stop time. The Filter object would reject all records falling outside that timeframe. Other types of Filter objects could be specified as well (e.g., display only ERROR entries, display ERROR entries with a criticality of MAJOR, display all WATCH entries with the simple type, etc.). The Filter object can be as simple or complex as required. The key point is that the output interface using the Filter object provides an optimized means of accessing the binary log store to get the information required quickly and efficiently.

User Interface to the Log Store

As mentioned earlier, the log store can be thought of like the black box from a commercial airliner. It contains the information needed to reconstruct events leading up to a disaster. The application code is responsible for inserting enough information into the log store, via records, such that problems can be debugged and/or system trends can be analyzed. Up to now I have been discussing the application interface in some detail. It is time to turn to the user interface and look at how this massive amount of data can be accessed for debugging and analysis purposes.

As far as I am concerned, I should have as much if not more data available in a production environment as compared with a lab or development environment. Therefore, I will make no distinction between where or how I would use the log store data in the discussion that follows. This allows me to assume that I have the necessary data to debug a problem wherever it may occur and that the same techniques can be used. Further, the log store is designed to be “transplantable.” That is, it can be bundled and moved to another system for analysis (similar to the black box).

A command-line interface is provided for the user to access the log store [12]. It takes the following form:
logcat [-r <records> | -s <start time> [-e <end time>] | -A] 
           [-f] [(] [-c <criticality>,...] [-t <tag>,...] [-text]
        [-p <processed>,...] [-T <type>,...] [-o | -a] [...] [)]
where:
-r indicates the number of records to be output (e.g., logcat 
   -r 100) 
-s/-e indicate start and stop time intervals (e.g., logcat 
   -s 12:00:00 -e 12:00:05) 
-A output all records in the log store 
-f tail the output (i.e., continuously output records in real
   time) 
-t output records that match <tag> 
-text output records that match <text>: (similar to 'grep') 
-p output records that match <process ID> 
-T output records that match <type> 
-o/-a allows logical groupings of the search parameters
As you can see, there are many options available. The options can be combined with each other to form a complex search of the log store.

The logcat tool simply uses the LogStore output interface to search for the specified records. Filter objects are created to represent the user-specified search parameters.

Figure 2 shows the output that would be generated from the command logcat -A (display all records) if it where run against the log store generated by the application shown in Listing 3.

In the example from Figure 2, the binary log store required 794 bytes of storage space as opposed to the 2,676 bytes required to store the ASCII text. That is a 70-percent savings in storage space. Clearly you can store more data using this approach.

Note from the example that a call identifier (m_callobj->getCallId) is output with each log record. A unique call identifier is assigned to each call. Imagine thousands of calls occurring simultaneously and each creating similar log records. It would be very difficult to extract the log records associated with a particular call without some sort of key for searching. In this case, the call identifier can be used to obtain the data associated with a particular call in an efficient manner. Only the records of interest are formatted for user consumption (lazy evaluation).

As noted above, the binary log record contains no formatted strings. This greatly reduces the CPU overhead required to process a log record. Depending on the log store implementation, creating and storing a log record amounts to little more than a memory copy of bytes from one location to another [13].

Summary

This article presents a simple but flexible logging and monitoring technique for real-time systems. The technique utilizes code generation and embedded application code constructs to achieve verbose log record generation by the system with low overhead. Utilizing these techniques, you can have a system with comprehensive black box data available that can be used to debug and analyze your most difficult problems whether in the lab or in production.

Many of the ideas presented here are still evolving. One area that I didn’t explicitly mention is the embedding of an API in the application code that will allow a test program to communicate and in some cases control the application program. I have successfully implemented such a scheme and was very pleased with the outcome. I hope that the ideas presented here provide a seed that can help you advance your work in some way. You can take a look at the entire package of code that I used in the example at <www.cuj.com/code>.

Notes

[1] Believe it or not, formatting a log record (e.g., printf) is typically the most costly processing overhead associated with logging. Pre-evaluation uses such an approach; lazy evaluation does not.

[2] This might be where a TRACE statement would otherwise have been specified with a printf-like construct.

[3] The log record string displayed in Figure 1 is not what is stored in the log store. It is the formatted log record that has been produced from the binary record stored in the log store.

[4] A criticality indicator (e.g., INFO, MINOR, MAJOR, CRITICAL) is specified as well. The system uses criticality to help identify how serious the problem being logged might be. Some systems use this information to set alarms and/or escalate recovery of system resources.

[5] The application itself never needs to access or reference this string array. It is only needed to format the log records from their binary form to a human-readable form. This further reduces the executable size as unnecessary strings do not need to be linked with the application.

[6] A Singleton class ensures that it has only one instance and provides a global point of access to it. See Erich Gamma, et al. Design Patterns (Addison-Wesley, 1995).

[7] Clearly this simple example does not require a wildcard specification. I added it to display the power of AOP.

[8] The author has implemented binary log stores using shared memory and memory-mapped files on both Unix and Windows NT platforms. Unless you have a screamingly fast database, it is unlikely to be a suitable option given the need for speed.

[9] The GetInstance static method actually uses a LogStoreFactory (see the code package) object to create the specialized LogStore object.

[10] Error handling and other member functions and attributes are not shown for clarity.

[11] In more complex implementations of the LogStore class, you might include a cross-reference table that allows for quick record lookup.

[12] A GUI interface could also be designed that simply uses the same interface to the log store as the command-line interface.

[13] See the LogStoreMemoryImpl class available at <www.cuj.com/code>.

Bill Pyritz has worked for Lucent Technologies (formerly AT&T Bell Laboratories) for over 15 years architecting and developing real-time telecommunication systems. Feel free to contact him at pyritz@lucent.com.