C/C++ Users Journal April 2004

Lies, Misdirection, and Real-Time Measurements

Sometimes you have to read between the lines

By Cort Dougan and Zwane Mwaikambo

It's easy to get best-case numbers or good real-time performance when the system is idle and just waiting for interrupts. However, getting accurate worst-case performance numbers for real-time operating systems is difficult. If you need to know the worst-case performance of your system when operating under expected conditions, then it's important to look closely at what is being measured. In this article, we'll describe some common real-time performance measurements, how they can be misleading, and how you can recognize them. We'll then detail how to obtain and interpret useful real-time performance numbers on your own by describing how to measure the performance of RTLinux, developed by FSMLabs (the company we work for). In short, this article is intended to serve as a guide when evaluating real-time operating systems (RTOS) and the claims made by their vendors.

There are several definitions and uses of the phrases "real-time operating system" and "real-time performance." We'll define these terms as they're used in this article:

Marketing real time. This often means anything you want it to: "real-time" stock-quotes, "real-time" operating system, and "real-time" Jolt cola.
Soft real time. This means that applications meet their deadlines most of the time — it's a statistical measure. Applications that need soft real time won't fail if they miss a few deadlines. Examples are video capture/display, streaming audio, voice-over-IP, and the like.
Hard real time. The entire application fails if it misses even one deadline. Deadlines must always be met and worst-case delay is of the utmost importance. Examples are control systems for jet engine fuel pumps that must turn off in an emergency, robot arms that must stop before they strike a factory worker, or the timing of pulses from an x-ray machine and software TDMA.

Here, we discuss hard real time since that is when measurements are most critical. If an operating system cannot guarantee execution of an application with an upper-bound on delay then the application fails. Hard real-time systems are also important since they are often necessary to implement soft real-time systems. This is because a limit on the worst-case delay is necessary to allocate computing resources to obtain a good statistical response in a soft real-time system (making the decision of which video frame to drop and which to send through for multiple streams). The hard real-time component of a soft real-time system lets applications offer guarantees about "fairness" or quality of service.

Misleading Numbers

To illustrate how numbers can be used and are used to mislead people, consider this example that was given at a conference. An RTOS vendor was quoting what it claimed were worst-case performance numbers. The values they were giving were surprisingly good for the design being used. When asked how the test was performed and how long it was run, the response was, "These measurements were taken over 700,000 interrupts." That sounds impressive ("it's a really big number so it must be okay") until you look at the numbers closely.

Even given only timer interrupts on a common UNIX configuration, that's an interrupt every 10 ms; 700,000 interrupts takes 7000 seconds or just under 2 hours. That's only timer interrupts, though. Given a more likely — but still conservative — interrupt load, the average for this laptop as we wrote this article over the last 45 minutes is an interrupt every 2 ms. Those 700,000 interrupts take 1400 seconds — or 23 minutes. That's not long at all when dealing with software that needs to always meet certain performance constraints over a running time of possibly months or years at a time.

If you do actual work on the system, rather than leaving it idle, such as a simple file transfer across the network, the interrupt rate jumps much higher. One interrupt every 100s is not uncommon for this kind of load, which would result in 10,000 interrupts every second. 700,000 interrupts would take 70 seconds — just over a minute. One minute is not very long to run a test of a real-time system.

You need to be careful when performance numbers are thrown around. Worst-case interrupt latency and worst-case periodic task-scheduling jitter are sometimes reported with no additional measurements. This should raise concern since it is not difficult to sacrifice one aspect of system performance in order to improve another. For example, RTLinux provides a TIMER_ADVANCE facility that greatly improves periodic task jitter to be well under 15s. This mechanism wakes a thread up before its deadline has arrived and puts it into a busy-wait loop until the deadline has arrived. This busy-wait loop improves one aspect of performance (worst-case latency) for one application, but the system is busy in a busy-wait loop while waiting for the deadline. This directly impacts and makes other measurements worse. Interrupt latency is directly affected by the length of the busy-wait loop, and the usable CPU on the system will be decreased. This is a great advantage for some types of applications, but it is a critical problem for others.

This mechanism might be useful for a system that only requires high-performance periodic threads with low latency, but can tolerate very large interrupt latency. Likewise, it would be a disaster for a system that required low interrupt latency and periodic task jitter.

If the measurement conditions are not clearly stated, they're probably not what you think they are (or what you are being led to believe they are). Most quoted measurements are on idle systems — but shouldn't be. Real-time benchmarking differs from other types in that normal benchmarks involve an idle system to get accuracy and reproducibility. This is the worst situation for a real-time test. For real-time tests, you don't want reproducibility, but rather worst-case values. This usually means a heavily loaded system.

One important feature of a hard real-time system is that it provides worst-case guarantees. Rather than average, mean, or best-case performance, designers are most interested in worst-case values. Eventually, every system encounters a combination of worst-case scenarios and the system must be designed to tolerate it. This is the situation that hard real-time systems are designed to handle correctly.

If an application requires real time, then it probably requires hard real time. When people explain that their applications only need soft real time because they're "only" running at 100 Hz so they don't need microsecond accuracy, they are really saying, "I have a hard real-time requirement that must be met and it is very large." The fact that an RTOS guarantees a certain response time is important, not always how fast it is. If an application requires deterministic response, then it needs a guarantee. Even if the application requires better than a 20 ms response time but the operating system offers far better than that, the guarantee is necessary. General-purpose operating systems offer no guarantees of timing. If your application requires that it run with some determinism, even with a large period or tolerance of fairly large jitter, it still has a hard real-time requirement.

Common and Useful Measurements

Each of our measurements for RTLinux includes a warm- and cold-cache version. Instruction and data cache have improved the throughput of modern computer systems greatly, but it has caused a problem for determinism. When an interrupt occurs and an RTOS must execute the schedule, handlers reset times and eventually begin executing an application, and the cache must be reloaded. This takes a great deal of time so it is important to make measurements that characterize this cold-cache behavior.

Warm-cache measurements can be useful to evaluate how much of the worst-case performance comes from cache behavior and how much comes from other factors. Some applications can be locked into cache during their execution, so performance with a warm cache is useful in determining how much of a performance boost this can create.

Clock resolution is simply informational and reports the clock resolution used by RTLinux for the timer and clock reporting. The value is reported both as an interval and a frequency (referred to in some tests as rtl_clock_tick_rate). The interval is the period between clock ticks for the hardware timer, while the frequency is the number of ticks per second.

Listing 1 is the RTLinux measurement suite report of clock resolution. This value gives the precision of timer intervals for scheduling, reporting current time, and any measurements performed. This is a hardware limit so it cannot be improved on by software. It is useful to know this value since it is the upper limit on precision for software.

The context switch, Listing 2(a), is the time to switch between two threads. The RTLinux test measures the time interval between suspending thread A and executing thread B. Listing 2(b) is the output.

One of the most commonly quoted numbers is the vague "context-switch time." Windows CE and VxWorks quote these numbers because they're usually very good (often under 10s), but they're not very useful. This is the time it takes to switch from one process context into another — it's a real-time task switch. All that is involved in this operation is saving the state of the current task and restoring the state of the next task to be run. The timing for this operation is simply a factor of the hardware and nothing else. It is not a measurement of the RTOS at all. The incorrect meaning that is often attributed to this number (the one marketing leads you to believe) is that this is how quickly your code will begin running once an interrupt has occurred. In fact, that value is actually: hardware delay + OS exception entry code + context switch time.

In short, know the hardware — know what is possible from it. No software can deliver better than the hardware can deliver.

Tests for interrupt latency, Listing 3(a), measure the largest observed time between when the interrupt line is asserted to the processor and an application interrupt handler begins executing. This delay is largely operating-system dependent, as it is caused by enabling and disabling interrupts during critical system operations. It is also possible for applications that enable or disable interrupts to increase this number.

This is a fairly straightforward measurement to make. Set the timer to interrupt at a specific time and attach a handler to that interrupt. When the handler executes, it checks the current time. The difference between the current time and when the interrupt was expected is the interrupt latency. Listing 3(b) is the output.

Scheduling jitter is the difference between when processes or threads wish to begin execution and when the system actually allows them to execute. The jitter test, Listing 4(a), measures the scheduling deadline overshoot for a thread. This is often a limiting factor for real-time applications since it determines how much precision can be expected. This provides a guarantee on the largest delay that your application will experience due to system performance.

This test creates one thread for each available CPU. These threads are pinned so that each runs on a different CPU. Each of these threads runs at 1 kHz (period of 1 ms) and computes the difference between when it was scheduled to wake up and the time it is actually woken up. This value is usually referred to as "periodic task-scheduling jitter." The worst (largest) value observed for each thread is recorded and then a low-priority thread prints these values. It is important to note here that on multiprocessor systems a time skew is introduced so that the thread wakeups are not in lock step of each other in our tests. Listing 4(b) is the output.

The interrupt thread latency test, Listing 5(a), measures the time from an interrupt handler being suspended to the time execution of a real-time thread resumes. The wakeup mechanism is via semaphores, which in the RTLinux implementation causes an immediate reschedule if there is a higher priority thread ready to run. Listing 5(b) is the output.

The thread yield test, Listing 6(a), measures the time from when one thread yields execution via the sched_yield() call and a second thread begins execution. The test completes an iteration by signaling the yielding thread (thread A) to commence execution. Since that thread runs at a higher priority, it forces an immediate reschedule via sem_post(). Listing 6(b) is the output.

The thread cancellation test, Listing 7(a), measures the time to cancel and join a specific number of threads and reports the worst observed time for one thread. The join operation provides no real-time guarantees, but this test does measure the worst-case observed time for this function to complete. Listing 7(b) is the output.

The semaphore latency tests, Listing 8(a), measure the time between semaphore post and the execution of the thread waiting on the semaphore. Listing 8(b) is the output.

The uncontested mutex acquisition test, Listing 9(a), measures the time to relinquish and acquire a mutex that is not being held by another thread. Listing 9(b) is the output.

With contested mutex measurement, Listing 10(a), we measure the time for thread B to acquire a mutex that is being held by thread A. A semaphore is used to force a schedule of thread B causing it to attempt a failed mutex acquisition and sleep. Thread A will then run, relinquish the mutex, and allow thread B to proceed with a successful mutex acquisition. This is a useful measurement because it can illustrate trade-offs in mutex performance that either optimize for the failed or successful case. Listing 10(b) is the output.

Priority inversion recovery tests, Listing 11(a), measure the priority inversion recovery time for mutexes initialized with PTHREAD_PRIO_PROTECT. This operation uses the mutex priority ceiling priority inversion protocol. The recovery time is the time required for the acquisition (by high-priority thread A) of a mutex that is being held by a medium priority thread (B), which is waiting on a resource from a lower priority thread. This measurement excludes the total runtime of threads B and C. Listing 11(b) is the output.

The pthread conditional variable latency test, Listing 12(a), measures the time between signaling a thread via the pthread_cond_signal() call and the thread waking up. Listing 12(b) is the output.

Spinlock, Listing 13(a), measures the time to relinquish and acquire a spinlock that is not being held by another thread of execution. Listing 13(b) is the output.

Contested spinlock, Listing 14(a), measures the time for a thread (B) to acquire a spinlock that is being held by another thread (A). A semaphore is used to force a schedule of thread B, causing it to attempt a failed lock acquisition and block. Thread A will then run and relinquish the spinlock, allowing thread B to proceed with a successful spinlock acquisition. Listing 14(b) is the output.

RTLinux provides a communication mechanism called RT FIFO. This is analogous to UNIX pipes. We perform a number of measurements on these systems to determine any latencies that using them may cause as well as any latencies they may experience due to other operations on the system. Listing 15(a) is the FIFO test, while Listing 15(b) is the output of the test.

There is no true scientific guide for making these kinds of measurements, but it is advisable to test as much and in as many combinations as possible. If during your critical operation some system activity prevents your operation from completing in time, your entire application may fail. If this is a concern, then it's important to know what kind of tests have been conducted to ensure that your critical operations are safe from delay.

Toward this end, we measure FIFO latency, Listing 16(a). This is the time required to transfer data over a FIFO and the time required to simply enter and exit the routines that send the data. The test is run for both rtf_put() and rtl_write() methods or sending data and for varying data sizes. The FIFO handler latency test measures the invocation time for a FIFO handler installed as a signal handler (rtl_sigaction()) to service NULL FIFO rtl_write() requests. Listing 16(b) is the output.

Interpreting Measurements

When examining measurements regarding RTOS performance, watch out for words such as "typical" and the absence of "worst-case." Even raw numbers and examples of what's being tested can't always provide all the necessary information to evaluate system-performance impact on your application.

It's useful to look at what kind of real-time performance a perfect RTOS would be limited to by the hardware. A perfect RTOS would never disable interrupts, so the software would never directly cause interrupts to be delayed. This optimal RTOS still suffers delays from the hardware and the real-world fact that some code must be executed before a real-time event can be processed. That code takes time; it can miss in the cache, cause other events in the hardware that cause delays, and sometimes it's necessary to communicate with off-processor hardware to determine the type of event. All these can be very slow and there's little that can be done about them, short of improving hardware design.

Measurements alone don't tell the whole story. RTLinux virtualizes the interrupt controller for the general-purpose operating system (GPOS) to prevent it from disabling interrupts and preventing real-time applications from running. This is done by marking interrupts as disabled when the GPOS requests it, but not actually disabling them. When a real interrupt does occur but a real-time application is running or the GPOS wants interrupts disabled, the interrupt must be disabled on the hardware and then control must be returned to the previously running application.

This mechanism allows RTLinux to prevent a general-purpose operating system from disabling interrupts and delaying real-time applications for long amounts of time — but this is not free. The performance trade-off assumes that interrupts are rare while running real-time code and, to a lesser degree, that they are also rare while interrupts are disabled by the general-purpose operating system.

The scenario of most concern can be illustrated by the following situation. Assume a system with N nonreal-time interrupts handled by general-purpose operating systems (disk, mouse, keyboard, video, network, and so on) and the cost (in time) to receive an interrupt, disable it, and return when "soft" disabled is t_x. The worst-case delay of a real-time periodic thread or real-time interrupt can be expressed by:

t₀ + t₁...t_N-1

A common number of interrupts on stock Linux-based Intel x86 computers now is about six and a fast processor can receive and disable an interrupt in about 2s — our worst-case can be pushed up by 12s. This delay can occur scattered about while running real-time applications, assuming the application does not disable interrupts in the hardware. It can also happen at the same time as the assertion of an interrupt that needs to be serviced by a real-time application because higher priority interrupts may need to be disabled before handling the real-time interrupts.

It's always important not to accept a very long test as a complete guarantee. Some worst-case scenarios are very hard to trigger and understanding that is the only thing that lets you make useful evaluations of claims about software.

Cort Dougan is director of engineering and Zwane Mwaikambo is an engineer at FSMLabs. They can be contacted at cort@fsmlabs.com and zwane@fsmlabs.com, respectively.