Performance Verification

Cache, RISC, and embedded systems

Roger Crooks

Roger can be contacted at Tektronix, P.O. Box 460 DS-92-688, Beaverton, OR 97076.


Designers of high-performance embedded systems look for performance gains wherever they can be found. Although you can always increase performance with faster components, you'll also increase the system's cost--and embedded systems are generally cost-sensitive, have size constraints, and limited power budgets at the outset. Consequently, it's critical that the embedded software function as optimally as possible before you begin adding faster components.

Because real-time embedded systems are event driven, the design engineer must verify if the software reacts to events within a specified amount of time. Measurement tools, such as performance analysis, can help improve system performance without adding cost to your system. This article examines how you can use performance-analysis tools to debug the time-domain aspects of embedded software in a RISC-based system that uses cache memory.

As processing power becomes less expensive, many designers are looking at using higher speed RISC technology in new designs. If you're considering using RISC, be aware that it adds new problems to the already difficult task of debugging embedded software systems. Higher clock rates, expanded code, caches, large register sets, sophisticated compilers, and complex assembly programming all make the move to RISC a non-trivial decision. To underscore the complexity of debugging RISC-based systems, I'll examine one component of a RISC system--the use of caches.

Many high-performance RISC and CISC microprocessors incorporate high-speed cache memory to achieve maximum performance. One of the fundamental aspects of RISC is that its execution units must be kept busy. This means that one or more instructions must be loaded by the processor on each clock cycle. The only way to achieve this performance at a reasonable system cost is to add high-speed cache between main memory and the microprocessor. Figure 1 shows where a logic analyzer is connected to monitor data. Caches are typically integrated on the microprocessor. Secondary caches reside between main memory and the primary cache for added performance. Regardless of the type or size of cache, the impact on embedded systems is similar--they can drastically impact the time-domain aspects of your software.

Caches and Embedded-system Performance

It's generally accepted that adding cache to a system will improve performance. While true for most systems, there are cases when an embedded system's performance may actually decrease with the addition of a cache. But whether your embedded system's performance increases or decreases, there's no question that the time-domain behavior of your embedded software will be less deterministic.

The function of a cache is to store a portion of main memory--which uses slower RAM--in a smaller, high-speed RAM that can feed the microprocessor at its maximum clock rate. When code is resident in cache, performance will be optimal. Without a cache, every instruction must be fetched from the slower main memory which may take multiple bus cycles per instruction. In this case, performance will be slower but deterministic.

How Caches are Controlled

There are different types of caches and different algorithms for controlling them. Likewise, there are many theories on obtaining maximum performance out of a cache design, as well as theories on how to determine the optimal amount of cache. Ultimately you have to measure your system to determine which method is optimal for your application, because the best method for one application might be the worst for another application. The same is true for the optimal size of cache; it can be very application dependent. Since most embedded systems are designed for a single application, you can determine the best cache algorithms and optimal cache size by using performance-analysis tools to measure system performance.

When the CPU fetches an instruction, the system first determines if that instruction is in cache. If so, it's fetched and executed. If not, the cache is flushed and filled from main memory. You pay a performance penalty to initially fill the cache (the larger the cache, the larger the penalty), but hopefully this will be offset by the improvement gained by operating out of cache. Sequential software will benefit the most, whereas code that contains many calls to different portions of memory will suffer. In short, the order in which you link your functions can impact the performance of cache-based systems.

These factors are what cause cache-based software to be nondeterministic. If you're writing mission-critical software, you have to write for the worst case. But what is the worst-case situation--cache off or cache on? Theoretically, a situation can occur where worst-case performance occurs with the cache on.

How Caches Impact Deterministic Software

The problem with using a cache is that the software won't always be in cache when needed, causing a time lag before the software can be executed. This time lag can vary, making your software less deterministic. Figure 2 depicts three typical scenarios in which the cache can jeopardize the deterministic aspects of your embedded software.

Case 1: If a specific routine (say, an interrupt handler) is in cache when needed, it will execute very fast. In this case you get the optimal performance measurement.

Case 2: If the specific routine is not in cache, the cache must be flushed and then loaded. This scenarios gives you a second performance measurement.

Case 3: If the specific routine is partially in cache, it will execute until it reaches an instruction that is not in cache, flush the cache, fill the cache with the rest of the routine. Performance in this case is not very deterministic since the amount of code initially in cache can vary.

Maximizing System Performance with Performance Analysis

Performance analysis is a method for determining where your software is spending most of its time. There are two types of performance analysis: traditional performance analysis and single-event mode.

Traditional performance analysis measures the amount of time spent by multiple events simultaneously as shown in Figure 3. The most common use of traditional performance analysis is to determine which functions, if optimized, will result in the biggest improvement in overall system performance. For example in Figure 3, making Addr_range_3 execute 10 percent faster will have a greater impact on overall system performance than improving Addr_range_4 by 50 percent.

You can optimize functions a number of ways: by recoding them in assembler, using different algorithms, or possibly locking the function to cache (available on certain microprocessors). By locking a function to cache, once the function is loaded it will always remain in cache. This limits the use of the cache for other functions, but it may be worth the price if overall system performance improves. Again you can experiment with different techniques and measure the results with the performance analyzer.

Single-event mode is an alternate measurement capability of performance analyzers. Single-event mode measures the duration of an event each time it executes and displays the ranges of time it took to execute as in Figure 4. It is the most efficient way to determine the impact of caches on system performance. Although you could use traditional performance analysis to measure your whole program while it runs repeatedly, it won't provide much information on where a problem might be. A more practical and useful measurement is to use single-event mode to measure your cache performance.

You can also verify other time-critical functions, such as interrupt handlers and data-processing functions, with single-event mode. Systems are designed around the expectation that certain functions will execute in a specified amount of time, otherwise data will be lost. With single-event mode, you can profile these critical routines under worst-case conditions over long periods of time.

When you are finished running your tests, single-event mode will display the timing results in a histogram display. From this display, you can determine the minimum, maximum, and average time it took to execute your targeted routine as in Figure 4. If even one occurrence violates your system specification, an error will eventually occur.

You can also vary other system parameters such as the cache-control algorithm or the size of the cache to determine its impact on overall system performance.

Measuring a Data-independent Algorithm

If your data-processing algorithm is data dependent, meaning that the time it takes to execute is dependent on the data it has to process, you'll want to monitor that single function over long periods of time. For example, if you're writing software to read compressed video data off a CD-ROM drive and display it on your monitor, your data flow might look like Figure 5 where the two critical routines are the uncompress data routine (t1) and the display data routine (t2). What you want is the overall time (t3) to be fast enough to avoid display flicker caused by excessive time between frames.

If t1 is too slow, there'll be additional disk rotations between reading the next block of data. On a CD-ROM drive, this time can be excessive. A slow video memory or adding special effects to the video as it's being displayed can impact t2. Certain effects may take more processing time and cause flicker or jerky motion.

The first task is to optimize each function. Factors that might impact t1 include the compression method used and the size of the input buffer that reads data off of disk. Single-event mode can be used to determine the most optimal compression methods for your type of data. The reason you need a performance analyzer for timing more than just a few frames is that most compression methods are data dependent. With single-event mode, you can run your whole video clip and see the min/max/avg time for the decompression routine over the whole application. This is more efficient and reliable than timing-selected frames of data.

You may not have the option of changing the compression method, so you may want to experiment with the size of the data buffer read off disk. You'll want to determine the most optimal amount of data to process on one pass. Some processors have data caches that can deliver a broad range of performances. By experimenting with different buffer sizes and running single-event mode, you can zero into the right buffer size for optimal performance.

One of the ways manufacturers of video-display adapters differentiate themselves from competitors that use the same hardware is by writing more efficient firmware. Performance-analysis tools can be invaluable in fine-tuning firmware to get the maximum amount of performance out of the hardware.

It's important that once you have optimized routines, you next optimize how routines interact with each other. In this example, after doing an analysis of the display routine, you'll want to look at how the two routines work together. Ideally, the t1 and t2 times should be relatively balanced (see Figure 6). It doesn't do your system any good if t1 is optimal but passes too much data to t2 to handle efficiently. Here you'll want to look at the best combination of t1 and t2 that minimizes the total time t3.

A similar condition occurs in a dual-processor system where you need to look at "load balancing" between the t1 processor and the t2 processor. You may find that by optimizing t1, t2 is excessively idle and can't keep up with the data when it's sent. You might achieve better system performance by having t1 run at a less than optimal rate and passing data to the video display processor more frequently.

However, without actually running the system and measuring performance, you can only make an educated guess at how to design the system software. By having two logic-analyzer acquisition cards with performance-analysis capabilities monitoring the two processors, you can balance your system's processing capability between the two tasks.

How to Debug Cache-based Software

There's an inherent conflict in debugging cache-based software. First, to test the system correctly, the cache must be turned on to accurately reflect how the system will ultimately perform. However, for maximum visibility of data for debugging, the cache needs to be disabled (see Figure 1). This ensures all executed instructions are fetched off the bus and can be captured by a logic or performance analyzer.

Most debug tools--emulators, debug monitors, and so on--are intrusive to the cache. When a breakpoint is reached with these tools, the cache is flushed and refilled when execution starts again, changing how your system ultimately runs. A logic analyzer is a passive device, only monitoring data, making it non-intrusive.

For debugging the logic of the software, the best strategy is to turn the cache off. The cache won't affect what the software does, but will affect how the software executes in the time domain. Once the logic of the program is debugged, you should then turn the cache on to debug the time-domain aspects of the software.

If your software is failing inside cache, you'll need to trace the execution. Although the traditional printf statements are usually not an option for embedded systems, you can use a similar method with a logic analyzer that supports performance analysis. One method is to insert dummy write instructions that writes data to a location in memory. This debug statement will have slight impact on your program size or execution speed, but provides an easy way to trace your program. The logic analyzer can be set to monitor these memory locations to time a function or to simply trace execution.

There are three similar methods for monitoring your program via dummy write instructions. The first method, shown in Figure 7, is the simplest for higher-level languages. It performs a dummy write to an unused portion of memory monitored by the logic analyzer.

The second method is similar and is a bit easier when programming in assembler. With this method, you write the contents of the program counter to a single memory location. The logic analyzer is set to monitor writes to that memory location displaying the last value of the program counter. Although this method provides you with trace information, it doesn't tell you why something doesn't run as expected. Additional information can be obtained by writing the value of the program counter to a single test memory location.

The third method yields more debug information by writing intermediate values of a critical calculation to unused memory locations. For example, if you were performing an iterative calculation, such as processing data, you could write intermediate register values or variables to the test location. Again the logic analyzer would record the data for analysis.

It's critical that you run your final verifications with frozen software. Other than the obvious, the reason is that each time you compile your code, your critical routines will reside in different locations in memory. If routines are moved into different boundaries, then how they are loaded into cache will be much different.

Figure 1: A typical cache-based system with the optional secondary cache.

Figure 2: Depending on what portion of a routine is in cache when it is about to be executed, the tiem to execute will vary.

Figure 3: Traditional performance analyzers will measure defined routines by monitoring the address bus of the microprocessor. Each time an address appears on the bus, the performance analyzer will "bin" that address and display the bins in a histogram format.

Figure 4: Single-event mode measures one event (or function) repetitively and displays the range of execution times in a histogram format. Minimum, maximum and average times are also displayed.

Figure 5: A simplified example of displaying compressed video data on a display.

Figure 6: Performance analysis can measure how tasks are split in a single processor system. Here t1 and t2 are evenly split. An additional 4 percent of the total time is spent outside of these two critical routines.

Figure 7: Monitoring dummy write instructions

Loop_Being_1
  Write FF to Test_Location_1      <-Test Instruction
  code
  code
  code
  code
  Goto Loop_Begin_1 else goto Loop_Begin_2
Loop_Begin_2
  Write FF to Test_Location_2     <-Test Instruction
  code
  code
  code
  code
  Goto End else goto Loop_Begin_2
End
  Write FF to Test_Location_3    <-Test Instruction

Copyright © 1993, Dr. Dobb's Journal