Performance Verification

Cache, RISC, and embedded systems

Case 1: If a specific routine (say, an interrupt handler) is in cache when needed, it will execute very fast. In this case you get the optimal performance measurement.

Case 2: If the specific routine is not in cache, the cache must be flushed and then loaded. This scenarios gives you a second performance measurement.

Case 3: If the specific routine is partially in cache, it will execute until it reaches an instruction that is not in cache, flush the cache, fill the cache with the rest of the routine. Performance in this case is not very deterministic since the amount of code initially in cache can vary.

Maximizing System Performance with Performance Analysis

Performance analysis is a method for determining where your software is spending most of its time. There are two types of performance analysis: traditional performance analysis and single-event mode.

Traditional performance analysis measures the amount of time spent by multiple events simultaneously as shown in Figure 3. The most common use of traditional performance analysis is to determine which functions, if optimized, will result in the biggest improvement in overall system performance. For example in Figure 3, making Addr_range_3 execute 10 percent faster will have a greater impact on overall system performance than improving Addr_range_4 by 50 percent.

You can optimize functions a number of ways: by recoding them in assembler, using different algorithms, or possibly locking the function to cache (available on certain microprocessors). By locking a function to cache, once the function is loaded it will always remain in cache. This limits the use of the cache for other functions, but it may be worth the price if overall system performance improves. Again you can experiment with different techniques and measure the results with the performance analyzer.

Single-event mode is an alternate measurement capability of performance analyzers. Single-event mode measures the duration of an event each time it executes and displays the ranges of time it took to execute as in Figure 4. It is the most efficient way to determine the impact of caches on system performance. Although you could use traditional performance analysis to measure your whole program while it runs repeatedly, it won't provide much information on where a problem might be. A more practical and useful measurement is to use single-event mode to measure your cache performance.

You can also verify other time-critical functions, such as interrupt handlers and data-processing functions, with single-event mode. Systems are designed around the expectation that certain functions will execute in a specified amount of time, otherwise data will be lost. With single-event mode, you can profile these critical routines under worst-case conditions over long periods of time.

When you are finished running your tests, single-event mode will display the timing results in a histogram display. From this display, you can determine the minimum, maximum, and average time it took to execute your targeted routine as in Figure 4. If even one occurrence violates your system specification, an error will eventually occur.

You can also vary other system parameters such as the cache-control algorithm or the size of the cache to determine its impact on overall system performance.

Measuring a Data-independent Algorithm

If your data-processing algorithm is data dependent, meaning that the time it takes to execute is dependent on the data it has to process, you'll want to monitor that single function over long periods of time. For example, if you're writing software to read compressed video data off a CD-ROM drive and display it on your monitor, your data flow might look like Figure 5 where the two critical routines are the uncompress data routine (t1) and the display data routine (t2). What you want is the overall time (t3) to be fast enough to avoid display flicker caused by excessive time between frames.

If t1 is too slow, there'll be additional disk rotations between reading the next block of data. On a CD-ROM drive, this time can be excessive. A slow video memory or adding special effects to the video as it's being displayed can impact t2. Certain effects may take more processing time and cause flicker or jerky motion.

The first task is to optimize each function. Factors that might impact t1 include the compression method used and the size of the input buffer that reads data off of disk. Single-event mode can be used to determine the most optimal compression methods for your type of data. The reason you need a performance analyzer for timing more than just a few frames is that most compression methods are data dependent. With single-event mode, you can run your whole video clip and see the min/max/avg time for the decompression routine over the whole application. This is more efficient and reliable than timing-selected frames of data.

You may not have the option of changing the compression method, so you may want to experiment with the size of the data buffer read off disk. You'll want to determine the most optimal amount of data to process on one pass. Some processors have data caches that can deliver a broad range of performances. By experimenting with different buffer sizes and running single-event mode, you can zero into the right buffer size for optimal performance.

One of the ways manufacturers of video-display adapters differentiate themselves from competitors that use the same hardware is by writing more efficient firmware. Performance-analysis tools can be invaluable in fine-tuning firmware to get the maximum amount of performance out of the hardware.

It's important that once you have optimized routines, you next optimize how routines interact with each other. In this example, after doing an analysis of the display routine, you'll want to look at how the two routines work together. Ideally, the t1 and t2 times should be relatively balanced (see Figure 6). It doesn't do your system any good if t1 is optimal but passes too much data to t2 to handle efficiently. Here you'll want to look at the best combination of t1 and t2 that minimizes the total time t3.

A similar condition occurs in a dual-processor system where you need to look at "load balancing" between the t1 processor and the t2 processor. You may find that by optimizing t1, t2 is excessively idle and can't keep up with the data when it's sent. You might achieve better system performance by having t1 run at a less than optimal rate and passing data to the video display processor more frequently.

However, without actually running the system and measuring performance, you can only make an educated guess at how to design the system software. By having two logic-analyzer acquisition cards with performance-analysis capabilities monitoring the two processors, you can balance your system's processing capability between the two tasks.

How to Debug Cache-based Software

There's an inherent conflict in debugging cache-based software. First, to test the system correctly, the cache must be turned on to accurately reflect how the system will ultimately perform. However, for maximum visibility of data for debugging, the cache needs to be disabled (see Figure 1). This ensures all executed instructions are fetched off the bus and can be captured by a logic or performance analyzer.

Most debug tools--emulators, debug monitors, and so on--are intrusive to the cache. When a breakpoint is reached with these tools, the cache is flushed and refilled when execution starts again, changing how your system ultimately runs. A logic analyzer is a passive device, only monitoring data, making it non-intrusive.

For debugging the logic of the software, the best strategy is to turn the cache off. The cache won't affect what the software does, but will affect how the software executes in the time domain. Once the logic of the program is debugged, you should then turn the cache on to debug the time-domain aspects of the software.

If your software is failing inside cache, you'll need to trace the execution. Although the traditional printf statements are usually not an option for embedded systems, you can use a similar method with a logic analyzer that supports performance analysis. One method is to insert dummy write instructions that writes data to a location in memory. This debug statement will have slight impact on your program size or execution speed, but provides an easy way to trace your program. The logic analyzer can be set to monitor these memory locations to time a function or to simply trace execution.

There are three similar methods for monitoring your program via dummy write instructions. The first method, shown in Figure 7, is the simplest for higher-level languages. It performs a dummy write to an unused portion of memory monitored by the logic analyzer.

The second method is similar and is a bit easier when programming in assembler. With this method, you write the contents of the program counter to a single memory location. The logic analyzer is set to monitor writes to that memory location displaying the last value of the program counter. Although this method provides you with trace information, it doesn't tell you why something doesn't run as expected. Additional information can be obtained by writing the value of the program counter to a single test memory location.

The third method yields more debug information by writing intermediate values of a critical calculation to unused memory locations. For example, if you were performing an iterative calculation, such as processing data, you could write intermediate register values or variables to the test location. Again the logic analyzer would record the data for analysis.

It's critical that you run your final verifications with frozen software. Other than the obvious, the reason is that each time you compile your code, your critical routines will reside in different locations in memory. If routines are moved into different boundaries, then how they are loaded into cache will be much different.

Figure 1: A typical cache-based system with the optional secondary cache.

Figure 2: Depending on what portion of a routine is in cache when it is about to be executed, the tiem to execute will vary.

Figure 3: Traditional performance analyzers will measure defined routines by monitoring the address bus of the microprocessor. Each time an address appears on the bus, the performance analyzer will "bin" that address and display the bins in a histogram format.

Figure 4: Single-event mode measures one event (or function) repetitively and displays the range of execution times in a histogram format. Minimum, maximum and average times are also displayed.

Figure 5: A simplified example of displaying compressed video data on a display.

Figure 6: Performance analysis can measure how tasks are split in a single processor system. Here t1 and t2 are evenly split. An additional 4 percent of the total time is spent outside of these two critical routines.

Figure 7: Monitoring dummy write instructions

Loop_Being_1
  Write FF to Test_Location_1      <-Test Instruction
  code
  code
  code
  code
  Goto Loop_Begin_1 else goto Loop_Begin_2
Loop_Begin_2
  Write FF to Test_Location_2     <-Test Instruction
  code
  code
  code
  code
  Goto End else goto Loop_Begin_2
End
  Write FF to Test_Location_3    <-Test Instruction