A Heisenberg Compensator For Measuring Software Performance

Dr. Dobb's Journal June, 2004

Measure performance without affecting it

By Gary Carleton and Charles Spirakis

Gary and Charles are engineers for Intel. They can be contacted at gary.carleton intel.com and charles.spirakisintel.com, respectively.

The principle of uncertainty is a serious concern when trying to measure software performance. As soon as a tool is inserted into the system to gather performance data, the performance of the system is changed. Many software performance tools in use today inject extra code into the application or system being measured, which affects the performance itself. The issue of how to measure software performance without affecting the performance of the system being measured remains.

This is actually a long-recognized problem in other environments as well. In physics, the Heisenberg Uncertainty Principle states that, for any subatomic particle, there is no way to know its position and momentum at the same time. The broader interpretation of this is that, in order to make a measurement in any system, the system must be disturbed.

In the TV show Star Trek: The Next Generation, there was an episode in which the transporter had broken down. When asked what was wrong, the technician said, "Well, it might be a problem in the Heisenberg Compensator." This was an amusing acknowledgment that some type of compensator would need to exist to be able to scan the object being transported accurately enough to be able to reconstruct it. Interestingly, when the show's writers were asked if they knew how the Heisenberg Compensator worked, they responded that if they did, they would be off collecting their Nobel Prizes in Sweden, not working on a TV show (http://www.bbc.co.uk/cult/st/interviews/bormanis/ page38.shtml).

In this article, we examine ways to minimize the intrusion that measuring software performance naturally causes. We also examine which types of software performance data can be trusted. Given that performance can't be measured perfectly, we can at least try to come close.

Sampling

Some time ago, we began looking at ways to take occasional samples of program execution context, as opposed to generating extra code to look at every function or source-code line that executed. The question was whether this would provide data that could really be trusted.

In terms of general statistical sampling, the normal problem you face is determining how many samples to take. In the case of measuring software performance on a particular system, the problem is determining how frequently to take samples so that you'll end up with enough samples to produce reasonably accurate results, but not so many that the normal operation of the software is changed. We experimented with different sampling rates and compared the results with other instrumented methods of gathering the performance data to validate the sampling results. With these experiments, we came up with a sampling rate of 1000 samples/second as the best compromise between generating enough samples and beginning to affect the execution of the program.

An important thing to note is that the goal of this technology is to find the portions of a program that consume the most CPU execution time. As such, we are most concerned with finding those samples that occur most frequently. We are not trying to find which function took 0.01 percent of the total time: We want to find the places where the CPU spent a lot of time and focus our optimization efforts on those specific parts of the code.

Tradeoffs

There are a few tradeoffs involved in deciding to use sampling instead of the more traditional instrumented approach. Sampling only identifies specific bottlenecks in the executing software. Instrumentation can provide much more data including calling sequences between functions, cumulative execution times on a particular branch of the call tree, and (in the case of instrumented executable line numbers) the number of times each particular source line executes. Also, for sampling, the correct sampling rate needs to be selected. Too fast a sampling rate can bog down the system. Too slow a rate might not provide enough data for the results to be meaningful.

Another possible tradeoff that can occur with any type of sampling is "sampling bias." This results when the samples taken are not a truly random sampling of the population in general. In our type of execution address sampling, sampling bias can happen if there are regular, recurring events in the system that occur at a rate that is an exact integral multiple of the sampling frequency. This type of regular harmonic event is a very rare occurrence given the speed of modern microprocessors and the width of the sampling window. The event would have to repeatedly occur within a few clock cycles of the sampling interrupt. In our tool, users can simply change the sampling rate to overcome this issue.

Experimental Results

We ran some experiments in CPU execution address sampling using different sampling rates to determine how critical the rate is on the validity of the results. We used a range of sampling rates, from 1 sample/second to 100,000 samples/second, on a program that runs for 20 seconds. Each sampling rate was run three times to make sure the data was accurate and repeatable. The results reported here are the averages of the three runs. Of course, differences in system hardware or software design or configuration may affect actual performance. Engineers should consult multiple sources of information to evaluate the performance of the equipment or methodologies being considered.

The tool we used—VTune Analyzer from Intel (the company we work for)—generates hardware interrupts to gather the samples. The interrupt service routine that responds to the interrupt saves the interrupted execution address along with the current process and thread IDs, then immediately returns to the interrupted address. In these experiments, we used a simple, compute-bound application and measured its execution characteristics. Although VTune Analyzer is a commercial product, fully functional evaluation versions are available at no cost for both Windows and Linux (http://www.intel.com/software/products/vtune/).

Table 1 shows the resulting data reported by the VTune Analyzer. There are a number of interesting observations that can be drawn from the data. First, in terms of the validity of the data, most of the sample rates show that the norm2 and norm3 procedures are the most significant bottlenecks in the program. The 100,000/second sampling rate, however, had results that were radically different for certain data items. Presumably, this would mean that a 100,000/second sample rate is too fast—it is beginning to affect the normal operation of the program. The highlighted rate of 1000 samples/second would seem to be the best compromise. Also, the program itself is consuming most of the execution time, so we know that some other piece of software in the system (OS layer, device driver, and so on) isn't where the CPU is spending most of its time.

The other data items show some interesting properties, too. For the number of different processes, the number of threads, and the number of source code functions that are displayed, the faster sampling rates showed larger counts. This is because faster sampling rates have a better chance of catching items that rarely execute. This implies that the technology is not well-suited to reporting data for items that occur infrequently. It is best at finding those big offenders.

Enabling Technologies

Different processors are capable of different types of sampling; for example, time-based sampling (TBS) or event-based sampling (EBS). As long as you have a periodic time base that can interrupt the processor, you can do TBS (http://icl.cs.utk .edu/papi/index.html). If a processor provides for event counting, then TBS coupled with EBS can be used to analyze what is happening in different areas of the code. Finally, if the processor allows the event counter to be set with a starting value and to generate an interrupt, you can do EBS. If the Performance Monitoring Unit (PMU) overflow can be set as a nonmaskable interrupt, then you have the option of getting sampling information for interrupt handlers and other very low-level code in addition to applications and higher layers of the operating system.

With current technology, hardware architects have the ability to provide a large number of events to count, as well as a large number of counting registers. The limiting factor in this area appears to be the trend towards multiple-thread or multiple-core processors. In these processors, the EBS challenge comes from shared counters (events are counted on all hardware threads or cores) or hardware-thread or core-specific events (events that can only be counted on a single hardware thread or core) (http://www.intel.com/design/pentium4/manuals/245472.htm).

Finally, sampling—EBS or TBS—can be either system wide or context specific. Each has its own advantages and disadvantages. System-wide performance sampling can generally be implemented without many hooks in the OS and provides a global view of what is happening on the system. It lets you determine if a performance problem is related to your code, the OS, or some combination.

If, however, all you care about is your application, getting information on the rest of the system is irrelevant. In that case, context-specific sampling is preferred—but this type of sampling requires the EBS application or the OS to context switch the PMU registers as well as the general registers. Depending on the latency of the PMU registers, the context-switch time for the PMU's can significantly impact overall performance. For static feedback, this may not be an issue; but for real-time feedback, you need to overcome this performance degradation before you can actually see a benefit.

New Areas of Measurement

Currently, there is significant interest in using EBS to enhance performance in a number of areas including static feedback, real-time feedback, and architectural enhancements. For an example of static feedback, consider the usual compile/check performance/recompile loop. Normally, when you want to use feedback from an actual run, the compiler must insert extra code in the binary to measure a specific area of interest. In some cases, this can be a call for stack information (also known as gprof), and in other cases, it can be obtained through cache or branch mispredict measurements (via the MIPS compiler from years back). After a special compile is created, you would run the code, then recompile again based on the data generated.

Event-based sampling improves on this technique in two ways:

There is no need to create a special compile.
You can use EBS to verify that the compiler actually improved things based on the feedback.

For example, as the hardware architectural pipelines get longer, a mispredicted branch can have significant performance impact. By using EBS on an actual binary, you can statistically measure the number and location of mispredicted branches. This information can be fed back into the compiler and a new binary can be generated to handle the more common branch cases. Afterwards, you can again use EBS to verify that the new binary is an actual improvement over the old.

EBS can also be useful in providing real-time feedback on a running application. Currently, Java and .NET allow for new code generation when a hotspot is detected. They also allow a running application to dynamically shift to the new code. In this case, EBS can help detect hotspots that would be difficult or costly to detect in software. Feedback would then allow the runtime manager to implement new Just-In-Time (JIT) code. As an added bonus, since EBS would allow capturing information from the generated code as well as the interpreted code, a runtime manager could implement new JIT code to improve previously JIT-optimized code.

Making dynamic decisions based on EBS data is not limited to managed runtime code—operating systems can take advantage of this, too. For example, some architectures let the compiler generate speculation paths that are executed simultaneously with the regular instruction stream (http://www.intel.com/design/itanium2/manuals/ 251110.htm). There are flags in the hardware that can be used per process to indicate how aggressively the hardware should resolve speculation faults; for example, whether to allow a page fault on the speculation path. In such cases, EBS can be used to get information on how often the speculation path is taken and, if it is taken often enough, the OS can enable the hardware to be more aggressive on the speculation path for that particular process.

Process performance can also be dynamically impacted by the OS with a branch mispredict. For some instruction sets (Intel Itanium instructions, for example), the branch instruction contains a bit field hint for the hardware indicating whether the branch is going to be taken. Using EBS, the OS could watch for branch mispredicts and modify the code to make the hints more accurate. In this way, if the compiler guessed wrong or if the workload was significantly different, the OS could dynamically improve performance: The OS could decide whether to save changes when the code is paged out or simply let the EBS code change the binary again the next time the code is paged back in.

Finally, people designing hardware can take advantage of EBS data to better understand the utilization of their hardware. In the old days, benchmarks were small and could be simulated to see underlying memory-access patterns and pipeline utilization. Now, with benchmarks like Sysmark and Winbench, running the entire benchmark under a simulator is not very practical. Instead, EBS can be used to indicate where bottlenecks occur. For example, observing a cycles-per-instruction ratio is a good first-round approximation for determining architectural pipeline efficiency. If you have a low cycle-per-instruction ratio for the majority of a benchmark, you can assume that the architectural pipeline is being utilized fairly well. Conversely, when there is a high cycle-per-instruction ratio, you would do additional EBS sampling using other events to understand what is happening.

Conclusion

Sampling seems like a reasonable method of gathering CPU performance data, but you must understand its relative advantages and disadvantages in relation to other performance data-gathering technologies.

The biggest advantages are:

It is a relatively low-overhead technology.
It doesn't materially affect the performance of the software.
It doesn't require an extra instrumentation step in the build process or prior to the running of the test.

Disadvantages to watch for include:

There aren't as many different types of data produced as with other technologies (for example, method-to-method flow of control through the program).
The data is not exact and may only be accurate for frequently occurring instruction addresses.
There may be harmonic effects that could produce a bias in the resulting data.

While this may not be a perfect Heisenberg Compensator for measuring software performance, we can use modern microprocessor features to draw a fine line around the tradeoffs between measuring certain types of performance data and minimally affecting software performance. In terms of creating a real Heisenberg Compensator, we'll leave that to future Nobel Laureates and TV writers.

DDJ