Article jun2006.tar

An Interview with DTrace Guru Jarod Jenson

Peter Baer Galvin

The world is still in the early days of exploring and using DTrace, Sun's extraordinary new performance analysis tool. There are, however, some trends already revealing themselves. Specifically, all programs have performance problems, and (almost?) all performance problems on Solaris 10 systems can be revealed via DTrace. Jarod Jenson, Chief Systems Architect at Aeysis, spends a lot of his time using DTrace to track down performance problems.

I thought it would be interesting to talk with Jarod and hear his thoughts on DTrace. Read on and you will find that he does have many revealing things to say. Especially of interest is his information about common performance problems he's finding. Even if you don't use DTrace (and of course you should), those field findings should still be of use in helping to improve your programs and systems. Consider this column to be a case of free, remote performance tuning.

Q & A with Jarod

Peter Galvin: Tell us a little about yourself?

Jarod Jenson: I am currently the Chief Systems Architect at Aeysis. I spend the majority of my time traveling to two to three customers per week looking at performance problems on business-critical systems. I focus mostly on the Solaris platform because of DTrace, but I also look at Linux-based applications.

PG: How long have you been using DTrace?

DTrace for a little over four years now. I was using DTrace before its integration into Solaris and was the first Alpha customer of DTrace.

PG: What kinds of problems have you solved?

JJ: That is an extensive list. At two to three customers per week over about a two-year period (some are repeats), you can imagine how many applications I have seen and used DTrace on.

First of all, I would like to point out that currently there is a 100% success rate. In every case that DTrace has been used to look at a customer-written application, we have gotten a performance increase. Some are modest -- 10% or so, but others are massive -- up to 50x. There are also non-performance-related engagements, but those are much less frequent.

I have used DTrace on customer-written applications -- C/C++ and Java mostly (some Fortran) -- third-party applications including major databases, and, of course, the system itself (Solaris).

Here are some specific examples:

Recently, I was looking at a Java application with a financial firm. It was very well written, and they were having trouble identifying exactly why the performance was less than they expected. They had used almost every tool out there for profiling Java applications over many months with little success. I came in to look at the application with DTrace.

From the time my hands touched the keyboard until we had a solid hypothesis about the issue was approximately five minutes. So, why on an application that they had been looking at for months were we able to find something so quickly? The answer is simple -- the true systemic observability of DTrace. The problem was not in Java code per se, but in native code they didn't even know was there. With DTrace, the layers of software are irrelevant and easily observed because we have providers for every aspect of the system.

There were two issues at the core of the problem. In a third-party XML library, someone had the keen idea of creating a cache of string objects. The issue was that their usage was also causing Java to intern the strings -- something that happens in native code. This was causing a jump into JNI, many string comparisons, and memory allocations. The second issue was that in the logging framework, they were trying to get class, method, and line information for exceptions. This meant having the JVM fill in a stack trace for them. This too was interning strings and causing the same issues.

So, the bottom line was that they were seeing allocation contention in native code (this was heavily threaded, and the default Solaris memory allocator does not scale as well as needed) and were doing many string operations they didn't need to. We addressed both of these sources, and they saw a massive improvement in performance and scalability. No tools that exist solely in the JVM and rely on JVMTI or JVMPI could have found this issue -- none of them.

Here is another scenario that happens often. I was looking at a very old (more than ten years) application with a customer. They had aggressively jumped on the Solaris 10 bandwagon from a competing platform and were seeing stability and performance improvements across the board. However, they wanted to see if DTrace could do anything for them.

Their application was C and Fortran both and was pretty well optimized. They had one very old module that required them to execute a sub-process externally to complete a task. Clearly, they knew this was a sub-optimal solution, but for the reasons we all know too well with older code, it could not easily be addressed. They just avoided this area when looking for performance gains. Well, with DTrace something jumped right out at us, and it was something that no one really expected. The code was using system(3C) to execute this other task. This meant that they were fork/exec'ing /bin/sh, which in turn fork/exec'ed the actual application. There were many of these heavyweight processes, so the cost of all of this fork/exec activity was very high.

Since the code was old and the mindset of "just ignore it, we all know" had been passed down for development generations, no one had really thought to look in that source file. DTrace took us right to it and showed it was worse than expected (reviewing hundreds of thousands of lines of code is not fun -- let DTrace do it for you).

We just broke out our good friend posix_spawn(3C) that is in Solaris 10, and the fork/exec issue was almost directly cut in half. Not an earth-shattering example in terms of technical difficulty, but one that I think will hit home for many people when they start using DTrace daily -- simple things that have been overlooked for whatever reason are easily found with DTrace.

PG: When solving problems, do you still use the "usual suspect" tools (such as iostat(1M)) or is the best practice to go right to DTrace?

JJ: A DTrace invocation is never the first thing I do when I log into a box. The first command I run when I log into a system is mpstat(1M). Not uname(1) or anything else. Always mpstat(1M). This is the best tool to get a quick overview of the system.

All of my DTrace work at a customer is based on information I have gotten from mpstat(1M), iostat(1M), prstat(1M), etc. I can say with some certainty that if you neglect these standard tools, you will have difficulty formulating appropriate hypotheses early on.

PG: Have you categorized the kinds of problems you discover with DTrace? If so, what are the most common ones?

JJ: There are common mistakes that seem to be widespread. I address some of them in my ACM interview at:

http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=363)
	        
so I will mention some others here.

One issue I have run into several times lately is something that plagues C++ applications. They have an object constructor that memset(3C)'s some member to NULL. Then, in the method that creates the object, they memset(3C) that object again to NULL. Then they pass it to another method, that, you guessed it, memset(3C)'s it to NULL. The worst I have seen of this was the same character buffer memset(3C) a total of seven times before it was actually used. memset(3C) is fast, but not necessarily cheap.

I have to mention again the Java notifyAll() and pthread_cond_broadcast(3C) or cond_broadcast(3C) calls. These crop up about three times a month as a reason for high-context switch rates, lock contention, and non-deterministic latency. Don't use these unless you have to. This issue is discussed more in-depth in the ACM article.

From a system perspective, the biggest issues I see are related to I/O configuration, bad tuning, and most of all, over-monitoring.

For the I/O issue, people seem to have pretty lax rules for what should be considered good service times. For high-performance critical systems, just the fact you are doing I/O is an issue. When I see service times in the tens of milliseconds for writes, I know there is some gravy to speed up performance. It can be software, HW, or configuration that causes this slow I/O performance. For databases, make sure you cache as much in the DB as possible. Do everything you can to avoid I/O -- I/O is so much slower than memory. With DTrace's io provider, identifying and understanding the sources of I/O activity is trivial.

Then there is the bad tuning issue. /etc/system is not a portable file. You must understand everything in that file and the repercussions of the setting if it is incorrect. You should never carry an /etc/system from release to release. In fact, it should be re-examined even when patches are applied. The same applies for non /etc/system-related tuning -- nfs, I/O, TCP, etc. It is good to tune, but base the tuning on your workload, not a posting from Google. This is where DTrace is quite handy. You can precisely model a workload with DTrace. Many customers are creating DTrace scripts that serve as a baseline of their application. Then, as they make tuning or code changes, they re-enlist those scripts to truly understand what has changed (if anything) about the behavior of the application. DTrace makes for one heck of a regression-testing tool.

And the last one -- over-monitoring. I looked at one system recently that was "idle". That is, we had shut the application down. This was a twelve-way SunFire 6900, and it still showed 30% utilization. All that was running were the monitoring tools. Basically, it was a fork/exec bomb. Even commercial tools do this. They sit in some loop and constantly run mpstat(1M), iostat(1M), or whatever. Minor faults, xcalls, and kernel spins on mutexes can all be seen higher due to this. There are two better ways to do this. One is to create a pipe(2) and have the tool run at an interval -- like mpstat 10 10000. No more fork/exec and no missed windows. The best way though is to read the kstats directly. There is a Perl interface to the kstats, so this should be trivial to adapt for monitoring tools. For example, just running:

dtrace -n proc:::exec-success'{printf("%s (PID %d) by %d\n", \
  execname, pid, ppid)}'
	
has garnered a few raised eyebrows when people see what their monitoring tools are up to.

PG: Where is that silver bullet DTrace script?

JJ: I get some form of this question at least once a week. It is either, "Where is your DTrace script collection?" or "Mail me a script to tell me why my system is slow." There is no magical tell-all script (that I know of). In terms of a collection of scripts, I honestly don't have one. Any customer I have worked with will tell you I create all of the scripts on the fly (and leave them for them to play with). Now, I understand this is not for everyone, but here is what I have learned in four years. You almost never (99%) solve the problem with the first DTrace script. The first script gives you the information needed to formulate a good hypothesis. It is that second or third application-specific script that gets you the details you need.

Let's look at an example:

For a financial application I was working on, I used the profile provider to get a statistical sampling of hot code paths:

dtrace -n profile-1001'/arg1 && pid == <pid>/{@[arg1] = count()}' 
	
(This script will work on FCS S10 -- I use a modified version for later releases.)

This told me that getenv(3C) accounted for a significant portion of our CPU time. Now here is the question, is this use case valid or not? You cannot tell from this script. The next script I wrote showed me the environment variable being requested and the return value ("DEBUG" and NULL). This probably should not be in a hot code path. But you can't tell that from the first script. This is a simple example, but it shows that you have to create a refined script to answer the question of whether something is an anomaly that can be corrected, or whether it is required behavior.

PG: Thanks very much for all of the great insight, Jarod!

Magic Tell-All Script!

Okay, as Jarod mentioned, there is no such thing. However, there are some already-written DTrace scripts that you might find useful, especially in that first phase of forming a performance problem hypothesis.

I would be remiss if I didn't point out the scripts that come with Solaris 10, in /usr/demo/dtrace. These scripts are right out of the DTrace manual published at docs.sun.com. Rather than type them in as you read the manual, you can access them in the demo directory.

The other great DTrace example resource is Brendan Gregg's Web site:

http://users.tpg.com.au/adsln4yb/dtrace.html
Here you will find full, useful DTrace programs, as well as one-line commands and other useful things. While the programs are useful on their own, they are great examples to use while learning DTrace. Start from the scripts and modify them to suit your needs and to learn more about how DTrace works.

Other useful resources include scripts written by Richard McDougall for the Solaris Internals tutorial he teaches (with Jim Mauro) and from their forthcoming update to that book. Have a look at:

http://www.solarisinternals.com/si/dtrace/index.php
	
A good place to start on your path to DTrace nirvana is the BigAdmin DTrace page at:

http://www.sun.com/bigadmin/content/dtrace/
Finally, Sun has moved all DTrace discussion (including Sun kernel engineering discussion) to the Web. Read along or join in with questions and suggestions; it's all at:

http://www.opensolaris.org/os/community/dtrace/
Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.