An Interview with DTrace Guru Jarod Jenson
Peter Baer Galvin
The world is still in the early days of exploring and
using DTrace, Sun's extraordinary new performance analysis tool.
There are, however, some trends already revealing themselves. Specifically, all programs have performance problems,
and (almost?) all performance problems on Solaris 10 systems can be revealed via DTrace. Jarod Jenson, Chief Systems
Architect at Aeysis, spends a lot of his time using DTrace to track down
performance problems.
I thought it would be interesting to talk with Jarod
and hear his thoughts on DTrace. Read on and you will find that he does
have many revealing things to say. Especially of interest is his
information about common performance problems he's finding. Even if
you don't use DTrace (and of course you should), those field findings
should still be of use in helping to improve your programs and systems.
Consider this column to be a case of free, remote performance tuning.
Q & A with Jarod
Peter Galvin: Tell us a little about yourself?
Jarod Jenson: I am
currently the Chief Systems Architect at Aeysis. I spend the majority of my
time traveling to two to three customers per week looking at performance
problems on business-critical systems. I focus mostly on the Solaris
platform because of DTrace, but I also look at Linux-based applications.
PG: How long have you been using DTrace?
DTrace for a little over four years now. I was using DTrace before its
integration into Solaris and was the first Alpha customer of DTrace.
PG: What kinds of problems have you solved?
JJ: That is an extensive list. At two to three customers per week over about a two-year
period (some are repeats), you can imagine how many applications I have
seen and used DTrace on.
First of all, I would like to point out that currently
there is a 100% success rate. In every case that DTrace has been used to
look at a customer-written application, we have gotten a performance
increase. Some are modest -- 10% or so, but others are massive --
up to 50x. There are also non-performance-related engagements, but those
are much less frequent.
I have used DTrace on customer-written applications
-- C/C++ and Java mostly (some Fortran) -- third-party
applications including major databases, and, of course, the system itself
(Solaris).
Here are some specific examples:
Recently, I was looking at a Java application with a
financial firm. It was very well written, and they were having trouble
identifying exactly why the performance was less than they expected. They
had used almost every tool out there for profiling Java applications over
many months with little success. I came in to look at the application with
DTrace.
From the time my hands touched the keyboard until we
had a solid hypothesis about the issue was approximately five minutes. So,
why on an application that they had been looking at for months were we able
to find something so quickly? The answer is simple -- the true
systemic observability of DTrace. The problem was not in Java code per se,
but in native code they didn't even know was there. With DTrace, the
layers of software are irrelevant and easily observed because we have
providers for every aspect of the system.
There were two issues at the core of the problem. In
a third-party XML library, someone had the keen idea of creating a cache of
string objects. The issue was that their usage was also causing Java to
intern the strings -- something that happens in native code. This was
causing a jump into JNI, many string comparisons, and memory allocations.
The second issue was that in the logging framework, they were trying to get
class, method, and line information for exceptions. This meant having the
JVM fill in a stack trace for them. This too was interning strings and
causing the same issues.
So, the bottom line was that they were seeing
allocation contention in native code (this was heavily threaded, and the
default Solaris memory allocator does not scale as well as needed) and were
doing many string operations they didn't need to. We addressed both
of these sources, and they saw a massive improvement in performance and
scalability. No tools that exist solely in the JVM and rely on JVMTI or
JVMPI could have found this issue -- none of them.
Here is another scenario that happens often. I was
looking at a very old (more than ten years) application with a customer.
They had aggressively jumped on the Solaris 10 bandwagon from a competing
platform and were seeing stability and performance improvements across the
board. However, they wanted to see if DTrace could do anything for them.
Their application was C and Fortran both and was
pretty well optimized. They had one very old module that required them to
execute a sub-process externally to complete a task. Clearly, they knew
this was a sub-optimal solution, but for the reasons we all know too well
with older code, it could not easily be addressed. They just avoided this
area when looking for performance gains. Well, with DTrace something jumped
right out at us, and it was something that no
one really expected. The code was using system(3C) to execute this
other task. This meant that they were fork/exec'ing /bin/sh, which in
turn fork/exec'ed the actual application. There were many of these
heavyweight processes, so the cost of all of this fork/exec activity was
very high.
Since the code was old and the mindset of "just
ignore it, we all know" had been passed down for development
generations, no one had really thought to look
in that source file. DTrace took us right to it and showed it was worse
than expected (reviewing hundreds of thousands
of lines of code is not fun -- let DTrace do it for you).
We just broke out our good friend posix_spawn(3C) that is in Solaris 10,
and the fork/exec issue was almost directly cut in half. Not an
earth-shattering example in terms of technical difficulty, but one that I
think will hit home for many people when they start using DTrace daily -- simple
things that have been overlooked for whatever reason are easily found with
DTrace.
PG: When solving problems,
do you still use the "usual suspect" tools (such as iostat(1M)) or is the best
practice to go right to DTrace?
JJ: A DTrace invocation is never the first thing I do when I log into a box. The first command I
run when I log into a system is mpstat(1M). Not uname(1) or anything else. Always mpstat(1M). This is the best tool to get a quick overview of the system.
All of my DTrace work at a customer is based on
information I have gotten from mpstat(1M), iostat(1M), prstat(1M), etc. I can say with some certainty that if you neglect these
standard tools, you will have difficulty formulating appropriate hypotheses
early on.
PG: Have you
categorized the kinds of problems you discover with DTrace? If so, what are
the most common ones?
JJ: There are common
mistakes that seem to be widespread. I address some of them in my ACM
interview at:
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=363)
so I will mention some others here.
One issue I have run into several times lately is
something that plagues C++ applications. They have an object constructor
that memset(3C)'s
some member to NULL. Then, in the method that creates the object, they memset(3C) that object again
to NULL. Then they pass it to another method,
that, you guessed it, memset(3C)'s it to NULL. The worst
I have seen of this was the same character buffer memset(3C) a total of seven times before it was
actually used. memset(3C) is fast, but not necessarily cheap.
I have to mention again the Java notifyAll() and pthread_cond_broadcast(3C) or cond_broadcast(3C) calls.
These crop up about three times a month as a reason for high-context switch
rates, lock contention, and non-deterministic latency. Don't use
these unless you have to. This issue is discussed more in-depth in the ACM
article.
From a system perspective, the biggest issues I see
are related to I/O configuration, bad tuning, and most of all,
over-monitoring.
For the I/O issue, people seem to have pretty lax
rules for what should be considered good service times. For
high-performance critical systems, just the fact you are doing I/O is an
issue. When I see service times in the tens of milliseconds for writes, I
know there is some gravy to speed up performance. It can be software, HW,
or configuration that causes this slow I/O performance. For databases, make
sure you cache as much in the DB as possible. Do everything you can to
avoid I/O -- I/O is so much slower than memory. With DTrace's io
provider, identifying and understanding the sources of I/O activity is
trivial.
Then there is the bad tuning issue. /etc/system is
not a portable file. You must understand
everything in that file and the repercussions of the setting if it is
incorrect. You should never carry an /etc/system from release to release.
In fact, it should be re-examined even when patches are applied. The same
applies for non /etc/system-related tuning
-- nfs, I/O, TCP, etc. It is good to tune, but base the tuning on your
workload, not a posting from Google. This is where DTrace is quite handy.
You can precisely model a workload with DTrace. Many customers are creating
DTrace scripts that serve as a baseline of their application. Then, as they
make tuning or code changes, they re-enlist those scripts to truly
understand what has changed (if anything) about the behavior of the
application. DTrace makes for one heck of a regression-testing tool.
And the last one -- over-monitoring. I looked at
one system recently that was "idle". That is, we had shut the
application down. This was a twelve-way SunFire 6900, and it still showed
30% utilization. All that was running were the monitoring tools. Basically,
it was a fork/exec bomb. Even commercial tools do this. They sit in some
loop and constantly run mpstat(1M), iostat(1M), or whatever. Minor faults, xcalls, and kernel spins on mutexes
can all be seen higher due to this. There are two better ways to do this.
One is to create a pipe(2) and have the tool run at an interval -- like mpstat 10 10000. No more
fork/exec and no missed windows. The best way though is to read the kstats
directly. There is a Perl interface to the kstats, so this should be
trivial to adapt for monitoring tools. For example, just running:
dtrace -n proc:::exec-success'{printf("%s (PID %d) by %d\n", \
execname, pid, ppid)}'
has garnered a few raised eyebrows when people see
what their monitoring tools are up to.
PG: Where is that silver bullet DTrace script?
JJ: I get some form
of this question at least once a week. It is either, "Where is your
DTrace script collection?" or "Mail me a script to tell me why
my system is slow." There is no magical tell-all script (that I know
of). In terms of a collection of scripts, I honestly don't have one.
Any customer I have worked with will tell you I create all of the scripts
on the fly (and leave them for them to play with). Now, I understand this
is not for everyone, but here is what I have learned in four years. You
almost never (99%) solve the problem with the first DTrace script. The
first script gives you the information needed to formulate a good
hypothesis. It is that second or third application-specific script that
gets you the details you need.
Let's look at an example:
For a financial application I was working on, I used
the profile provider to get a statistical sampling of hot code paths:
dtrace -n profile-1001'/arg1 && pid == <pid>/{@[arg1] = count()}'
(This script will work on FCS S10 -- I use a
modified version for later releases.)
This told me that getenv(3C) accounted for a significant portion of our CPU time.
Now here is the question, is this use case valid or not? You cannot tell
from this script. The next script I wrote showed me the environment
variable being requested and the return value ("DEBUG" and
NULL). This probably should not be in a hot code path. But you can't
tell that from the first script. This is a simple example, but it shows
that you have to create a refined script to answer the question of whether
something is an anomaly that can be corrected, or whether it is required
behavior.
PG: Thanks very much for all of the great insight, Jarod!
Magic Tell-All Script!
Okay, as Jarod mentioned, there is no such thing.
However, there are some already-written DTrace scripts that you might find
useful, especially in that first phase of forming a performance problem
hypothesis.
I would be remiss if I didn't point out the
scripts that come with Solaris 10, in /usr/demo/dtrace. These scripts are
right out of the DTrace manual published at docs.sun.com. Rather than type
them in as you read the manual, you can access them in the demo directory.
The other great DTrace example resource is Brendan
Gregg's Web site:
http://users.tpg.com.au/adsln4yb/dtrace.html
Here you will find full, useful DTrace programs, as
well as one-line commands and other useful things. While the programs are
useful on their own, they are great examples to use while learning DTrace.
Start from the scripts and modify them to suit your needs and to learn more
about how DTrace works.
Other useful resources include scripts written by
Richard McDougall for the Solaris Internals tutorial he teaches (with Jim
Mauro) and from their forthcoming update to that book. Have a look at:
http://www.solarisinternals.com/si/dtrace/index.php
A good place to start on your path to DTrace nirvana
is the BigAdmin DTrace page at:
http://www.sun.com/bigadmin/content/dtrace/
Finally, Sun has moved all DTrace discussion
(including Sun kernel engineering discussion) to the Web. Read along or
join in with questions and suggestions; it's all at:
http://www.opensolaris.org/os/community/dtrace/
Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier
systems integrator and VAR. Before that, Peter was the systems manager for
Brown University's Computer Science Department. He has written
articles for Byte and other magazines, and previously wrote Pete's Wicked World, the
security column, and Pete's Super Systems, the systems management
column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating
Systems Concepts and Applied Operating Systems Concepts textbooks.
As a consultant and trainer, Peter has taught tutorials and given talks on
security and systems administration worldwide.
|