Article

dec2006.tar

Learning DTrace -- Part 4: Solving Everyday Problems

Chip Bennett

In part three of this series on DTrace, I covered a lot of detail on the power of the D language. In this part, I'll use that power to look at two common uses for DTrace: application profiling and finding system performance problems.

Profiling Applications

Profiling is the process of analyzing an application's characteristics, usually for the purpose of improving its performance. Before you can profile a program, however, you need to give the D program the ability to restrict recorded events to those related to just one program. There are a couple of ways to accomplish this.

First, you can associate your D program with one or more processes by using either the -c option to specify a command to execute or the -p option to specify the process ID of an already running process. You can have multiple instances of both options, but the first one specified gets its process ID stored in a special macro variable, $target. (Macro variables get replaced with their associated value before the D program actually starts, so any place in your D program you use $target will get replaced with the process ID number of the first -c or -p process you specify.) As an example, you might use the $target macro in a predicate to restrict probe clause execution to just the one process. After all of the associated processes exit, the D program will exit too, displaying any default ending output or running your END probe.

Second, you can simply pass a process ID number as an argument to the D program. Any arguments, after any D program options, are passed to macro variables $1, $2, $3, etc. (The $0 macro is the name of your D program.) Since DTrace has no way of knowing that your arguments are processes, when using this method your D program doesn't exit when the processes exit.

A good first step, when profiling, is to gather information on where a process is spending its time. For my profiling experiment, I'm going to compare the cp command and the dd command for copying a file. If I start out with a 50M file called file1 and run time cp file1 file2, I get:

real        4.1 
user        0.0 
sys         0.9

However, if instead a copy the file using time dd if=file1 of=file2:

real        7.5 
user        1.0 
sys         4.6

The dd command took almost twice as long to copy the file and used more than four times as much system time [1]. Because the difference in system time is so high, let's do a comparison of the system calls that each command is executing. To gather a system call count, I'll use the following D program (scount.d):

syscall:::entry 
/ $target == pid / 
{ 
   @[probefunc] = count() 
}

As you may recall from part three of this series, this D program uses the count aggregating function to gather the total number of times each system call is invoked. For the syscall provider, probefunc is the system call name, so the count aggregation will be keyed on system calls. The predicate restricts our data collection to system calls made by the target process. Now I'll run each copy command against this D program:

# dtrace -s scount.d -c "cp file1 file2" 
dtrace: script 'scount.d' matched 226 probes 
dtrace: pid 2194 has exited 

  creat64                                    1 
  fstat64                                    1 
  getpid                                     1 
  getrlimit                                  1 
  open64                                     1 
  pathconf                                   1 
  rexit                                      1 
  acl                                        2 
  brk                                        2 
  memcntl                                    2 
  setcontext                                 2 
  open                                       3 
  resolvepath                                3 
  stat                                       3 
  close                                      5 
  stat64                                     5 
  mmap64                                     7 
  munmap                                     7 
  write                                      7 
  mmap                                      15 

# dtrace -s scount.d -c "dd if=file1 of=file2" 
dtrace: script 'scount.d' matched 226 probes 
102400+0 records in 
102400+0 records out 
dtrace: pid 2200 has exited 

  creat64                                    1 
  fstat64                                    1 
  getpid                                     1 
  getrlimit                                  1 
  open                                       1 
  open64                                     1 
  resolvepath                                1 
  rexit                                      1 
  stat                                       1 
  sysconfig                                  1 
  munmap                                     2 
  setcontext                                 2 
  sigaction                                  2 
  close                                      3 
  brk                                        4 
  mmap                                       4 
  read                                  102401 
  write                                 102406

Two differences between cp and dd stand out: the cp command only uses seven writes, whereas the dd command uses 102,406 writes to accomplish the same copy. Also, the cp command is suspiciously missing any reads at all. So then how is it reading file1 to copy it?

Clearly, even this high level of analysis gives us some notions as to what might be going on here. Since dd is making many more write calls than cp, I'm led to believe that it's probably writing smaller blocks than cp. Also, the lack of any read calls in cp makes me wonder if cp has a different, possibly better, way to read files.

Here is where we see the power of DTrace. The D programming language gives you the tools to look at both the forest and the trees, and any zoom level between. To prove or disprove our theories, we'll need to look at the system calls more closely. Let's take a look at the write system calls. The write system call has three arguments (see man -s 2 write). The first argument is the file descriptor, the second is a pointer to the buffer to be written, and the third is the number of bytes to be written. To focus on the write system calls only, we'll create a new D program (wcount.d):

syscall::write:entry 
/ $target == pid / 
{ 
   @[arg0] = quantize(arg2); 
}

For this example, I've introduced a new aggregating function, quantize. By applying quantize to the length of the write, it displays a frequency distribution of the write lengths. A separate distribution graph is displayed for each aggregation key, which in this case is the file descriptor of the write:

 
# dtrace -s wcount.d -c "cp file1 file2"
dtrace: script 'wcount.d' matched 1 probe 
dtrace: pid 2264 has exited 

              
                4 
           value  ------------- Distribution ------------- count 
         1048576 |                                         0 
         2097152 |@@@@@@                                   1 
         4194304 |                                         0 
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       6 
        16777216 |                                         0 

# dtrace -s wcount.d -c "dd if=file1 of=file2"
dtrace: script 'wcount.d' matched 1 probe 
102400+0 records in 
102400+0 records out 
dtrace: pid 2266 has exited 

                2 
           value  ------------- Distribution ------------- count 
               1 |                                         0 
               2 |@@@@@@@@@@@@@                            2 
               4 |@@@@@@@@@@@@@                            2 
               8 |@@@@@@@@@@@@@                            2 
              16 |                                         0 
              
                4 
           value  ------------- Distribution ------------- count 
             256 |                                         0 
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 102400 
            1024 |                                         0

Each row in the output represents a "pail". (The Solaris Dynamic Tracing Guide [2] calls it a bucket, so I'll use pail.) Each pail contains values that are as big as or smaller than the value listed at the left of the row, but bigger than the value on the previous row (r_n < x <= r_n₊₁). The value r for each row increases in powers of two. For example, the second row of the third distribution says that there were 102,400 writes that were greater than 256 bytes but less than or equal to 512.

In this case, I suspect that most of the writes are the same length, so it really isn't sufficient to just know which pail the write falls in. We need to know exactly how many writes occurred for each buffer size. Even though quantize produces prettier output, to accomplish this I'm going back to using count (wcount2.d):

syscall::write:entry 
/ $target == pid / 

{ 
   @[arg0, arg2] = count(); 
}

Running this script against each copy command produced the following results for me:

# dtrace -s wcount2.d -c "cp file1 file2" 
dtrace: script 'wcount2.d' matched 1 probe 
dtrace: pid 13164 has exited 
              
        4          2097152               1 
        4          8388608               6 
# dtrace -s wcount2.d -c "dd if=file1 of=file2" 
dtrace: script 'wcount2.d' matched 1 probe 
102400+0 records in 
102400+0 records out 
dtrace: pid 13166 has exited 
              
        2               12               1 
        2               13               1 
        2                2               2 
        2                6               2 
        4              512               102400

The cp command only had seven writes: six 8M writes and one 2M write, for a total of 50M. The dd command wrote to two file descriptors. File descriptor 2 is standard error, so we conclude that these writes were used to display the record count messages. That leaves file descriptor 4 with 102,400 512 byte writes. These writes also add up to 50M of data. Our conclusion is correct: dd is doing a lot more small writes. It turns out that the dd command has a blocksize parameter that allows us to make dd use an 8M buffer for copying. If I run time dd if=file1 of=file2 bs= 8192k, I get the following results:

6+1 records in 
6+1 records out 

real        3.3 
user        0.0 
sys         1.3

The messages from dd clearly show that dd is now writing six 8M blocks and one 2M block. We could prove that with our D script, but I'll leave that for you to try on your own. The system call time is still almost 50% higher for dd than it is for cp. Perhaps this has something to do with cp not using the read system call. If so, how is cp reading the file?

The only other system call, in the call summary, that has to do with file IO, is mmap. This system call maps an area of memory over a file segment, so that the file becomes part of virtual memory. As the program accesses the memory, Solaris pages in sections of the file. There are no read system calls. There is still IO, of course, but it's all managed by the system's paging algorithms, which are very efficient. Sounds like a good theory. How do we prove it? Like we did with write, we can restrict the D program to just look at the mmap system call (mcount.d):

syscall::mmap*:entry 
/ $target == pid / 
{ 
   @[(int)arg4, arg1] = count(); 
}

The mmap system call has six arguments (see man -s 2 mmap), but only the size of the allocated area of memory (arg1) and the file descriptor (arg4) are of interest to us right now. I used a metacharacter on the probefunc name because there are two mmap system calls: mmap and mmap64. Also, note that I cast the aggregation key, arg4, into an int. The argn parameters are typed unsigned, and the file descriptor argument to mmap can be a -1, when the allocated area is anonymous, that is, with no existing file mapped behind it. Here's the output from the command dtrace -s mcount.d -c "cp file1 file2":

dtrace: script 'mcount.d' matched 2 probes 
dtrace: pid 14500 has exited 

        3              296               1 
        3             3260               1 
        3            13512               1 
        3            13613               1 
        3            55381               1 
        3          2097152               1 
       -1             5880               1 
       -1            16384               1 
       -1            24576               1 
       -1            81920               1 
       -1           147456               1 
       -1             8192               2 
        3             8192               3 
        3          8388608               6

There is a distribution around some anonymous mmaps (file descriptor -1), and a distribution around file descriptor 3. (Note that the counts for file descriptor 3 are split into two groups. This is because count sorts its output by the count value [3]). It's file descriptor 3 that is of more interest. However, there are several different memory size allocations. How do we know which ones have to do with reading file1? I'm suspicious that the mmaps of size 8M and 2M are the ones we're looking for, but we should prove it.

To do this, we need to restrict our output to mmap calls that are only associated with reading "file1". In our D program, we can set up a control variable that gets set once file1 is opened (mcount2.d):

BEGIN 
{ 
   fd = -1;  /* -1 means file1 not open */ 
} 

/* If file1 not open, save the file name pointer */ 

syscall::open*:entry 
/ $target == pid && fd == -1 / 
{ 
   fnp = arg0; 
} 

/* If file1 not open, open succeeded, 
      and filename is file1, save fd */ 

syscall::open*:return 
/ $target == pid && fd == -1 && arg0 != -1 / 
{ 
   fd = basename(copyinstr(fnp)) == "file1" ? arg0 : -1; 
} 

/* If file1 is open and the mmap fildes is file1's, 
      collect size data */ 

syscall::mmap*:entry 
/ $target == pid && fd != -1 && fd == arg4 / 
{ 
   @[arg4, arg1] = count(); 
} 

/* If close is for file1, show that file1 
      no longer open */ 

syscall::close*:entry 
/ $target == pid && arg0 == fd / 
{ 
   fd = -1; 
}

(Note that I used global variables rather than thread-local variables, since the cp command and the dd command are single threaded.) Once the D program sees an open system call for file1, it sets fd to the file descriptor from that open. The probe clause for the mmap probe is predicated on file1 being open, so we only collect data for mmaps associated with file1. (For information on why we need to handle the file name pointer from the open: entry in the open:return probe clause, see part 2 of this series.) Here's what we get from running the command dtrace -s mcount2.d -c "cp file1 file2":

dtrace: script 'mcount2.d' matched 8 probes 
dtrace: pid 14530 has exited 
              
        3          2097152                1 
        3          8388608                6

This clearly shows us that the cp command maps 8M of memory to file1 six times and 2M of memory to file1 one time. But just because file1 is being mmapped, does this prove that this is how file1 is being read? Because there are no other file IO system calls present, probably; but there is one more step we might take to further prove it. It would be difficult to trace the cp program actually accessing the mmapped memory locations, but we don't really need to. All we need to see is the kernel paging in the file data from memory. To this end, here is a change to mcount2.d (mcount3.d):

#pragma D option quiet 
BEGIN 
{ 
   fd = -1;  /* -1 means file1 not open */ 
   pc = 0;   /* Page-in count per mmap  */ 
} 
syscall::open*:entry 
/ $target == pid && fd == -1 / 
{ 
   fnp = arg0; 
} 
syscall::open*:return 
/ $target == pid && fd == -1 && arg0 != -1 / 
{ 
   fd = basename(copyinstr(fnp)) == "file1" ? arg0 : -1; 
} 
syscall::mmap*:entry 
/ $target == pid && fd != -1 && fd == arg4 / 
{ 
   printf ("Last mmap page-in count %d\n", pc); 
   pc = 0; 
} 
vminfo:::fspgin 
/ $target == pid && fd != -1 / 
{ 
   pc++; 
} 
syscall::close*:entry 
/ $target == pid && arg0 == fd / 
{ 
   fd = -1; 
} 
END 
{ 
   printf ("Last mmap page-in count %d\n", pc); 
}

I've added a new probe, vminfo:::fspgin. The vminfo probes fire when virtual memory (paging and swapping) events occur. The event we are interested in is the page in of filesystem data between mmap calls. To this end, the D program has a page counter (pc), which gets bumped every time the fspgin probe fires. When the mmap probe clause executes, indicating a new memory segment was mapped over file1, pc gets displayed, and then set back to zero. The first time we run dtrace -s mcount3.d -c "cp file1 file2", we see the following:

Last mmap page-in count 0 
Last mmap page-in count 67 
Last mmap page-in count 64 
Last mmap page-in count 69 
Last mmap page-in count 64 
Last mmap page-in count 65 
Last mmap page-in count 64 
Last mmap page-in count 18

The first count is zero, because this is the first mmap and no mmapped memory has been accessed yet. Every mmap call after that shows that 60 to 70 page-ins have occurred after each mmap call. The paging activity is fairly consistent and would seem to be a result of the cp command accessing the memory that was just mapped in. The last count is a lot smaller because the last mmapped file area was only 2M. With cache now loaded with the contents of file1, if we run the command again, we get this instead:

Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0 
Last mmap page-in count 0

Since the contents of file1 was still in the cache, no page-ins were needed to do the copy a second time [4].

Solving a System Performance Problem -- Anonymous CPU Time

Another common use of DTrace is to discover system problems that conventional tools have trouble with. In this example, I seem to be using a lot of CPU time, but there seem to be no processes to blame. vmstat 5 gives me the following output:

kthr      memory            page            disk          faults      cpu 
r b w   swap  free  re  mf pi po fr de sr dd dd s1 --   in   sy   cs us sy id 
0 0 21 2773568 815736 2  8  3  3  4  0 29  0  0  0  0  463   26  100  0  1 99 
0 0 37 2641992 768704 2474 5583 0 0 0 0 0  0 19  0  0  499 4493  268 55 45  0 
0 0 37 2641968 768928 2493 5616 0 0 0 0 0  0 19  0  0  499 4494  259 55 45  0

The system has no idle time left and is spending about half its time in system calls and about half in user code. Let's take a look at prstat to determine which processes are using up all this time:

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP 
27494 root     1576K 1120K sleep   49    0   0:00:00 0.7% callfast/1 
28891 root     4048K 2072K run     29    0   0:00:00 0.2% ld/1 
28041 root     4792K 4424K cpu0    59    0   0:00:00 0.2% prstat/1 
28887 root     1352K  960K sleep   19    0   0:00:00 0.1% gcc/1 
28890 root     1320K  928K run     19    0   0:00:00 0.0% collect2/1

The process time leaders don't seem to show any process using this kind of CPU time. Quite a bit of the CPU time is being spent on system calls, so let's take a closer look at exactly which system calls are being invoked. We'll need to modify our original scount.d script, since this time we're not looking at a specific target. Also, even though I could just collect data for a while and then interrupt my D program, this time I'd like for it to run exactly 5 seconds and exit on its own. To do this, I'll need to be creative with the profile provider. I'll call this new script sc.d:

#pragma D option quiet 
BEGIN 
{ 
   tick2 = 0; 
} 
tick-5s 
/ tick2 / 
{ 
   exit (0); 
} 
tick-5s 
/ !tick2 / 
{ 
   trace ("Collecting data\n"); 
   tick2 = 1; 
} 
syscall:::entry 
/ tick2 / 
{ 
   @[probefunc] = count(); 
}

The control variable, tick2, allows us to collect data on the second firing of the tick-5s probe. As you may recall, we can't rely on the profile probes waiting the specified time the first time they fire, so we want to collect data for 5 seconds on the first firing of tick-5s. The first time tick-5s fires, the probe clause sets tick2 = 1, which makes the predicate true for the sycall:::entry probe clause. The D program collects data for 5 seconds and then exits on the next firing of tick-5s. When I ran this D program against my busy system, I got the following output:

Collecting data 

  lwp_continue                                 1 
  lwp_create                                   1 
  lwp_exit                                     1 
  lwp_kill                                     1 
  nanosleep                                    1 
  stat64                                       1 
  pset                                         3 
  pollsys                                      4 
  gtime                                        7 
  lwp_park                                    10 
  vfork                                       25 
  chmod                                       26 
  rusagesys                                   26 
  open64                                      50 
  getgid                                      52 
  getuid                                      52 
  umask                                       52 
  sysconfig                                   75 
  times                                       75 
  schedctl                                    77 
  fork1                                      103 
  lwp_self                                   103 
  unlink                                     125 
  rexit                                      127 
  exece                                      128 
  fcntl                                      129 
  read                                       155 
  waitsys                                    178 
  fstat64                                    206 
  write                                      210 
  getrlimit                                  228 
  memcntl                                    231 
  setcontext                                 257 
  llseek                                     259 
  ioctl                                      281 
  lseek                                      333 
  dup                                        377 
  fstat                                      384 
  resolvepath                                640 
  munmap                                     665 
  getpid                                     860 
  lwp_sigmask                                865 
  access                                     869 
  sigaction                                 1173 
  close                                     1375 
  open                                      1793 
  mmap                                      2441 
  stat                                      2479 
  brk                                       2950

There is a lot of system call activity here, so we need to pick one of the systems calls to get more detail. There were 1793 open calls over a 5-second period. Let's find out which files the system was so busy opening (oc.d):

syscall::open:entry 
{ 
   self->fnp = arg0; 
} 
syscall::open:return 
/ self->fnp / 
{ 
   @[copyinstr(self->fnp)] = count(); 
   self->fnp = 0; 
}

The output I received from executing this script was rather lengthy, so I'll just show you the more important last 10 lines:

... 
  
  /usr/sfw/lib/gcc/sparc-sun-solaris2.10/3.4.3/libgcc.a       54 
  /usr/sfw/lib/gcc/sparc-sun-solaris2.10/3.4.3/libgcc.so      54 
  /usr/sfw/lib/gcc/sparc-sun-solaris2.10/3.4.3/libgcc_eh.a    54 
  /usr/sfw/lib/gcc/sparc-sun-solaris2.10/3.4.3/libgcc_eh.so   54 
  /usr/sfw/lib/gcc/sparc-sun-solaris2.10/3.4.3/crt1.o         55 
  /usr/ccs/lib/libc.a                                        108 
  /usr/ccs/lib/libc.so                                       108 
  /lib/libc.so.1                                             110 
  /platform/SUNW,UltraAX-i2/lib/libc_psr.so.1                138 
  /var/ld/ld.config                                          138

This file names appear to be related to the GNU C compiler. This is a lot of open calls for just one invocation of the C compiler. I wonder how many times the compiler is being invoked. Let's check that out (exc.d):

syscall::exec*:return 
{ 
   @[execname] = count(); 
}

Running this script for five seconds on my busy system produced the following output (make sure you wait until after DTrace tells you how many probes were matched before you start the 5 second count):

  cc1                                                        30 
  collect2                                                   30 
  gas                                                        30 
  gcc                                                        30 
  ld                                                         30

This is a lot of calls to the C compiler for a 5-second period. We need to find out who is initiating all this activity. The easiest way to do that is to list the execs and who is making them (exlist.d):

#pragma D option quiet 
syscall::exec*:entry 
{ 
   self->exn = execname; 
} 
syscall::exec*:return 
/ self->exn != "" / 
{ 
   printf ("%s -> %s\n", self->exn, execname); 
   self->exn = 0; 
}

Running this D program made it clear who was starting all the compiles:

callfast -> gcc 
gcc -> cc1 
gcc -> gas 
gcc -> collect2 
collect2 -> ld 
callfast -> gcc 
gcc -> cc1 
gcc -> gas 
gcc -> collect2 
collect2 -> ld 
...

A process named callfast is calling the C compiler repeatedly. In fact, callfast is a script that I wrote to do exactly that. The short processes generated by the C compiler were so short-lived, they weren't around long enough to make a significant impact on prstat. However, DTrace was able to find the problem easily.

Summary

There are a variety of problems that are difficult to solve using traditional tools and methods previously available. DTrace makes getting to the bottom of these issues much easier. In the next and final part to this series, I'll show some other ways to use DTrace and tie up some loose ends about other options and probe providers.

Endnotes

1. You need to know that I cheated a little here in order to get consistent results. The first time you run a command that reads a file, the file gets read into cache. Every time you read the file after that, the data comes from the cache rather than disk. Obviously, this would taint the results. To compensate for this, I wrote a C program to allocate enough heap space to use all available real memory. (I don't advise using this technique on a production system. Using up that much swap space could impact other applications.) Before a timed run, I ran this program to flush the file cache:

main() 
{ 
   int i = 0; 
   while ((char *)malloc(1024*1024) != 0) 
   { 
      printf ("%d ", ++i); 
   } 
   exit(0); 
}

You don't actually need to run the program until it exits (runs out of virtual memory). Once it starts allocating very slowly, you've invalidated all of the real memory pages that were being used for file cache.

2. http://docs.sun.com/app/docs/doc/817-6223

3. In a future release of Solaris, there will be options to change the sort order for count. These options are already available in Nevada, the OpenSolaris release (see http://opensolaris.org).

4. Before you get all excited about memory mapping being so much better than regular file reads, you need to know that the designers of Solaris are aware of this. That is why Solaris already implements a regular file read using the system paging algorithms. Memory mapping the file uses a little less system time, but the IO for both ways is just as efficient.

Chip Bennett (cbennett@laurustech.com) is a systems engineer for Laurus Technologies (http://www.laurustech.com), a Sun Microsystems partner. He has more than 20 years experience with Unix systems, and holds a B.S. and M.S. in Computer Science. Chip consults with Laurus and Sun customers on a variety of Solaris related topics, and he co-chairs OS-GLUG, the OpenSolaris Great Lakes User Group (Chicago).