Article nov2006.tar

Learning DTrace -- Part 3: Advanced Scripting and Aggregations

Chip Bennett

In part two of this series on DTrace, I covered the basics of writing D programs. In this part, I'll go into more detail about the D language and tell you about some additional probe providers. Along the way, I'll show you some examples, expanding on last month's example script and developing a new one.

This leg of our adventure will deal more with the D language, which is based on C. I'll explain some of the common features, but it's beyond the scope of these articles to teach all of the capabilities of C. If you're not familiar with the C language, there are a number of online references and tutorials that can be found by searching the Internet for something like "C programming tutorials."

Arithmetic Expressions

D supports most of the arithmetic expressions that ANSI C supports. These expressions can be used in predicates, recording actions like printf, functions like exit, or in a statement by themselves. Standalone D expressions, like C expressions, must have a side effect to be useful. The two sets of operators with side effects are the assignment operators and the increment/decrement operators. For example, the expression 3+1 is a valid standalone expression, but it doesn't change anything.

One particular expression operation that D borrowed from C is especially useful in D. It is the conditional expression (expression1 ? expression2 : expression3). This is a powerful operation in C, but it's even more useful in D, because D doesn't have flow control statements [1]. Since this operation is so important in D and often unknown to casual C programmers, I'll touch on it here.

First, expression1 is evaluated. If the result is true (non-zero), then expression2 is evaluated and returned as the overall expression result. Otherwise, the result of expression 3 is returned. For example:

  printf ("The value of x is %s\n", x == 0 ? "zero" : "non-zero"); 
            
I'll tell you more about the usefulness of conditional expressions a little later.

DTrace Provider

DTrace has a number of providers for creating different type of probes. The only provider you've seen, unless you've been reading the manual, is the syscall provider, which provides a way to instrument Solaris system call ingress and egress.

There are times when you need to be able to perform actions only once: at the start of the run, and at the end of the run. DTrace has a provider, actually called the dtrace provider, that gives you this capability. The dtrace provider only has three probes. (Note that capitalization is important. These probe names are in all uppercase.):

  • BEGIN -- Fires once at the start of the script, before any other probes fire
  • END -- Fires once at the end of the script, after all the other probes fire
  • ERROR -- Fires when a clause execution error occurs
To expand on the D program shown in part two (execs.d), suppose you'd like to add a header and a trailer to the output. You can use the BEGIN probe to print a header before any of the other probes fire. It also might be useful to keep some kind of tally of probe events, to be printed at the conclusion of the run. The dtrace provider's END probe can be used for this purpose:

#pragma D option quiet 
dtrace:::BEGIN 
{ 
   printf ("%-20s %s\n", "===============", "=============="); 
   printf ("%-20s %s\n", "Execing Process", "Execed Process"); 
   printf ("%-20s %s\n", "===============", "=============="); 
   total = 0; 
} 
syscall::exec*:entry 
{ 
   self->exn = execname; 
} 
syscall::exec*:return 
/ self->exn != NULL / 
{ 
   printf ("%-20s %s\n", 
   self->exn, execname); 
   self->exn = 0; 
   total++; 
} 
dtrace:::END 
{ 
   printf ("%-20s %s\n", "===============", "=============="); 
   printf ("Total # of execs = %d\n", total); 
} 
  
In practice, you usually only specify BEGIN and END without the rest of the probe tuple; the dtrace provider probes are unique and well known.

You'll notice that the variable total is not prefixed with self->. This is because total is not a thread-local variable (see part two of this series for an explanation of thread-local variables). Since total is not one of the built-in variables, the D compiler will make it a global variable. Global variables maintain their value for the entire execution, and there is only one copy shared by all threads. Like other variable types in D, the compiler gets the type of a global variable from an explicit declaration or an assignment statement. All explicit declarations must be outside of the probe clauses.

How does DTrace decide when to fire the END probe? DTrace has an exit function that can be called from any clause. Additionally, you could type ^C to interrupt execution. When the D program calls exit, receives a SIGINT, or receives a SIGTERM, it executes the END clause(es) before exiting. Your output should look something like this:

===============      ============== 
Execing Process      Execed Process 
===============      ============== 
sshd                 sh 
sh                   locale 
sshd                 pt_chmod 
sshd                 sh 
sh                   quota 
sh                   cat 
sh                   mail 
^C 
===============      ============== 
Total # of execs = 7
  
Structures

In D, structures are declared and referenced the same way they are in C. There are two main reasons why you would use structures in D -- to organize your own data and to allow easy access to the myriad of structures in the kernel.

Like other kinds of declarations in D, you may only declare structures outside of probe clauses. The declaration syntax is the same as C, as is the dot operator used to reference a structure member (mystruct.x). Also, like C, if you are dealing with a pointer to a structure, you can access a member using the arrow operator (pmystruct->x). Most kernel structure references you run into will be pointers. They come from basically two sources: built-in variables and external variables.

Some of the built-in variables are pointers to structures. For example, curpsinfo is a pointer to the current psinfo structure (see man -s 4 proc). An extremely useful built-in variable is the args array. This is an array that represents the arguments passed to the probe clause. The args array is referenced just like any array in C, using a 0-based index. However, the args array has been given a special capability -- the probe provider can make each element of the args array a different type. For example, one probe might define args[0] an integer and args[1] a pointer to a structure. Another probe might define args[0], args[1], and args[2] to be pointers to different structures.

This flexibility allows strongly typed access to the most useful data that each probe has to offer. Note that not all probes have the args array available. Some probes don't have any arguments. Others have arguments, but the args array still isn't available. In those cases, there's another set of built-in variables: arg0 through arg9. However, these variables are just integers. If they contain pointers to structures, the variables must be cast, and the structures to which they point may need to be declared, unless they're already declared by default (i.e., psinfo_t). The DTrace guide details which arguments are available for each provider and probe.

One of the providers that defines types for the args array is the io provider. It has a set of probes named start, which has various function names for different kinds of I/O. All of the start probes have three elements in their args arrays:

  • args[0] -- pointer to a bufinfo structure
  • args[1] -- pointer to a devinfo structure
  • args[2] -- pointer to a fileinfo structure

Each structure provides information about different aspects of an I/O request. The following are the actual structure definitions. These declarations are automatically provided to any clauses called from an io:::start probe: you do not include them in your D program. This information was extracted from the Solaris Dynamic Tracing Guide:

typedef struct bufinfo { 
    int b_flags;         /* flags */ 
    size_t b_bcount;     /* number of bytes */ 
    caddr_t b_addr;      /* buffer address */ 
    uint64_t b_blkno;    /* expanded block # on device */ 
    uint64_t b_lblkno;   /* block # on device */ 
    size_t b_resid;      /* # of bytes not transferred */ 
    size_t b_bufsize;    /* size of allocated buffer */ 
    caddr_t b_iodone;    /* I/O  completion routine */ 
    dev_t b_edev;        /* extended device */   
} bufinfo_t; 

typedef struct devinfo { 
    int dev_major;       /* major number */ 
    int dev_minor;       /* minor number */ 
    int dev_instance;    /* instance number */ 
    string dev_name;     /* name of device */ 
    string dev_statname; /* name of device + instance/minor */ 
    string dev_pathname; /* pathname of device */ 
} devinfo_t; 

typedef struct fileinfo { 
    string fi_name;      /* name (basename of fi_pathname) */ 
    string fi_dirname;   /* directory (dirname of fi_pathname) */ 
    string fi_pathname;  /* full pathname */ 
    offset_t fi_offset;  /* offset within file */ 
    string fi_fs;        /* filesystem */ 
    string fi_mount;     /* mount point of file system */ 
} fileinfo_t; 
  
With all this predefined power at hand, writing a D program to inspect I/O information becomes a much easier task. Here's an (ios.d) example that prints some of this information, as each I/O start occurs:

#pragma D option quiet 
io:::start 
{ 
   printf ("%5d %4d %4d %30s %30s\n", args[0]->b_bcount, 
           args[1]->dev_major, args[1]->dev_minor, 
           args[1]->dev_pathname, args[2]->fi_pathname); 
}
  
Your output should look something like this:

 8192  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a  /usr/share/lib/termcap 
 8192  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a                  <none>
32768  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a       /var/tmp/ExP3aOLg 
16384  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a       /var/tmp/ExP3aOLg 
16384  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a       /var/tmp/ExP3aOLg 
34816  136  8 /devices/pci@1f,0/ide@d/dad@0,0:a                  <none>
No Control Constructs: Deal With It

The lack of control constructs in D is not as limiting as you might think. There are several methods for getting around this limitation:

1. Use the C preprocessor to implement static changes.

2. Utilize the ability to have more than one probe clause per probe.

3. Use conditional expressions to assign alternate values.

4. Drive your D script, or multiple D scripts, from a parent script or program.

C Preprocessor

As you may recall, when writing C programs, the C preprocessor (cpp), performs a first pass on the C source file and looks for directives that start with a # in column one. Based on the preprocessor directive, the source code is modified and then handed to the real compiler. For example, #if, followed by a logical expression, controls whether or not a block of code gets included in the compilation. You can invoke the C preprocessor for your D program using the -C flag on the DTrace command line. One use for this is to make the printf format a common definition that can be maintained in one place.

For example, here's how you might modify execs.d using the preprocessor directive #define:

#pragma D option quiet 
#define fmt "%-20s %s\n"
BEGIN 
{ 
  printf (fmt, "===============", "=============="); 
  printf (fmt, "Execing Process", "Execed Process"); 
  printf (fmt, "===============", "=============="); 
} 
  ... 
  
There are other DTrace command-line options available to control the preprocessor, such as -D to define a symbol or -I to add a path to the #include search. Any of these can also be included in the script preamble (#!/usr/sbin/dtrace). The preprocessor can be used to statically control which parts of the D program are given to the compiler.

Multiple Probe Clauses for a Probe

Conditional logic in the probe clause can be accomplished by dividing the clause into multiple clauses. The predicate is used to select which clause is executed when the corresponding probe fires. Also, since the size of a probe clause's trace record is static[1], this is the only way to generate a varying amount of trace data for a probe. For example, suppose I wanted to refine the printing of the header and trailer in execs.d. It would be useful to reprint the header every 20 lines of output, and omit the separator line at the end if there were no execs printed:

#pragma D option quiet 
BEGIN 
{ 
   total = 0; 
} 
syscall::exec*:entry 
{ 
   self->exn = execname; 
} 
syscall::exec*:return 
/ self->exn !=NULL && !(total % 20) / 
{ 
   printf ("%-20s %s\n", "===============", "=============="); 
   printf ("%-20s %s\n", "Execing Process", "Execed Process"); 
   printf ("%-20s %s\n", "===============", "=============="); 
} 
syscall::exec*:return 
/ self->exn != NULL / 
{ 
   printf ("%-20s %s\n", 
   self->exn, execname); 
   self->exn = 0; 
   total++; 
} 
END 
/ total / 
{ 
   printf ("%-20s %s\n", "===============", "=============="); 
} 
END 
{ 
   printf ("Total # of execs = %d\n", total); 
}
  
The exec return clauses are executed in order each time the probe fires. The first clause will print the header if total is a multiple of 20 (which of course includes 0). The beauty of this method is that if no execs happen, the header will never be printed. When the END probe fires, if total is zero, the ending separator is not printed.

Another good example exists in the ios.d script. In the output, you probably noticed that args[2]->fi_pathname is the string <none> much of the time. This is because many IO requests are not associated with any particular file. To pretty this up a bit, we could put the tracing of the file path in a different clause, and bypass it altogether if its value is <none>:

#pragma D option quiet 
io:::start 
{ 
   printf ("%5d %4d %4d %s\n", args[0]->b_bcount, 
           args[1]->dev_major, args[1]->dev_minor, 
           args[1]->dev_pathname); 
} 
io:::start 
/ args[2]->fi_pathname != "<none>" / 
{ 
   printf ("                %s\n", args[2]->fi_pathname); 
} 
  
The output should look something like this:

 1024  136    8 /devices/pci@1f,0/ide@d/dad@0,0:a 
 1024  136    8 /devices/pci@1f,0/ide@d/dad@0,0:a 
24576  136    8 /devices/pci@1f,0/ide@d/dad@0,0:a 
                /var/tmp/Extya4Mg 
 8192  136    8 /devices/pci@1f,0/ide@d/dad@0,0:a 
                /var/tmp/Extya4Mg 
Using Conditional Expressions

This technique allows your D program to choose different calculations to perform based on the probe data. As an alternative to the previous example (ios.d) of splitting the printed output into two probe clauses, a conditional expression can do the same thing in one probe clause:

#pragma D option quiet 
io:::start 
{ 
   printf ("%5d %4d %4d %s%s\n", 
      args[0]->b_bcount, args[1]->dev_major, 
      args[1]->dev_minor, args[1]->dev_pathname, 
      args[2]->fi_pathname == "<none>" ? 
        "" : strjoin("\n             ", args[2]->fi_pathname)); 
} 
For this modification, I've also introduced a string function, strjoin. The function returns a string with the two argument strings concatenated. The conditional expression tests args[2]->fi_pathname. If it's equal to <none>, it prints a null string. If it's not equal to <none>, it prints the pathname, but it does so with a newline and some extra white space. This formats the pathname, so that it's lined up on the next line, just like the split probe clause method [2]. The output should look just like the previous example.

Another use for conditional expressions is to present information in a more readable form. The kernel usually stores an enumerated list of choices as a group of bits. For example, the bufinfo structure has a b_flags field that contains a single bit that represents whether an I/O is a read or a write. In the ios.d script, we can mask out this bit and then use a conditional expression to print a value that is more descriptive than 1 or 0:

#pragma D option quiet 
io:::start 
{ 
   printf ("%c %5d %4d %4d %s%s\n", 
      args[0]->b_flags & B_READ ? 'R' : 'W', args[0]->b_bcount, 
      args[1]->dev_major, args[1]->dev_minor, args[1]->dev_pathname, 
      args[2]->fi_pathname == "<none>" ? 
         "" : strjoin("\n               ", args[2]->fi_pathname)); 
} 
  
Like C, the D language uses single quotes for character constants and double quotes for string constants. I chose to use a %c format item with character constants 'R' and 'W'. You could also have used %s with string constants "R" and "W". The choice is somewhat arbitrary [3]. This output should look like the previous examples, but with an R or W in the first column.

Driving the D Script from a Parent Script

In this method, you invoke the D script from another program, such as a shell script. In the parent script, you make all the decisions about how the D script is going to be invoked and then tailor the D script accordingly. The D program source can be contained within the shell script as one long command, or it can be invoked from a separate file. If it's embedded in the shell script, you can use shell variables to tailor the DTrace command line before it's invoked. There are some excellent examples of this technique in the DTrace Toolkit developed by Brendan Gregg at:

http://www.opensolaris.org/os/community/dtrace/dtracetoolkit/ 
  
If the D script is in a separate file, you can use D macro variables on the DTrace command line to tailor the D program. See the Solaris Dynamic Tracing Guide for more information about using macros.

Aggregations

One of the most powerful features of DTrace is aggregations, which provides the ability to automatically collect raw data and print statistical reports. An aggregating function is executed using the following general form:

@aggname[key1, key2, ...] = aggfunc(arg1, arg2, ...); 
where aggname is a variable used to uniquely identify this aggregation, each key is an expression that defines how the aggregation is grouped, aggfunc is the name of one of D's built-in aggregating functions, and each arg is an expression that aggfunc acts on. By default, an aggregating function collects data each time it's executed, and the resulting statistics are printed when the D program exits (either via the exit function or ^C).

As an example, let's suppose I want to modify ios.d to include a grand total of both the number of I/O starts and the number of bytes read and written. Also, I want to segregate the totals into read totals and write totals. To get a total number of I/O starts, I'll use the count function, and to get a total number of bytes read and written, I'll use the sum function:

#pragma D option quiet 
io:::start 
{ 
   self->rw = args[0]->b_flags & B_READ ? "R" : "W"; 
   printf ("%s %5d %4d %4d %s%s\n", self->rw, args[0]->b_bcount, args[1]->dev_major, 
      args[1]->dev_minor, args[1]->dev_pathname, 
      args[2]->fi_pathname == "<none>" ?
         "" : strjoin("\n             ", args[2]->fi_pathname)); 

   @blcktotal[self->rw] = count(); 
   @bytetotal[self->rw] = sum(args[0]->b_bcount); 

   self->rw = 0; 
} 
  
You'll note that I've added a thread local variable, self->rw, so that the conditional expression doesn't have to be repeated. The keys for these aggregations are the same as what is printed in the first column of the detail lines: "R" or "W". I had to make them strings instead of chars: keys with the char type, by default, are displayed as integers (which wouldn't look as nice). It's also worth noting that count and sum(1) yield the same results. Upon exit, the new ios.d will display four lines that look something like this:

  R                 107 
  W                 129 
  R              317440 
  W              418816 
  
We also have the ability to control the final output using the printa data recording action. This action works much like the printf action. The format string for printa is used to position and restrict how each result line is printed. Each regular format item corresponds to one of the keys in the aggregation. A special format item prefix, %@, corresponds to the aggregation result.

As an example, to get almost the same output as the default (except for the amount of white space), you could use the following in an END clause:

printa ("%s %@30d\n", @blcktotal); 
printa ("%s %@30d\n", @bytetotal); 
  
However, to make the output easier to read, you can change the field widths, add other constant data to the format, and surround the printa actions with printf actions:

#pragma D option quiet 
io:::start 
{ 
   self->rw = args[0]->b_flags & B_READ ? 'R' : 'W'; 
   printf ("%c %5d %4d %4d %s%s\n", self->rw, args[0]->b_bcount, 
      args[1]->dev_major, args[1]->dev_minor, args[1]->dev_pathname, 
      args[2]->fi_pathname == "<none>" ? 
         "" : strjoin("\n              ", args[2]->fi_pathname)); 

   @blcktotal[self->rw] = count(); 
   @bytetotal[self->rw] = sum(args[0]->b_bcount); 

   self->rw = 0; 
} 
END 
{ 
   printf ("Blocks: R/W     Total\n"); 
   printf ("=====================\n"); 
   printa ("        %c %@11d\n", @blcktotal); 
   printf ("\n"); 
   printf (" Bytes: R/W     Total\n"); 
   printf ("=====================\n"); 
   printa ("        %c %@11d\n", @bytetotal); 
} 
  
Note that I changed the result of the conditional expression back to a char, rather than a string. Now that I have control over the output format of the aggregations, I can use the char data type and actually print a character rather than its integer representation. The output at the end, after you enter ^C, should look something like this:

Blocks: R/W     Total 
===================== 
        R           4 
        W          36 
 Bytes: R/W     Total 
===================== 
        R        4096 
        W       50176
  
Profile Provider

The syscall provider is an example of an anchored provider, because its probes are triggered by events in the code that you are observing. DTrace also has unanchored probes, which are asynchronous to the observed code. The dtrace provider's probes are examples of unanchored probes. Another unanchored probe provider is the profile provider. The profile provider's probes fire on a time interval. In ios.d, suppose you want to print the aggregation results on a 15-second interval. You could use the profile provider's tick-15s probe, which fires every 15 seconds. I've added this probe to ios.d below:

   ...
 
END,tick-15s 
{ 
   printf ("\n"); 
   printf ("Blocks: R/W     Total\n"); 
   printf ("=====================\n"); 
   printa ("       %c %@11d\n", @blcktotal); 
   printf ("\n"); 
   printf (" Bytes: R/W     Total\n"); 
   printf ("=====================\n"); 
   printa ("       %c %@11d\n", @bytetotal);  
   clear (@blcktotal); 
   clear (@bytetotal); 
} 
  
The io:::start probe clause hasn't changed, but I added the tick-15s probe to the END clause. This will cause the aggregations to be printed every 15 seconds, in addition to after the script is interrupted. The clear functions zero out the result values for each aggregation, making the output the total for the last 15-second period. If, instead, you want a running total, just remove the clear functions.

The units that can be specified for the tick probes are: nanoseconds (nsec or ns), microseconds (usec or us), milliseconds (msec or ms), seconds (sec or s), minutes (min or m), hours (hour or h), days, (day or d), and hertz (hz). Hertz (the default, if no suffix is supplied) specifies the number of times per second the probe will fire.

The numeric value of the probe is completely arbitrary. You could have just as easily used tick-14s or tick-2s. One of the interesting things about profile probes is that they don't exist until they are used. If you run dtrace -lP profile before using the profile provider the first time, you'll see a handful of probes that get created by default, but probably not tick-15s. When you run a D program that uses this probe, it gets created. Once the probe is created, it remains on the probe list until the system is rebooted or the profile module is unloaded [4]. On top of that, if five different users are running a D program that happens to use the tick-15s probe, they will all be using the same probe.

As you may recall, each DTrace consumer has its own trace buffer. When the tick-15s probe fires, that event gets shared with all the buffers for all the consumers who have clauses associated with that probe. This also means that each DTrace consumer sees the firing of the probe at about the same time. (There is a slight delay because of the buffer.) As a result, you shouldn't write a D program that depends on the first firing of a profile probe to be a certain amount of time from the start of the program. If the same probe has been instrumented by another consumer, your timing will vary according to when you enter the cycle. This is easiest to see in an example. Try running the following command in two different windows at the same time. Once the command is entered in the first window, no matter when you start the second instance, you'll see that each consumer shows tick-15s firing at about the same time:

dtrace -n tick-1s -n tick-15s
  
Summary

In this article, I've taken you a lot deeper into the D language than I did in part two. I think you can begin to see the extreme flexibility of D, and the power of observability that DTrace has on your system. In part four, I'll present some real examples of problems and show you how to solve them using DTrace. Of course, along the way, I'll explain more about the D language and some other probe providers.

Endnotes

[1] There's a couple of very good reasons why the D language doesn't have flow control statements like if, while, or for. In part two, I mentioned that a D program is compiled into a format called DIF (D Intermediate Format). The DIF program for each probe clause is executed in the kernel, with interrupts disabled. If a DIF program got into a long or infinite loop, a processor would now be running a long-lasting, kernel resident, uninterruptible thread. To prevent this possibility, the D compiler doesn't allow backward branches in DIF code. This precludes any kind of loop construct.

Additionally, DTrace requires that each probe clause produce a fixed-length trace record. This means that the total number of bytes generated by all of the recording actions (trace, printf, etc.), for one clause, must be constant. Each clause can generate a different size record, but one clause must generate the same size record each time it is invoked. If the designers allowed flow control statements, the size of the trace record could vary, depending on whether the if condition was true.

[2] You may be asking how we are cheating the static trace record limitation with the conditional expression method. When the D program alternates between the two possible results, isn't the DIF program in the kernel sending a different-sized record to the buffer? Actually it isn't (as you've probably guessed by my leading question). In D, strings are always allocated at a pre-defined length. By default, this pre-defined length is 256. The intermediate string generated by the conditional expression is 256 bytes, regardless of how much data is shoved into it. In this example, three integers and two 256-byte strings are actually put into the buffer. When the DTrace consumer pulls the data out of the buffer, it formats it using the printf format string, and varies the length of the output based on the position of the terminating null character.

[3] I say "somewhat" because there is a distinct difference in how much data is put in the recording buffer, considering that strings are always recorded at their full length, which by default is 256 bytes.

[4] DTrace providers are Solaris-loadable modules that can be listed using modinfo. Any default probes for a provider are created when the module is loaded. The provider's probes become undefined when the module is unloaded (modunload -i module_id), or when the system is rebooted. So theoretically, profile provider probes, that you've created, could be removed by unloading the profile module, but I've never had success with this -- the module is always busy.

Chip Bennett (cbennett@laurustech.com) is a systems engineer for Laurus Technologies (http://www.laurustech.com), a Sun Microsystems partner. He has more than 20 years experience with Unix systems, and holds a B.S. and M.S. in Computer Science. Chip consults with Laurus and Sun customers on a variety of Solaris related topics, and he co-chairs OS-GLUG, the OpenSolaris Great Lakes User Group (Chicago).