Article oct2006.tar

Learning DTrace -- Part 2: Scripts and the D Language

Chip Bennett

In Part 1 of this series on DTrace, I covered the fundamentals of using DTrace from the command line. Once DTrace command arguments start getting complex, it's generally easier to write a script rather than make the command line longer. Here is the basic layout of a D program:

#!/usr/sbin/dtrace -s 

probe-description[, probe-description[, ...]] 
/ predicate / 
{ 
   action; [action; [...]] 
} 
            
Let's break this down one part at a time:

1. The first line is similar to what you would see in any shell script [1]. The -s option must be specified so that Dtrace knows it is being invoked in script mode rather than command mode.

2. A probe-description is any 4-tuple probe description, as discussed in Part 1. These specifiers must be of the -n type. That is, the D compiler assumes that a single word, or whatever is after the last colon, is the probe name. (i.e., a probe-description of a single colon will match all probes.) You can specify more than one probe-description by using a comma-separated list of probe tuples.

3. The predicate is a C-style relational expression. This line is optional. A missing predicate has the same effect as a predicate that is always true.

4. The actions enclosed in braces are the actions to be performed if (1) the instrumented probe fires, and (2) the predicate is true. The action body in braces is also in the style of C, but does not have any flow control statements and has a different set of available functions than ANSI-C [2]. The actions are also optional. An action block that is empty (just an open and close brace) performs only the default action, as described in Part 1.

The probe-descriptions, predicate, and action block, together make what is called a probe clause. There can be many probe clauses in one D program, and their probe-descriptions can overlap, (i.e., syscall:::entry, open:entry, and open: will all match the firing of the syscall::open:entry probe). When a probe fires, only the clauses that match that probe are executed, and they are executed in the order they appear in the D program. (Keep in mind that the corresponding predicates also have to be true before the clause's action block is executed.)

Example Scripts

So let's take a look at some example scripts to get an idea of what they're like. The absolute simplest script possible is the one that instruments all of the probes on your system.

#!/usr/sbin/dtrace -s 

: 
{ }
DTrace hint: You may only want to try this script on a lab or test system. Internally, this script is activating every available kernel probe, which is going to be on the order of tens of thousands. DTrace is extremely efficient, but even this script is a bit abusive. I've tried it many times without doing any more harm than using up a lot of CPU, but you should be cautious. Another problem you will probably see with this script is dropped data. Every time a probe fires, DTrace records data into a buffer (one for each consumer and each CPU). If the buffer becomes full, the data is dropped. You can see these drops by running dtrace -n : > /dev/null on a lab or test system. The normal trace output will be routed to /dev/null, but the drop messages are sent to standard error.

If you were to actually run this script, you would have to save it into a file, say "intense.d", turn on the execute permissions (chmod +x intense.d), and then execute it from the command prompt (./intense.d). Like any script, you can pass it as an argument to the interpreter, rather than turn it into an executable [3]. To invoke the script in this manner, you'd enter dtrace -s intense.d. If you use this method, then you don't need the preamble (#!/usr/sbin/dtrace -s), and you don't need to turn on the execute bits.

In the rest of my examples, I won't show the preamble, but keep in mind that you need it if you are executing your scripts by changing them into commands.

So now let's add a little more to our degenerate program so that it behaves more reasonably.

syscall::exec*:entry 
{ } 
Don't forget to invoke the script with the -l option initially to verify how many probes it's going to instrument (dtrace -ls sample.d). You should have matched two probes: exec and exece [4]. This is accomplished by using basic shell pattern matching when specifying probe tuples. Now try allowing the script to instrument (dtrace -s sample.d). How would you describe what this D program is displaying? (If you're not seeing any activity, try logging in from another window and type a few commands.)

You should be seeing default output for each time the exece system call is invoked. However, this script doesn't tell us which process is doing the exec. To display this information, we need a built-in variable and a way of printing to standard out.

Recording Actions and Built-in Variables

The D language provides an output action called trace(). It takes, as a single argument, a D expression. Additionally, D provides a set of built-in variables, one of which is execname, the name of the process that caused the probe to fire. In this case, that would be the process that called exec. The following modification to the script will show us this information, in addition to the default output:

syscall::exec*:entry 
{ 
   trace(execname); 
} 
     
To make the script complete, it would be nice to also print the name of the program being exec'd. Each type of system call has specific arguments passed to it. DTrace has a set of built-in variables (arg0 through arg9) that contain the system call arguments. In the case of exec, arg0 will contain the name of the program being exec'd. However, what's actually passed is a pointer to a string, so the arg0 variable has to be treated as a pointer. (The Solaris reference manual lists the arguments for each system call.)

Subroutines and String Variables

Our pointer has a problem: it refers to a string that is not in the same address space [5] where the D program is running. The D program runs in the kernel, while the address in arg0 points to a string in the process that called exec.

DTrace hint: Even though the dtrace command runs in your address space, the D program gets compiled into an internal form called DIF (D Intermediate Format). When a probe fires, any DIF programs associated with that probe are executed in the kernel. This is why the program can't access the pointer in arg0 directly. (The DIF program records data in the buffer. Later, the dtrace command gets the recorded data from the buffer.)

To access the string from userland, we need some kind of built-in function to get the data for us. DTrace has a limited set of built-in functions. One of these, copyinstr, takes a pointer as an argument and returns the null terminated string from the other address space. It does this by making a copy of the string in the probe clause scratch memory, which is automatically reclaimed when the probe clause ends.

But copyinstr doesn't just return a pointer to an array of characters. Unlike C, the D language has a real string type. This means that if you declare a variable as type string, wherever you reference this variable, your D program will use it as a string (not a pointer, a single character, or an array of characters). This also means that strings can be assigned using the assignment operator (=), and can be compared using the relational operators. The built-in variable execname is type string as is the returned value from copyinstr().

With the changes I've described so far, here is the new D program:

syscall::exec*:entry 
{ 
   trace(execname); 
   trace(copyinstr(arg0)); 
} 
    
The above program will work but will not produce very pretty output:

CPU     ID       FUNCTION:NAME 
  0    106       exece:entry   ksh      /usr/bin/ls 
  0    106       exece:entry   ksh      /usr/bin/who 
Let's go over a couple of more features to make things look a little nicer. First, instead of using the trace action, we'll use printf, which allows formatting of the output. For the most part, D's printf works like its counterpart in C. Second, to make the output even cleaner, we'll get rid of the default output. D has a quiet option, which can be specified with -q on the command line or using a pragma in the D program. I've used pragma in my example below:

#pragma D option quiet 

syscall::exec*:entry 
{ 
   printf ("%-20s %s\n", execname, copyinstr(arg0)); 
}
    
Thread-Local Variables

We are almost finished with the D program, but there is one more pointer issue that we need to deal with. It is possible that the address in arg0 points to a location in memory that is not in real memory (just on disk). This could happen because the string is a constant that has never been accessed or because the string got paged out at some point. DTrace can't access such addresses, because the kernel resident probe clause runs with interrupts disabled and can't wait around for the disk I/O to complete. The trick is that we need to defer access until the system call has completed; that way we know the string is in memory. This technique requires that we have another probe clause that gets executed when the exec system call returns.

However, when the return probe fires, arg0 will have a different value. What are we going to do with the address in arg0 from the entry probe? We need to save arg0 so that we can reference the address later in the return probe clause. But where should we save it? Since Solaris is a multi-threaded operating system, there could be many exec system calls going on at the same time. We need to save the pointer in such a way that will allow a unique save location for each thread.

DTrace has a variable type called thread-local. When a thread-local variable is declared, a copy gets created for each new thread that fires the probe. You reference it by placing self-> in front of the variable name. Thus, one thread, that fires the exec:entry probe and then later fires the exec:return probe, is accessing the same thread-local variables. Over the lifetime of a DTrace run, there could be hundreds of thousands of threads, so there has to be a way of getting rid of the variables when you know you don't need them anymore (i.e., when you don't care about the thread anymore). Assigning 0 to a thread-local variable de-allocates its storage. So, here's the final script:

#pragma D option quiet 

syscall::exec*:entry 
{ 
   self->prog = copyinstr(arg0); 
   self->exn = execname; 
} 

syscall::exec*:return 
/ self->prog != NULL / 
{ 
   printf ("%-20s %s\n", self->exn, self->prog); 
   self->prog = 0; 
   self->exn = 0; 
} 
    
DTrace gets the type of a thread-local variable implied from an assignment statement or explicitly from a declaration. If the type is implied, the assignment statement must be ahead of any use of the variable in your D code; however, if your probes fire in an order such that you reference a thread-local variable before it's initialized, the reference will return zero.

So, why do I check that self->prog has been set in the predicate of the second clause? If we didn't have this predicate, what would occur if I happen to start the DTrace program at the moment an exec call was in progress? For that thread, the first probe to fire would be the return probe. Since the entry probe hasn't fired yet, the thread-local variables have not been initialized, and the output would contain erroneous results (in this case, zeros). This type of check, where you don't know for sure which probe is going to fire first, is a common construction in DTrace.

Your output should look something like this:

sh               /usr/bin/ls 
sh               /usr/bin/csh 
csh              /usr/bin/pwd 
There's usually more than one way to solve a given problem. Here's an alternative method that produces almost the same output:

#pragma D option quiet 

syscall::exec*:entry 
{ 
   self->exn = execname; 
} 

syscall::exec*:return 
/ self->exn != NULL / 
{ 
   printf ("%-20s %s\n", self->exn, execname); 
   self->exn = 0; 
} 
On return from exec, execname contains the name of the program that was being exec'd, so I really don't need to save arg0 in the entry probe. However, if you try this script, you'll see the output is a little different from the output of the first script.

Summary

In these two articles, I've really just scratched the surface of the capabilities of DTrace and the D language. In the future, I'll be covering other probe providers, and I'll introduce you to an extremely useful statistical feature called aggregations. Ultimately, we'll look at some ways to use DTrace to solve system problems. In the meantime, if you can't wait, or would just like more information on other built-in variables, subroutines, probe providers, etc., I refer you to the Solaris Dynamic Tracing Guide at:

http://docs.sun.com/app/docs/doc/817-6223 
    
Endnotes

1. The rationale behind this construction is to allow the exec system call to determine how to invoke an executable. To exec, everything is an executable binary with a two-byte magic number at the beginning telling it what kind of executable. Normal executable binaries have codes that indicate whether the file is "ELF" format, whether the symbols have been stripped, etc. The #! two-byte code at the beginning of a script tells exec that this is a script and to actually invoke the executable whose path appears just after the #!, passing along any arguments and the rest of the script as parameters and data. This is the standard way all scripts are handled in Unix (sh, bash, csh, ksh, perl, etc.)

2. In the latest version of Solaris 10, or OpenSolaris, there have been some new functions added that are similar to the standard C string manipulation functions. However, documentation on these new features is slow in coming.

3. Invoking the script as an argument to the dtrace command actually makes more sense if you are going to invoke a new script with the -l option initially to verify how many probes you matched. The -s flag must be last, because dtrace expects anything after the -s flag to be the filename of your script. For example, to list the probes that the "intense.d" script would have instrumented, try dtrace -ls intense.d.

4. The exec system call is the mechanism by which one process calls another one. When you type an external command at a shell prompt, the shell forks (or replicates itself) to make two processes, and then the second (or child) process transforms itself into a different program by exec'ing that program. There are two basic exec system calls: exec and exece. The first form doesn't pass the environment variables to the new executable, and the second form does. You need to specify both in the probe tuple in order to make sure you capture all execs.

5. Address spaces work a lot like street addresses. Let's say you're looking for address 201. There's a 201 Main Street and a 201 Oak Street, but they're not the same location. So, when your probe clause receives a pointer as one of its arguments, the address is in the context of the process that caused the probe to fire. It may be address 201, but it's not an address on your street.

Chip Bennett (cbennett@laurustech.com) is a systems engineer for Laurus Technologies (http://www.laurustech.com), a Sun Microsystems partner. He has more than 20 years experience with UNIX systems, and holds a B.S. and M.S. in Computer Science. Chip consults with Laurus and Sun customers on a variety of Solaris related topics, and he co-chairs OS-GLUG, the OpenSolaris Great Lakes User Group (Chicago).