Article Figure 1 Listing 1 Listing 2 Table 1 aug2006.tar

System Call Tracing

Roy Butler

    "The world is the totality of facts, not of things." -- Ludwig Wittgenstein, Tractatus Logico-Philosopicus

As systems administrators, we're often faced with puzzles in application behavior in the absence of source, documentation, and especially time. Often, the problems are exceptions to the norm, and it's not pure application logic we're asked to troubleshoot but rather some aberrant interaction between the application and its current environment. In such cases, with a little knowledge of operating system kernel fundamentals, it's extremely effective to turn to system call tracing to diagnose the behavior at the application/kernel boundary. In this article, I'll present how this is done both with truss on Solaris and strace on Linux, including a couple of cases we've come across on flight missions here at Jet Propulsion Laboratory.

Overview

Applications consist of "internal calls" to self-contained functions, which incorporate unique logic to perform their tasks. For anything non-trivial, these in turn initiate "library calls" to a middle layer of external bundles of included code, providing convenience routines to often-used procedures. From here, when low-level access is required to the filesystem, network, memory, or other device, "system calls" are made to the operating system kernel itself. When dealing with an interpreted language, like Perl or Java, a translational execution unit is wedged in above the kernel as well. Figure 1 illustrates this call hierarchy. It is at the point when calls are made to the operating system where system call tracers intervene. They print out each system call, its arguments, and the return value, all transparently to the process.

For a simple example, refer to the sample code in Listing 1, which opens a log file, writes to it, and exits. To run this under a system call tracer, you'd use one of the following syntaxes, depending on platform:

solaris$ truss ./log_example
linux$ strace ./log_example
            
The complete output of this example on Solaris is presented in Listing 2, although the output is largely the same on Linux, both being POSIX-compliant or at least related, in most respects. Here's an overview of the highlights:

open("/usr/lib/libc.so.1", O_RDONLY)            = 3
[The standard C library is opened, which corresponds with our #include]

open("mylog", O_WRONLY|O_APPEND|O_CREAT, 0666)  = 3
[The log file is opened, which corresponds with our fopen()]

write(3, " S t a t e   =   1 0\n", 11)          = 11
[11 bytes are written out, which corresponds with our fprintf()]
For details on the arguments and return codes of individual system calls, you can consult section 2 of your system manual pages, for instance:

$ man -s 2 open
A brief description of return codes denoting error conditions returned by system calls can be found in /usr/include/sys/errno.h on Solaris and /usr/include/asm/errno.h on Linux. Some common examples of these appear in Table 1.

For further background material on operating system kernel design and implementation through which system calls are made available, my recommendations are Modern Operating Systems by Andrew Tanenbaum, Advanced Programming in the Unix Environment by W. Richard Stevens and Stephen Rago, and Understanding the Linux Kernel by Daniel Bovet and Marco Cesati. Now, let's move on to some cases where I've used these techniques to quickly and accurately diagnose somewhat elusive problems.

Case #1: The "Misconfigured" Environment

Configuration management is taken seriously throughout NASA, as (when done right) it locks down the variations, increases the odds of success, and provides a structured scenario to backtrack through in the event of failure. We often replicate workgroup environments to strict standards as project personnel branch out.

After one such deployment, it was reported that one of the in-house Perl applications was hanging when run in the new environment. After checking the Solaris workstations for obvious memory, CPU, and I/O issues, we decided to run the application under truss in both a working and this "misconfigured" environment. The end-user was pretty sure it must be a hardware issue, but obliged. After saving the output from the trial runs, we compared the results side by side and found the hanging application was calling alarm(), attempting to open() a file unsuccessfully in a group data directory not under configuration management control, receiving its alarm() timeout, and then looping to try again, like so:

alarm(10)                               = 0
open("datarun.out", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = -1 EACCES
--- SIGALRM (Alarm clock) @ 0 (0) ---
...repeated...
On the other hand, in the good case, the file was opened successfully and things proceeded normally. We checked the file permissions and found the group-write bit not to be set in the new environment's group data directory, corrected it, and the problem was solved. The application hang on failing to open a file is not at all standard Perl behavior (of which I'm a big advocate, as it's elegant and furthermore was developed here by Larry Wall in 1987). But, in this case, I decided not to dig into the application's logic, because it included several modules with which I was unfamiliar and I'd already gotten things working again within 10 minutes.

Case #2: The Pathological Transform

During the final stages of deploying the data warehousing Solaris server for a recent mission, our image processing group came to us with the gripe that their software running from Linux workstations over NFS was operating unacceptably slow. The claim was that it worked well in their testbed. To make matters worse, it needed to go into production within a few days -- a complication formed by the repeated server crashes we had experienced in our initial deployment. (This I diagnosed with adb, worked with the vendor, and resolved -- something I hope to cover in a future article).

When asked for more details, we learned that the slowdown seemed to occur when utilizing a third-party image transformation Java library, for which we were only provided byte-code. When the external development group was contacted, we learned that their lead programmer was on vacation. Was it time to brush up the resume?

All eyes were on the NFS server, whose previous manifestations had proven so faulty. We tried tuning nfsd thread counts, read/write sizes, and timeouts, but nothing produced a dramatic change to the situation. We were using some experimental file access control and logging methods, which we thought might be complicating the picture. But, then we ran some tests and demonstrated that file copies to and from the server in all standard cases were completely up to snuff, so it wasn't an underlying storage or network-saturation event.

At this point, I turned to strace on the clients, running the imaging software under it and using the tracer's -c option, which summarizes system call count and timing. The results showed that the library's read and write counts were much higher than expected for the file sizes it was working with. Running it again under strace with the defaults pinpointed this:

open("martianCity.dat", O_RDWR|O_CREAT|O_SYNC|O_LARGEFILE, 0666) = 12
fcntl64(12, F_SETFD, FD_CLOEXEC)        = 0
fstat64(12, {st_mode=S_IFREG|0644, st_size=10860, ...}) = 0
read(12, "\200", 1)                     = 1
_llseek(12, 0, [0], SEEK_SET)           = 0
write(12, "\201", 1)                    = 1
read(12, "F", 1)                        = 1
_llseek(12, 1, [1], SEEK_SET)           = 0
write(12, "G", 1)                       = 1
...repeated...
It may not be immediately clear, but what this represents is the application reading in one unbuffered (O_SYNC) byte at a time, incrementing it, rewinding, and writing it back out. The actual transform was somewhat more mathematically complex. My initial thought was that the byte-at-a-time behavior was the killer, so we suggested wrapping the library in a script that copies the file locally, works on it there, and then copies it back to the NFS server. This proved to be quick and effective, leaving all parties mostly satisfied.

Since then, I've dabbled in reproducing the problem and have found that unbuffered byte-at-a-time reads/writes actually take place more quickly over NFS than over local storage, at least in test cases. The situation is that it's faster to write this way to a socket, which buffers data up and sends it over the network in large blocks, than to the local file system, which more properly honors the unbuffered request resulting in inefficient disk I/O.

Further exploration turned up that the main culprit was the rewinding and writing of each byte back out, which causes NFS performance to plummet pathologically, just as we saw in production. Looking at the network traffic with tcpdump (whose output, just as with snoop, can be analyzed later with ethereal, by the way) revealed that each byte read or written causes a full NFS buffer size worth of data to be transferred -- due to cache invalidation, I suspect.

Conclusion

So, for tackling the hidden elements of underlying application behavior regardless of the high-level programming language used, I hope I've demonstrated the effectiveness of system call tracers and their right to a prime spot in the systems administrator's tool box. These techniques have opened new perspectives for me as I've learned them, and have saved me from countless misdiagnoses and busy work, not to mention headaches. I welcome correspondence on these and other advanced techniques.

System call tracers exist for most other modern platforms, as well, including truss for most SysVR4-based Unix releases, ports of strace to several Unix releases, ktrace for FreeBSD and Mac OS X, and even BindView's implementation of strace for Microsoft Windows (which is a wholly different beast, best when used with the book Windows NT/2000 Native API Reference by Gary Nebbett).

To confuse matters, some systems come with a program called "strace", which is used for STREAMS diagnosis and not for system call tracing in the manner I've covered here. A check with the manual pages should clear things up. It should also be noted that tools exist for library call tracing, too, fulfilled by truss on Solaris with its -u flag, and separately on Linux, with the ltrace utility.

Additionally, the use of system call tracers by privileged accounts can present both security benefits and concerns -- enabling monitoring of all data passed to and from an application. One positive implementation of its use in this arena is the "real-time" monitoring of applications by some anti-virus packages to root out worms and Trojan horses.

Roy is a Senior Ground Data Systems Administrator for NASA/Caltech's Jet Propulsion Laboratory, where he's worked on the Mars Exploration Rover and Mars Reconnaissance Orbiter projects. His previous experience includes systems architecture for a backbone ISP and vulnerability research for an international computer security firm. He holds a bachelor's degree in Philosophy from U.C. Irvine. He can be reached at: roy.butler@jpl.nasa.gov.