Tracking Down Killer Bugs

Dr. Dobb's Journal April 2002

By Eric McRae

Eric is a senior embedded systems engineer and the owner of Electro-Logic Machines in Port Townsend, Washington. He can be contacted at eric@elmi.com.

In the field of embedded system development, there occasionally arises a bug that defies all normal techniques for discovery. These "killer" bugs are characterized by unpredictability, an unclear hardware/software basis, and their proximity to a major project deadline. Such was the case for me recently. At the outset, I must say that the process of resolving this bug took five weeks, but the bug itself was spontaneously identified in the middle of the night during a vacation, two weeks into the problem.

I've been working in embedded systems since nearly the dawn of time. Ten years of my career were spent developing in-circuit emulators and studying embedded system debug techniques, so I spend more energy than most doing preventative design to minimize the need for debugging. Lately though, I've been doing a substantial amount of automotive engine control work using Motorola TPU coprocessors. This device's I/O is buried underneath a Ball Grid Array package and other hardware out of my design control, and its data interface is dual ported to a CPU that is extremely busy running my client's normal engine control algorithms.

If you aren't familiar with the TPU, it's a device for real-time control of engines, robots, and other I/O-intensive applications. TPU microcode provides a high degree of flexibility and power but is difficult to debug "in the target" since it executes at 16 MIPS and there are no in-target debuggers. (For more information on the TPU, see "Inside Motorola's TPU," by Richard Soja, DDJ, December 1996.) For development, I use a TPU simulator from AshWare (http://www.ashware.com/). Until the events related here, this simulator had been all I'd needed for program development and testing.

The Bug

The first indication of the bug came in an e-mail message from my client who reported that they were investigating an occasional "long pulse" on fuel injectors on one test vehicle. The fuel microcode in question had been running on numerous other engine applications for almost a year, so my reaction to the message was mild interest. I have high respect for my client's designers and rarely suspect their work, but the preponderance of known good performance temporarily pushed the finger of guilt in their direction and I went back to work on other things. A day or two later, when they delivered a data capture, the finger of guilt reversed. The symptoms were typical for a killer bug — my fuel injector control code running on a particular V8 engine, after idling at 600 RPM for an hour or two, would, on a random injector, generate a fuel pulse that was approximately 10 times as large as had been commanded by the CPU. After the pulse was delivered, the code reported accurately that it had delivered the large pulse, as though it had been commanded to do so by the CPU. Unfortunately, recorded data seemed to indicate that the CPU hadn't commanded such a long fuel pulse. After this singular event, the engine might idle for several more hours without incident.

I spent a great deal of time reviewing my code and trying to imagine conditions that would cause the known symptoms. Each time I ran a simulation, I saw nothing but desirable behavior. Nothing came out of this until the middle of a previously planned vacation. Something woke me in the wee hours and almost immediately, a possible cause of this problem came to me. I got up, composed an e-mail to my client telling them that I would verify the problem on my return to the office. My conjecture required that the CPU issue a Fuel Update command at just the right time. Confident, I returned to sleep and subsequently enjoyed the remaining days of vacation. Unfortunately, when I got home, a message had arrived indicating that Fuel Updates were synchronized to the up edge of a certain control signal, which my TPU code produces, and were thus being issued far away from the danger zone my bug required. I verified on the simulator that my bug was real, but it apparently wasn't the bug my client was experiencing. I didn't want to make any changes until I had my client's problem nailed down, so I deferred issuing a fix for the bug I knew about.

Building a Test Platform

I was facing a dilemma. AshWare's TPU simulator is excellent, but even on my workstation, it takes almost an hour of real time to simulate 20 seconds of real-time TPU operation. It is also difficult to model variations in input signals from the engine. I realized that the time had come to create a test environment based on real hardware running in real time. I wasn't going to have a real vehicle to work with, but I was going to spend some money.

First, I needed a capable single-board computer to supply all the signals that are normally generated by a running automobile engine. I already had a suitable device in the form of an MC68332-based development board from AshWare. The 68332 contains a TPU so I knew I'd be able to simulate engine signals accurately. Next, I needed a separate single-board computer based on the CPU my client used — the Motorola PowerPC-based MPC555 microcontroller. I would have to write and debug CPU code for both of these platforms and, therefore, wanted C compilers and high-level language debuggers for both. Since I didn't need to do hardware debug, I believed BDM-based debugging would be sufficient. AshWare's development board comes with a nice Windows-based HLL debugger. I much prefer to do development under Linux, but I have a Windows NT workstation available and I could run the 68332 debugger from there. I already had the GCC targeted for both the 68332 and the MPC555 installed on my Linux workstation, so I planned to do all code development there. I absolutely wanted to use GDB/DDD running under Linux for debugging the PowerPC target. Finally, I knew I had to look for a long fuel pulse on any one of eight fuel-injector channels and that I'd need to capture signal states leading up to the pulse. Obviously, I needed some sort of logic analyzer. I initially choked at the expected expense but realized that I didn't need nanosecond resolution, and that I had an available laptop that could be used as a front end to an analyzer if that would reduce the price.

The Web is a wonderful thing. I was delighted to find a $200 development board based on the MPC555, offered by Axiom Manufacturing (http://www.axman.com/). It had a BDM port and all MCU signals brought out to a pin array. The next purchase was a "Wiggler" BDM debugger from Macraigor Software (http://www.macraigor.com/). To support the Wiggler, I downloaded the GDB package for the PowerPC under Linux. Finally, I searched for PC-based logic analyzers and found one that seemed to stand out. It was GoLogic from NCI Logic Analyzers (http://www.nci-usa.com/). GoLogic is a PC-based device with a transitional analysis mode that appeared to be just the ticket for my problem. Total time for product search — about two hours. Money spent — about $3500, including taxes and shipping. Nothing to it. Sign here.

Next, I installed the Linux GDB package distributed by Macraigor. Linux RPM packages sure make things easy these days. A high-speed Internet connection helps too. Over on the 68332 side, I wrote some special TPU microcode to emulate the engine signals. Then I wrote some CPU code to vary the speed of that output in order to reproduce the major and minor idle speed fluctuations of a real engine. I was able to test this code right away since I had all the hardware I needed and an oscilloscope to look at signals.

CPU code running on the MPC555 would have to duplicate control interactions occurring on the real engine, so I sent a request to my client to define what the CPU was doing and when. I also began to write CPU code for the PowerPC to initialize the TPUs and configure them according to published specifications for my own engine control microcode. I didn't have to write any special microcode for the PowerPC side because I would be using exactly the same build my client was using.

Much to my relief, all the purchased products arrived within four business days of ordering. Unfortunately, I encountered problems trying to make things play together. AshWare's 68332 development board and debugger were a slam dunk. The debugger was fast, obvious to use, and it behaved well. The combination of Axiom Manufacturing's PB-0555-ED single-board computer and Macraigor's Wiggler BDM debugger did not play at all. I couldn't get the Wiggler to talk to the PowerPC board. Messages sent back through GDB kept complaining that the BDM cable was disconnected. This situation was immediately frustrating because I had no idea who was at fault. I considered that the MPC555 board may have been faulty, so I e-mailed Axiom's support and asked how I might confirm the BDM feature. The board did boot up in its serial-port-based monitor and I could run hand-assembled programs, so a lot was working. Also, the BDM port is wired directly to the MCU so not much can go wrong. I also e-mailed support at Macraigor and requested help. While I was waiting for responses, I verified that I could compile PowerPC source down to Motorola S-records and download those to the serial port monitor successfully. That would be a more tedious fallback if I required it.

About an hour later, Axiom responded and said that the serial monitor was programmed into the MPC555 using the BDM port. Since the monitor obviously worked, it was unlikely that the BDM port was the problem. Shortly after that, I received a more definitive response from Macraigor: "The Wiggler is not supported under Linux." Indeed, while I hadn't seen it in my haste, Macraigor's web site did state that fact. Now I had a decision to make. Macraigor's next-level product, the "Raven," was supported under Linux. On the other hand, I could use my Windows NT machine and run the CygWin debugger and the Wiggler from there. I much prefer to do development under Linux/UNIX so I chose to go ahead and order the Raven, which was $700 versus the $165 Wiggler. The money was not the issue for me, being able to develop on my favored platform was well worth the price. Macraigor kindly expedited a Raven to me.

Since I had a forced delay, I decided to learn about my new logic analyzer. I considered installing the GoLogic on the NT machine that had additional parallel ports. Unfortunately, it wouldn't run with the parallel I/O board I'd installed. NCI's web site was very clear about its need for a good EPP port and explicitly said the board I had wouldn't work with its product. Then I decided that I really wanted the analyzer to be driven from my laptop. I installed it there and it worked perfectly. NCI provides a server mode whereby I could still operate the GoLogic GUI from my NT machine while the probe is connected to the Windows 98 laptop. Unfortunately, that feature did not initially work. I collected all the symptom data I could and fired an e-mail to support@nci-usa.com. I received a response within an hour from the person responsible for the GUI software. He said he'd look into it. The next morning, another message arrived indicating that the server problem had been isolated and he'd send me a fix as soon as he'd tested it. The fix arrived a few hours later and worked correctly. There have since been a few other GUI issues and suggestions I've e-mailed to NCI. In each case I've received an immediate and satisfactory response. I consider NCI a model company in terms of its support. Both by phone and by e-mail, its responses were quick and clear.

When Macraigor's Raven arrived, I plugged it in and attempted to go to work. It didn't generate complaints about disconnected cables, but it wouldn't come up and appeared to hang during processing of the initialization file. The PowerPC has lots of features and its BDM protocol is quite a bit different from earlier versions. I compared the initialization files supplied by Axiom and those supplied by Macraigor and used the board-related data from Axiom and the debugger-related data from Macraigor. Lo and behold, I was able to boot up DDD/GDB under Linux attached to my beloved PowerPC target. There remained some reliability issues with this setup and it took several phone-tag exchanges with Macraigor before these were resolved. However, I was now able to download and debug PowerPC code.

As soon as I had a usable test platform (see Figure 1), I set up the logic analyzer to monitor one of the eight fuel channels for the faulty pulse. GoLogic's trigger mechanism didn't support simultaneous triggers on eight channels and I realized that I might have to be very lucky to catch a fault on any one channel. I therefore decided to write some additional TPU microcode for the 68332 that would detect an overly long fuel pulse on any of the MPC555's fuel output channels and produce a fault signal that would trigger the GoLogic. That microcode was easy and was verified by setting the fault threshold to five milliseconds. The moment I commanded any injector to go one microsecond over five mSec, the analyzer would trigger and capture a large window around the event.

The GoLogic transitional analysis feature is very nice. The analyzer will clock at a fixed rate (125 MHz in my case), but it will only consume a storage location if one of its inputs has changed state since the last recorded sample. Each recorded sample is timestamped. The fastest moving signal I had was only running at KHz rates. Thus, using this mode, a center trigger on a fuel error would capture many minutes of real-time signals on both sides of the event.

Experience has taught me that when you are closing in on a nasty bug, lab mishaps are more likely. Looking at my bench, I observed a tangle of boards, probes, a laptop, and a scope — probably about $10,000 worth of equipment but representing a value far greater in project delays if someone snagged any of the cables and dragged something to the floor. I took immediate action. Using plywood, screwnails, and cable ties, I fastened everything down securely and in accessible positions. All cables crossing the floor were strain relieved. In addition, I placed a high-intensity reading lamp nearby for use when probing signals on the boards. Commercial boards rarely come with suitable grounds for attaching oscilloscope ground clips and I've more than once seen ground clips pop off feeble attachments and magically land in contact with a power supply circuit. To avoid this, I soldered several large bare wire loops to the ground on both boards. Finally, I established suitable static protection around the area.

The MPC555 comes in a 272 BGA package. On the Axiom board, all MCU connections are brought out to an array of 272 pins surrounding the package. These are very convenient for attaching logic-analyzer probes, but you need a road map to find desired signals. In the past, I've generated graphic images and scripts for locating signals based on CAD data. The graphics were enlarged and placed conveniently for reference when probing pins. The scripts were a quick way of finding a particular signal without having to search the graphic. In this particular case, I didn't have the CAD data so I had to manually create a pin assignment graphic, but it's easily been worth the hour it took to create. Finally, because my eyes have been looking at embedded systems since nearly the dawn of time, I also keep a good magnifying glass handy to make sure I'm locating the correct pins.

Now I was ready to work. I configured and started my test system. Using GoLogic, I verified that everything was operating as expected. Then I let the system idle. After three days, there had been no long fuel pulses detected. To be sure my trigger setup was still working, I went in through GDB and again commanded a 5.001 Sec pulse on one fuel injector. The trigger and analyzer fired immediately, making me feel confident that the problem had not yet shown up in my lab. Since I couldn't reproduce it, this bug had now met enough requirements to qualify as a class "killer." It is not trivial for my client to collect data from a test vehicle. To facilitate debugging, I produced special microcode that would store CPU-accessible critical data at the start of each fuel pulse. A week later, they sent a huge data dump from my debug code. From this data, it became clear that when the fuel code started the injection, it had been commanded to produce a normal pulse. When the long pulse occurred, it was as though the injector had been updated with new data to produce a huge pulse. However, my client had also recorded the actual commanded pulse widths, and they were at all times completely normal. The only things notable were that the commanded pulse width had decreased slightly when the faulty pulse was produced and that the pulse lasted until just after the next injector's pulse ended. This was similar to the only other fault capture from the real engine. The bug I already knew about might create a symptom like this but only if the CPU updated a fuel parameter near the end of the fuel pulse. Based on the information supplied by my client, updates were phased far away from that position. I quickly adopted all the parameters from the data dump and let the test bed run. After five more days of continuous testing without failure, the finger of guilt seemed to be swinging back toward my client.

Ah Hah!

When all else fails, you can sometimes find a solution to a problem by emphatically declaring that everything you've done is correct and submitting this information to public inspection. This is a variation on solving problems by declaring them unsolvable in some newsgroup and staking your reputation on it. My approach was to summarize the current bug status for one of my client's engineers. The people I work with there are very sharp, and when we run into a difficult problem we will lean our heads together and explain the facts to each other. Often, the process of explaining is all it takes — one of us will see the light. So it was this time. As I was composing the message, I realized there was only one piece of data they had supplied that I could not confirm by the data dump collected from the engine. That was the phase of the control signal that was used to synchronize fuel updates. Saying to myself, "What if they are using the other edge?" I looked at a trace from my analyzer and my heart skipped a beat. The other edge was occurring right in the "danger zone" near the end of each fuel pulse.

The bug that I had previously conjectured required that an update had to diminish the commanded fuel pulse to within two microseconds of the time of the update. That was consistent with the previous evidence that the problem, in the recorded cases, occurred when the pulse width was decreasing. I set up my system to use the "other" control edge and carefully adjusted the CPU timing to produce the bug I knew about. It only took a minute before my trigger system fired and the analyzer captured a trace that looked exactly like the traces my client had sent me. It had been five weeks and thousands of dollars since the initial bug report, but now I had it. I deleted the e-mail I'd been composing and then asked my client to verify the control signal edge polarity. The immediate response was as I expected — their previously supplied information was wrong. The fix took 10 minutes of thought and five lines of microcode and my client has not reported a fuel problem since.

Lessons Learned

The bug itself is a lesson. In these days of multi-MIPS machines, we can get sloppy knowing that a bunch of instructions can be executed before a microsecond timer ticks over twice. I knew that the original critical thread would execute in less than a microsecond. Unfortunately, a subsequent modification to another region of the fuel code resulted in a duplication of instructions, and since I needed code space, I compressed the ends of the two threads into one and jumped into the new thread from the critical thread. Algorithmically, it was correct and it looked shorter due to being terminated by a jump, but the thread was actually longer — taking almost 1.5 microseconds to execute. Since the thread was setting a microsecond-based hardware match register, a tiny window was created in which the code may attempt to set a match for 1 microsecond in the future, but by the time the hardware is updated, the match time will have ticked into the past.

There are other lessons here, some obvious, some not:

Read the fine sales print whenever a purchase is critical to your progress. It might not hurt to call the company and ask if the product can be used the way you want to use it.
When a test equipment problem occurs, be aggressive, if necessary, to get a response.
Design your system for debugging, starting from the beginning. When you are creating a design, imagine the worst possible places for a bug and create a way to debug them. Lay the ground work ahead of time for debugging with your client or your product design team. Know what procedures you will use and what data you will have to collect.

Finally, if you have a really tough bug, explain the symptoms and all related information to another technical person, preferably someone who you'd rather not have discover a bug in your work.

DDJ