NASA develops robotic spacecraft that perform complex tasks in literally otherworldly conditions, producing hardware and software so dependable that only rare mistakes make headlines. As you saw last month, though, those headlines can arise from simple errors, ones you might make yourself, that somehow slip through verification and testing procedures far more rigorous than anything your projects will ever encounter.

This month I take a closer look at how errors evade detection and give you the opportunity to play code reviewer. Even if your blunders are neither so costly nor public, what you see here may look disturbingly familiar.

Do What We Mean

After the Mars Polar Lander (MPL) vanished, the NASA review board identified several potential mission-killer hardware and software errors, of which premature descent engine shutdown was deemed the most likely. Their review isolated that failure to one routine, then determined how it got there.

Each of MPL's three landing pads has a ground-contact probe attached to a Hall-Effect switch. Unlike science-fiction spacecraft, all three pads don't touch down simultaneously, so engine shutdown occurs when the first probe contacts the surface. If that sensor fails, the signal from the second probe triggers the shutdown.

It is important to get the engine thrust terminated within 50 milliseconds after touchdown to avoid overturning the lander. The high-level requirements specified a 100-Hz sample rate, automatic rejection of a stuck-on sensor, and two successive "touchdown" readings from any single sensor to signal a valid landing. In addition, the use of the touchdown sensor data shall not begin until 12 [later changed to 40] meters above the surface [...] to protect against premature descent engine thrust termination in the event of failed sensors and possible transients.

With those requirements in mind, test your code-debugging skills with the Pythonesque pseudocode I created from flowcharts in the MPL report describing how each of the three sensors work. Example 1(a) tells the monitoring code in Example 2 to begin sampling the contact sensor inputs. Unlike some real-time systems, Example 2 gets called every 10 ms regardless of whether it's enabled or not; this satisfies another requirement to not add sudden CPU loads at critical times.

(a)
def TouchdownMonitorStart() :
 IndicatorHealth = TRUE
 IndicatorState = FALSE
 EventEnabled = FALSE
 TouchdownMonitor = TRUE
 return

(b) 
def TouchdownMonitorEnable() :
 if TouchdownMonitor :
  if LastTouchdownIndicator and CurrentTouchdownIndicator :
   IndicatorHealth = FALSE
  EventEnabled = TRUE
 return

Example 1: (a) This routine starts the touchdown monitor code, which then samples the input sensors every 10 ms. (b) After this routine returns, the touchdown code is fully activated and can shut off the descent engines.

Def TouchdownMonitorExecute() :
 if TouchdownMonitor :
  LastTouchdownIndicator = CurrentTouchdownIndicator
  CurrentIO = ReadIOSensors()
  if IOError() and not EventEnabled :
   CurrentTouchdownIndicator = FALSE
  else:
   CurrentTouchdownIndicator = CurrentIO
  if LastTouchdownIndicator and CurrentTouchdownIndicator :
   IndicatorState = TRUE
  if IndicatorState and IndicatorHealth and EventEnabled :
   DisableThrusters()
   TouchdownMonitor = FALSE
   EventEnabled = FALSE
  return

Example 2: Does this routine meet its specifications? Does it do so under all conditions?

By the way, if global variables give you the heebie-jeebies, you're just not cut out for this line of work.

The Entry, Descent, and Landing (EDL) control program calls the code in Example 1(b) at 40 meters to activate the touchdown monitor by setting EventEnabled. If the sensor has failed stuck-on, it will always read as active and IndicatorHealth will be cleared. There's no check for a stuck-off failure, which will be handled by simply waiting for the next probe to touch the surface.

Example 2 implements a straightforward two-in-a-row filter that ignores spurious single-sample events. When two successive samples indicate touchdown, the code shuts off the descent engines and disables further testing.

Now, for an eighth of a billion dollars, answer this simple question: Will it work? If not, why not and how would you fix it?

Here's a hint. The legs deploy when the lander is a kilometer or two above the surface, well before the EDL code calls TouchdownMonitorEnable() at 40 meters. However, the Start() routine has already set TouchDownMonitor and the Execute() routine has begun sampling the sensors, testing for two successive TRUE inputs from each contact switch.

Examine that code again: Well over 100 Ultra-Large rests on your analytic prowess!

Need another hint? [A] 2001 lander was used for two leg deployment tests ... . The first test resulted in transient times of 12, 26.5, and 7.3 milliseconds ... . The second test resulted in transient times of 16, 12, and 25 milliseconds ... .

Any transient longer than 10 ms after the code begins sampling can cause two successive "valid" samples, at which point Execute() sets IndicatorState. That variable is never turned off, so as soon as the EDL code calls Enable(), the engines shut off.

It took me a protracted pencil-and-paper session to verify the report's conclusions and reject a few false assumptions, but, yup, that's exactly how it works.

The Board found that [t]he touchdown sensing software was not tested with the lander in the flight configuration. Because of this, the software error was not discovered during the verification and validation programs. The initial requirement to ignore sensor transients somehow didn't make it into the software specification used to design the code. The programmers, who weren't aware that switches often produce glitches, simply didn't take that failure mode into account. The system testers, lacking that key spec, did not feel an all-up test would be worth the not-inconsiderable effort and risk.

The lack of telemetry during EDL made it impossible to determine if the landing leg deployment transients set the touchdown state to True during the leg deployment.

The Board recommended some additional checks-and-balances to prevent a replay of the MPL error. Another incident shows how such procedures play out in real life.

For Lack of a Bolt

NASA orbited the first Television Infra-Red Observation Satellite (TIROS) in 1960, followed by increasingly complex satellites through the NOAA-KLM series jointly controlled by the National Oceanic and Atmospheric Administration. The current satellites have a minimum two-year lifetime, which requires a fairly steady stream of replacement birds that, it seems, is a difficult pace to maintain.

Two satellites in the same polar orbit observe each point on the earth every 12 hours, so they're built in pairs. The project manager for the new NOAA-N and N' (also N-Prime) satellite pair convened a risk-management meeting by asking attendees to brainstorm possible failure modes and create procedures to prevent those mishaps. He explicitly ruled out things that simply can't happen, such as dropping a satellite on the floor. The 6 September 2003 phone call describing the situation in Figure 1 came as a nasty surprise.

[Click image to view at full size]

Figure 1: The NOAA N-Prime satellite slipped off its turnover cart because 24 bolts that should have connected the adapter ring to the cart were missing. Photo from the NASA N-Prime Report. Courtesy of NASA.

During much of its assembly, a satellite stands vertically on its base atop a handling cart, but some operations require other alignments. The satellite's structure has enough rigidity to support it horizontally from its base, which terminates in the flight adapter that joins the satellite to its booster and serves as a robust mechanical connection during construction and handling.

The white turnover cart (TOC) in Figure 1 includes a hinged plate that supports the satellite, with a bearing that rotates the satellite around its long axis. Actuators pivot the plate between vertical and horizontal positions. Because Lockheed-Martin Space Systems Company (LMSSC) also builds the Defense Meteorological Satellite Program (DMSP) satellites, which are similar to the NOAA N-series birds, in the same building, the two projects share a common set of TOCs. Even though the satellites both ride Titan-II boosters into orbit, they have different flight adapters and thus require different adapters between their bases and the TOC.

The DMSP Program decided to use this TOC for its activities ... . The reconfiguration was interrupted part way through the process of the TIROS adapter ring removal, in order to install the DMSP adapter ... . This change in plan left the TIROS adapter ring sitting on the TIROS TOC with its 24 attachment bolts removed.

The TOC already had a "red tag" due to a damaged floor jack. The Board discovered that [n]o red tag nor any other indication was added to the TIROS TOC to indicate the incomplete configuration. None of this was communicated to the TIROS folks ... , because the over-riding philosophy was that each user was required to verify or ensure the [Ground Support Equipment] configuration was appropriate for its own specific use each time it was used.

The TIROS crew repaired the jack, hoisted the satellite with an overhead crane, lowered it on the TOC adapter, then installed and torqued 44 bolts between the satellite's flight adapter and the TOC adapter. During the operation, the Technician Supervisor commented that there were empty bolt holes, a conversation that was overheard by several of the technicians. The team and the RTE [Responsible Test Engineer] in particular dismissed the comment and did not pursue the issue further.

A total of 88 bolts secure the satellite's flight adapter to the booster, but because ground handling imposes far less stress than launching, standard practice calls for installing only half the bolts. That left 44 "normally empty" holes in the same area as the 24 missing bolts, obscuring the problem while simplifying the procedure: ...members of the crew recognized that, "This was the smoothest this operation has ever gone."

With N-Prime standing atop the TOC, the crew activated the pivot motors and, when the TOC plate reached an angle of 13 degrees, the satellite simply slid off the plate and punched a neat crescent dent in the TOC on its way to the floor. Nobody got hurt, but it could have been much, much worse, as the Nickel-Cadmium batteries were fully charged, the propulsion system was pressurized, and the separation band was tensioned.

As you might expect, the Board found [t]here were missed opportunities that could have averted this mishap.

Checks and Balances

The MPL Report includes several project-level recommendations that seem applicable to nearly any software project.

R1) For highly cost- and schedule-constrained projects, it is mandatory that sufficient systems engineering and technical expertise and the use of the institution's processes and infrastructure be applied early in the formulation phase to ensure sound decision making in baseline design selection and risk identification.

In other words, when you're expected to work smarter, you must actually think about the project, preferably before you begin. Ignoring potential problems is not a strategy.

R2) Do not permit important activities to be implemented by a single individual without appropriate peer interaction; peers working together are the first and best line of defense against errors. Require adequate engineering staffing to ensure that no one individual is single string; that is, make sure that projects are staffed in such a way as to provide appropriate checks and balances.

We've found that bouncing your coding ideas off somebody else really is the first and best line of defense against errors. That person must, of course, be competent to recognize errors and omissions, which was part of the problem in both of these mishaps.

While testing cannot ensure the absence of errors, it can demonstrate that the project meets specifications and, with any luck, also show that the specifications don't have any glaring omissions. The tests must match up with reality, rather than with an idealized model: end-to-end validation...through simulation and other analyses was potentially compromised in some areas when the tests employed to develop or validate the constituent models were not of an adequate fidelity level to ensure system robustness.

Reading these NASA mishap reports makes perfectly clear the suffocating number of reviews, cross-checks, verifications, and sign-offs required to get anything done. Many of the recommendations boil down to "be more careful" and "check more often," but it seems that, beyond a certain point, humans simply cannot and will not do more checking.

The N-Prime mishap clearly reveals that limiting point. The Report does not include the checklist itself, but the missing steps give some indication of how an intricate procedure can go wrong in real life.

L-1) The RTE decided to "assure" the cart configuration through an examination of paperwork from a prior operation rather than through physical and visual verification. The RTE made a second decision error in dismissing a comment by the Technician Supervisor concerning empty bolt holes.

L-2) The technicians, with the exception [of the] Technician Supervisor noted above, failed to notice the missing bolts, even though they were working within inches of where the bolts were supposed to be.

L-3a) The PQC [Product Quality Control] and the PA [Product Assurance] signed-off on "assure the configuration" of the TOC procedure step without personally validating the TOC configuration or, in the case of the PA, even being present at the time this step of the procedures was completed during the operation.

L-3b) The safety representative was not present as called for in the procedure. Again, this investigation determines that such a violation is routine.

These elements described above led the MIB to conclude that decision and skill-based errors and routine violations by the NOAA N-PRIME I&T [Integration and Test] team were manifested as a failure to adhere to procedures.

Last Tab

Space missions, unlike software projects, can't use iterative design. You simply don't release an incomplete Version 0.1, get user feedback, tweak the design, release Version 0.2, and continue iterating until the code works well enough. Instead, you spend a few hundred million dollars over a decade or so, and get exactly one attempt. Should you overlook something, you must start all over.

Complacency at any point can kill such a project and, I think, accounts for much of ordinary software's error rate. If we all did what we know we should, most of the errors simply wouldn't happen. Be honest: Have you actually analyzed all the possible failure modes for that fancy SOA project?

A simple fault tree for your current project will prove an enlightening activity, if only when it shows how much code you cannot control. Even better, it might reveal ways to improve your own error handling and prevent hostile intrusions.

The more procedures you put in place to prevent errors, however, the more opportunity you'll have to simply ignore those rules in order to get the job done. Has anyone ever factored that into a management decision?

No combination of keywords turns up the NASA newsletter containing the N-Prime director's summary of the risk-assessment meeting and that fateful phone call, but it's out there somewhere. Memo to self: Always save the links.

Error Checking

by Ed Nisley

Do What We Mean

For Lack of a Bolt

Checks and Balances

Last Tab