Embedded/Real Time Systems


Software for Fail-Safe Applications

Dan Wisehart

Fail-safe software isn't an oxymoron, despite what some cynics insist. But it isn't easy to write, either.


When designers must protect human lives and valuable assets with automatic equipment, they turn to fail-safe design. Fail-safe software systems are in use around the world, despite the reservations of many regulatory agencies about using software (especially C and C++). What is considered "fail-safe" varies (see the sidebar, "Fail-Safe System Design"), but all fail-safe designs incorporate certain common components. The two basic methodologies in fail-safe design are:

1) use expensive, non-redundant components, that cannot fail until far above normal system load is applied for several times the system life, or

2) use relatively inexpensive, redundant components, working in parallel. Redundant components open the door for relatively inexpensive controllers whose operation can be continually checked and cross-checked.

In this article I explore the latter option. I review the hardware such systems must interface to, and then develop a complementary software base that makes fail-safe operation possible.

The Hardware

Redundant systems use some combination of the following: multiple inputs that monitor the same parameter, multiple outputs that control the same device, self-testing equipment, fault-tolerant equipment, and redundant controllers. The designer must ensure that this equipment detects and safely responds to any single-point failure (or string of related failures) in a vital path. In such systems the output from each hardware or software stage feeds at least two inputs to the next stage. Because the parallel chains are interconnected, the monitoring equipment need not grow exponentially with the length of the chain. With two or more controllers, external equipment will continuously compare the internal and external state of each controller. Even if the external states agree — for the current moment — the comparison equipment detects errors and failures within the controllers and their software before they can propagate.

Digital I/O can be especially confounding in a fail-safe system, because some I/O lines will remain at a constant state for hours or days at a time. A system normally looks for changes of state to determine if an I/O point is responding, but hours or days is enough time between active events for redundant chains to develop multiple failures. Fail-safe designers use several methods to continuously and actively verify the status of the controller's digital I/O channels. To automatically detect an open or shorted circuit, an active, current-on signal indicates safe status; absence of signal indicates the unsafe condition. Digital inputs come in redundant, state-inverted pairs: if one is high, the other must be low. Digital counters require pulses from two tachometers in the correct order to register a single count and allow application of test signals to their inputs between counting events. Digital outputs come in redundant pairs and undergo a special test which I describe later. Before each start, those relays, mechanical brakes, and other operating equipment which cannot be de-energized while the system is running are de-energized in a "drop-out test." During this test the system checks the configurable parameters of the controller.

Analog I/O consists of actively changing, complex signals whose correct level varies from moment to moment. These I/O points can experience many kinds of failure, including scaling or gain errors, offset errors, single bit failures, slow response, and simple signal loss. Analog inputs come in scaled (using non-binary factors) and offset redundant pairs. The changing nature of analog outputs often prevents external hardware from doing much level verification; thus, the software becomes more fully responsible for analog I/O.

Multiple analog outputs cannot usually control a single device at the same time. Instead, the designer will provide a separate and redundant software monitoring facility as a part of the analog control system.

The Software

In the condensed code examples that follow, I have included just those features that highlight their fail-safe design. I have not included constructors, destructors, or member variables, but I have shown the methods that return the internal state of the objects, because the software uses these to verify the configuration of the object during the drop-out test. Most real-time operating systems are multi- threaded, but I have left out the many needed synchronization locks. I have not shown how to filter or debounce the inputs; that would take another complete article — the system conditions them prior to my using them in these examples. Because the solutions are largely controller determined, these classes do not address the problem of simultaneity: during the time between reading inputs, the state or levels can change. Some controllers provide a snapshot command that allows you to store the state of several inputs at one instant in time. Other controllers buffer their I/O in such a way that the user software sees the inputs as if it read them at the same moment. In some cases, you may have to use several readings and your own ingenuity to compare separate inputs.

Listing 1 defines the constants used in other parts of the source code. Listing 2 shows the AnalogInput class, which functions as a single software analog input for a fail-safe application. I designed it to read two, scaled but not offset, hardware inputs that indicate the same parameter. I intend that two instances of this class will watch each pair of hardware inputs as the first link in a redundant software chain. The level method reads the hardware, re-scales one input to put it at the same level as the other (the scaling factor is negative if the input is sign inverted), refreshes the auditor, and verifies that the signals are the correct sign and within their allotted window. These are the only checks requied within this level method, but in some applications the method might also check that the average signal is within an allowable window.

I generally prefer to keep the software inputs and outputs as simple as possible, so I put error checking for other than hardware failures in separate modules. This level method accepts signals within a window, which includes both multiplication (+/- 5%) and addition (+/- 0.5) terms. The reason for this rather elaborate window is that analog-to-digital converters are accurate within a few percentage of the full-scale input: a +/- 2% error for a 10.0 volt input gives a whopping +/- 20% error for a 1.0 volt signal. The multiplier takes care of the large signals and the adder keeps small signals from falsely indicating a failure. If the class detects a failure, it triggers a fault through an Auditor class (described below), which latches the fault and returns an error. A Configure object sets the scaling factor and a MasterRun object calls the reset method as needed.

I keep the AnalogOutput class (Listing 3) simple by providing the output but leaving most of the regulating and monitoring for other classes. The only regulating AnalogOutput does is divide large changes into a number of smaller changes which, combined with some hardware filtering, provide a linear ramped output from step inputs. This feature is intended to provide a smooth waveform when the output is quickly returned to its neutral position. The regulating classes that use an AnalogOutput will likely generate an "S" shaped output that always changes more slowly than the allowable step size. AnalogOutput can be configured with an operational step size, which you may use repeatedly; you program the maximum allowable step size into the monitoring classes.

Any class can check the idleLevel, but only the Configure class can set the idleLevel and stepSize. The controlling part of an analog output chain is largely a single chain of non-redundant modules that control single hardware analog outputs. The monitoring classes, which will redundantly read the same inputs and outputs, verify the software output of the control chain and deactivate the digital outputs in the case of a detected failure.

The reset method behaves a little differently in this class than it does in the other classes. Instead of demanding a certain amount of time between resets, this method requires the output to return to an idle level before it will accept new levels for the output. Returning to idle, of course, takes time — more time the further the output signal is from the idle level — and it ensures that each run starts from an idle position. Consider what happens if you throw your car into drive while the engine is racing and you will understand why I take this precaution.

The DigitalOutput class of objects (Listing 4) control two hardware outputs as a single software device. I designed it such that several instances may simultaneously control the same hardware outputs. A MasterRun object activates the hardware by calling the reset method; subsequent calls of the safe method only reset the auditor. If some object triggers a fault, the class releases the hardware outputs (allows them to go inactive) and then latches the internal fault state until the next reset. Subsequent calls of safe or fault will again release the outputs, in order to keep any other instance from driving the outputs active while this object holds a latched fault. Often, but not in this example, no reset is performed until the system has successfully arrived at the next known safe state. This class's reset method cannot be called within five seconds of the previous call. This enforced wait prevents a string of resets from disabling the safe operation of the output. It also keeps the system from trying to go back to a higher level of service before it has settled at the new state.

The DigitalInput class (Listing 5) accepts two, state-inverted hardware inputs. As long as the inputs agree, the state method returns the uninverted state. If the class detects a fault it will no longer refresh the auditor, which latches the fault until the next reset. If a class were to reset itself when a transient fault cleared, it could cause the system to surge or behave wildly.

The software inputs differ from the software outputs in that they have no direct control over a hardware output by which to signal a fault. Instead, the software input classes use an auditor to signal an internal fault to the external system. By now it is clear why each I/O software point needs its own auditor. The auditor verifies that the software and its associated hardware points are scanning.

The auditors fulfill a typical need in fail-safe design: wherever potential failures and errors arise, the code must trap for them. As long as methods of DigitalInput do not return an error, the objects that call them will see a steady stream of states that indicate some fact about the system on which they can safely rely. If there is a hardware input failure, you know it. If the controller does not regularly scan the module, you know it. If the reset is malfunctioning, you know it. The system is fail-safe because it redundantly catches every predefined single-point failure.

The Auditor and WatchDog classes in Listing 6 watch over most classes and provide the base class for the specialized Auditor that handles the DigitalOutput class. (The DigOutAuditor class needs extended behavior to test its hardware outputs.) The Auditor class looks for a regular refresh from the client (20 times per second currently) and latches a fault if it is less frequent. It is the watchDogCheck method that binds the Auditor instances together and drives the WatchDog timer output. The WatchDog class contains an array of pointers to all of its Auditors, which it steps through, calling each Auditor's watchDogCheck method. If an Auditor returns an error, the WatchDog latches the fault, does not reset its timer, but continues scanning anyway. It is very important that the Auditor and the WatchDog be as simple as possible. Because they verify the operation of much of the software; errors here affect the safety of the entire controller system.

Listing 7 shows the DigOutAuditor class. The DigOutAuditor class inherits all the behavior of the Auditor class — it even uses the Auditor refresh method — and it extends the refresh method. First, DigOutAuditor's refresh method looks up (in a static structure available to all objects of this class) the last time an object ran a hardware test. If more than doTestInterval milliseconds have elapsed, refresh initiates a new test of the hardware outputs.

Depending on the hardware, the refresh method could test it in one of several ways. This version of the test does the following: deactivate an output, read its state through a digital input, and then reactivate it, in fewer microseconds than the output relay can respond. The test checks both outputs while deactivating only one, so as to verify that the outputs are still functionally separate. If the two hardware outputs were under the control of a single controller output, they would lose the benefits of a redundant chain.

The static structure allows several DigOutAuditors to watch over the same hardware outputs while maintaining the same testing interval. If each object had its own indication of when it had last tested the outputs, each new instance would decrease the actual testing interval. Besides being unnecessary, having two Auditors test the same output at the same time would potentially create a conflict and cause erroneous results.

My code has a rather relaxed testing schedule, at one test every fifteen seconds, because the hardware outputs have a low failure rate that makes the active-on failure of both outputs within this interval well beyond the design guarantees. Note that even in this outlandish case the WatchDog would discover the failure during its next scan, allow the watchdog timer to expire and initiate its own shutdown.

A fail-safe system that uses two or more controllers for vital controls must provide the internal states of the controller to external equipment for comparison. This comparison looks for an exact match between controllers — any difference indicates a failure. If the system uses three or more controllers, it can locate the faulted unit by an odd-man-out voting process. Because these state checks include a time stamp, the software must include a method of synchronizing the clocks of the various controllers. Whatever the process for gathering, communicating, and comparing the controllers' states, the synchronization method must not hold up the normal flow of processing. Otherwise, a failure in one controller might hang all the other controllers, if they were waiting for a response to continue processing.

Conclusion

You now have a solid start from which you can develop a fail-safe application. There is still a great deal of work to do, but I have prepared the basic foundation. By starting at the bottom with simple fail-safe I/O, you can build a higher-level application that is still fail-safe. As long as you continue to test for and trap single-point failures in redundant fashion, you will create a fail-safe application. o

Daniel Wisehart is a contract programmer currently working for a hedge fund in Bermuda. Besides his computer interests, he is an analog and digital hardware designer whose past projects include people moving transportation systems. He can be reached at dwisehart@yankee.us.com.