Fail-Safe System Design


Fail-safe systems are only safe within their design parameters. The design parameters indicate the type and mode of failures that the designer must consider and the conditions under which the system must safely operate. Within parameters that the customer and regulators establish, a fail-safe system can have no failures that lead to unacceptable consequences — that is what makes it a fail-safe system. A fail-safe system does not necessarily run at full capacity after a failure, but it must safely handle all specified failures. What constitutes an acceptable consequence of a failure and which failures and events are part of or not part of the design, generate a great deal of discussion and disagreement.

Industries and regulating authorities have their own interpretation of "fail-safe." The Aerospace industry uses multiply redundant controllers to detect and work around failures. Some American railway codes require the designer to provide for failures that might occur once in 10 million years. Swiss authorities are reluctant to recognize any controller as part of a fail-safe system. Discovering the relevant codes and requirements is the first step in designing a fail-safe application.

Since there are certain critical events that a fail-safe system cannot fail to respond to, the designers go to great lengths to verify the system's proper operation. This requires self-checking equipment, which makes for a truly fascinating software project.

Early in the design process, the designers identify the "known safe states," the requirements of each, and their place in a hierarchy of states. A known safe state is level of service at which the system can safely run, so long as the prerequisites for attaining that state remain valid. The hierarchy indicates the system response to a failure at each level. If a failure makes the current state unsafe, the system will repeatedly revert to the next lower state until it finds a valid safe state. For the system discussed in the software section, slow is safer than fast, drive power off is safer than power on, brakes applied is safer than brakes released, unregulated mechanical brake is safer than regulated electronic brake, etc. While these might be a good starting point for a ground-based system, the particulars of an airborne system will be quite different.

To verify that a controller and its software are in a safe state, designers put them through a continuous series of run-time tests. To this end, watchdog timers, when properly intertwined with the hardware and the software, verify that the controller is scanning its inputs, its outputs, and all of its software on a regular basis. These timers — sometimes internal to the controller, usually external to it — verify every few milliseconds that the required processing is taking place. If the controller fails to refresh a watchdog timer before it expires, the watchdog equipment will disconnect the controller in order to return the system to the next known safe state.

As an overall comment on fail-safe software design, consider that in all modules it is important that they return non-error status only when everything they monitor is correct. A word processor can use certain defaults and be tolerant of certain types of errors; but in fail-safe design any deviation must automatically trigger a fault and return an error. A system is not fault tolerant because the individual classes do their best to continue running when faced with a fault — a system is fault tolerant when it detects any fault immediately and is able to switch to other, known good units. o