Dr. Dobb's Sourcebook September/October 1997

The Cost of Fault Tolerance


Can you afford to make your system fault tolerant? Can you afford not to? The need is obvious when lives are at stake, as in environmental management of surgical suites or airplane maintenance, as well as for applications with services that must be continuously available, as in telecommunications. What about more traditional business users?

Let MTTF=mean time to failure. Let MTTR=mean time to repair. Then, the expected percent of downtime, of lost availability of service, is Downtime= MTTR/MTTF.

For example, if MTTF=100 hours and MTTR=10 hours, then services will be unavailable 10 percent of the time -- the business will lose approximately 10 percent of revenue. In addition, consider related costs -- cost of repair, labor to manage and execute repair, and idle facilities and labor during downtime.

How can you minimize downtime? Base software on DBMS, middleware, and fail-over systems. For hardware, rely on mirrored disks, RAID, redundant CPUs, and network management.

These systems are becoming widely available at affordable prices. Some can be integrated with little or no impact on your development efforts. The newer ODBMSs can integrate directly with applications, allowing most programmers to benefit from them easily.

-- A.E.W.

Back to Article