Dr. Dobb's Sourcebook September/October 1997
From a computer user's point of view, the difference between toy programs and useful production tools is robustness and reliability. Reliability means that it does not fail, and robustness means that if it does fail, it behaves reasonably. When selecting software, inexperienced users will compare product features. Experienced users, however, know it's the quality of the software that makes the difference between success and failure.
From the programmer's point of view, the difference between a prototype project and a production system is error handling. Prototypes quickly implement an algorithm and user interface, but don't take into account all the things that can go wrong. Adding error-handling code is the hard part: It typically takes far longer than the original prototype and often clutters the short, simple, clean original -- creating a maintenance nightmare.
Newer programming techniques for exception handling help clean up the programmer's task by separating the major flow of the program: signaling exceptions, causing execution to jump to another piece of code to handle exceptions, possibly returning, possibly not. Still, the programming and design of work flow need to be done.
Furthermore, this approach typically handles only the errors the application or designer expects. What happens when the power goes out, a node fails, a disk fails, a network link fails in a distributed system, or even when the application or the system has a bug? These things happen in real life, but most programs -- even when an effort has been made to handle exceptions -- do not behave reasonably in such situations. The result, in some applications, could be loss of life; in others, it could be loss of substantial business.
Another approach augments exception handling by adding separate systems (hardware and/or software) specifically for managing such failure situations. In software, the most common supplements are built into Database Management Systems (DBMSs), which are traditionally used when the information is critical and its integrity must be guaranteed. These systems can increase reliability, reduce loss if there is a failure, minimize the effort required to restore from a failure, and increase availability of system information.
Historically, most programmers did not use DBMSs -- only certain types of systems were developed using DBMSs, and only specially trained programmers worked directly with the DBMS. With today's increasing demand for data stability (see the accompanying text box entitled "The Cost of Fault Tolerance"), more and more software systems require the reliability benefits of DBMSs. Luckily, the new breed of DBMSs -- especially Object Database Management Systems (ODBMSs) -- have significantly increased ease of use and accessibility to nonspecialist programmers. Instead of a separate DBMS language and separate DBMS data structures (which you must translate to and from your program data structures), these ODBMSs allow programmers to simply use their native structures and operations (combined into objects) directly from the programming language. The DBMS automatically manages them. There is no need to learn a new approach or to translate data structures. There is no need to "fetch" data from storage or "store" data back. Programmers simply use the objects as desired.
As systems become more distributed, involving more computers, disks, and networks, the possible modes of failure multiply, and the possibilities of failure increase. At the same time, these distributed systems can dramatically reduce the frequency of failures and the damage such failures cause. In this article, I'll discuss current DBMS facilities, future additions, and how these affect you.
Fault tolerance, as a quality, means that a system continues to run even when something fails. Certain businesses always demanded fault tolerance -- hospital power systems (Landis&Gyr, for instance), telecommunication systems (such as Nortel), process-control systems for manufacturing (Fisher-Rosemount), and the like. The nature of these businesses demands that the system provide continual service. Also, global operation means that remote, unattended systems will be accessed around the clock. Increasing competitive pressures also drive the need to be constantly available to the potential customer.
Today's distributed environments present a bigger challenge for fault tolerance, and at the same time, offer an opportunity to better manage fault tolerance. With old, mainframe-like systems (including the central-server architectures typical of RDBMSs), there is really only one thing that can fail -- the mainframe (or central server). With a distributed environment, the information and processing is spread over multiple servers, workstations, and PCs, any of which might fail -- not to mention the network connections.
However, the distributed environment offers better handling of such faults. The availability of multiple nodes and multiple network connections makes it possible for the system to intelligently manage a fault by substituting working nodes or network links for those that have failed. Also, software technology advances now offer a means to do this and still maintain integrity. Since DBMSs are usually used for managing critical resources with integrity and recoverability, it isn't surprising that this new level of fault tolerance is coming from DBMSs, particularly distributed ODBMSs.
In traditional distributed systems, the failure of a critical component means the system no longer provides any services. In fact, this is often the definition of "critical." The first level of fault tolerance is to continue to provide services despite such failure. This can be accomplished through autonomous partitions.
When a node or network link fails, the distributed system is partitioned into two or more separate regions. For these partitions to continue to operate on their own (autonomously), they must be able to replicate all critical system resources (catalog, schema, locking, and logging information), then automatically transfer to local service providers (hardware and software), that can use the replicated system information and continue to provide services. From the user's point of view, though some user resources (objects in an inaccessible partition) might be unavailable during the fault, the system continues to function and provide services. By supporting multiple autonomous partitions, such systems can tolerate multiple, simultaneous faults, and do so transparently to users.
The second level of fault tolerance is providing continued access to all user resources -- those aforementioned remote objects. Since the remote partition is, by definition, inaccessible during the fault, this requires replicating those objects locally. Though available for some time, most replication has suffered from limitations in performance and integrity. The former comes from the need to execute operations on all replicas, which is typically done sequentially by sending operation requests from server to server. By integrating knowledge of the replicas at the kernel level, the distributed DBMS is able to initiate operations on all replicas in parallel. The result is that write operations are typically no slower, and read is typically much faster.
The more difficult replication issue is integrity -- keeping changes in sync. Most older systems either give up on this, requiring administrators to manually track changes and copy master databases to slaves; or automate it only for a single, hot backup. Even in this latter case, if the network link fails between the primary and backup server, each will allow updates; so the same (replicated) object might have two different (inconsistent) values, a problem that cannot be repaired even after the original fault is repaired.
The dynamic quorum calculation approach maintains integrity even during faults. Servers vote among each other to determine whether they're communicating with the majority or minority. The former allow updates, while the latter allow only read access. This heads off inconsistencies while allowing read access to all users and write access to more than half of the users during the failure. Of course, minority users can always write by branching a new version of the objects, in which case, the DBMS maintains all such versions. In such a case, the user must decide which version(s) to retain. Once the fault is repaired, the minority databases are automatically resynchronized.
With multiple replicas, each can take over if any of the others become inaccessible, providing continued access even with multiple faults. Users can choose which objects to replicate, where to replicate them, and can weigh the voting process to customize as desired; see Figure 2.
All of these capabilities are available without any changes to applications. In fact, as applications are running (with databases online), administrators can create new autonomous partitions, add replicas of objects, and so on. Users will immediately realize benefits. Applications can implement sophisticated support when they're in the minority partition; for example, automatically branching versions when users ask for updates, prompting users, and the like.
Administrators have complete control. They can dynamically manage partitions and replications, independent of applications. Statistics gathered in real time can show who is using which objects, what the local hits versus remote network accesses are. Hence, administrators can make informed decisions about what to replicate and where. Administrators might, for example, create some local replicas of frequently accessed remote objects, thereby significantly reducing the WAN bandwidth requirement.
End users, on the other hand, see continued availability despite failures, as well as increased performance. The failures, in fact, could be transparent to the user if the desired user objects have been replicated. The performance increase arises partly from the ability to perform read operations on local replicas rather than going over slower remote links. Another improvement comes from parallelism. In a distributed DBMS, multiple servers, each maintaining multiple replicas, can serve users in parallel, providing unlimited scalability. As the load increases, the administrator can simply add more servers (commodity-priced NTs rather than expensive multiprocessor mainframe-like machines, for instance) and replicate the most-requested (hot) objects -- all dynamically, all online. Since each new server has its own replica of the hot objects, it can serve requests to those independently, so three servers can literally serve three times as many users.
The traditional mainstay of the DBMS is recovery. Usually, this is based on transactions. The programmer decides where in the program there are critical regions, then sandwiches them with transaction start and transaction commit. The DBMS is supposed to ensure that everything up to the last commit is guaranteed safe, no matter what might happen. Of course, it's important to choose transaction commit points for which all the information is logically consistent. Also, it's important to choose frequent-enough commit points, lest too much be lost in a failure. You have the ability to intentionally cause a transaction to abort, for example, if an error condition is detected but insufficient information is available to recover from the error. Finally, when a transaction abort occurs, users should have the ability to insert a piece of code to clean up whatever else is necessary (such as the screen graphics, user notifications, and so on).
Typical DBMS systems today guarantee such recovery for almost any system error, application error, or computer error. Of course, no such mechanism can be 100 percent perfect. For example, suppose the recovery implementation saves information for recovery by writing it to a disk, but the operating-system failure mode erases the disk. Admittedly, such a situation may be rare. Less rare, though, might be the failure of the disk itself. Such media failure usually requires a different recovery mechanism.
Backup can provide at least some measure of recovery from media failure, because it copies the information to a second medium (disk, tape, and so on). Systems typically offer both full backup and incremental backup, the latter being important for large database users (where it's impractical to do full backups frequently). These backup procedures are usually set up independently by a database administrator (DBA). In better systems, they can run with minimal impact on active users.
Of course, restoring a backup restores the state of information as of the time of the backup, losing any later changes. To restore to the most-recent commit transaction, the DBA requests a roll forward. If the DBMS writes a log of all changes for each transaction since the last backup, and writes it to a different disk from the one that failed (as is typical), then it can replay those transactions, applying them to the backup copy; thereby restoring the database to the state of the last commit transaction.
Transactions also have the effect of controlling concurrent use by multiple users. The result of one user's changes are not seen by another user until the first executes commit, and "simultaneous" changes are serialized so they don't overwrite each other. However, you may wish to ensure recoverability and not expose your changes to other users. This functionality is provided by a transaction checkpoint, available in many systems.
Some systems also provide nested transactions, allowing you to start a transaction, then start a "child" transaction within it. When the child transaction commits, the parent continues. Such transactions do not aid concurrency, because the results cannot be seen by other users until the top-level transaction commits. However, they can, in some implementations, establish a recovery point similar to the checkpoints mentioned previously. Also, they can be convenient to use within a subroutine or function, where the writer does not know whether the caller might have already started a transaction.
A final point on recovery: After a failure occurs, some systems require manual intervention to recover. For unattended systems, remote servers, and the like, it is convenient (sometimes even critical) for the system to automatically recover.
In the old mainframe-only world, there was typically one computer, with central control over all resources. Today, however, there are typically multiple computers and multiple tiers of servers and clients. Distributed computing provides better functionality and possibly better performance and easier access, but there are many more possible faults. Not only are there more computers and disks that can fail, but the communication links between them can fail, too. This presents new challenges to the DBMS for recovery and backup. If well handled, the result can be increased fault tolerance and increased availability.
Consider an application that modifies objects on multiple nodes. If a failure occurs after updating some nodes, but before completing all nodes, the collective result is an inconsistent state. A distributed DBMS will automatically extend its recovery scheme to allow for such situations, ensuring that either all changes are made or none are made. Some DBMSs provide such consistent, atomic updates, but only over predefined sets of servers or objects. This requires you to explicitly specify desired resources, such as databases. Also, this system fails if the end user or DBA later desires to redistribute objects among databases and servers. A distributed DBMS, however, provides users with a single logical view of all objects (regardless of location, database, and server), and supports all operations (including atomic commit) within that logical view. By automatically mapping the logical view to physical views, support for recovery, even in distributed environments, is provided. The flexibility for end users or DBA to redistribute objects online, without affecting programs, also is provided.
Another solution, especially if multiple DBMSs are used, is a third-party transaction monitor that will execute the transaction commit in two phases, first asking each DBMS if it is prepared to commit, then, if all respond affirmatively, proceeding to the second phase and instructing each DBMS to commit. Each DBMS or resource manager must be prepared, between the two phases, to either guarantee completion of commit, or to guarantee completion of abort (restoration to the previous committed state). This is usually accomplished via the standard X/A protocol, from X/Open, or newer versions of the same, such as the OMG Transaction Service. You must write additional code to wrap all of your transactions in each DBMS to the transaction monitor.
So far this sounds like extra overhead to achieve a comparable result, albeit in a more complicated environment. Further, what happens if one of the nodes fails or if a network connection is severed producing a "partitioned" network? The aforementioned scheme means that all users could well be prevented from committing transactions. This leads to the next level, an even newer evolution in DBMSs.
Network failure results in a partitioned network. Users and servers in different partitions cannot communicate or share any changes, but it is still possible for users in each partition to at least continue to receive the services of the information system. To achieve this, the DBMS must replicate its own services within each partition. These typically include the schema and data dictionary, catalogs for locating servers and other resources, and locking and logging information. With such a system, users can at least continue to access basic services such as creation, modification of information, and recovery. A partition like that shown in Figure 1 can then continue to execute autonomously.
This approach can dramatically reduce the risk of catastrophic failure or loss of service, and can do so with no impact at all to users (if the DBMS supports mapping of a logical view transparently over all servers, and if the system allows the DBA to dynamically create and modify such partitions). By creating arbitrary numbers of such autonomous partitions, each with its own replicates and servers, the DBA can achieve any desired level of fault tolerance. If a node fails in another partition, it may prevent other users from accessing information stored on that node, but users will be able to continue accessing all other information and all services. The same is true for network failures that isolate partitions. This not only allows service to continue, but does so in a way that is reasonable; that is, users can still do everything except access the inaccessible information.
What if that inaccessible information, however, is critical? This brings us to replication.
By making copies, or replicas, of objects, the availability of access to those objects can be maintained, even under some failures. The simplest way to do so is to create a single, online, hot backup, by simply having a primary server copy every action to a secondary server. Then, if one of the servers fails, the other can take over instantly, maintaining uninterrupted service. Usually, however, this means increased overhead due to the need to pass each operation to the secondary server. Integrity is at risk, too. If the network fails between the two servers, each must assume it's the only remaining server, so each will allow updates. There will be no way to resolve these updates, even after the network is restored.
A more general scheme is to allow any number of desired replicas in any number of partitions. This increases the availability to any desired level, rather than just one backup. Also, this approach can be implemented with no extra overhead for reading. In fact, wise choice of replicas can improve read performance, by allowing the system to access a closer (in access time) replica server, or a more lightly loaded server. There is, of course, extra overhead for writes, due to the need to propagate the write locks and updates to all replicas, though this can be minimized by caching and batching such updates.
A tricky situation arises, though, when users in disconnected partitions wish to update the same objects. If each is allowed to update independently, then the data integrity issue will arise. The simplest solution is to restrict updates to only one user, perhaps the "owner" of the object. Greater update availability under network failures can be achieved by having the replica servers communicate with each other in a manner similar to voting. Then, all users who have access to a majority of the servers can be safely allowed to update without the danger of integrity loss. Those in the minority must be limited to read access. For situations in which some users require special consideration, they can be given greater weight in the voting (up to infinite weight) to selectively reproduce the "owner" scenario.
All these DBMS fault-tolerance capabilities are available, and more and more programmers will be using them as they increasingly integrate naturally with programming languages. Even more possibilities can be envisioned: For example, the DBMS could incorporate an expert system to measure the frequency of access and the number of attempts to automatically create or modify partitions and replicas (perhaps with programmer control to avoid massive undesired restructuring).
The DBMS is not the only solution. Hardware solutions include mirrored disks that automatically duplicate each write to both disks. This solution is much like the hot backup example mentioned previously; but, done in hardware, it avoids the performance problem. RAID is a generalization of this concept. Similarly, redundant processors can provide protection from processor failure with some fail-over software system to maintain replicated operating-system structures, detect failures, and switch over. Finally, redundant network connections, with intelligent network routers, can automatically work around failed communication links.
With all solutions, however, the programmer still must design and implement exception handling and decide where the commit and check points should be. The schema designer or object modeler still must design classes or types, relationships, and composites. Additionally, all the other resources managed by the programmer must be coordinated with recovery, often by registering them in the DBMS.
Fault tolerance, then, has come within the reach of most programmers, via ODBMSs integrated with languages, distribution and replication, as well as a variety of redundant hardware and fail-over software. Today's distributed environment of networked clients and servers requires this new level of support, and with it, offers greater availability, reliability, and robustness. Soon, mainstream users will routinely demand this level of service.
DDJ