Distributed Real-Time Operating Systems

The next generation

E. Douglas Jensen

Doug is the technical director for real-time computer systems at Digital Equipment Corp. His responsibilities include establishing a technology vision and strategy for Digital's efforts in real-time systems. Doug has 27 years of experience in real-time computing, including eight years on the computer-science faculty at Carnegie Mellon University.


To date, virtually all real-time computing has been at the lowest level of the application-control hierarchy--embedded computers, controllers, appliances, and other intelligent devices. However, more and more real-time systems are being specified at the higher levels, including decentralized production operations and business-management systems.

Unfortunately, the requirements of this expanding real-time domain violate the assumptions implicitly underlying conventional real-time computing. Moreover, advanced, complex, and distributed real-time applications typically need more operating system (OS) technology than their smaller, simpler, executive-based brethren. Thus, the requirements are becoming too broad for any single, general-purpose, real-time OS to satisfy. The obvious alternative--multiple, additional real-time OSs--is logistically and economically infeasible for vendors and users alike.

Consequently, a new generation of real-time control, computing, and operating systems needs to be born. These OSs should be modular, adaptable, and scalable in functionality. They should support global, distributed, and cooperative computing across and between levels--and nodes--in the application-control hierarchy. The new OSs should perform dynamic resource management in the face of application and system uncertainties, enforce end-to-end timeliness of the total control system, and support variable degrees of "hardness" and "softness" of the real-time application.

One scalable, real-time OS architecture across many levels of hardware would benefit both vendors and users. It would improve software reusability while accommodating evolving needs and technologies. It would reduce engineering costs and time-to-market. Plus, it would use a single, common, software-development environment across the entire application regime.

The Changing Real-Time Application Environment

Traditional, relatively small, simple, centralized, real-time applications exist everywhere. For example, it's now familiar to see 8-bit embedded-chip valve controllers running on top of a home-grown or vendor-proprietary OS, 32-bit industrial microcomputers providing supervisory cell control while running VxWorks or a proprietary OS, and UNIX workstations providing the operator interface to a remote console. In commercial, real-time applications--online transaction-processing applications such as lottery systems, bond trading, currency markets, and the like--predictable and timely response is mandatory, but not on microsecond time scales. In the consumer world, the antilock-brake system (ABS) ensures real-time safety for a car's occupants.

As Figure 1 illustrates, real-time applications usually exist in a stylized, restricted hierarchy consisting of three levels: control, supervisory, and management. Generally, the control and computing in this hierarchy have been local, centralized, and autonomous; they use elementary client/server relationships between levels, and their real-time nature is different at each level.

Control, the lowest level of the hierarchy, is typically reactive--small, stand-alone, and generally oblivious to other systems. Control applications are relatively simple real-time subsystems for low-level, sampled-data monitoring and control (such as regulatory loops in process-control applications). They use static, priority-based resource management and have highly predictable behavior. Almost all real-time computing has been at this lowest level of the application-control hierarchy.

The second level is usually a supervisory control and computing system with loosely defined real-time operations. Typical applications in industrial plants include production scheduling and control, quality management, and process optimization. In commercial aircraft, this level would be the mission-management computer system, which supervises the flight control, communications, and propulsion subsystems, among others.

At the highest level, the management level, computing is non-real-time. These systems generally handle business operations, such as manufacturing-resource planning (MRP II), maintenance management, and order processing.

The three-level hierarchy in Figure 1 implies the need for only two kinds of real-time computer systems. The control level typically has small, proprietary, "hard" real-time executives that provide limited functionality; Wind River's VxWorks is typical of these executives. The supervisory- and management-level systems, on the other hand, usually have full-function, "soft" real-time OSs such as DEC OSF/1 and VMS.

Real-Time Rules have Changed

The evolution of industrial automation systems is pulling real-time computing up from its familiar niche at the lowest level of the system hierarchy to the higher (supervisory and management) levels. This movement is expected to improve product quality and yield predictability, asset utilization, and plant flexibility, among other benefits.

To achieve these benefits, the concepts and techniques of real-time computing must be clarified, improved, and generalized. In technology, high-level real-time control and computing will have to be adaptive, self-directing, global, and distributed (translevel and transnode). In particular:

While (fortunately) the operational time frames are normally slower at the higher levels, ranging from seconds on up, the computing must still be hard real time in the sense that task timeliness has to be predictable.

One Size Can't Fit All

Many real-time systems use special-purpose OSs for a particular application or to best balance performance; functionality; hardware and software costs; and hardware size, weight, and power. Increasingly, the economies of developing and supporting these OSs, together with the nonportability of their applications, are forcing users to move to commercial, general-purpose, real-time OS products. For a user with only one application in the real-time domain, these OS products are often adequate. Users having a variety of different real-time applications--especially higher-level, distributed applications--are finding that these existing OS products do not encompass all their requirements. Theory and practice show that general-purpose OSs cannot accommodate the broad set of control and computing needs across the entire real-time control hierarchy. Moreover, OS vendors wishing to serve real-time user needs up and down the application hierarchy cannot afford to offer a multiplicity of different real-time OSs, nor can users afford to use a multiplicity of OSs--the support costs are too high in staff retraining, time-to-market, budgets, and software quality.

Consequently, the application hierarchy needs a scalable OS because "one size doesn't fit all." Scalable, in this case, means an operating system is highly modular, accommodating application specificity; and highly adaptable, accommodating execution-time situational specificity.

Current OS Limitations: Timeliness

Timeliness is one limitation to the current generation of operating systems. Timeliness is effected by factors such as the hard and soft dichotomy, performance metrics, priorities, exception cases, the division of responsibilities between OS and applications programmers, and determinism.

Hard versus soft real time. Hard and soft real time are much more complex technical issues than popular usage of the terms would suggest. Hard real time conventionally means that all important tasks have deadlines that must always be met, otherwise the system has failed. Conversely, soft real time is not hard in that tasks might not have deadlines. Even if they do, missing a deadline is not necessarily a system failure. Most real-time systems fall into this soft category; their tasks may have deadlines or must be completed as soon as possible. However, the tasks may still be more or less acceptable, if they are completed within a specific, suboptimal time.

For example, missing a sensor sample will create a discontinuity in the sensor reading or a click in the audio signal; neither are catastrophic events, and the next sample will override the missed data point. Alternately, an application may require at least 85 percent of the tasks to be no more than 20 percent late, as long as no two tasks in a row miss their deadline.

So the bad news is that traditional soft real time is undefined, and the worse news is that most real-time systems are soft.

The information in both hard and soft real-time systems is highly perishable, and the system (OS and application) has to act on that information while it is current. At the core of real-time computing are issues of predictability, as opposed to how long the system takes to complete a set of tasks.

Performance metrics. A computing system or operating system is real time to the degree that it explicitly manages resources so tasks are completed at acceptable times. This may happen implicitly (by luck) or by hardware brute force. Such systems may successfully operate in real time; they may even be rational, cost-effective solutions for certain applications. However, they are not genuine real-time systems because they do not use real-time resource management.

Historically, the real-time OS development and user communities have subliminally conspired in the belief that some form of interrupt response time is the real-time performance metric. However, starting the most eligible--traditionally, the highest-priority--task as fast as possible is necessary, but not sufficient. What really matters is that tasks complete at acceptable times.

The response-time artifact arises from the implication that if you start a task fast enough, it will complete on time. This implication often holds in small, simple systems, but not in larger, more complex systems, such as at higher levels of the application hierarchy. In these systems, where there are dynamic resource conflicts, interrupt response time is insufficient to characterize the system's real-time performance.

A richer, more powerful approach to expressing application-specific timeliness is necessary. This is especially true for the more sophisticated real-time computing necessary at higher levels of the control hierarchy.

Priorities. Programmers normally characterize individual task-timeliness requirements with priorities. Priorities arose in the context of simple systems, where they are adequate, but they are not adequate for large, complex, real-time systems because they:

Exception cases. System performance should be optimized for the most important cases. At the higher levels of the control hierarchy, these cases are often the high-stress exceptions rather than the most frequent (normal, uneventful) cases. These exceptions are inherent in some applications or in emergencies, such as plant upsets. It is in these exception cases that system and OS performance are most critical.

The traditional real-time approach to dealing with exception cases is through determinism. Only the most frequent cases are accommodated, and it is presumed that no exception cases will suffer. Or, the most demanding case, however infrequent, is identified and satisfied in advance--regardless of the consequences to overall system performance and cost.

Division of programmer responsibilities. The responsibility for completing real-time applications on time is generally divided between the application software and the OS.

Historically, most of this responsibility has fallen on the application programmers. They normally construct a static mapping of task-completion times to priorities. They do this in a way that is usually ad hoc and experimental. The OS contributes only fast, tightly bound interrupt latency, plus fixed-priority scheduling. This imbalance has higher costs because:

Determinism. Real-time people are obsessed with the idea that a system has to be deterministic to behave predictably. This mistake confuses ends and means. To put that into action, real-time people must attempt to anticipate contingencies, and therefore reserve the appropriate compute and data network resources. So they build rigid systems overendowed with resources that might never be used. Unfortunately, they might still be unprepared for an entirely different set of contingencies. For complex and distributed systems, this approach can be fatal.

Communications people tend not to make this mistake. For example, when you pick up the phone and call your mom or log onto CompuServe and send packets, it's highly probable that mom or another CompuServe node will be at the other end. However, unbeknownst to you, all kinds of uncertainties are present: links break, buffers get congested, and so on. While all of that goes on, the telephone or data network dynamically reconfigures itself transparently. This reconfiguring is considered a feature; the dynamic routing in the network provides robustness. However, if you are a classic real-time bigot, the reconfiguring is considered a bug because you don't know exactly how that routing was accomplished.

Current OS Limitations: Distribution

Another factor limiting the real-time effectiveness of current operating systems is distribution. System designers typically do not build distributed systems so much as they build networks--collections of processors connected together. The result is a non-real-time network of real-time systems, such as the collection of machining centers and materials-handling equipment on the factory floor.

Many systems designers are happy with networking-centralized real-time subsystems; they've been doing that for years. But many more designers today want to build entirely real-time process-control applications physically dispersed across multiple computers. Such applications must contend with the end-to-end timeliness across the entire network.

In addition, application programmers must know the identities and physical locations of the computers and the software functions on them. Application programmers are also responsible for coordinating concurrent execution and data accesses on each of these computers. Conventional real-time OSs provide no support for any of these activities.

The solution to many of these issues is to provide some decentralization and better resource management through middleware at intermediate levels, including such conventional and object-oriented distributed execution environments as the Open Software Foundation's Distributed Computing Environment (DCE) and the Object Management Group's Object Management Architecture (OMA). However, no off-the-shelf real-time distribution middleware products exist at present, and middleware typically does not have direct access to kernel- and OS-level resources, thus limiting the real-time capabilities of the system.

Technologies for Next-Generation Real-Time OSs

Overcoming the limitations in OS timeliness and distribution requires not only a shift in mindset, but also new technologies. Today, a variety of computer and software vendors are performing advanced development in real-time OS architectures, particularly in the areas of timeliness as expressed in the benefit-accrual model, distributed threads, and passive objects. What is likely to happen is that portions of these developments--whether as ideas or implementations--will be added to existing and new OSs.

For the next-generation real-time computer systems, the operating-system architecture must have a framework for expressing highly scalable timeliness specifications. That is, it must encompass a wide continuum of real-time hardness and softness in a unified manner.

A time constraint, such as the archetypal deadline, is conventionally thought of as a point on a timeline; see Figure 2. Classic scheduling theory often measures a task's timeliness in terms of lateness, where lateness=deadline--completion time. For a soft deadline, timeliness is equal to the lateness value. For a hard deadline, timeliness is equal to the sign of that value; that is, if it's negative, the task is late.

A better approach would be to think of a task's timeliness in two dimensions: benefit or contribution to the system over the time required to complete the task. Graphically, a hard deadline is a binary, downward step function with a lower range of either zero (missing the deadline is nonproductive) or a negative number (missing the deadline is counterproductive); see Figure 3.

In the real world of computing, applications are rarely black and white--real time is a continuum. With some systems, all tasks and processes have a deadline that must be met or the system has failed. At the other end of the continuum, some systems do not have any time constraints. In the middle, some tasks and process can be late or always late, which can be relatively difficult to express.

Many of these applications, especially higher-level ones, require individual task-completion times that are softer in the sense of not being deadlines. Nonetheless, these completion times must be specified and enforced. Figure 4 illustrates the two-dimensional view of real-time application, where:

For example, consider a satellite communications system which has an optimal window of opportunity for sending and receiving data between the satellite and the ground station. On each side of that window is a period of time during which communications can take place, but at a lower rate because of poorer signal-to-noise ratios. Abstracting this natural analog continuum of timeliness into an artificial, binary deadline can be highly disadvantageous.

Expressing time constraints in two dimensions lets you represent a wide range of hardness and softness coherently and methodically, thus letting the OS satisfy those specifications. Moreover, application programmers can derive actual timeliness specifications directly from the requirements and behavior of the system.

One framework for expressing timeliness is called the "benefit-accrual model" and is based on three orthogonal functions for specifying timeliness:

The benefit-accrual model expresses an individual task's time constraint in terms of a timeliness metric called "benefit."

Graphically, the origin of the benefit function axes is the current time tc (value of the system clock). The earliest time for a benefit function is called initial time, ti and the latest time, terminal time, tT; see Figure 5. The benefit function is evaluated only between the current and the terminal time. Using these terms, the hard benefit function in Figure 6 has:

Conversely, a soft benefit function can have arbitrary values before and after the optimal values at tS, but it need not have constant values on each side of tL and tD, nor expiration times; see Figure 7.

Individual tasks, in general, also have two other attributes: dynamic dependencies (for example, precedence and resource conflicts) and relative importance (functional criticality) that an advanced scheduling policy must consider. This importance is orthogonal to timeliness, and may be a function of time and other parameters that reflect the application and computing system state.

Usually real-time applications include multiple tasks that may each have time constraints. Collective timeliness, another function in the benefit accrual model, indicates how timeliness, in terms of system benefits, is "accrued" by the collection of tasks.

One of the challenges of an advanced scheduler is to optimize collective timeliness specified by these task time constraints. The scheduler considers all time constraints it knows about, and creates one or more schedules by assigning estimated (or expected) execution-completion times. This results in estimated initiation times and an order for executing the tasks.

Yet another function in the benefit-accrual model, collective-timeliness acceptability, specifies the acceptability of the completion times for a set of tasks. Acceptability of certain tasks or combinations of tasks may be conditional on the present state of the system, such as other tasks' timeliness, resource availability, and application mode. Realize that the semantics and metrics of timeliness acceptability are application specific. For example, "unacceptable" may mean either nonproductive or counterproductive in some way.

Larger, more complex, more distributed, mission-critical real-time systems usually call for softer collective-timeliness acceptability criteria. These systems must dynamically adapt to situational uncertainties to remain robust. For example, a particular group of tasks may be acceptable if they complete at times yielding at least 75 percent of their maximum possible collective benefit--if, and only if, no more than two of them, which complete within 100 msecs of each other, yield a timeliness benefit of less than 90 percent of their maximums.

Real-time applications at higher levels include tasks that span levels of the control hierarchy and nodes. These tasks are typically "faked" by breaking them into centralized tasks on each node, which communicate by messages.

Alternatively, next-generation real-time operating systems can provide actual distributed (transnode) tasks. The technology for accomplishing this is called "distributed threads." These threads can transparently and reliably extend themselves across address spaces, and thus among computing nodes. This transparency minimizes the effect of physical dispersal on software costs and lets programmers use familiar centralized programming techniques and tools.

These threads also maintain their identities and attributes. In particular, a distributed thread includes all the real-time scheduling information needed to enforce its end-to-end timeliness.

Distributed threads provide the opportunity for managing resources coherently, according to a common performance-optimization criterion, such as meeting timeliness constraints. For example, the scheduling policy employed for processor cycles can also be used for managing synchronizers, such as locks, semaphores, and transactions. Coherent resource management is the only complete solution to the common problem called "priority inversion."

Separating the task entity--the distributed thread--from the code it executes and the data it accesses requires a programming model that includes entities consisting of only code and data. In object-oriented programming, these entities are abstract data types called "passive objects" (see Figure 8). The number of distributed threads that can be concurrently active in a passive object and their synchronization constraints are determined by the object programmer.

In contrast, active objects often have only one captive thread, and typically communicate among themselves with asynchronous messages. Because active objects are a special case of passive objects, they are easily provided by OSs, if desired. (Active objects are common because of their automatic compatibility with existing OS process models.) The OS should accept responsibility of the basic integrity of distributed threads by, for example, providing orphan detection and elimination, or allowing situation-specific invocation of failure semantics and recovery policies.

Another important technology that differentiates true distributed computing from networking is distributed concurrency control. Multiple, mutually asynchronous, distributed threads must coordinate their concurrent execution and data accesses. This way, the system remains correct and consistent. The OS must accomplish this by providing the equivalent of semaphores and locks, but without shared primary memory among nodes. Technologies for accomplishing this include distributed agreement, atomic broadcasts, and transaction-like constructs.

The Challenge of the Next-Generation OS

After remaining relatively unchanged for over 30 years, the real-time computing domain is expanding into higher-level applications. Traditional real-time computing concepts and techniques need to scale up to satisfy these new requirements. This calls for a new real-time paradigm--one that more carefully defines and generalizes the traditional real-time approach.

This should improve the economies and productivity of the entire application-control hierarchy and enable new services that are not available today in real-time applications. At the same time, the continual improvement in performance and cost-effectiveness of microprocessors increases the need for these new real-time technologies.

The new paradigm will include a hierarchy of real-time distributed objects and threads. Recognizing these as strategic assets represents a business challenge to traditionally hardware-oriented computer vendors, system suppliers, and users.

Figure 1 Three-level control hierarchy. Figure 2 Time constraint as a point on a timeline. Figure 3 Traditional real-time interpretation of a hard deadline. Figure 4 Examples of soft individual time-constraint functions. Figure 5 Benefit function defined over a range of time. Figure 6 A "hard" benefit function. Figure 7 A "soft" benefit function. Figure 8 Passive objects are abstract data types.


Copyright © 1995, Dr. Dobb's Journal