MESSAGE-PASSING OPERATING SYSTEMS

High performance and flexibility in a real-time, multitasking, multiuser, or networked environment

Dan Hildebrand

Dan Hildebrand is a programmer for Quantum Software Systems Ltd. (Ottawa, Ontario) and the developer of Qterm, a high-level terminal emulation program.

Operating systems in real-time environments must be capable of handling the demanding intertask communications problems that are inherent in communications, process control, and other real-time applications. At the same time, the operating system (OS) must be capable of providing the functional capabilities of a traditional multiuser OS while delivering fully network-distributed processing and the deterministic performance of a real-time executive. One way to achieve both performance and flexibility is message passing architecture.

The performance and flexibility of a message-passing OS enable the data flow on the network to consist of intertask messages. Tasks can communicate with any other task, anywhere on the network. The network then functions as a homogeneous, tightly connected array of computers, rather than a collection of computing islands connected on a functionally limited network.

Conventional wisdom would have us believe that the 8088 and 80286 possess a "flawed" architecture, leaving them unsuitable for multitasking. Contrary to popular opinion, the design of these processors are admirably suited to multitasking. It is only the "flawed" architecture of conventional operating systems that negatively impact their performance. Operating system design is the primary limiting factor in the multitasking performance of these processors.

The Dilemma of the Layered Approach

In a conventional OS, various unrelated pieces of the OS often share common code and data space for convenience of implementation. Software layers over existing facilities (the "yet-another-layer" design philosophy) provide additional OS function ability. With each new release, this ever-increasing depth of layering results in progressively worsening performance in the following three crucial areas:

The synchronization overhead incurred when a task communicates a request to the operating system.

The transparency of intertask communications.

The efficiency of intertask communications.

For example, to provide network services, a layer is often added to catch OS requests and reroute them through the network software/hardware to a file server. The file server is then running a network control task that interfaces the network requests to the local OS. This "network-services" layering imposes a performance penalty on all network transactions.

To avoid performance losses, extensions can be coded into the kernel itself, thereby having access to data structures and code fragments not necessarily needed for the extension. However, these pathological connections result in side effects that can be difficult to debug and maintain. You face the dilemma of choosing between:

1. extending the complexity of the OS kernel at the expense of reliability and maintainability; or

2. extending the OS services through the addition of multiple, performance-robbing layers around the existing OS.

SideStepping the Dilemma---The Message Passing Solution

One technique that solves the performance difficulties of intertask communication is message passing. This technique involves the copying of a block of data (the message) by the OS kernel from the data space of one task to that of another. Whether the tasks are executing on the same processor or on physically remote processors does not matter. Obviously, this approach is particularly effective in integrated-network, distributed, and parallel processing environments.

An important characteristic of this approach is that message data must be physically copied from the source task to the destination task. This physical copying of the message accomplishes a "disconnection" between the two tasks, thus allowing the tasks to run on different processors (if necessary). if one of the two tasks provides OS-related services, this disconnection easily results in a networked operating system.

Figure 1: Send/Recieve Message Passing

While performance optimization techniques may encourage the passing of pointers to messages (rather than passing message contents), this optimization has negative effects. In actual practice, the time for data transfer (passing the message( does not represent a significant portion of the task-switching process. The task switch itself represents the bulk of the operation. Also, the vast majority of messages are only a few bytes long.

In specialized applications where large buffers must be passed between tasks and where networking is not an issue, a pointer to the necessary buffer can be passed within the message. If the message is not physically copied, many additional details must be managed. The primary problem is that a sending task cannot modify or release a message buffer until the receiving task has indicated that it is finished with the message. The synchronization issues that must then be addressed only complicate and impede the operation of the system.

Conventional, layered operating from user tasks by rigidly separating memory into "system" and user areas. An OS built from a group of cooperating tasks that pass messages can be set up without distinct system and user memory spaces. The only necessary system memory management is that already provided to support user tasks. A system task is then treated the same as a user task except that the system task is providing a resource intrinsic to the OS.

The dilemma confronted when expanding a conventionally structured OS is neatly sidestepped with this multiple task arrangement. Extensions to the OS are painlessly added as additional tasks that efficiently pass messages to the existing OS tasks. Maintenance of the OS also can be easily managed because each task is responsible for only a well defined set of services, requested through an explicitly defined set of messages.

If memory-management hardware is available, an additional benefit of this structure is that all user and OS tasks are protected from one another. The 80286 microprocessor running in protected mode is an example of an environment within which this protection is available. The modular nature of this type of OS design is highly reliable and easily modified and maintained.

Message-Passing Operating System Interface

The application interface to a message-passing OS is quite different from the OS interface provided by operating systems such as OS/2, PC-DOS, or Unix. Such operating systems require the application program to execute a software interrupt, or subroutine calls passing either the data for the request in the processor registers or through a pointer to a predefined table or buffer. The OS then expects to be able to directly read and write into the data space of the application making the OS request. This requires that the OS be resident on the same CPU as the task making the request and results in severe problems for networked versions of these operating systems.

Additional layers of software (which decrease overall system performance) are required to solve this problem. In contrast, a message-passing interface produces an OS with a single, unified interface that works for communication between either local tasks or remote tasks. This unified interface results in a smaller, leaner OS that need not support two sets of interfaces. While other OS interfaces require the applications to be written in both "networked" and "non-networked" flavors, as well as requiring the operating system to support two sets of interfaces, a message-passing OS means that an application in a message-passing OS need only be written for a single interface.

One such message-passing operating system is QNX, designed and developed by Gordon Bell and Dan Dodge as an outgrowth of research done at the University of Waterloo in Canada. This operating system was introduced in 1982 by Quantum Software Systems Ltd. It is currently being used at over 55,000 sites in applications ranging from integrated office automation systems to robotics and real-time process control systems.

Although its underlying architecture is much different from Unix, the QNX interface itself is Unix-like. The OS consists of a group of cooperating tasks that pass messages among themselves in order to accomplish various OS requests. These tasks are referred to as administrator tasks because they are essential to the operation of the OS. When an application task requires OS services (such as device I/O, task creation, and so forth), messages are sent to the administrator task that provides the required service. If those services are required of another workstation or node in the network, those same request messages need only be sent to the administrator tasks on the remote node. This message redirection is handled transparently by the system.

The kernel holds together all of the administrator tasks. The QNX kernel, which represents 10K of highly optimized code, has the primary function of performing the message-passing and task synchronization functions within the OS. A task scheduler that has set priorities within the kernel provides QNX with the deterministic response time necessary for real-time applications. On an 8-MHz 80286, the kernel performs 3,200 task switches per second; on a 16-MHz 80386, 7,200 task switches per second. Assuming another interrupt is not being serviced, a worst-case interrupt latency of 30 micro- seconds is experienced on an 8-MHz 80286.

System Administrator Tasks

Various administrator tasks are placed around the kernel. "Task" is the task which provides facilities relating to task creation, task death, memory allocation. and task name registration. These are given the highest priority in the system. The protected-mode 80286 version of QNX supports 150 tasks, while the real-mode 80286 or 8088 versions support 64 tasks.

Fsys is the task that implements the QNX file system. It manages the on-disk data structures that represent files and directories. Messages can be sent to this task to request operations related to the file system (such as file opening, closing, reading and writing, as well as absolute disk block manipulation). Fsys implements a tree-structured file system that supports disks up to 1 Terabyte (a million Mbyte) in size with a space-efficient 512-byte unit of allocation. This file system supports random seeks within files from any point to any other point with a single, direct disk seek. Unlike typical file systems, intervening disk blocks need not be read to perform large seeks. This means QNX can be used for large, multiuser database applications. Because the file system is also power-fail safe, QNX is also suitable for harsh environments. File ownership, attribute and permission checking usually found within a multiuser file system is also handled by Fsys. Block-oriented device drivers can be installed using a "mount" command and become an extension of the Fsys task. See accompanying text about "Mounting Device Drivers." Special tasks can also be written to adopt a drive for special purposes.

Dev is the task that performs character-oriented I/O. Drivers for the console, serial, and parallel devices are present within this task. Additional drivers can be mounted as background tasks, which can then adopt device names from the Dev task for special applications. The drivers within this task perform all the handling for options (such as flow control, line editing, baud rate changes, and so forth). Changes to option settings are performed by utilities that send the appropriate messages to Dev, thus commanding Dev to modify the requested options. Also present are library routines that allow user programs to communicate these requests. A set of routines that implement high-speed video output are included in Dev. Since these routines are integrated into the terminal independent screen and keyboard library, programs can be writ- ten that perform instaneous screen updates on the console, while retaining terminal independence for terminal or modem applications. On a PC AT, 19 physical devices are supported, in addition to the 40 virtual device names available for adoption by device-driver tasks. The fast task switching and low interrupt latency of QNX allow many more serial devices to be supported than under conventional, non-real-time, Unix derived operating systems.

Idle is a null task executed whenever all the other tasks in the system are in a blocked state and waiting for an external event either to occur or to complete. Idle runs at the lowest priority in the system.

Net is the task within QNX that performs message passing between machines on a network. This task exists only in networked versions of QNX and occupies approximately 20K of memory. (By comparison, the standard networking extension for PC-DOS is nearly 190K in size.)

The user-extendable Timer is an optional task that can be started by the user to add complex timing capabilities. Other tasks can request "timer" to provide timeouts ranging from one multisecond to many years. Because of the real-time scheduling within QNX, tasks can be accurately scheduled with very precise timing resolution.

The queue manager is a task that can perform queued message passing similar to that provided in mailbox-oriented operating systems. The standard send/receive, intertask message calls within QNX are blocking. Unless a conditional receive has been explicitly requested, these calls do not allow the sending task to continue until the message has been received. This blocking design is deliberate within the operating system, although it may not be convenient for some system designs. To support those designs that require it, the Queue manager task can be started to provide network-wide, queued message passing. Unlike the OS/2 Queue manager, this Queue manager buffers entire messages, rather than just pointers to messages. This allows queues to be used across the network (if necessary). If performance optimization is necessary and network transparency is not important, the message stored in the queues can be a pointer to the message.

Through a request to Task, user-written tasks are able to become admin tasks. As admin tasks, they share the same privileges as the original set of tasks that make up the OS. Being able to start admin tasks allows the initial functional capabilities of the OS to be extended at run-time.

An essential characteristic of an admin task is that it cannot be arbitrarily killed by other tasks. Typically, admin tasks are commanded to shut down and to release any system resources the tasks may have allocated. An admin task is also able to detect the death of other system tasks so that resources allocated by the admin task for those dead tasks may be released.

More Details.

OS/2'S A Real-time Alternative

By G. Michael Vose

G. Michael Vose, co-editor of the newsletter OS Report: News and Views on OS/2." He can be reached at Box 3160, Peterborough, NH 03458

Neither Microsoft nor IBM touts OS/2 as a real-time operating system. Nevertheless, programmers might write OS/2 applications that must track real time. This is particularly true when programmers are developing communications applications to monitor events and take action when responses fail to occur as expected. Real-time tracking can provide the user with a specific time period in which to perform some action, or perform an action such as saving an editor's buffer on a regular, timed interval. Real-time control can also allow an application to run at preset time intervals. OS/2 has several timer service functions to facilitate writing real-time control routines.

The only fly in the OS/2 real-time control ointment is OS/2's main feature---multitasking. Because multiple threads and processes can be running simultaneously, real-time tracking that uses the CPU clock can never be totally accurate because a higher-priority process may be eating up CPU cycles. But multitasking has its advantages, too. Using multiple threads, a program can synchronize several different hardware devices to that they can perform simultaneous tasks. Timers can likewise synchronize the activities of several asynchronous programs.

OS/2 provides both synchronous and asynchrounous timer services. DosSleep is the synchronous function that puts your application on hold so that you do not need delay loops. DosTimerAsync, DosTimer-Start, and DosTimerStop are asynchronous functions that allow you to start, stop, and read software timers, using system semaphores to alert an application when timing functions have finished executing. The timer starts when it is called and then control passes back to the calling thread, which resumes execution. The thread and the timer execute concurrently. Upon completion of the timer's interval, the timer clears a semaphore. The calling thread can check the semaphore to see if timing is complete. For example, if your program issues a DosTimerAsynch(5000, mysem, semidentifier) call, a timer with a five-second interval begins execution and at the end of the interval clears the semaphore mysem, which you can read on the file handle semidentifier. Your program must create and set the semaphore by using the DosCreateSem and DosSemSet functions before you call the asynchronous timer.

DosTimerStart operates much the same as DosTimerAsynch but continues to run while clearing its associated semaphore each time the timer interval elapses. You must reset the semaphore after it is cleared. DosTimerStop is used to halt DosTimerStart. The DosSleep function acts as a synchronous timer. The thread that calls DosSleep suspends its execution for the interval of the timer. A DosSleep(5000) call puts its calling thread on hold for five seconds.

If your programs merely need to synchronize the flow of data among threads or processes, semaphores and shared memory enable you to exploit one of several OS/2 communication paths between processes. Particularly within a single monolithic application, the private semaphore/shared memory interprocess communication technique has speed advantages over the nonprivate (but slower) pipes mechanism for sharing data. Pipes pass data only between parent processes and their children. Two-way communication between such processes requires two separate pipes.

The periodic clock interrupt (or timer tick) of OS/2 occurs 32 times each second. This means that timing functions carry a 1/32-second quantization error. Therefore, you should think in term of seconds (not multiseconds) when using timers of OS/2. High-precision timing of events happening in the multisecond range should use other methods to measure time intervals. For example, you can make repeated DosGetDateTime calls to read the Date/Time date structure's contents into a buffer or into variables for computing the passage of small time intervals. Another solution is to check the time bytes in the read-only global information segment by using the DosGetInfoSeg call.

You can alleviate the problem of timer service threads losing clock cycles to higher priority threads by elevating the priority of the thread making the timer calls. If you make these threads the highest-priority threads, you will ensure that events needing critical timer servicing won't lose clocks to higher-priority threads. The major difficulty developers face in writing real-time control software is interrupt processing. OS/2 does not allow an application to process hardware or software interrupts. You can process interrupts one with OS/2 device driver. Since these drivers are very difficult to write in a high-level language, creating routines to handle interrupts---which can then notify and pass data to an application will be a complex process. In short, OS/2 should be considered a workable real-time operating system alternative only for those custom applications where you are in total control of the environment and can therefore, worst case interrupt latency and task scheduling times. G.M.V.

Send/Receive Message Passing Primitives

QNX implements two message passing primitives-send and receive. These primitives are unbuffered, blocking operations that cause the task issuing a send request to be blocked if the target task is not correspondingly receive-blocked. When two tasks are in complementary send/receive states, the message is transferred and the receive task becomes unblocked (see Figure 1, page 35). The highest-priority task will then run. QNX always executes the highest priority, unblocked task. If two tasks are compute bound at the same priority level, round robin task scheduling will occur. Should an event occur that causes a higher priority task to become unblocked, QNX will pre-empt the currently executing task, and switch to the higher priority task. Fast, pre-emptive task scheduling is essential to real-time applications.

An important aspect of the send/receive operations is that time-ordered queuing is performed whenever more than one task attempts to communicate with the target task. Multiple send requests to a single task performing a receive are queued in the order they were received and are processed in sequence. The target task has the option of completely servicing the first request before serving the next request (or additional requests).

In addition to the send/receive operators, mechanisms called exceptions, ports, and registered names are available for intertask communication. These additional mechanisms are useful for special cases where send/receive communication is not appropriate.

An exception is similar to the signal found in Unix. An exception is an asynchronous event that can cause an exception handler within the task to be executed in response to the exception. Exceptions are valuable because they can be used to break out of the send/receive blocked states. The most common exception is the break exception that is generated from the keyboard.

The primary use of a port is for interrupt handlers to communicate with a task. This facility makes it possible for a task to be written which contains the interrupt handler within the body of the task itself. Using standard systems calls, the task is able to attach to a port and then connect the handler to the appropriate interrupt vector. After any other internal initialization, the task can receive block itself upon the port and optionally open itself for message reception from other tasks. An "attach" or "detach" operator is used by a task to obtain a port.

During interrupt service time, the interrupt handler is able to make use of the code and data of the task. If an event requiring handling by the task or the OS results, the handler can signal the assigned port, causing the task to become unblocked in order to perform whatever service the interrupt handler required. Note that the interrupt handler is connected directly to the interrupt vector and that no operating system overhead is added to the interrupt service time.

For two tasks to communicate, the sending task must know the node number and task identifier (ID) of the destination task. If the sending task was responsible for starting the remote task, the sending task will know this information. If the sending task is expecting to send to a previously present task, the sending task must be able to discover the node and task ID (TID) of that task. To facilitate this, the receiving task that wishes to provide a network-accessible service can register a textual name. Tasks needing to locate that task can obtain the node and TID by using the textual name that the task would have registered.

For example, one task that typically needs to register itself is a print spooler. Once registered, any task on the network wanting to print need only inquire about the spooler task and use the standard intertask messaging to send the data to be printed to the spooler task. Multiple spoolers could be started for any printer, anywhere in the network. Any spooler and its corresponding printer could be relocated without concern for whether tasks needing that spooler would be able to locate it.

Since the primary tasks within the operating system are assigned predefined TID numbers at boot time, remote tasks are always able to communicate directly with the primary OS tasks anywhere on the network without having to first discover their TID numbers.

$ [2] [3]4:/cmds/copy [1]3:/user/peggy/ new [4]$lpt

Assuming this command line were typed from the console on node 6, the command copy would be loaded from node 3, drive 4 into memory on node 2. While executing on node 2, the copy command would read data from the file new on node 1, drive 3, and send it out to the $lpt device on node 4. If a `&` character were placed on the end of the command line, the command would execute as a background task on node 2 and freeing the users console for other work. None of the general utilities contain special code to handle network-oriented file or device names.

In this case, the "copy" command itself is 2,700 bytes in size and operates with simple calls to standard file processing library calls.

Device access across the network is able to work in this manner because whenever a task performs an fopen to open a file or device, it sends a message to the fsys (for files) or dev task (for devices) on the node that owns the file or device. When the requested task replies to the fopen request, any following read/write messages will be routed directly to the remote task that controls the device.

In a Unix environment, the network typically supports only terminal sessions and file transfers. To access a remote database, the task performing the I/O must log in to the remote node and execute the query task there. The result is the central machine (already burdened with the file and device I/O) must also execute the tasks that are generating the I/O requests for all the users on the network. A QNX network allows the tasks to execute on each user's workstation with only the remote file and device I/O requests flowing to the node (or nodes( that contain the devices being accessed.

With this naming flexibility, resources present on the entire network are part of the same "name space" and may be operated upon just as the resources present on a single node. A program written to access files or devices by name can name any file or device on the network. The program can then have transparent access to the file or device without resorting to a special "network services" interface.

Conclusion

Through the use of a message passing architecture, QNX is able to provide real-time performance with multiuser support and full network transparency in an operating system that occupies less than 150 Kbytes, QNX can run on as limited a machine as a PC with a single floppy drive and 256K of RAM, or in protected mode on a 80286 or 80386 with 16 Mbytes of RAM. The networking transparency results in a QNX "mainframe" that can be built piece by piece to provide as much performance as necessary.