New real-time operating-system (RTOS) enhancements make 99.999% availability and real-time application requirements achievable. Applications like transaction processing, process control, communi- cations switching, and air-traffic control are just a few examples where any downtime cannot be tolerated. Such companies as Monta Vista, OSE Systems, QNX Systems, Red Hat, Lynuxworks, and Wind River Systems have added high-availability services to the list of modules that can be incorporated into an RTOS.
The technology of high-availability systems isn't new. IBM, Sun, Microsoft, and others have done it for years. Custom embedded systems have often utilized high-availability techniques through customized software instead of standardized OS support.
High-availability hardware isn't new either, but this type of hardware such as RAID disk and tape support is showing up in more embedded and real-time systems. Standard CompactPCI systems, like those from Force Computers, provide hot-swap board support. Likewise, network interconnects, including Ethernet and InfiniBand, give developers a choice of implementation methods. Today, off-the-shelf hardware can provide high-availability support with an off-the-shelf RTOS.
High-availability hardware systems available generally feature:
- Hot-swapping capability. This is available in computer boards like CompactPCI boards and disk and tape drives.
- Multiprocessor links. Popular buses like InfiniBand and CompactPCI as well as networks like Ethernet include this feature.
- A RAID (redundant arrays of hard disks) architecture as found in disk and tape drives.
It's important to recognize the roles redundant hardware and hot-swapping play in a high-availability system (see "Hot-Swapping Is Only Part Of The Hardware Story," p. 44). A number of hardware technologies are available to implement high-availability systems.
Software support for high-availability systems is cropping up in a number of places (Fig. 1). Now, even an application programming interface (API) exists for CompactPCI.
Checkpointing, transaction support, and application heartbeat support are just some of the features be-ing used with real-time systems. But the APIs aren't always standardized across vendors because each OS implements a heartbeat support in a different fashion.
Checkpointing is the ability to save enough information from a process to restart it if it fails. Heartbeat support is the act of finding when a process fails.
Modularity is still the key aspect of high availability in an RTOS. One example can be seen in a partitioning of high-availability services that closely match an OS, in this case, Wind River's new VxWorks Foundation HA, which builds on the company's VxWorks AE RTOS (Fig. 2).
Other examples include Lynuxworks Lynx/HA and Monta Vista's High Availability Framework, which add high-availability support to Linux-compatible and Linux operating systems respectively. These additions have a modular construction similar to VxWorks Foundation HA.
Hardware may steal the limelight in numerous circuit designs, but high-availability hardware won't work without the correct software. More importantly, high-availability applications need to operate regardless of the kind of hardware available in the system. In particular, applications must continue working with other applications in the system, even if one application fails due to errant coding, a lack of resources, or other software-related problems.
In some cases, software failover support can be provided transparently. That's how many message-based systems operate.
In general, a high-availability system should have the following software services:
- Heartbeat support for each server and each application.
- Event management capability for change notification.
- Alarm management for error handling.
- Transactions capability for check-pointing and rollback/restart.
- Clustering for server management and applications links.
- Reliable storage support for RAIDs and for journaling file systems.
With QNX, applications communicate with each other using a messaging system that is part of the RTOS' core services. The QNX message system supports transparent message-based services independent of its new high-availability support. The QNX link manager can detect a failed application and redirect messages to an alternate application (Fig. 3).
The link manager can utilize alternate paths between applications and start up a new application if necessary. Changes are handled based on an application's description of a link. QNX uses messaging for all major services, and messages move transparently across node boundaries (Fig. 3, again). Of course, this redirection works equally well between applications on the same node.
Some RTOSs add messaging capabilities as part of their high-availability services. For example, Lynuxworks Lynx/HA includes message-oriented middleware that uses unicast, broadcast, and multicast transmissions for notification of system events. Lynuxworks also includes CORBA-compatible quality-of-service options.