Article updated 9/26/22
New real-time operating-system (RTOS) enhancements make 99.999% availability and real-time application requirements achievable. Applications like transaction processing, process control, communications switching, and air-traffic control are just a few examples where any downtime cannot be tolerated. Such companies as Monta Vista, Enea, BlackBerry QNX Systems, Red Hat, Ubuntu, Lynx Software Technologies, and Aptiv/Wind River have added high-availability services to the list of modules that can be incorporated into an RTOS.
The technology of high-availability systems isn't new. IBM, Sun, Microsoft, and others have done it for years. Custom embedded systems have often utilized high-availability techniques through customized software instead of standardized OS support.
High-availability hardware isn't new either, but this type of hardware such as RAID disk and tape support is showing up in more embedded and real-time systems. Standard CompactPCI Serial systems, like those from ADLINK, provide hot-swap board support. Likewise, network interconnects, including Ethernet and InfiniBand, give developers a choice of implementation methods. Today, off-the-shelf hardware can provide high-availability support with an off-the-shelf RTOS.
High-availability hardware systems available generally feature:
- Hot-swapping capability. This is available in computer boards like CompactPCI Serial boards and disk and tape drives.
- Multiprocessor links. Popular buses like InfiniBand and PCI Express as well as networks like Ethernet include this feature.
- A RAID (redundant arrays of hard disks) architecture as found in disk and tape drives.
It's important to recognize the roles redundant hardware and hot-swapping play in a high-availability system. A number of hardware technologies are available to implement high-availability systems.
Software support for high-availability systems is cropping up in a number of places (Fig. 1). Now, even an application programming interface (API) exists for CompactPCI Serial.
Checkpointing, transaction support, and application heartbeat support are just some of the features being used with real-time systems. But the APIs aren't always standardized across vendors because each OS implements a heartbeat support in a different fashion.
Checkpointing is the ability to save enough information from a process to restart it if it fails. Heartbeat support is the act of finding when a process fails.
Modularity is still the key aspect of high availability in an RTOS. One example can be seen in a partitioning of high-availability services that closely match an OS, in this case, Wind River's VxWorks (Fig. 2).
Other examples include Lynx MOSA.ic which add high-availability support to Linux-compatible and Linux operating systems respectively. These additions have a modular construction similar to VxWorks.
Hardware may steal the limelight in numerous circuit designs, but high-availability hardware won't work without the correct software. More importantly, high-availability applications need to operate regardless of the kind of hardware available in the system. In particular, applications must continue working with other applications in the system, even if one application fails due to errant coding, a lack of resources, or other software-related problems.
In some cases, software failover support can be provided transparently. That's how many message-based systems operate.
In general, a high-availability system should have the following software services:
- Heartbeat support for each server and each application.
- Event management capability for change notification.
- Alarm management for error handling.
- Transactions capability for check-pointing and rollback/restart.
- Clustering for server management and applications links.
- Reliable storage support for RAIDs and for journaling file systems.
With QNX, applications communicate with each other using a messaging system that is part of the RTOS' core services. The QNX message system supports transparent message-based services independent of its new high-availability support. The QNX link manager can detect a failed application and redirect messages to an alternate application (Fig. 3).
The link manager can utilize alternate paths between applications and start up a new application if necessary. Changes are handled based on an application's description of a link. QNX uses messaging for all major services, and messages move transparently across node boundaries (Fig. 3). Of course, this redirection works equally well between applications on the same node.
IBM, Microsoft, and Sun Microsystems have extensive clustering solutions. Although these tend to be used in high-end installations, the same techniques are applicable to embedded environments.
APIs for this type of clustering support are OS-specific. Applications must take advantage of these APIs, and applications that work together are tightly integrated.
Exceptions, such as a failed service or application, must be handled explicitly. High-availability support typically provides services like checkpointing and transaction rollback.
RTOS high-availability modularity allows developers to choose the kinds of services needed to support their particular requirements. This may include hardware support such as hot-swap recognition, device failure, environment problems like overheating, or the use of reliable storage.
It might further be limited to event and alarm support. Even basic heartbeat monitoring can help bring a system into high-availability land if applications are written to handle faults.
Certainly, additional high-availability modules should make the programmer's job easier. For this reason, high-availability technologies from high-end systems, such as clustering, are finding their way into embedded systems.
Some high-availability technologies already exist in many RTOSs. Those from QNX are an example. This message-based RTOS provides transparent message redirection as part of the regular RTOS implementation. Additional support addresses features typically not found in a basic RTOS, such as transaction-oriented checkpoint support.
In this case, a checkpointed task provides data and restart information as part of a checkpoint that's managed by the QNX high-availability monitor. If the task terminates or fails to respond in a set time, the monitor will start a new task.
Using features like checkpointing becomes significantly easier with off-the-shelf components if the RTOS vendor provides support for the boards used in the system. The latest crop of high-availability add-ons, such as those available from Wind River and QNX, have the necessary support.
Meeting the five-nines requirement isn't the only reason to consider for high-availability support. Simply providing a more reliable product is justification enough to consider a high-availability-enabled RTOS—either that, or build it from scratch.
Yes, high-availability RTOS integration is just beginning.
This article appeared in Electronic Design, Oct 29, 2001.
Need More Information? | |
Green Hills Software Inc. (805) 965-6044 www.ghs.com IBM Corp. (800) IBM-4YOU www.ibm.com Lane 15 Software Inc. (512) 502-9898 www.lane15.com Lynuxworks Inc. (408) 979-3900 www.lynuxworks.com Monta Vista (408) 328-9200 www.mvista.com Microsoft Corp. (425) 882-8080 www.microsoft.com Enea (408) 392-9300 www.enea.com |
PCI Industrial Computer Manufacturers Group (781) 246-9318 www.pcimg.org QNX Systems (800) 676-0566 www.qnx.com Red Hat Inc. (919) 547-0012 www.redhat.com Sun Microsystems Inc. (800) 786-7638 www.sun.com Wind River Systems (800) 545-WIND www.windriver.com |