High performance and predictability are prerequisites for any large-scale networked system that depends on real-time data processing and analysis. Data representing actual events or system status must be evaluated while it's still relevant to tactical conditions, making it imperative to know when specific data is available, and to aggregate and evaluate that data in real time. Unreliable receipt times make effective analysis difficult or impossible.
Fast and predictable performance is always an issue in the design of such a system. This is especially the case when designing distributed systems with thousands of nodes that must move lots of data around quickly in a dynamically changing environment.
Switched-fabric networks, which can provide fast and highly scalable hardware solutions, are now increasingly finding their way into such applications. What's needed beyond that is a software solution for bringing predictability, flexibility, and reliability to distributed data communications. In this article, we will describe how the Data Distribution Service (DDS) data-centric publish subscribe middleware layer can realize the full potential of a hardware switched-fabric network to deliver a complete solution for application developers.
Data-Critical Systems Share Characteristics
Many large-scale data-critical applications can be characterized by three attributes: the need to gather and distribute data in real time, the large amount of data being transferred, and the various entities involved in this data exchange (which may even change over time). For instance, data-critical systems like air-traffic control; financial transaction processing; battlefield, naval command, and control; or industrial automation all feature these three attributes.
Such systems aren't necessarily hard real time, but their predictability requirements represent an integral part of the functions they perform. They gather data from various sources (e.g., sensors) and distribute the data to a number of users like databases, display devices, or control algorithms. Furthermore, by their very nature, they are distributed.
Today's bus-based architectures, typically multi-CPU, VME backplane solutions with hard-wired I/O interfaces to sensors and effectors, fall short in several areas in addressing the needs of data-critical systems. For example, these hardware transport mechanisms don't scale, are difficult to make fault-tolerant, and are tough to modify and upgrade once they've been deployed.
For these reasons, designers of complex, data-critical distributed systems are turning to switched fabrics to replace bus backplane and serial interconnect technologies. StarFabric, PCI Express Advanced Switching, Serial RapidIO, and InfiniBand are some commercial products that implement different switched-fabric designs.
A switched-fabric bus is unique in that it allows all nodes on the bus to "logically" interconnect with all other nodes on the bus. (Fig. 1). Each node is physically connected to one or more switches. Switches may be connected to each other. This topology results in a redundant network or "fabric," in which there may be one or more redundant physical paths between any two nodes. A node may be logically connected to any other node via the switch(es). A logical path is temporary and can be reconfigured, or switched among the available physical connections. Switched fabric networks can be used to provide fault tolerance and scalability without unpredictable degradation of performance, among other features.
Figure 1: Switched fabric architecture. Multiple switches can be used to expand the fabric and provide hardware redundancy.
Switched Fabrics and Data Distribution Service
A key characteristic of switched fabrics is that they allow peer-to-peer communication between nodes without having to physically connect every node to every other node. With every node physically connected to every other node, adding a new node is exponentially more and more expensive with an increasing number of nodes. Because a switched-fabric network employs switching to achieve logical connectivity and reconfigurability, these systems can be designed to be highly scalable.
On the software side, publish-subscribe communication systems map very naturally onto switched fabrics. Publish-subscribe systems work by using endpoint nodes that communicate with each other by sending (publishing) data and receiving (subscribing) data anonymously via topics. A topic is identified by a name and a data type. A data producer declares the intent to publish data on a topic; a data consumer registers its interest in receiving data published on a topic. The middleware acts as the glue between the producers and the consumers. It delivers the data published on a topic by a producer to the consumers subscribing to that topic.
There can be as many topics as needed, a producer can publish on multiple topics, and a consumer can subscribe to multiple topics. The middleware layer isolates the data producers from the consumers?they have no knowledge of each other (Fig. 2).
Figure 2: The Data Distribution Service (DDS) data-centric publish-subscribe architecture anonymously sets up direct data flows between DataWriters and DataReaders, resulting in scalable and fault tolerant data distribution A publish-subscribe software architecture allows producers and consumers to be loosely coupled. Therefore, it's naturally scalable, and can easily adapt to the changing needs of distributed data-critical systems. The producers and consumers are peers?they communicate directly with each other, so that the topology of publish-subscribe systems can be closely matched to that of switched-fabric systems. Thus, a publish-subscribe middleware layer can fully exploit the potential switched-fabric network hardware.
The Data Distribution Service (DDS) standard (see "The DDS Standard," ED Online 13516) specifies a data-centric publish-subscribe middleware layer, developed with the needs of distributed data-critical applications in mind. A well-designed DDS middleware implementation can be good at real-time data distribution, be easily field-upgradable, and be transport agnostic. It can be better at real-time data distribution because publish-subscribe is more efficient than the traditional request/reply-based architectures in both latency and bandwidth for periodic data exchange. Furthermore, it can be easier to upgrade in the field because publishers and subscribers don't care about the type or amount of counterparts. And, finally, since the middleware is layered on top of the physical means of getting the data from one place to another, it needn't depend on the network transport or topology used.