[Design View / Design Solution]
Break Through The TCP/IP Bottleneck With iWARP
Using a high-bandwidth, low-latency network solution in the same network infrastructure provides good insurance for next-generation networks.
T he online economy, particularly e-business, entertainment, and collaboration, continues to dramatically and rapidly increase the amount of Internet traffic to and from enterprise servers. Most of this data is going through the transmission control protocol/Internet protocol (TCP/IP) stack and Ethernet controllers.
As a result, Ethernet controllers are experiencing heavy network traffic, which requires more system resources to process network packets. The CPU load increases linearly as a function of network packets processed, diminishing the CPU’s availability for other applications.
Because the TCP/IP consumes a significant amount of the host CPU processing cycles, a heavy TCP/IP load may leave few system resources available for other applications. Techniques for reducing the demand on the CPU and lowering the system bottleneck, though, are available.
iWARP SOLUTIONS Although researchers have proposed many mechanisms and theories for parallel systems, only a few have resulted in working computing platforms. One of the latest to enter the scene is the Internet Wide Area RDMA Protocol, or iWARP, a joint project by Carnegie Mellon University and Intel Corp.
This experimental parallel system integrates a very long instruction word (VLIW) processor and a sophisticated finegrained communication system on a single chip. It’s basically a suite of wireless protocols comprising RDMAP and DDP. The iWARP protocol suite may be layered above marker PDU aligned (MPA) and TCP or over Stream Control Transmission Protocol (SCTP) or other transport protocols (Fig. 1).
The RDMA Consortium released the iWARP extensions to TCP/IP in October 2002, implementing the standard for zerotransmission over legacy TCP/IP. Together, these extensions eliminate the three major sources of networking—transport (TCP/IP) processing, intermediate buffer copies, and application context switches—that collectively account for nearly 100% of CPU utilization (see the table).
A kernel implementation of the TCP stack has several bottlenecks. Therefore, a few vendors now implement TCP in hardware. Because simple data loses are rare in tightly coupled network environments, software may perform the error-correction mechanisms of TCP. Meanwhile, logic embedded on the network interface card (NIC) strictly handles the more frequently performed communications. This additional hardware is known as the TCP offload engine (TOE).
The iWARP extensions utilize advanced techniques to reduce CPU overhead, memory bandwidth utilization, and latency. This is accomplished through a combination of offloading TCP/IP processing from the CPU, eliminating unnecessary buffering, and dramatically reducing expensive operating-system (OS) calls and context switches. Thus, the data management and network protocol processing is offloaded to an accelerated Ethernet adapter instead of the kernel’s TCP/IP stack.
iWARP COMPONENTS Offloading TCP/IP (transport) processing: In conventional Ethernet, the TCP/IP stack is a software implementation, putting a tremendous load on the host server’s CPU. Transport processing includes tasks such as updating TCP context, implementing required TCP timers, segmenting and reassembling the payload, buffer management, resource-intensive buffer copies, and interrupt processing.
The CPU load increases linearly as a function of the network packets processed. With the tenfold increase in performance from 1-Gigabit Ethernet to 10-Gigabit Ethernet, packet processing and the CPU overhead related to transport processing increases up to tenfold as well. In the end, network processing will cripple the CPU well before reaching the Ethernet’s maximum throughput.
The iWARP extensions enable the Ethernet to offload transport processing from the CPU to specialized hardware, eliminating 40% of the CPU overhead attributed to networking (Fig. 2). The transport offload can be implemented by a standalone TOE or be embedded in an accelerated Ethernet adapter that supports other iWARP accelerations.
Moving transport processing to an adapter also rids a second source of overhead—intermediate TCP/IP protocol stack buffer copies. Offloading these copies from system memory to the adapter memory saves system memory bandwidth and lowers latency.
RDMA techniques eliminate buffer copy: Repurposed for Internet protocols by the RDMA Consortium, Remote DMA (RDMA) and Direct Data Placement (DDP) techniques were formalized as part of the iWARP extensions. RDMA embeds information into each packet that describes the application memory buffer with which the packet is associated. This enables the payload to be placed directly in the destination application’s buffer, even when packets arrive out of order.
Data can now move from one server to another without the unnecessary buffer copies traditionally required to “gather” a complete buffer (Fig. 3b). This is sometimes called the “zero copy” model. Together, RDMA enables an accelerated Ethernet adapter to support direct-memory reads-from/writes-to application memory, eliminating buffer copies to intermediate layers. RDMA and DDP eliminate 20% of CPU overhead related to networking and free the memory bandwidth attributed to intermediate application buffer copies.
Avoiding application context switching/OS bypass: The third and somewhat less familiar source of overhead, context switching, contributes significantly to overhead and latency in applications. Traditionally, when an application issues commands to the I/O adapter, the commands are transmitted through most layers of the application/OS stack.