Multicore NPU architecture
Shared memory area
Dataflow architecture
Packet processing attributes
Fully deterministic architecture
Mapped to applications
System vendors face a number of architecture options as they research next-generation packet-processing technologies to meet future scalability and integration challenges. Two architectures are common: the generic multicore architecture and the special-purpose dataflow architecture.
Each architecture has its strengths. And, as is so often the case, each system vendor’s design decision boils down to the platform’s intended tasks. Essentially, it’s all about mapping architecture to application.
PACKET PROCESSING BACKGROUND
Packet processing, which is data intensive, calls for optimized hardware. In the early days, prior to broadband Internet, general-purpose processors were used for both control session processing and packet processing of the user traffic.
The sharing of central processing unit (CPU) resources between the data and the control planes proved, however, to be significantly difficult to scale as bandwidth requirements grew. For switches and routers, data-plane packet processing was offloaded to custom fixed-function ASICs or programmable network processor units (NPUs). The general-purpose CPU was then freed up and dedicated to control-plane tasks.
Several NPU players have tried to optimize general-purpose processors for layer 2-4 packet processing and offer a multicore architecture with integrated network hardware (i.e., physical layer, media access controller, and table memory) as well as hardware engines for specific tasks (i.e., hashing). At the turn of the 21st century, companies like MMC, C-Port, and the Intel IXP division developed these types of devices.
While there were differences among them, they all shared the same principal architecture. By stripping down the complexity, the processor cores could be simplified to enable tens of processor cores to be integrated into the device, meeting a higher demand for parallelism.
With very few exceptions, these NPU ventures have commercially failed. Ultimately, they couldn’t efficiently meet the processing and memory access requirements in networking applications of greater than 10 Gbits/s.
Now, as we approach 2010, we see a new generation of multicore players addressing the network processing market. While CMOS technology, memory bandwidth, and clock cycle performance have evolved, they still depend on the same principal architecture. Thus, can these new players expect greater success?
This will depend on which type of application they address. Today’s networking nodes not only process packets at layer 2-4, processing is also required at higher levels to support services and add security. Let’s explore the differences and why certain architectures are better than others for any given application.
WIRESPEED PACKET PROCESSING
Layer 2-4 packet processing differs from other network applications (Table 1). First, wirespeed processing for all packet sizes is a key objective. Modern routers and switches are designed with a broad set of network features that service providers expect to be available in parallel without performance degradation.
Second, the data planes view packets as independent entities, allowing for a high degree of parallel processing. For a 100-Gbit/s application, the NPU needs to process 150 million packets every second to guarantee wirespeed performance. A 10-µs delay through the processor corresponds to the concurrent processing of 1500 packets.
Third, data-plane programs require large I/O memory access bandwidth for forwarding table lookups, updates of statistics, and other processes. In high-speed platforms, packet inter-arrival times are very short, putting hard requirements on memory latencies. For small packets, the memory bandwidth to perform these tasks is several times the link bandwidth.
Finally, today’s networks consume a significant amount of power. For both operational cost and environmental reasons, service providers are carefully seeking the highest performance per watt. Given the special characteristics of packet processing, the most efficient architecture should be measured as the highest performance per watt at wirespeed performance.
SERVICE AND SECURITY PROCESSING ATTRIBUTES
Adjacent markets to packet processing are service and security processing. These applications have other characteristics than packet processing at layer 2-4. Consequently, other hardware design optimizations can be made.
These applications terminate and process host-to-host protocols in a client-server manner or operate on re-assembled payload packet data in intermediate network nodes (i.e., firewalls, load balancers, and intrusion and prevention systems). These products must be able to operate across packet borders as they typically need to carry out a larger amount of operations on a broader set of data, resulting in a lower degree of data parallelism. On the other hand, these applications are less classification-intensive, requiring less I/O memory bandwidth relative to processed data.
COMPARING ARCHITECTURES
An NPU promises to provide the performance of a custom ASIC with the programmability of a general-purpose processor. However, comparing processor performance is difficult as theoretical maximum values are often referred to with little real-world relevance. Moreover, performance is impacted by the ability to efficiently use the available processing capabilities, as well as by how well the I/O memory can be utilized in relation to the processing capacity.
The comparison must therefore start at the design level. Let us start by looking at a generic multicore NPU architecture. Stemming from general-purpose processors, a multicore NPU architecture would want to leverage a higher degree of parallelism by increasing the number of processor cores. This can be achieved by decreasing the complexity and removing unnecessary features (i.e, floating-point instructions) found in today’s general-purpose processor architectures.
A multicore NPU architecture comes with a unique organization of processor cores. These cores are grouped into parallel pools or pipelined together in a serial manner (Fig. 1). The organization can be tightly controlled by the architecture, as designed by the NPU vendor, to optimize performance.
If loosely defined, the organization allows programmers to more freely divide tasks between them, ultimately providing greater flexibility at the cost of performance control. In many cases, multicore NPUs end up as a hybrid architecture of pipelines and pools.
The organization of processor cores has a fundamental impact on the programming model. Parallel pools come with an associated multi-threaded programming model, where every processor core may run one or more threads. Essentially, the program takes a packet and executes a series of operations on it.
Once completed with the packet, the program is ready to take on the next in line. The programmer utilizes the processing resources by scheduling packets to the different pools. Synchronization across threads is another key systemization task for the programmer.
The pipelined model takes the data-plane application and divides it into separate processing tasks (i.e., classification, modification, tunneling, and update of statistics). Each task is then mapped onto separate processor cores and the execution is either enforced by the architecture or left to the programmer. One challenge traditionally involves efficiently dividing the tasks among the cores, as the throughput is limited by the slowest stage.
Packets in a generic multicore architecture are typically stored in a shared memory area (Fig. 2). In this case, the programmer has to divide the classification and packet modification tasks between the pools and pipelines of processing resources.
SHARED DATA COMPLEXITY
In parallel packet processing, multiple threads may need to access and update shared data such as statistics and ARP entries. The different threads need to synchronize to enforce mutual exclusion and implement common sharing patterns. It is well known, however, that synchronization is difficult and has an effect on performance.
To increase performance, many multicore processors implement hardware caches. While this greatly shortens average memory access delays, the architecture becomes less predictable.
Cache coherence protocols ensure the integrity of data in multicore systems implementing a cache hierarchy. While this is transparent to programmers, they need to understand how caches and coherence protocols operate to tune performance. On the other hand, the memory consistency model is exposed to programmers. Hence, they need to understand the memory consistency model to write correct programs.
MAINTAINING PACKET ORDER
Another challenge in parallel packet processing is maintaining packet order. All nodes are expected to keep packet order for related packets, as upper-layer transport protocols depend on it to function correctly. It is typically the programmer’s responsibility to understand what packet types require maintained packet order and how this is most efficiently achieved.
To ease this complexity, the NPU vendor often provides hardware support and software libraries. Adding more packet buffers can help ensure packet order, though always at the expense of an increased delay.
THE NEED TO REDUCE COMPLEXITY
Taming a multicore-based NPU has proven challenging. Larry Huston at Intel concluded a paper for the 10th International Symposium on High-Performance Computer Architecture with the following statement:
“The ideal scenario would have a programmer write an application as a single piece of software and the tools would automatically partition and map the application to the set of parallel resources. This may be a difficult goal, but any steps in that direction will improve the life of a developer.”
The dataflow architecture brings exactly this. While this quotation is from 2004, it is as valid and meaningful today as it was six years ago.
THE DETERMINISTIC DATAFLOW ARCHITECTURE
The dataflow architecture (Fig. 3) takes a unique approach and features a single pipeline of processor cores. The architecture has been designed to be fully deterministic and ultra-efficient. It includes a packet instruction set computer (PISC) and an engine access point (EAP), in addition to the execution context.
The PISC is a processor core specifically designed for packet processing. The pipeline can include several hundred (400+) PISCs. The EAP is a specialized I/O unit for classification tasks. EAPs unify access to tables stored in embedded or external memory (TCAM, SRAM, DRAM) and include resource engines for metering, counting, hashing, formatting, traffic management, and table search.
The execution context is packet specific-data available to the programmer. It includes the first 256 bytes of the packet, general-purpose registers, device registers, and condition flags. An execution context is uniquely associated with every packet and follows the packet through the pipeline.
Packets travel through the pipeline as if advancing through a fixed-length first-in-first-out (FIFO) device. In each clock cycle, all packets in the pipeline shift one stage ahead to execute in the next processor core or EAP.
The instruction is always executed to completion within a single clock cycle. Every instruction can execute up to five operations in parallel in a very long instruction word (VLIW) fashion. The packet then continues to the next PISC or to an EAP.
The data-plane program is compiled to instruction memory located in the processor cores, eliminating the need to fetch instructions to the processor cores from a shared memory during program execution. Moreover, this leads to significant gains in performance and power dissipation.
The programming model mirrors the well-known sequential uni-processor model, wherein programmers can write sequential modules to avoid the hassles of multi-parallel programming (i.e., memory consistency, coherence, and synchronization). When the software is compiled, the code is automatically mapped to the single pipeline of processor cores. One VLIW instruction occupies one processor core in the pipeline.
A significant benefit of the architecture and programming model is that it enforces wirespeed operation. Every type of packet has a guaranteed number of operations and classification resources.
REDUCED COMPLEXITY, GREATER PERFORMANCE
The multicore architecture cannot guarantee a certain level of performance, whereas the dataflow architecture is fully deterministic (Table 2). By reducing the complexity and fully optimizing the architecture for layer 2-4 packet processing, the dataflow architecture’s design scales to several hundreds of processor cores, supporting 100 Gbits/s and 150 million packets per second with a strong wirespeed guarantee.
While the raw processor performance is impressive, the programmer’s ability to make the most of the processor is another key component of the dataflow architecture. From atomic operations to table memories, a common set of memory operations—independent of memory type (on- or off-chip) and common processor cores across the complete pipeline—allows for efficient coding and code reuse.
In multicore architectures, processing capacity needs to be over-allocated at each stage, which in practice is a significant challenge as programmers always tend to be short of processing resources. The data-plane programmer therefore goes into an endless iteration mode of testing and performance optimization to recover lost clock cycles.
Comparing the multicore and dataflow architectures for packet processing, the differences in efficiency are obvious. Let’s look at two of today’s state-of-the-art processors and compare the metrics for layer 2-4 packet processing.
The first processor, the HX 330 NPU from Xelerated, is built upon the dataflow architecture. It runs at 300 MHz and features 448 processor cores, each capable of five simultaneous operations. Every second clock cycle, a new packet can enter the pipeline.
This translates to 150 Mpackets/s of processing, which is required to guarantee 100-Gbit/s wirespeed operation supporting even the smallest Ethernet packet sizes of 64 bytes. Each packet is guaranteed 5 × 448 = 2240 operations.
The latter number is theoretical, of course. No real data-plane application will leverage the full potential. Well-optimized data-plane code utilizes approximately 50% of the resources. This allows for great service density.
The second processor is one of the highest-performing multicore processors on the market. It features 64 processor cores and runs at 700 MHz. Using this chip for a 100-Gbit/s packet processing application would require a new packet to be scheduled every fourth clock cycle. On average, every packet would in theory get 256 clock cycles of processing capacity.
Synchronization challenges and performance hits for managing shared data will drive down performance to 50% utilization at best. This translates to 128 operations per packet, or 13% of the processing resources of the HX NPU. In addition, these operations are without performance guarantees.
Adding power dissipation to the equation makes an even greater difference. The NPU built upon a dataflow architecture brings 15 to 20 times the performance per watt at wirespeed relative to the multicore processor.
DIFFERENT NEEDS, DIFFERENT ARCHITECTURES
For layer 2-4 packet processing, the dataflow architecture offers significant advantages. Other comparisons, however, yield different results. Thus, approaches that at first may appear competitive are, in fact, complementary.
For service-oriented applications, multicore architectures scale efficiently as the architectures serve well together. In a split architecture, system vendors can leverage a dataflow-based processor for layer 2-4 processing and run multicore processors for content recognition, encryption, and service execution.
CONCLUSION
Architecture debates tend to be cyclical. Ten years ago, the industry had some 30 NPU players in the >10-Gbit/s segment. Most of these companies based their development on the multicore architecture.
Today, we know that this architecture cannot compete with the special-purpose dataflow architecture for packet processing at layer 2-4. The dataflow architecture delivers 15 to 20 times the performance per watt with strict wirespeed guarantees.
When comparing architectures for network processing, don’t be misled by the supported interface bandwidth, which is meaningless if service density isn’t considered. The amount of services simultaneously supported at wirespeed operation is eventually what counts when service providers evaluate network platforms. System vendors need to look closely into the service density aspects early in the R&D process.
Coming out short on service density for layer 2-4 processing, the new generation of multicore processors can still address a big and growing network market segment. There is a strong push for network-based service and security processing, opening up new opportunities to combine modern multicore processor and dataflow architectures (Table 3).