The data-plane program is compiled to instruction memory located in the processor cores, eliminating the need to fetch instructions to the processor cores from a shared memory during program execution. Moreover, this leads to significant gains in performance and power dissipation.
The programming model mirrors the well-known sequential uni-processor model, wherein programmers can write sequential modules to avoid the hassles of multi-parallel programming (i.e., memory consistency, coherence, and synchronization). When the software is compiled, the code is automatically mapped to the single pipeline of processor cores. One VLIW instruction occupies one processor core in the pipeline.
A significant benefit of the architecture and programming model is that it enforces wirespeed operation. Every type of packet has a guaranteed number of operations and classification resources.
REDUCED COMPLEXITY, GREATER PERFORMANCE
The multicore architecture cannot guarantee a certain level of performance, whereas the dataflow architecture is fully deterministic (Table 2). By reducing the complexity and fully optimizing the architecture for layer 2-4 packet processing, the dataflow architecture’s design scales to several hundreds of processor cores, supporting 100 Gbits/s and 150 million packets per second with a strong wirespeed guarantee.
While the raw processor performance is impressive, the programmer’s ability to make the most of the processor is another key component of the dataflow architecture. From atomic operations to table memories, a common set of memory operations—independent of memory type (on- or off-chip) and common processor cores across the complete pipeline—allows for efficient coding and code reuse.
In multicore architectures, processing capacity needs to be over-allocated at each stage, which in practice is a significant challenge as programmers always tend to be short of processing resources. The data-plane programmer therefore goes into an endless iteration mode of testing and performance optimization to recover lost clock cycles.
Comparing the multicore and dataflow architectures for packet processing, the differences in efficiency are obvious. Let’s look at two of today’s state-of-the-art processors and compare the metrics for layer 2-4 packet processing.
The first processor, the HX 330 NPU from Xelerated, is built upon the dataflow architecture. It runs at 300 MHz and features 448 processor cores, each capable of five simultaneous operations. Every second clock cycle, a new packet can enter the pipeline.
This translates to 150 Mpackets/s of processing, which is required to guarantee 100-Gbit/s wirespeed operation supporting even the smallest Ethernet packet sizes of 64 bytes. Each packet is guaranteed 5 × 448 = 2240 operations.
The latter number is theoretical, of course. No real data-plane application will leverage the full potential. Well-optimized data-plane code utilizes approximately 50% of the resources. This allows for great service density.
The second processor is one of the highest-performing multicore processors on the market. It features 64 processor cores and runs at 700 MHz. Using this chip for a 100-Gbit/s packet processing application would require a new packet to be scheduled every fourth clock cycle. On average, every packet would in theory get 256 clock cycles of processing capacity.
Synchronization challenges and performance hits for managing shared data will drive down performance to 50% utilization at best. This translates to 128 operations per packet, or 13% of the processing resources of the HX NPU. In addition, these operations are without performance guarantees.
Adding power dissipation to the equation makes an even greater difference. The NPU built upon a dataflow architecture brings 15 to 20 times the performance per watt at wirespeed relative to the multicore processor.
DIFFERENT NEEDS, DIFFERENT ARCHITECTURES
For layer 2-4 packet processing, the dataflow architecture offers significant advantages. Other comparisons, however, yield different results. Thus, approaches that at first may appear competitive are, in fact, complementary.
For service-oriented applications, multicore architectures scale efficiently as the architectures serve well together. In a split architecture, system vendors can leverage a dataflow-based processor for layer 2-4 processing and run multicore processors for content recognition, encryption, and service execution.
CONCLUSION
Architecture debates tend to be cyclical. Ten years ago, the industry had some 30 NPU players in the >10-Gbit/s segment. Most of these companies based their development on the multicore architecture.
Today, we know that this architecture cannot compete with the special-purpose dataflow architecture for packet processing at layer 2-4. The dataflow architecture delivers 15 to 20 times the performance per watt with strict wirespeed guarantees.
When comparing architectures for network processing, don’t be misled by the supported interface bandwidth, which is meaningless if service density isn’t considered. The amount of services simultaneously supported at wirespeed operation is eventually what counts when service providers evaluate network platforms. System vendors need to look closely into the service density aspects early in the R&D process.
Coming out short on service density for layer 2-4 processing, the new generation of multicore processors can still address a big and growing network market segment. There is a strong push for network-based service and security processing, opening up new opportunities to combine modern multicore processor and dataflow architectures (Table 3).