3D FPGA Takes On 100G Networking

Tabula's original ABAX introduced its 3D SpaceTime architecture to the FPGA world (see FPGAs Enter The Third Dimension). The ABAX² series is implemented using Intel's 22nm Tri-Gate technology (see Moore's Law Continues With 22nm 3D Transistors) allowing Tabula to implement a 12-level architecture, up from the 8-level in the original version.

The architecture also makes it possible to tackle 100G networking chores. This is something that has only been done with ASSPs and network processor chips specifically designed for the task. This is due to a number of factors including challenges that conventional FPGAs have when it comes to routing wide buses and support for multiport memory. These are features that ABAX² addresses.

A reference design suite is provided with the latest chip that includes:

12x10G-to-100G bridge
4x100G switch that utilizes only 14% of the chip resources
A second generation Ternary Search Engine (TSE)
A shared memory switch with queueing
100G-to-Interlaken bridge

Additional examples include a 600 Gbit/s packet classifier, a 100 Gbit/s 64-bit CRC generator and a 1.3 Tbit/s L2 packet parser. All this is done without custom hard logic. The L2 packet parser uses less than 0.3% of the system resources.

Tabula is initially targeting its latest chips at the networking market but it is a general purpose FPGA. This makes it ideal for a wide range of applications and many of the features such as transparent pipelining are valuable in many non-networking applications.

How 3D SpaceTime Works

The third dimension in the ABAX² is time. The connection fabric routing and look up tables (LUT) programming are static in a conventional FGPA. The layers or folds are virtual and come into play allowing the routing and LUT configurations to change on a per clock basis (Fig. 1).

Figure 1. Each SpaceTime fold has its own fabric and LUT configuration. Data is moved between folds using transparent latches. The top fold feeds back to Fold 0.

The number of folds with a region are fixed and may number up to 12 in the ABAX². There may be different regions within the chip with each region being independent in terms of configuration. This includes the number of folds as well the timing per fold. Some regions could operate at very fast clock rates while others at slower speeds thereby reducing power requirements.

The 3D approach has a number of benefits especially when it comes to memory utilization. Each fold essentially places a new set of logic next to the memory which effectively creates a multiport memory with one port per fold for single port memory and two ports per fold for dual port memory (Fig. 2). This translates to a maximum of 24 ports if all 12 folds are used.

Figure 2. Memories span folds providing multiport memory by default and providing a port per fold for single port memory.

There are tradeoffs with this approach. For example, the data from all the ports is not available at the same time. This is often less of an issue than one might suspect since there is often some computation that can occur when part of the data is available. In this case that computation would occur in a fold and then the next item from the memory would be available along with this result when the next fold is executed. This type of pipeline effect is normally employed in ASSPs and processors but it often requires significant effort to implement and debug.

Developers do not have to worry about these types of pipeline and timing details or where regions are or the number of folds being used. This it all taken care of by the Tabula Stylus compiler. The compiler has to generate a system that remains within the limits of the chip but the flexibility of the architecture means that most designs can be implemented without forcing a redesign.

The folds also make a difference for designs that employ a wide bus. Essentially a bus can be split so each fold moves a subset of the data using a pipeline effect. The approach is especially effective if computation can be performed across the pipeline but this is obviously application specific.

Data is not only maintained between folds within memory blocks. There are transparent latches that also maintain state between fold activation so partial computations are pipelined as well. This has a significant impact on timing closure, a major issue with any high speed electronic design.

The Stylus compiler can essentially eliminate many timing closure problems by using the folds and transparent latches (Fig. 3). With a conventional FPGA, the flip-flops within the system are changed on a per clock basis so any logic between a pair of flip-flops is the determining factor in terms of speed. A designer must change a design if the timing constraints cannot be met. This might just mean the addition of a flip-flops in the logic chain but this could have a more significant impact on the design.

Figure 3. A conventional FGPA requires timing to be based on the flip-flops (top) while Tabula's approach can put each state on a different fold so only the end-to-end time requirements need to be met (bottom).

Tabula's addition of the folds and transparent latches allow a design to be implemented in a more flexible fashion where the compiler can handle the details. There may still be timing closure issues but they occur less frequently and are easier to solve. Designers have more flexibility by because they can have longer logic chains that would otherwise cause timing closure issues.

The Stylus compiler does quite a bit under the hood. Folds are resources just like LUTs and memory. Fully utilizing a system reduces system requirements and usually improves performance and power utilization. This means that a region may contain folds that have unrelated logic. For example, lets say that we have three pieces of logic that require four folds each. These could all reside within the same region. The compiler handles the relationship between folds since the designer only cares that the proper result is available after a cycle is completed.

Unlike a conventional FPGA, the Tabula FPGA has a logical maximum number of resources but they also have a performance component. For example, if a section of logic needs to run at the maximum clock rate then a single fold is used. If a designer needs a 24-port memory then it has to take advantage of all 12 folds otherwise a different design approach must be taken. Luckily a design rarely pushes the performance envelope in all areas providing the compiler with a great deal of flexibility.

Like most FPGA compilers, Stylus provides timing reports (Fig. 4). The difference is that the hardware can provide a range of solutions that may or may not be suitable for an application. This may mean there will be performance tradeoffs but it may be an alternative to making design changes.

Figure 4. Stylus provides timing feedback when timing closure problems are detected. The reports highlight what will and will not work allowing a designer to determine what tradeoffs or corrections should be made.

Making FPGAs Work With 100G Networking

ABAX² only implements the basic I/O in hard logic. This includes the 10G/40G/100G EMAC line termination and the four DDR3 memory controllers supporting 2.133 Gtransfers/s. The rest is on-chip logic and memory. The top end chip includes a whopping 23.3 Mbytes of memory. This includes support for dual port access per fold that translates into a 24-port memory if all 12 folds are used. The Stylus compiler automatically blends user RAM and Spacetime latches.

The chip has a throughput of up to 13.8 Tbytes/s. The fabric and logic blocks operate at 2 GHz. The logic blocks include the conventional FPGA LUTs plus logic carry blocks (LCB). LCBs are adder/comparators. They can be used to implement a range of functions including CRC support. This allows a 64-bit CRC to be used at 100G to 400G line rates. The LCB implementation is more efficient than a LUT implementation.

The 4 x 100G crossbar switch design (Fig. 5) utilizes only a fraction of the system resources and highlights the advantages already presented. It uses only 14 K LUTs and has a maximum frequency (F_MAX) of 472 MHz. The timing closure turned out to be very simple even though the design would be impractical on a conventional FPGA.

Figure 5. The 4 x 100G crossbar switch can be implemented using an ABAX2P1 but would not be feasible using a conventional FPGA.

The design employs 3 port RAM and the ports are 256-bits wide. The system has a 12ns port-to-port latency. It also implements 288 Kbits of port buffering.

Tabula's IDE includes tools such as a schematic viewer. The tabbed browser interface provides access to this as well as project, timing and package management.

Tabula is providing the RTL source code, test benches and 3rd party evaluation licenses for the reference designs. The Stylus compiler takes conventional FPGA designs so the features of the chip are available transparently. The only real visibility that the ABAX² is different than a conventional FPGA is in the Stylus report.

As I noted at the start, the ABAX² is going to have a major impact on the networking world but it is a general FPGA. The features are applicable to a range of applications allowing it to challenge markets where other platforms like GPUs have been dominant. FPGAs still require a different type of design expertise but the ABAX² eliminates many of the limitations normally associated with FPGAs. The transparent pipelining done by Stylus has yet to be fully tapped. It is likely to move even more designers away from ASSPs.