[Design View / Design Solution]
Carefully Weigh The Tradeoffs Of Cell-Based Vs. Structured ASICs
When all is said and done, the structured ASIC’s shorter development cycle may well be the deciding factor for your design team.
With the emergence of 90-nm process technology, ASIC designers get to explore uncharted levels of performance and density. However, it has also unleashed a slew of challenging design-integrity issues, from crosstalk and noise to IR drop and timing closure. Complicating the development process is a growing array of silicon integration options. Today’s designers can implement designs using either a cell-based or structured ASIC methodology.
General Design Goals
Our accelerator IC design project began when engineers on the team discovered that a number of patterns in Internet-based communications are repeated over and over in high-traffic applications. General-purpose, Pentium-class servers (Xeon, Opteron, and others)—usually stacked in blade configurations in a rack—often handle these computations.
But tackling such highly specialized algorithms with a general-purpose CPU was proving highly inefficient. Clearly, an opportunity existed for implementing these key algorithms in hardware, in the form of a specialized accelerator IC.
Accelerators have a long history in personal computing. Math coprocessors were widely used for many years until processor designers, moving to higher-density process technologies, integrated them on-chip. Designers of early analog modems used dedicated digital hardware in the form of gate arrays or cell-based ASICs to accelerate performance. This hardware digitally processed the signals until other technologies came along years later.
Since then, those functions have been absorbed largely into the single-instruction, multiple-data (SIMD) instructions that were added to the Pentium processor. Finally, designers still use accelerators for video graphics. By implementing these functions in a special-purpose accelerator, product developers can offer performance comparable to a bank of general-purpose processors for a fraction of the price.
Similar opportunities lie in Internet communications today. Many security functions using public key encryption add so much data to the primary processing task that they render the system highly inefficient for all but the transfer of relatively small volumes of data. A special-purpose security device that implemented proprietary encryption schemes in hardware could seriously affect system performance.
Similarly, the rapidly growing use of XML—the markup programming language designed to simplify the use of richly structured documents over the Web—has presented new challenges for server designers. The XML language is widely used to translate databases among dissimilar systems, match up fields in dissimilar e-commerce systems, or simply exchange data between Web sites. But repeatedly processing format conversions at high volumes can quickly eat up computing resources. A specialized accelerator IC could relieve the system processor of this overhead and, in the process, dramatically improve system throughput.
Such a design would require a highly complex and dense ASIC. To maximize performance, the accelerator chip must combine multiple parallel implementations of the logical, architectural, and data-movement operations of the algorithm. Processing in the application would need to be both deep and wide.
In the deep direction, pipelined stages perform various comparison and math operations on each data packet. These pipelines feature FIFOs at both the input and output ports. Output is then transferred to another pipelined stage for additional data processing. Throughput rates reach one output per clock up to 250 MHz.
In the wide direction, the core processor clones parallel copies of the pipelined FIFO sequences to achieve further multiples of performance. With interfaces to external double-data-rate (DDR) DRAM at the inputs and outputs, the chip can process large amounts of data well beyond what the chip could normally process at very high speed. Figure 1 contains a basic block diagram of the accelerator IC and an illustration of the processing task flow.
To meet performance requirements, the team estimated that the design would require approximately 5 million gates of logic and over 5 Mbits of high-speed SRAM in a state-of-the-art process technology. The chip would feature high-performance I/O and memory controllers for interconnect to high-speed DDR2 memory located off-chip. An embedded PCI-X interface core would supply a high-speed link to the server. Support for diagnostic and test functions comes from additional on-chip buses.
System Partitioning
One early decision faced by designers was how much flexibility they needed in the device’s SRAM configuration. By implementing the device in a cell-based ASIC, the designers could choose the size and number of SRAM blocks they wanted to use in the design.