[Product Innovation]
Scalable, Reconfigurable Processor Adjusts Logic For Top Performance
An array of processing tiles, coupled with a reporgrammable fabric, can adjust its on-chip resources in a single cycle to optimize itself for the task at hand.
In the first silicon implementation of the RCP architecture, the CS2112, designers opted to combine four slices, or 12 tiles, onto one chip. That gives the system designer 84 datapath units, 24 multipliers, and 48 local-store memories, totalling 196 kbits. With all blocks active and the clock running at 125 MHz, the chip provides a maximum compute throughput, with 16-bit data, of 24 BOPS, and 3 billion 16-bit multiply-accumulates/s. In terms of communications applications, that translates into the ability to implement 50 channels of cdma2000 processing.
In addition to the four-slice CS2112, Chameleon plans to release two scaled-down versions, the two-slice CS2106 and the single-slice CS2103. Both chips can function in less-compute-intensive applications. Their performance and I/O buses equal half or a quarter of those in the CS2112.
The initial market for RCPs includes basestations, fixed-point wireless local loops, smart antennas, voice-over-IP, secure communications, and very high bit-rate digital-subscriber-line (VDSL) systems. It encompasses various other communications applications that traditionally use DSPs and FPGAs.
Configurable compute tiles form the heart of the RCP. Each holds seven 32-bit datapath units, four local store memory blocks, two single-cycle multipliers, and a control unit. Routing multiplexers, a barrel shifter, registers, mask logic, a 32-bit operation block, and several output registers lay inside the datapath unit. The local store memories are multiported. They possess the ability to perform simultaneous reads and writes. They also can be concatenated to form wider or deeper memory blocks.
The datapath is able to handle, 16- and 32-bit word operations, and dual, independent 16-bit data streams. These streams are for operations like single-instructions/multiple data. Word- and byte-swapping and word duplication are done by the 32-bit barrel shifter. It can generate any 5-bit constant, to be employed by the register and the two 32-bit AND/OR mask operators.
The 16- by 24-bit multipliers provide a result in a single cycle. In 16-bit mode, they can produce a signed 32-bit product. In full-resolution mode, they create a 40-bit product that's rounded to 32 bits.
Equivalent to an ALU, the 32-bit operation block directly implements all C and Verilog operators. It performs number calculations, signed/unsigned shifting, and bit-field masking-data operations. All registers for the datapath have conditional enables to improve pipelining efficiency. At reconfiguration, registers can either initialize or preserve their state. Furthermore, there's an optional-use shifted-feedback mode for shift-register and LFSR implementations.
Tying all of the logic together, the interconnect fabric guarantees 100% routability through a fully enumerated interconnect hierarchy. The routing employs a rule-based timing model that's simple and deterministicone clock cycle within a slice, and two clock cycles for other slices. Timing is independent of fanout.
There are three levels of hierarchy for routing in the dynamic interconnect. At the first level, local routes connect nearby datapath units with just a one-clock cycle delay. With the same delay, intraslice routes connect all datapath units within a slice. Finally, interslice routes connect datapath units in different slices with a delay of two clock cycles. In each datapath unit, routing multiplexers route signals through or around the datapath units. On a clock-by-clock basis, the multiplexers can be told to alter the data flow.
To conduct the quick personality change, the RCPs contain two configuration memory planes. The active plane holds the configuration for the function being executed. The shadow or background plane holds the most likely alternative configuration that the current algorithm may have to call upon. If it's called, the control logic only needs a single cycle to switch from the current to the background plane. The active plane then becomes the background plane. New configuration data is able to load into it from external system memory at a speed of about 3 µs per slice.
Using this updating approach makes multipart algorithms possible. An example of such an algorithm uses the four key parts of the power-control group employed by the cdma2000 chip-rate processing algorithm. Those parts include pseudonoise sequence generation, demodulation, finger searches, and access searches.
In a traditional ASIC, each piece of software is implemented as a different logic block and the four blocks are cointegrated on the chip. That gives designers little flexibility for updates or function changes.
In contrast, only enough resources for the most complex function must be allocated with Chameleon's approach. While one algorithm executes from the active plane, the pattern for the next function transfers to the shadow plane.
For the cdma2000 function, the four algorithms mentioned require 77, 615, 224, and 334 µs to execute. The four functions are referred to as one power control group. Depending on the algorithm, all or part of the reconfigurable processing fabric can be devoted to the computations. More reconfigurable resources can be used if that will improve the result's speed or quality.