Designers are packing more functionality into less space while increasing the data bandwidth in portable devices. Unfortunately, they're quickly running out of signal-processing performance. Today's crop of DSPs can deliver 30 to 100 MIPS. But portable communications also need to handle algorithms for MP3 audio and other multimedia and web/Internet support tasks.
To help developers of next-generation cell phones, PDAs, and other products, designers at Infineon have crafted a second-generation implementation of their Carmel DSP core. Used in conjunction with the PowerPlug accelerator options, it can deliver from two to ten times the performance of the first-generation Carmel DSP 10XX family.
The core contains an enhanced instruction set that includes all the commands from the DSP 10XX. It also features additional commands that target the complex signal-processing tasks required by advanced voice processing, third-generation cell-phone algorithms, multimedia applications, and data communications. Like its predecessor, the Carmel DSP 20XX core retains the dual multiplier-accumulator (MAC), the dual arithmetic and logic units (ALUs), and the unique configurable long-instruction-word (CLIW) capability. The CLIW feature lets programmers create their own "superinstructions" by combining several of the DSP core's instructions into one large operation that does more in parallel, greatly improving algorithm performance.
Also, special add-on hardware accelerators known as PowerPlugs can be cointegrated with the core whenever the software execution can't deliver the performance required by the algorithms. These coprocessor blocks simplify the system-on-a-chip (SoC) implementation, since designers don't have to worry about importing a function from another design library or designing it themselves.
The PowerPlug functions that initially will be available include a MAC, a quad 8-bit ALU, and an MPEG-4 video decoder. The MAC PowerPlug can supplement the core's dual MAC. The cointegration of two MAC PowerPlugs would enable the DSP to deliver four MAC operations per clock cycle. In turn, this would permit the DSP core to handle multiplication-intensive computations.
Similarly, the quad 8-bit ALU PowerPlug can increase the pixel-processing rate when used in conjunction with the Carmel's ALU. If the quad ALU is employed for all four PowerPlug coprocessors, the DSP block could process up to 16 pixels per cycle. Or, the MPEG-4 PowerPlug could occupy one of the coprocessor slots and process real-time MPEG streams.
These PowerPlug blocks provide the DSP core extra horsepower for functions that can't readily use traditional "MIPS" (typically, multiply-accumulate operations). Such nontraditional MIPS are accounting for an ever-larger portion of the DSP's execution time as the applications go beyond the traditional signal-processing realm.
Some DSP solutions resolve these challenges by adding dedicated instructions that allow efficient implementation of the algorithmscommands to execute key portions of a Viterbi decoder, for example. These instructions are supported by dedicated hardware that provides the computational acceleration.
Not everyone writes algorithms in the same way, though. Dedicated instructions and hardware could end up as excess baggage. This was mainly why Infineon developed the CLIW approach, which lets programmers define their own CLIW instructions. Each is actually a composite of up to four Carmel instructions that execute in parallel.
This CLIW concept has been extended in the Carmel DSP 20XX series. Designers not only can configure the instruction set in the new core, they also can modify the core's datapath to meet their application requirements. These modifications are implemented through the addition of one to four PowerPlug accelerators (see the figure).
The Carmel 20XX's basic architecture consists of two processor blocks. One of these contains an ALU, a MAC, a barrel shifter, and an exponent unit. The second processor block only has a second ALU and a MAC. The two blocks can operate in parallel to make fast work of time-critical computations. Both MACs can perform 16- by 16-bit single-cycle multiplications and accumulations of up to 40 bits.