ELECTRONIC DESIGN - August 17, 1998 - RISC And CISC Processors Compete For Embedded Applications

Return to August 17, 1998 table of contents

Digital Design
* Exploring the world of digital, logic, memory, and microprocessors

RISC And CISC Processors Compete For Embedded Applications

With Effective Throughputs Of 200 MIPS And More, Superscalar CPUs Tackle Throughput-Intensive Embedded Needs.

Dave Bursky


Art Courtesy: Sun Microelectronics

No matter how fast a CPU is, new applications always seem to surface--and demand even more horsepower. Driven by image processing, communications, multimedia, and many other applications, CPU throughputs have climbed from the few million instructions per second (MIPS) of last decade's CPUs to over 2000 MIPS delivered by today's leading-edge chips. Such high-throughput CPUs are not being demanded just for desktop computing applications to power advanced workstations and servers. They are also wanted for many embedded applications, such as in high-speed printers, for network bridges and routers, and computer games.

Advanced CMOS processes are being called upon to achieve speed improvements. But, this is only part of the landscape. Other performance gains come from many architectural enhancements. In part, these gains are possible because of improved processing, which permits millions of transistors to be integrated on a single chip. Some enhancements include the use of superscalar architectures with multiple-integer and floating-point execution units, out-of-order instruction execution, and high levels of pipelining. In addition, designers have increased CPU word widths from 16 to 32 to 64 bits, and expanded the cache interface to bus widths as large as 128 bits.

New, more complex instructions have been added to many superscalar processors in order to leverage the wide data paths and new features. For example, a subdividable arithmetic unit can take 64-bit data words and divide them into two, four, or eight subwords. It can then perform up to eight parallel computations on the data words. Such instructions are critical to improving the CPU performance for applications such as digital-signal processing, image processing, and many other applications that deal with large arrays of data.

One common characteristic of all these superscalar RISC processors is high-throughput floating-point computation. Single- or two-cycle FP multiplication or multiply accumulate operations give users several hundreds of MFLOPS to almost 1 GFLOPS of math throughput--performance previously relegated to supercomputers or to dedicated signal-processing circuits. These and even more MFLOPS can be readily consumed by signal processing, ray tracing in 3D imaging, and many other applications.

The first company to implement such a feature in their PA-RISC processors was Hewlett-Packard Co., Palo Alto, Calif. However, they have not developed a commercial OEM market for these processors, and only use the CPUs in their own systems. Other RISC-processor families--Alpha, MIPS, SPARC, and PowerPC, as well as the suppliers of x86 high-end CPUs--have each, in turn, crafted their own instruction-set extensions to add visual computing capabilities. These instructions can speed up applications such as image processing, printing, and audio-signal processing. The performance improvement ranges from three or four times, up to peak improvements of eight to 12 times the chips' throughput (without the new instructions).

The superscalar, highly pipelined CPUs typically run at clock speeds of 200 MHz and faster. At the high end of the embedded controller market, they deliver supercomputer-like integer and floating-point computational performance--peak instruction throughputs of more than 2 billion instructions per second (BIPS) are already possible from a single CPU. In many cases, these CPUs were initially developed for desktop computers and servers. Not only do they perform well in single-CPU systems, but most include the necessary control signals and buses to be seamlessly connected in large multiprocessor arrays. They are now finding their way into performance-demanding applications, like raster engines in high-end laser printers, 3D graphic displays, bridges and routers, and many other systems.

To achieve the high throughput, designers of these CPUs spend all of their transistor budget on throughput-enhancing features--large caches, dynamic execution control, register renaming logic, advanced branch prediction, and, of course, multiple execution units. Interfacing these powerful CPUs to the rest of the system requires one or more support logic chips, which provide the control and I/O functions. Luckily, advances in ASIC technology have made it relatively easy to craft the desired support circuits, which typically run at bus frequencies of 66 to 120 MHz. In contrast, many "high-integration" embedded controllers use their transistor budgets to pack much of the system-support logic on the same chip, allowing designers to incorporate only a more moderate-performance CPU.

Larger instruction and data caches, from just a few kbytes at the beginning of this decade to as much as 64 kbytes each, as well as second-level caches on the chip, are improving memory bandwidth. These caches tie into 64- or 128-bit-wide buses on the chip, and those wide buses permit more data or multiple instructions to be transferred every cycle. That translates into higher instruction throughputs.

One of the major steps forward in CPU design has probably been the move to superscalar and superpipelined architectures. Superscalar architectures with multiple execution units can execute two or more instructions simultaneously--almost doubling or quadrupling the instruction throughput. Superpipelining adds more pipeline stages, which permits multiple instructions to be staged in the execution queue. A new instruction in each pipeline can then start every clock cycle.

For example, the RM7000 64-bit, dual-issue, superscalar processor from Quantum Effect Design packs dual 16-kbyte, Level-1 (L1) caches for instructions and data. It also includes a 256-kbyte unified, write-back Level-2 (L2) cache (Fig. 1). An on-chip tertiary cache controller supports up to an 8-Mbyte Level-3 (L3) off-chip cache. With its L2 cache and dual instruction-issue pipeline, the chip achieves a raw throughput of 500 Dhrystone (2.1 MIPS) when clocked at 300 MHz.

1. The first MIPS-family processor to incorporate both Level-1 and Level-2 caches on the chip, the RM7000 from Quantum Effect Design employs a superscalar architecture that initiates two instructions every clock cycle. It thereby achieves a peak throughput of over 500 Dhrystone 2.1 MIPS when clocked at 300 MHz.

Two instructions, issued by the processor every clock cycle, are routed to either the two 64-bit integer units or to one integer unit, as well as to the fully pipelined 64-bit floating-point unit. For maximum efficiency, the L1 caches are four-way set associative, and implement write-back and nonblocking policies. Both integer pipelines have five stages, while the physical length of the floating-point pipeline is seven stages. A single integer multiply/divide unit in one of the integer execution units helps accelerate multiplication and multiply accumulate operations--key operations in signal and image-processing algorithms.

A few weeks ago, the company unveiled the next upgrade to its 5000 series processors, the RM5231, 5261, and 5271. The processors feature less expensive implementations than the RM7000. They also are dual-instruction issue, but pack dual 32-kbyte caches for instructions and data. (The caches are two-way set-associative, rather than four-way.) These CPUs, however, only include an L2 cache controller, which supports off-chip cache sizes of 512 kbytes to 2 Mbytes.

The L1 cache sizes are double the amount integrated on the company's previous R5000 family members, the RM5230, 5260, and 5270. The new versions also operate at higher clock rates than the previous versions. The older ones run at clock rates of up 200 MHz, and deliver a top throughput of 260 MIPS (Dhrystone 2.1). The larger caches and faster clock rates (266 MHz) on the new processors bring their effective peak throughput up to between 300 and 350 MIPS.

By initiating four instructions every clock cycle (four-way superscalar), the Alpha 21164 processor, developed by Digital Equipment Corp., also leverages a high clock rate--up to 667 MHz to achieve a peak throughput of more than 2.4 BIPS. The 667-MHz clock is the highest clock rate for a commercially available CPU. To keep the data moving fast on the chip, designers integrated a 96-kbyte, L2 write-back unified cache (three-way set associative), and dual 8-kbyte, direct-mapped L1 caches for data and instructions. A wide, 128-bit memory-data path allows fast cache refills. In addition, a 40-bit address bus permits the processor to access a very large memory space.

Not to be outdone, designers at what is now Alpha Processor (formerly Digital Equipment), a division of Samsung Semiconductor, will soon release the Alpha 21264. This version of the Alpha adds out-of-order execution and dynamic scheduling with register renaming to the key features list. In addition to being able to start up to four instructions every cycle, the out-of-order capability lets the CPU execute the instructions more efficiently, thus improving throughput (Fig. 2).

2. Incorporating huge L1 caches of 64 kbytes each, the Alpha 21264 from Alpha Processor can perform four-way out-of-order execution on the instruction stream. The pipeline consists of four integer execution units, two of which can perform memory address calculations, as well as two floating -point execution units.

Packing over 15 million transistors, the 21264 processor includes two huge L1 caches--64 kbytes each--that are two-way set associative, as well as four integer execution units and two floating-point execution units. Two of the integer execution units can perform memory-address calculations for load and store operations. Clock rates for the 21264 will start at about 500 MHz. The company expects to be able to offer faster versions in the future, with clock speeds of 1 GHz and up.

Wide Buses Move Data Fast

Very wide buses characterize chips like the Alpha--a 128-bit-wide cache interface and a128-bit internal data bus allow very fast cache fills. However, with internal clocks running at 500 MHz and higher, and hundreds of I/O lines switching at 100 MHz or more, power consumption can exceed 40 W. (The Alpha 21264 has been estimated to dissipate a healthy 60 W.)

Leveraging off the UltraSPARC II architecture, designers at Sun are readying the UltraSPARC III CPU, a chip that employs a low-latency, non-stalling, 14-stage superpipeline that can run at clock speeds of over 600 MHz. About 16 million transistors will be used to implement a 64-kbyte data cache and a 32-kbyte instruction cache (both four-way set associative). Also implemented with some of those transistors is a pair of 2-kbyte, four-way set-associative caches associated with the data cache, one for prefetches and the other for writes.

Designed for operation at clock rates of 600 MHz and higher, the processor will also consume a considerable amount of power--about 70 W--even when powered by a 1.8-V supply. The deep 14-stage pipeline allows many instructions to be in various stages of execution, providing a high degree of parallelism along with the four-way instruction issue.

Until the UltraSPARC III becomes available later this year, the UltraSPARC II and IIi are the highest-performing members of the UltraSPARC family. The 300-MHz UltraSPARC II is a four-way superscalar processor that includes nine execution units--four integer, three floating-point, and two for graphics. It also offers dual 16-kbyte caches for instructions and data. To ensure that the pipeline is kept full, the processor includes a prefetch and dispatch unit that fetches instructions before they are needed in the pipeline. The instructions can be prefetched from all levels of the memory hierarchy. To allow prefetches across conditional branches, a branch-prediction scheme is implemented in hardware.

All the execution units, the four-way instruction issue, and the 300-MHz clock speed permit the processor to deliver a peak throughput of over 1 BIPS. A slight variation on the UltraSPARC II is the recently released UltraSPARC IIi. In addition to all the basic features of the II, the IIi replaces the generic bus interface with a 66-MHz, 32-bit PCI host interface. Like the II, the IIi packs 16-kbyte instruction and data caches. Both the UltraSPARC II and IIi, as well as the III, include Sun's enhanced visual-instruction-set (VIS) extensions. These greatly accelerate many image-processing applications.

A five-way superscalar architecture is used in the TC86R10000 processor. Developed by the MIPS Division of Silicon Graphics (now MIPS Technology Inc.), it is made by both Toshiba and NEC. Its peak throughput exceeds 1 BIPS, thanks to the availability of five separate execution pipelines. Each execution unit contains seven stages and all tie into large, 32-kbyte, instruction and data caches to ensure that the pipelines don't run out of data. For the L2 cache, a controller is used to address an off-chip cache of 512 kbytes to 16 Mbytes.

Several MIPS-compatible processors provide two-way superscalar operation and can deliver peak throughputs of 300 to 500 Dhrystone MIPS when clocked at speeds of up to 250 MHz. They're available from IDT, NEC, QED, and Toshiba. For instance, NEC's VR5400 series offers three performance grades: the 250- and 200-MHz VR5464, with respective ratings of 519 and 415 MIPS, and the 347-MIPS 167-MHz VR5432. This latter device employs a 32-bit, rather than a 64-bit, memory interface to reduce system cost.

The chips employ a dual-issue superscalar architecture with dual 32-kbyte L1 caches. This permits an average of 1.7 instructions to be issued every cycle. They contain six execution units (including two unified integer/floating-point units), a non-blocking load/store unit, and a high-performance 32-bit-by-32-bit multiply accumulate unit. They also feature a 64-bit barrel shifter, a vector unit that supports multiple 8-bit-by-8-bit multiplications in a single cycle, and a branch unit. The processor instruction set also includes multimedia extensions (part of the vector unit), which let the processor easily handle graphics and image manipulation.

Also available from NEC is the VR5000, a 64-bit processor that was a precursor to the company's VR5400. The R5000, though, has a maximum internal clock frequency of 200 MHz. That limits the throughput to a maximum of 282 MIPS.

Integrated Device Technology also has several RC5000-series devices in its family. Like the NEC VR5000, the chips include dual 32-kbyte instruction and data caches, a dual-issue floating-point ALU, and a five-stage pipeline. The peak clock rates and thus, MIPS throughput, though, are higher than the NEC CPU's--250 MHz and 330 MIPS (Dhrystone 2.1).

With an eight-stage superpipeline, which supports dual instruction issue and internal top clock frequencies of 250 MHz, the Toshiba TC86R4400 delivers a raw throughput of between 300 and 350 MIPS. Direct-mapped instruction and data caches of 16 kbytes each tie into the external L2 cache over a 128-bit-wide bus, which supports fast line refills. An on-chip memory-management unit employs a fully associative translation-look-aside buffer to handle variable page sizes, ranging from 4 kbytes to 16 Mbytes.

Moving the PowerPC architecture up to the next level by implementing a superscalar execution unit, designers at IBM and Motorola have developed the MPC750/740 processors. These CPUs can issue two or more instructions every cycle. They pack large L1 instruction and data caches--32 kbytes each--and are eight-way set associative. When clocked at 266 MHz, the design can deliver a throughput of well over 300 MIPS. To get the high throughput, four instructions are simultaneously fetched from the instruction cache. Meanwhile, dual instructions are fetched from the branch-target instruction cache. The processor performs speculative execution with dynamic prediction. To improve the response to branches, they are processed upstream of the dispatch unit.

Prior to the 750/740, IBM and Motorola did enhance the 600-family architecture with a superscalar implementation in the form of the 603e CPU. This processor can deliver throughputs of over 300 Dhrystone MIPS.

Also moving its processor architecture into the superscalar realm to increase the MIPS rating, designers at Hitachi have released the first details of its SH-4 RISC processor (SH7750) with on-chip, graphics-support circuitry. Starting with a five-stage pipeline and a two-way, superscalar instruction-issue scheme, the SH-4 can deliver a throughput of 360 MIPS and 1.17 GFLOPS (peak) when clocked internally at 167 MHz. To achieve the high throughput, designers integrated a 16-kbyte, direct-mapped data cache and an 8-kbyte, direct-mapped instruction cache. Additionally, to target the chip at applications that require a high degree of system integration, they also included various peripheral support functions--timers, DMA controller, real-time clock, serial communications interface, interrupt controller, and a flexible bus interface that allows bus transfers of 8-, 16-, 32-, and 64-bit data.

Like previous members in the SH family, the processor uses a 16-bit instruction format to achieve a high code density, thereby reducing the amount of off-chip memory needed to support the application program. The processor also includes a specialized support block that accelerates the computations required for 3D graphics and other multimedia applications.

Attached processors, as used by Hitachi in the SH-4, help supplement the CPU to handle application-specific tasks such as graphics or multimedia algorithms. Other processor manufacturers, such as DEC and NEC, have followed suit with superscalar CPUs that also have attached media processors. This month, for example, Intel, which recently acquired the StrongArm division from DEC, will unveil the details of a yet unreleased version of the StrongArm--the SA-1500. Detailed at the IEEE Hot Chips Conference at Stanford University, Stanford, Calif., the chip combines the SA110 StrongArm CPU core and a media processor, as well as other system support functions.

Able to run at 300 MHz, the processor can deliver about 300 integer MIPS. It can decode MPEG-2 MP@ML (main profile at main level) data streams at 200 MHz, with processing power to spare. Originally developed by DEC and described at the 1997 IEEE Hot Chips conference, the chip was targeted at applications such as set-top boxes, video games, high-speed modem banks, and video conferencing.

In addition to the enhanced version of the SA-110 CPU core, the chip includes a dual-issue, long-instruction, word-attached media processor that assists with functions such as the MPEG decoding. Also on the chip are 16 kbytes of instruction cache, 16 kbytes of data cache, and a 4-kbyte writable control store to hold the attached media processor's long-instruction words. The chip also holds a 128-byte read buffer for the SA-110 core, and a 256-byte prefetch buffer for the media processor.

Expected later this year is a superscalar embedded processor developed by NEC as part of its proprietary V800 family--the V830R/AV. This highly integrated controller offers a two-way superscalar CPU that can execute 250 MIPS when clocked at 200 MHz. It also will be the first 32-bit embedded processor to incorporate the Rambus DRAM interface, providing a high-bandwidth, low-pin-count memory interface. Additionally, the V 830R/AV will include a media coprocessor to permit the processor to handle image- and audio-processing algorithms.

NEC plans to integrate dual 16-kbyte caches on the chip, and various multimedia support functions to handle applications like 3D rendered graphics, digital-video disks, speech recognition, and sound and music synthesis. As part of the media support, the company has developed a special 64-bit media extension coprocessor that contains a single-instruction/multiple-data engine that can perform up to nine multiplication operations in parallel. The SIMD engine includes 56 media-optimized instructions such as saturated addition and subtraction, as well as multiply accumulate operations (Fig. 3).

3. Targeted at controlling and processing multimedia data, the V830R/AV developed by NEC Electronics contains not only a dual-issue superscalar CPU, but also a 64-bit single-instruction/multiple-data media engine. This engine can make quick work of complex computations needed to process images and audio data.

With a top throughput of 150 MIPS, Intel's 80960HA/HD/HT products provide a 32-bit dual-instruction-issue processor architecture, a 16-kbyte instruction cache, an 8-kbyte data cache, and 2 kbytes of internal general-purpose RAM. The core can deliver a maximum of 150 MIPS through the use of a sophisticated instruction scheduler. The scheduler lets the processor maintain a throughput of two instructions every core clock, and deliver a peak performance of three instructions per clock. Large, 128-bit-wide buses connect the instruction and data caches to the processor to maximize system throughput.

The processor employs the traditional load/store RISC architecture. But, it also includes several optimized features for embedded control. For example, a high-speed interrupt controller can handle up to 240 external interrupts with 31 fully programmable priorities. There's also dual on-chip 32-bit timers, and a 32-bit demultiplexed burst bus. The bus incorporates per-byte parity generation and checking and address pipelining capability, as well as the ability to handle 8-, 16-, or 32-bit bus widths.

CISC Still Up For The Job

Despite the barrage of technical propaganda regarding high-performance RISC engines, don't rule out the venerable complex-instruction-set x86 products for applications with performance demands in the 100 to 300-MIPS range. Internally, many of these CPUs have moved to superscalar architecture. And, many architectural enhancements used by RISC processors have been applied to streamline instruction execution. The latest-generation chips also include multimedia and signal processing instruction-set extensions for 3D graphics and audio applications, putting them on par with many RISC processors. These products are available from Advanced Micro Devices, Cyrix/National Semiconductor/IBM, Centaur/IDT, and, of course, Intel.

Running at clock speeds of 266 to 400 MHz, the Pentium II, Xeon, and other x86 CISC CPUs (K6, 6x86, and WinChip) deliver throughputs competitive with, or even better than, many of the RISC solutions. Even the older 486, when clocked at 133 MHz (486/DX5 from AMD) will deliver performance comparable to MIPS-compatible RISC processors, such as the NEC4300i or the IDT 4640, running at 133 MHz (Fig. 4).

4. According to this benchmark data compiled by Advanced Micro Devices, the integer performance of CISC processors, when compared to RISC CPUs, often exceeds that of RISC processors. Even the 486DX5-133 processor achieves a level of performance that's comparable to some 133-MHz RISC processors.

From a software developer's viewpoint, leveraging the x86 instruction set can be beneficial. It has a wide selection of low-cost development software. And, an abundance of programmers are familiar with the processor programming model. Software development costs can thus be kept low, while applications can be developed very quickly. However, there are potential limitations with the CISC approach when it comes to handling performance-critical, real-time applications. CISC processors use variable-length instructions. This may make dealing with real-time events and handling interrupts a little harder than with the simple single-cycle execution model used by most RISC processors. That's due to the potentially slower interrupt-response time, which is caused by the delay imposed while waiting for the current multicycle instruction to complete before switching contexts.

In addition to the off-the-shelf x86 processors that come from the desktop world, AMD and several other companies have developed highly integrated low-power versions of the x86. They contain many of the system I/O and support functions, yet consume just a few hundred milliwatts. When clocked at up to 100 MHz, chips like the Elan SC400 and 410 from AMD provide computational throughputs comparable to moderate-performance RISC processors.


High-Performance Embedded Controller Manufacturers

Advanced Micro Devices Inc.

5204 E. Ben White Blvd.
Austin, TX 78741
(512) 602-4135
http://www.amd.com

Alpha Processor Inc.

1900 West Park Dr.

Westborough, MA 01581

(508) 366-5050

http://www.samsungsemi.com

Centaur/Winchip (See Integrated Device Technology Inc.)

Cyrix Corp. (See National Semiconductor Corp.)

Digital Equipment Corp. (See Intel Corp. for the StrongArm CPU and Alpha Processor Inc. for the Alpha CPU)

Hitachi Semiconductor Corp.

2000 Sierra Point Parkway
Brisbane, CA 94005
(800) 285-1601
http://www.hitachi.com

IBM Microelectronics Inc.

1580 Route 52
Hopewell Junction, NY 12533
(800) 426-3333
http://www.ibm.com

Integrated Device Technology Inc.

2975 Stender Way
Santa Clara, CA 95052
(800) 345-7015
http://www.idt.com

Intel Corp.

5000 W. Chandler Blvd.
Chandler, AZ 85226
(602) 554-8080
http://developer.intel.com

MIPS Technology Inc.

1225 Charleston Road
Mountain View, CA 94043
(650) 567-5000
http://www.sgi.com

Motorola Inc.

6501 William Cannon Drive West
Austin, TX 78735
(512) 895-3260
http://www.motorola.com

National Semiconductor Corp.

2900 Semiconductor Drive
Santa Clara, CA 95052
(800) 272-9959
http://www.national.com

NEC Electronics Corp.

2880 Scott Blvd.
Santa Clara, CA 95052-8062
(408) 588-6340
http://www.nec.com

Quantum Effect Design Inc.

3255-3 Scott Blvd., Ste. 200
Santa Clara, CA 95054
(408) 565-0357
http://www.qedinc.com

Rise Technology Company

2451 Mission College Blvd.

Santa Clara, CA 95054

(408) 330-8800

http://www.rise.com

Siemens Components Inc.

10950 N. Tantau Ave.

Cupertino, CA 95014

(408) 777-4500

http://www.siemens.com

STMicroelectronics Inc.

10 Maguire Road

Lexington, MA 02421

(781) 861-2650

http://www.st.com

Sun Microelectronics Inc.

901 San Antonio Road
Palo Alto, CA 94303
(408) 544-0410
http://www.sun.com

Toshiba America Electronic Components Corp.

9775 Toledo Way
Irvine, CA 92618
(714) 455-2000
http://www.toshiba.com/taec

 

Return to Top