High performance is relative. A 64-bit processor is typically faster than a 32-bit processor, but not always. A 3-GHz, 32-bit processor will run rings around a 200-MHz, 64-bit processor, but speed is only part of the puzzle. Power consumption, compact code, small physical size, and reliability all come into play when considering a system design. That 64-bit processor may be just the thing for a high-performance MP3 player.
Processor designers deal with a variety of tradeoffs when developing high-performance systems. While most companies claim that all their features deliver high performance, only some tend to be unique. For example, the use of caching is almost universal for 32- and 64-bit processors. Likewise, single-instruction, multiple-data (SIMD) instructions are the norm for processors in multimedia environments.
Target-specific features like SIMD multimedia support are often included in a processor design because they give software developers a flexible programming environment. This is taken to the extreme with Java hardware acceleration where an entire environment gains from this support (see "Hardware Speeds Up Java," right). Java acceleration highlights the way that a feature can be implemented in varying degrees across a range of solutions to provide different levels of performance.
The table shows a range of features that we will examine in more detail. They span the processor spectrum from 8-bit microcontrollers through 64-bit multiprocessors. Some features address system interconnects using new high-speed interconnect standards like HyperTransport and RapidIO that are necessary if fast processors are to quickly exchange data with the outside world (see "Getting Data On- And Off-Chip Faster," p. 50).
A 64-bit device usually implies a high-performance system that often uses high-speed interconnect technology. This is a good place to begin examining high-performance processor features.
64 Bits—Big Integers, Address Space: All computing could be done with a 1-bit processor if it were infinitely fast. Because that's not the case, designers and programmers have pushed the register since 4-bit processors were born. The 64-bit powerhouses have proven their worth in areas like high-end servers and workstations that require a large address space and in embedded environments where integer computations benefit from a large number of bits.
A wide, 64-bit word means that single instructions can handle address calculations and one register can store addresses. This reduces the number of instructions that must be executed to finish a job, increasing overall performance.
The MIPS64 from MIPS cuts the number of instructions required with its SIMD floating-point support. Most processors that handle SIMD address only byte or integer data. The MIPS64 floating-point support eliminates the need for specialized floating-point hardware for applications such as radar data processing.
Of course, integer SIMD is still very important in a variety of environments. The SH5 from SuperH implements a four-way SIMD processing unit that greatly speeds up SIMD data computations. Integer SIMD is important to handling multimedia data.
Intel chose a more radical approach to speeding up the system with its Itanium and the very-long-instruction-word (VLIW) Explicitly Parallel Instruction Computing (EPIC) architecture. EPIC places the job of scheduling instructions to its multiple computational units on the compiler instead of having the processor do this during program execution. The theory is that a compiler can do a better job with extensive static analysis of a program than a processor could do while the program is running. Unfortunately, it sacrifices performance with 32-bit x86 programs when running in compatibility mode. But most EPIC systems are expected to run new 64-bit code.
The EPIC architecture has been used primarily in high-end servers and workstations because of its high power requirement and cost. It probably won't find its way into embedded applications, but its architectural features may. Transmeta's Crusoe uses a VLIW architecture but hides its existence from programmers by presenting an x86 execution engine. The result is a low-power system that's ideal for embedded and portable applications.
IBM PowerPC also tries to keep its processor humming with fast memory accesses and uses new process technology to keep things moving. Its copper connections and silicon-on-insulator technology reduce component size and increase connection speed—key features for embedded applications.
AMD's Opteron employs a range of high-performance features. One of its more novel components is its use of HyperTransport for shared memory access and peripheral support. The processor contains not one, but three HyperTransport links. The chip has a built-in dynamic memory controller, so each processor shares local memory with other processors via the HyperTransport links.
The memory architecture is called ccNUMA, for cache coherent nonlocal memory access. Caches on all processors maintain up-to-date information, and accesses to nonlocal memory take more time than accessing local memory.
Nonlocal data is forwarded through a mesh of processors with HyperTransport links until it reaches its destination. The impact of this overhead is minimized by caching, moving blocks of data in a packet, and prefetching. Effective nonlocal access times result. They're only a fraction longer than local access times.