Multiple cores deliver performance with lower power requirements, but processors can’t contribute much if they’re idle.
Across the embedded landscape, the design credo has become “more cores.” However, challenges remain when it comes to the software side. Some hardware architectures can deliver dozens of cores, while others hit thousands of cores. Unfortunately, applications don’t always port easily across different architectures.
For the low end of the embedded space, single-core solutions will remain. It’s still possible to move up the power and performance curve by moving to faster or wider processors. At the high end, though, multiple cores is the way to go.
This is why double-precision floating point crops up often and the reason these solutions often wind up in supercomputers. In fact, desktop and rack-mount systems like ones from Nvidia bring this level of processing power to the masses (see “\\[\\[Standard-GPU-Cluster-Provides-High-Performance-In-|Standard GPU Cluster Provides High Performance In The Mid-Range\\]\\]”).
Another matter that crops up when discussing software and multicore architectures is virtualization. Not all multicore platforms support virtualization, but it opens up opportunities. And while it does make hardware design more challenging, it typically simplifies software and application management.
The Xeon Nehalem-EX is the top-of-the-line octal core symmetrical multiprocessing (SMP) platform from Intel. Multichip solutions like an eight-chip, 64-core system utilize the highspeed QuickPath point-to-point interconnect to link processor and peripheral controllers together (Fig. 1).
This architecture will be familiar to those using AMD’s Opteron processors with HyperTransport links. In both cases, the simplest configuration is a single processor linked to a single peripheral controller by a single high-speed link.
Both vendors implement a form of cache-coherent, non-uniform memory access (ccNUMA) in addition to a distributed memory subsystem. Each processor chip has its own memory controllers plus L1, L2, and L3 caches. Any chip can access memory in any other chip by using the high-speed links. Of course, data that’s further from the requester will take more time to access.
These high-speed links are also used on consumer devices, but a single link to an I/O hub is usually all that is necessary. On the other hand, servers generate significant traffic between processor chips for shared memory access. Chip-to-chip traffic and cache management is critical to efficient operation.
One of the key features in AMD’s new Istanbul Opteron, HT Assist, optimizes the memory request and response process to minimize the number of transactions involved, freeing up bandwidth for other traffic (Fig. 2). HT Assist actually tracks data movement among cores and caches, allowing a request to be serviced by the nearest core with the required data.
The worst-case scenario is when data must be accessed from off-chip memory by the chip that owns that memory space. The best-case scenario is finding the data in the cache of the chip running the thread needing the data. Intermediate scenarios will have cores getting the data from an adjacent chip’s cache.
The system becomes more complex when virtualization and caches are brought into play, leading to variable latency for data that’s hard to determine. This can be an issue in deterministic embedded applications, but is less so in most server applications where speed is desired over fine-grain determinism.
Programmers now seek these platforms because they greatly simplify the programming task. Likewise, applications can exploit the growing number of cores, assuming the application can utilize enough threads efficiently.
Efficient use of multicore systems isn’t as easy at it might look. Cache size and locality of reference within an application’s working data set can affect how well a particular algorithm will run.
AMP APPLICATION PROCESSORS
Symmetric-processing (SMP) architectures come in handy for many embedded applications, but asymmetrical multiprocessing (AMP) can also be useful for other applications. AMP configurations show up in a number of guises, from Texas Instruments’ OMAP (Open Multimedia Application Platform) to Freescale’s P4080 QorIQ (Fig. 3).
TI’s OMAP 44xx combines an ARM Cortex-A9, a PowerVR SGX 540 GPU, a C64x DSP, and an Image Signal Processor. Each core has a dedicated function. Communication between processors isn’t symmetrical.
The OMAP operates only in an AMP mode. The P4080, on the other hand, is an SMP system at heart with the ability to partition its cores into an AMP mode as well. The eight-core chip can run as eight independent cores, or it can be combined in a number of configurations (e.g., a pair of two-core SMP subsystems or four single-core subsystems).
The main high-level architectural difference between the OMAP and P4080 is that the OMAP functions are fixed and the cores are optimized for their respective chores. This makes programming significantly easier since partitioning the application to particular cores will be based on matching functionality.
Continue to page 2
Each subsystem’s level of performance is limited by the architecture, whereas the P4080 can adjust its partitioning, though this is normally done when the system starts up. A system designer can adjust the allocation of cores in the P4080, assuming sufficient cores are available. QorIQ platforms with fewer cores are on the market, too, allowing a more economical chip to be used.
IBM’s Cell processor fits somewhere in between (see “\\[\\[CELL-Processor-Gets-Ready-To-Entertain-The-Masses9|Cell Processor Gets Ready To Entertain The Masses\\]\\]”). It incorporates a 64-bit Power core plus eight Synergistic Processing Elements (SPEs). The SPEs are identical (each with 256 kbytes of memory), and they operate in isolation unlike the shared-memory SMP systems already discussed. There is no virtual memory support or caching within the SPEs.
This has advantages and disadvantages for hardware and software design. The approach simplifies the hardware implementation but complicates the software from a number of perspectives. For example, memory management is under application control, as is communication between cores. Data must be moved into the local memory of an SPE before it can be manipulated.
Fully exploiting architectures like the Cell takes time because they differ from the more conventional SMP or AMP platforms. The improvement in software on Cell-based platforms like Sony’s PlayStation 3 over the years highlights the changes in programming techniques and experience.
DIVIDE AND CONQUER
Changing programming techniques is key to success in using graphical processing units (GPUs). GPUs from the likes of ATI and Nvidia expose hundreds of cores in a single chip. These GPUs can be combined into multichip solutions, providing developers with thousands of cores. For instance, four Nvidia Tesla T10s packed into a 1U chassis deliver 960 cores (Fig. 4).
Programming the Tesla or any of the other compatible Nvidia GPU chips can be a challenge, but frameworks like Nvidia’s CUDA or utilization of runtimes based on CUDA make the job much easier. Part of the challenge is the single-instruction, multiple-thread (SIMT) architecture of Nvidia’s GPU (see “\\[\\[SIMT-Architecture-Delivers-Double-Precision-Terafl|SIMT Architecture Delivers Double-Precision TeraFLOPS\\]\\]”). Like many high-performance systems, it likes to work with arrays of data. This works well for many applications but not all, which is one reason why GPUs are often matched with a multicore CPU.
CUDA and OpenCL (Open Computing Language), another parallel programming framework, match the GPU approach that uses separate memory from the host processor. This means data must be moved from one place to another before it can be manipulated. The C programming language has some extensions, but there are also restrictions. For example, it is recursion-free, and it doesn’t support function pointers. Some of these restrictions come from the SIMT approach.
Many applications utilize CUDA. Performance gains compared to conventional SMP platforms varies quite a bit, though, from a factor of two to 100. One reason for this variance is that threads work best if they’re running in groups of 32. Branches don’t impact performance, assuming the group of 32 threads follows the same branch.
Specialized processors like GPUs are one approach to both graphics and multicore processing. Another approach is to use many conventional cores, such as Intel’s Larrabee (Fig. 5). Larrabee uses x86-compatible cores that are optimized for vector processing (see “Intel Makes Some Multicore Lemonade”).
In one sense, Larrabee is similar to IBM’s Cell processor. The Larrabee cores only have access to the 32-kbyte L1 and 256-kbyte L2 caches. If data isn’t in the cache, it must be requested from the memory controller or another cache within the system. The data is then placed into the core’s cache, and the application proceeds on its merry way.
A ring bus is used to communicate between cores and controllers. IBM’s Cell Element Interconnect Bus (EIB) is also a ring bus connecting the SPEs to the memory controller and peripheral interface. From a programming perspective, Larabee’s cache and the Cell’s SRAM differ significantly.
Still, to programmers, Larrabee appears as an array of cachecoherent x86 processors. Because of its GPU orientation, programmers can take advantage of DirectX and OpenGL support.
Multicore chips also are common pieces in a networking infrastructure puzzle. Handling 10-Gbit/s networks is a challenge all by itself for multicore chips. Analyzing and massaging data from a network connection at line speeds requires lots of processing power.
Netronome’s NFP-3200 Network Flow Processor contains 40 1.4-GHz cores that can run eight threads each, providing 320 hardware-based threads in one chip. This is on the same order as GPUs, but the processors are slanted toward packet processing.
Like IBM’s Cell, the NFP-3200 has a master CPU-style controller. In this case, it’s an ARM11 core. Its 40 cores, also called microengines, are compatible with Intel’s IXP28xx architecture, which was developed for network processing. This compatibility is important because a good deal of code targets this architecture. Older chips had fewer cores, so in a sense the NFP-3200 offers more of the same.
Of course, simply tossing more cores at the problem is just one course of action. Netronome incorporates a host of improvements, such as enhanced microblocks with TCP offload support. The interconnect speeds are higher as well, running at 44 Gbits/s between cores.
Continue to page 3
Netronome’s chip has a number of specialized processors, including its cryptography system, which handles the ubiquitous security protocols. Its PCI Express interface supports I/O virtualization often used by x86 processors. It can be moved next to the NFP-3200 instead of being separated by another network link.
Programming the NFP-3200 is often less of an issue compared to other multicore chips because of the large amounts of existing code for the IXP28xx family. In addition, Netronome provides libraries that make the creation of network-processing applications more a matter of tying together modules.
Cavium’s Octeon II is a more conventional SMP multicore design with two to six 64-bit MIPS64 cores connected by a crossbar switch (see “Multicore Chip Handles Broadband Packet Processing”). Like Netronome’s chip, the Octeon II is designed for network and storage devices.
Also, the Octeon II has a RAID 5/6 accelerator as well as Hyper Finite Automata (HFA) support of regular expressions for packet inspection. Programming the Octeon II is comparable to most SMP systems. It can run operating systems such as Linux.
OTHER MULTICORE ARCHITECTURES
Moving to more radical multicore architectures adds to programming chores. However, it opens opportunities for developers that can take advantage of the new architectures.
The IntellaSys SeaFORTH 40C18 fits in this category (Fig. 6). Its native programming language is VentureForth (see “\\[\\[parallel-programming-is-here-to-stay20655|Parallel Programming Is Here To Stay\\]\\]”). Instructions are actually five bits with four instructions packed into a single 18-bit word. (One instruction is only 3 bits long.) The 40C18’s 40 cores have identical processing units with 64 words of RAM and 64 words of ROM. That’s not a lot of space, though this does translate to 256 instructions.
Obviously, programming the 40C18 will be dramatically different than a chip with more storage like Intel’s Larrabee or IBM’s Cell. The 40C18 cores consume less than 9 mW, whereas the other two chips don’t work well without massive heatsinks and a fan or two. The 40C18 is designed for embedded and even mobile applications.
Programming the 40C18 will be different for most developers, and not just because Forth is the programming language. Each core’s small memory space and the matrix interconnect changes program design methods. Cores typically run small functions that pass data onto one or more neighbors, so cooperative programming is the way to go.
Even external memory accesses require three cores working together. This makes sense when there are many cores to work with. The 40C18 also has the unique ability to send a small program of four instructions in a single word to be executed by a neighboring core. That is actually enough space to perform a block transfer.
The XMOS XS1-G4 is an interesting mix based on 32-bit integer Xcores (see “\\[\\[Multicore-And-Soft-Peripherals-Target-Multimedia-A|Multicore And Soft Peripherals Target Multimedia Applications\\]\\]”). Each Xcore can handle a number of different threads with a hardware-based event system that facilitates XMOS’s soft peripherals. Like the 40C18, the XS1-G4 can wait on an I/O port. The difference is that the XS1-G4 handles multiple threads whereas the IntellaSys chip works with one.
Developers can use XC, an extended version of C, to get the most out of XMOS hardware. Extensions provide shortcuts to the hardware support, which also includes XLinks. The XLinks connect the four cores in the chip and provide four off-chip links. As a result, multiple chips can be connected. Internally, the chip uses a switch for the XLink connection, but the hardware and software provide a uniform interface for interprocessor communication.
Each core has 64 kbytes of memory. This is more than the 40C18 but less than some of the higher-performance chips covered here. Still, it is sufficient for a significant amount of application code, allowing a more conventional thread approach to programming. The bulk of programming for the XMOS chip is likely to be in conventional C or C++ rather than XC, which tends to be used for communication and peripheral handling.
The chip won’t present a challenge to double-precision floating-point GPUs or other high-end systems, but its integer and fixed-point DSP support lends itself to many other audio- and video-processing functions. Linked XMOS chips are already used in to drive multiple large-screen LCDs.
Multicore architectures continue to proliferate. Programming these cores efficiently and choosing the right one isn’t necessarily easy, but it will become more common even for embedded developers. Legacy applications will tend to migrate to architectures that match their existing hosts. More radical departures are possible when the applications are being redesigned or created from scratch.