[Technology Report]
Match Multicore With Multiprogramming
Multiple cores deliver performance with lower power requirements, but processors can’t contribute much if they’re idle.
Netronome’s chip has a number of specialized processors, including its cryptography system, which handles the ubiquitous security protocols. Its PCI Express interface supports I/O virtualization often used by x86 processors. It can be moved next to the NFP-3200 instead of being separated by another network link.
Programming the NFP-3200 is often less of an issue compared to other multicore chips because of the large amounts of existing code for the IXP28xx family. In addition, Netronome provides libraries that make the creation of network-processing applications more a matter of tying together modules.
Cavium’s Octeon II is a more conventional SMP multicore design with two to six 64-bit MIPS64 cores connected by a crossbar switch (see “Multicore Chip Handles Broadband Packet Processing”). Like Netronome’s chip, the Octeon II is designed for network and storage devices.
Also, the Octeon II has a RAID 5/6 accelerator as well as Hyper Finite Automata (HFA) support of regular expressions for packet inspection. Programming the Octeon II is comparable to most SMP systems. It can run operating systems such as Linux.
OTHER MULTICORE ARCHITECTURES Moving to more radical multicore architectures adds to programming chores. However, it opens opportunities for developers that can take advantage of the new architectures.
The IntellaSys SeaFORTH 40C18 fits in this category (Fig. 6). Its native programming language is VentureForth (see “Parallel Programming Is Here To Stay”). Instructions are actually five bits with four instructions packed into a single 18-bit word. (One instruction is only 3 bits long.) The 40C18’s 40 cores have identical processing units with 64 words of RAM and 64 words of ROM. That’s not a lot of space, though this does translate to 256 instructions.
Obviously, programming the 40C18 will be dramatically different than a chip with more storage like Intel’s Larrabee or IBM’s Cell. The 40C18 cores consume less than 9 mW, whereas the other two chips don’t work well without massive heatsinks and a fan or two. The 40C18 is designed for embedded and even mobile applications.
Programming the 40C18 will be different for most developers, and not just because Forth is the programming language. Each core’s small memory space and the matrix interconnect changes program design methods. Cores typically run small functions that pass data onto one or more neighbors, so cooperative programming is the way to go.
Even external memory accesses require three cores working together. This makes sense when there are many cores to work with. The 40C18 also has the unique ability to send a small program of four instructions in a single word to be executed by a neighboring core. That is actually enough space to perform a block transfer.
The XMOS XS1-G4 is an interesting mix based on 32-bit integer Xcores (see “Multicore And Soft Peripherals Target Multimedia Applications”). Each Xcore can handle a number of different threads with a hardware-based event system that facilitates XMOS’s soft peripherals. Like the 40C18, the XS1-G4 can wait on an I/O port. The difference is that the XS1-G4 handles multiple threads whereas the IntellaSys chip works with one.
Developers can use XC, an extended version of C, to get the most out of XMOS hardware. Extensions provide shortcuts to the hardware support, which also includes XLinks. The XLinks connect the four cores in the chip and provide four off-chip links. As a result, multiple chips can be connected. Internally, the chip uses a switch for the XLink connection, but the hardware and software provide a uniform interface for interprocessor communication.
Each core has 64 kbytes of memory. This is more than the 40C18 but less than some of the higher-performance chips covered here. Still, it is sufficient for a significant amount of application code, allowing a more conventional thread approach to programming. The bulk of programming for the XMOS chip is likely to be in conventional C or C++ rather than XC, which tends to be used for communication and peripheral handling.
The chip won’t present a challenge to double-precision floating-point GPUs or other high-end systems, but its integer and fixed-point DSP support lends itself to many other audio- and video-processing functions. Linked XMOS chips are already used in to drive multiple large-screen LCDs.
Multicore architectures continue to proliferate. Programming these cores efficiently and choosing the right one isn’t necessarily easy, but it will become more common even for embedded developers. Legacy applications will tend to migrate to architectures that match their existing hosts. More radical departures are possible when the applications are being redesigned or created from scratch.
Great overview of many of the various multi-core approaches and solutions available today. One key point that was overlooked: While homogenous multi-core solutions (e.g. Intel Xeon processor 5500 series) are typically thought of as being implemented in SMP environments creative architects, especially in the embedded and networking marketplace, are doing heterogeneous architectural implementations. These can be done in an AMP configuration using a boot loader or in a virtualized configuration using embedded hypervisors (typically not enterprise/server VMMs.) Flexibility is one of the strengths of Intel architecture. It can be used to do many things...but the arch/developer needs to dream them up then implement them.
Jim St. Leger -June 25, 2009
Your Comments:
Enter the text from the image below
Please refresh the page if you have trouble reading this text.
Search Electronic Design
Web Seminar
Sponsored By:
Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices