Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Technology Report]
Match Multicore With Multiprogramming
Multiple cores deliver performance with lower power requirements, but processors can’t contribute much if they’re idle.

William Wong  |   ED Online ID #21341  |   June 25, 2009


Each subsystem’s level of performance is limited by the architecture, whereas the P4080 can adjust its partitioning, though this is normally done when the system starts up. A system designer can adjust the allocation of cores in the P4080, assuming sufficient cores are available. QorIQ platforms with fewer cores are on the market, too, allowing a more economical chip to be used.

IBM’s Cell processor fits somewhere in between (see “Cell Processor Gets Ready To Entertain The Masses). It incorporates a 64-bit Power core plus eight Synergistic Processing Elements (SPEs). The SPEs are identical (each with 256 kbytes of memory), and they operate in isolation unlike the shared-memory SMP systems already discussed. There is no virtual memory support or caching within the SPEs.

This has advantages and disadvantages for hardware and software design. The approach simplifies the hardware implementation but complicates the software from a number of perspectives. For example, memory management is under application control, as is communication between cores. Data must be moved into the local memory of an SPE before it can be manipulated.

Fully exploiting architectures like the Cell takes time because they differ from the more conventional SMP or AMP platforms. The improvement in software on Cell-based platforms like Sony’s PlayStation 3 over the years highlights the changes in programming techniques and experience.

DIVIDE AND CONQUER
Changing programming techniques is key to success in using graphical processing units (GPUs). GPUs from the likes of ATI and Nvidia expose hundreds of cores in a single chip. These GPUs can be combined into multichip solutions, providing developers with thousands of cores. For instance, four Nvidia Tesla T10s packed into a 1U chassis deliver 960 cores (Fig. 4).

Programming the Tesla or any of the other compatible Nvidia GPU chips can be a challenge, but frameworks like Nvidia’s CUDA or utilization of runtimes based on CUDA make the job much easier. Part of the challenge is the single-instruction, multiple-thread (SIMT) architecture of Nvidia’s GPU (see “SIMT Architecture Delivers Double-Precision TeraFLOPS). Like many high-performance systems, it likes to work with arrays of data. This works well for many applications but not all, which is one reason why GPUs are often matched with a multicore CPU.

CUDA and OpenCL (Open Computing Language), another parallel programming framework, match the GPU approach that uses separate memory from the host processor. This means data must be moved from one place to another before it can be manipulated. The C programming language has some extensions, but there are also restrictions. For example, it is recursion-free, and it doesn’t support function pointers. Some of these restrictions come from the SIMT approach.

Many applications utilize CUDA. Performance gains compared to conventional SMP platforms varies quite a bit, though, from a factor of two to 100. One reason for this variance is that threads work best if they’re running in groups of 32. Branches don’t impact performance, assuming the group of 32 threads follows the same branch.

Specialized processors like GPUs are one approach to both graphics and multicore processing. Another approach is to use many conventional cores, such as Intel’s Larrabee (Fig. 5). Larrabee uses x86-compatible cores that are optimized for vector processing (see “Intel Makes Some Multicore Lemonade).

In one sense, Larrabee is similar to IBM’s Cell processor. The Larrabee cores only have access to the 32-kbyte L1 and 256-kbyte L2 caches. If data isn’t in the cache, it must be requested from the memory controller or another cache within the system. The data is then placed into the core’s cache, and the application proceeds on its merry way.

A ring bus is used to communicate between cores and controllers. IBM’s Cell Element Interconnect Bus (EIB) is also a ring bus connecting the SPEs to the memory controller and peripheral interface. From a programming perspective, Larabee’s cache and the Cell’s SRAM differ significantly.

Still, to programmers, Larrabee appears as an array of cachecoherent x86 processors. Because of its GPU orientation, programmers can take advantage of DirectX and OpenGL support.

MULTICORE NETWORKING
Multicore chips also are common pieces in a networking infrastructure puzzle. Handling 10-Gbit/s networks is a challenge all by itself for multicore chips. Analyzing and massaging data from a network connection at line speeds requires lots of processing power.

Netronome’s NFP-3200 Network Flow Processor contains 40 1.4-GHz cores that can run eight threads each, providing 320 hardware-based threads in one chip. This is on the same order as GPUs, but the processors are slanted toward packet processing.

Like IBM’s Cell, the NFP-3200 has a master CPU-style controller. In this case, it’s an ARM11 core. Its 40 cores, also called microengines, are compatible with Intel’s IXP28xx architecture, which was developed for network processing. This compatibility is important because a good deal of code targets this architecture. Older chips had fewer cores, so in a sense the NFP-3200 offers more of the same.

Of course, simply tossing more cores at the problem is just one course of action. Netronome incorporates a host of improvements, such as enhanced microblocks with TCP offload support. The interconnect speeds are higher as well, running at 44 Gbits/s between cores.

Continue to page 3


<-- prev. page     1 [2] 3     next page -->

Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Network-On-Chip Tools Arrive for The Masses
  • Tackling System Design Challenges Through Early Verification
  • ESL Tools Take Center Stage As Designers Move Up
  • Parasitic Extraction Tool Targets Next-Generation Custom ICs
  • Synopsys Jumps Into ESL-Synthesis Pool
  • Verify Control Systems Before Committing To Hardware
  • You're Using How Many FPGAs?
  • Tool Up For The FPGA Blitz
    1) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (187 views today)
    2) Hot Hands For Some Cool Rock: Motion Sensing Meets Audio Engineering
    (172 views today)
    3) GPS-Derived Grandmaster Clock Delivers Ultra-Precise Time And Frequency Sync
    (90 views today)
    4) Science Fiction Meets Science Fact In Today's Robot Research
    (89 views today)
    5) What's All This Transimpedance Amplifier Stuff, Anyhow? (Part 1)
    (78 views today)
    ALL TOP 20



    Reader Comments

    Why can't I see the fig 1, fig 2 etc....?

    John Branthoover -November 10, 2009

    Great overview of many of the various multi-core approaches and solutions available today. One key point that was overlooked: While homogenous multi-core solutions (e.g. Intel Xeon processor 5500 series) are typically thought of as being implemented in SMP environments creative architects, especially in the embedded and networking marketplace, are doing heterogeneous architectural implementations. These can be done in an AMP configuration using a boot loader or in a virtualized configuration using embedded hypervisors (typically not enterprise/server VMMs.) Flexibility is one of the strengths of Intel architecture. It can be used to do many things...but the arch/developer needs to dream them up then implement them.

    Jim St. Leger -June 25, 2009

    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources