[Technology Report]
Match Multicore With Multiprogramming
Multiple cores deliver performance with lower power requirements, but processors can’t contribute much if they’re idle.
Across the embedded landscape, the design credo has become “more cores.” However, challenges remain when it comes to the software side. Some hardware architectures can deliver dozens of cores, while others hit thousands of cores. Unfortunately, applications don’t always port easily across different architectures.
For the low end of the embedded space, single-core solutions will remain. It’s still possible to move up the power and performance curve by moving to faster or wider processors. At the high end, though, multiple cores is the way to go.
This is why double-precision floating point crops up often and the reason these solutions often wind up in supercomputers. In fact, desktop and rack-mount systems like ones from Nvidia bring this level of processing power to the masses (see “Standard GPU Cluster Provides High Performance In The Mid-Range”).
Another matter that crops up when discussing software and multicore architectures is virtualization. Not all multicore platforms support virtualization, but it opens up opportunities. And while it does make hardware design more challenging, it typically simplifies software and application management.
SMP SERVERS The Xeon Nehalem-EX is the top-of-the-line octal core symmetrical multiprocessing (SMP) platform from Intel. Multichip solutions like an eight-chip, 64-core system utilize the highspeed QuickPath point-to-point interconnect to link processor and peripheral controllers together (Fig. 1).
This architecture will be familiar to those using AMD’s Opteron processors with HyperTransport links. In both cases, the simplest configuration is a single processor linked to a single peripheral controller by a single high-speed link.
Both vendors implement a form of cache-coherent, non-uniform memory access (ccNUMA) in addition to a distributed memory subsystem. Each processor chip has its own memory controllers plus L1, L2, and L3 caches. Any chip can access memory in any other chip by using the high-speed links. Of course, data that’s further from the requester will take more time to access.
These high-speed links are also used on consumer devices, but a single link to an I/O hub is usually all that is necessary. On the other hand, servers generate significant traffic between processor chips for shared memory access. Chip-to-chip traffic and cache management is critical to efficient operation.
One of the key features in AMD’s new Istanbul Opteron, HT Assist, optimizes the memory request and response process to minimize the number of transactions involved, freeing up bandwidth for other traffic (Fig. 2). HT Assist actually tracks data movement among cores and caches, allowing a request to be serviced by the nearest core with the required data.
The worst-case scenario is when data must be accessed from off-chip memory by the chip that owns that memory space. The best-case scenario is finding the data in the cache of the chip running the thread needing the data. Intermediate scenarios will have cores getting the data from an adjacent chip’s cache.
The system becomes more complex when virtualization and caches are brought into play, leading to variable latency for data that’s hard to determine. This can be an issue in deterministic embedded applications, but is less so in most server applications where speed is desired over fine-grain determinism.
Programmers now seek these platforms because they greatly simplify the programming task. Likewise, applications can exploit the growing number of cores, assuming the application can utilize enough threads efficiently.
Efficient use of multicore systems isn’t as easy at it might look. Cache size and locality of reference within an application’s working data set can affect how well a particular algorithm will run.
AMP APPLICATION PROCESSORS Symmetric-processing (SMP) architectures come in handy for many embedded applications, but asymmetrical multiprocessing (AMP) can also be useful for other applications. AMP configurations show up in a number of guises, from Texas Instruments’ OMAP (Open Multimedia Application Platform) to Freescale’s P4080 QorIQ (Fig. 3).
TI’s OMAP 44xx combines an ARM Cortex-A9, a PowerVR SGX 540 GPU, a C64x DSP, and an Image Signal Processor. Each core has a dedicated function. Communication between processors isn’t symmetrical.
The OMAP operates only in an AMP mode. The P4080, on the other hand, is an SMP system at heart with the ability to partition its cores into an AMP mode as well. The eight-core chip can run as eight independent cores, or it can be combined in a number of configurations (e.g., a pair of two-core SMP subsystems or four single-core subsystems).
The main high-level architectural difference between the OMAP and P4080 is that the OMAP functions are fixed and the cores are optimized for their respective chores. This makes programming significantly easier since partitioning the application to particular cores will be based on matching functionality.
Great overview of many of the various multi-core approaches and solutions available today. One key point that was overlooked: While homogenous multi-core solutions (e.g. Intel Xeon processor 5500 series) are typically thought of as being implemented in SMP environments creative architects, especially in the embedded and networking marketplace, are doing heterogeneous architectural implementations. These can be done in an AMP configuration using a boot loader or in a virtualized configuration using embedded hypervisors (typically not enterprise/server VMMs.) Flexibility is one of the strengths of Intel architecture. It can be used to do many things...but the arch/developer needs to dream them up then implement them.
Jim St. Leger -June 25, 2009
Your Comments:
Enter the text from the image below
Please refresh the page if you have trouble reading this text.
Search Electronic Design
Web Seminar
Sponsored By:
Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices