In the best of all worlds, programmers would write applications without thinking about the target systems on which they'll run. Processor dependencies such as cache size, memory bandwidth, and others would be ignored. Multiprocessor (MP) architectural dependencies like memory sharing, number of processors, and network bandwidth also would be forgotten. And of course, tools would be supplied to automatically provide an efficient executable with minimal effort.
Today, available compiler technology can do a good job of optimizing for processor architectures. Compilers even deal well with shared-memory multiprocessor systems. But in the next century, we'll have to develop the tools to enable this level of simplicity on distributed-memory multiprocessors. Efforts are under way to ease the use of data-flow libraries. There's even a group working on standardizing this capability, which will result in application portability. The real challenge is in automatically decomposing a single application to run on multiple processors. Research has started, and usable solutions shouldn't be too far away.
One thing is surethe development of multiprocessor architectures continues to increase dramatically. At the high end, SGI/Cray, Sun, IBM, HP, Compaq, and other vendors are all producing MP systems based on their workstation technologies. The availability of high-performance networking lets users build their own systems. There are a number of firms in the embedded area who supply multiple processors on a single board. A few vendors, including SKY Computers, are able to efficiently connect many multiprocessor boards into high-performance-computer (HPC) systems.
Multiple Solutions, Challenges
Reading about multiprocessor architectures, however, is like alphabet soup. There are symmetric multiprocessors (SMPs), nonuniform memory access (NUMA), network of workstations (NoW), distributed shared memory (DSM), and others. There are tradeoffs in each of these architectures, and there isn't any single best solution for all MP requirements. Yet all multiprocessor systems, regardless of architectural proclivity, present significant software challenges.
Software Issues: The first issue to be addressed is maximizing the performance from each processor. The most common reason for purchasing an MP system is that a single processor doesn't have the CPU or I/O performance to solve problems fast enough. If the application can be tuned to run faster on every processor, the number of processors can be reduced, thereby decreasing the MP system's size and complexity.
Developing a multiprocessor application is seldom simple or transparent. Vendor-supplied libraries are available to support specific hardware features. They also may include shared-memory functions, a multithreading library, or a message-passing communications library. Sometimes, users can "flip a switch" and convert a uniprocessor application to an MP application. The state of the art, however, is such that this simple conversion isn't feasible in all configurations. Even when it is, the resulting performance isn't necessarily as good as expected.
Managing communications can become a significant challenge when the size of the system increases. If only a few processors are involved, it's possible to keep track of the communications manually. When there are hundreds or even thousands of processors, automation is required to supervise the configuration. In addition to the size of the problem, the configuration may change between application runs because of the availability of resources, or because hardware has failed.
Finally, there's the "ease of use" issue. This may be an overused marketing term, but the application programmer really doesn't want to deal with the complexities of an MP system. The goal is to get the application running with minimum effort and maximum performance.
CPU Performance: Maximizing the processor's performance is a relatively well-understood problem. Compilers were the first programmer productivity enhancers developed. The current technology is very advanced. Each new processor needs to have a compiler in order to be successful in the marketplace.
Achieving additional performance is possible for some applications, such as DSP and image processing, if a vector-processing library that has been tuned for the processor is provided. Getting the best performance from these libraries usually involves some amount of coding in assembly language. Fortunately, vector libraries are usually available from the vendor.
To date, these libraries haven't been standardized. An application written for one processor family doesn't port easily to another processor. The Vector Signal and Image Processing Library (VSIPL) proposed standard, funded by the Defense Advanced Research Projects Agency (DARPA), is attempting to address this issue. By bringing together major vendors and users to define a common application programming interface (API) for signal and image processing, the vendors will implement the standard. Users also will gain enhanced portability in their applications.
A number of vendors have developed compilers that automatically vectorize. These compilers understand the processor/cache/memory architecture. They also can generate vector-optimized executables from sources that are very portable. This capacity first appeared on the Cray supercomputers. It has been implemented by Digital for the Alpha, as well. SKY offers this capability for embedded real-time applications.