ARMâs Scalable Vector Extensions (SVE) for the ARMv8-A builds on the 64-bit ARMv8-A architecture.
ARM’s Scalable Vector Extensions (SVE) for the ARMv8-A architecture (see figure) expands ARM’s scope to supercomputing, but it will also have a significant impact on high-performance embedded computing (HPEC) systems. It has been half-a-dozen years since Intel released its SSEx (Streaming SIMD Extensions). In the meantime, vendors with chips based on the Power architecture let support for AltiVec SIMD (single instruction multiple data) instructions flounder. This, along with other changes, allowed Intel’s Xeon to gain the high ground in HPEC.
AltiVec has made a resurgence, but the Power is no longer the platform of choice for high-end embedded systems where it was once dominant, with the Xeon taking the lead. ARM’s rise in general has included adoption in the HPEC as well as the military and avionic embedded systems. ARMv8-A SVE will help improve its market share, but it will take many years for this change to make a dent in Intel’s dominance. Simply getting SVE in to the queue of production chips will take years and those environments require significant investments in time, money, and certifications. Fujitsu plans on using ARMv8-A SVE in silicon that will be used in the RIKEN Post-K supercomputer project, which is scheduled for deployment in 2020.
ARM has included its NEON SIMD support in the ARMv8-A architecture, but this is akin to Intel’s SSEx (Streaming SIMD Extensions). While useful, these are not in the same category as SVE, which is designed to handle 128- to 2,048-bit data. Intel’s AVX initially handled 256-bit data and was designed to handle 512- and 1,024-bit registers. AVX-512 is supported by the latest Intel Skylake Xeon and Xeon Phi processors. The scope of the ARM implementations will depend upon the vendors, since ARM does not create its own chips.
SVE supports a vector-length agnostic (VLA) programming model. The instructions adjust to handle the length of a vector versus fixed length vectors, avoiding the need to rewrite code if future changes occur in the size of vectors within an application.
SVE supports a range of optimizations, such as gather-load and scatter-store that help with vectorization of non-linear data structures—a common occurrence in high-performance computing (HPC). The per-lane predication support allows vectorization of nested control code containing side effects. It also addresses avoidance of loop heads and tails. The predicate-driven loop control and management helps reduce vectorization overhead compared to scalar code.
The new technology also supports vector partitioning and software managed speculation. This allows vectorization of uncounted loops that have data-dependent exits when entire vector does not have processed. There are extended integer and floating-point horizontal reductions so vectorization can be applied to more types of reducible loop-carried dependencies. Finally, scalarized intra-vector sub-loops allows vectorization of loops containing complex loop-carried dependencies.
ARM’s SVE targets only the 64-bit instruction set (A64). A64 uses fixed 32-bit instructions. SVE uses 25% of the remaining A64 instructions. Three sixteens remain for future A64 enhancements.
SVE will have a major impact on exascale computing in application areas such as pharmaceutical research, quantum physics, and fluid dynamics, as well as addressing areas like weather and geological simulation and analysis. It will also be used for applications such as machine vision and learning.