Innovative Platforms Enlist To Serve High-Performance Military Computing
Fig 1. Curtiss-Wright Control Embedded Computing’s CHAMP-AV8 ties a pair of 2.1-GHz quad-core Intel Core i7 processors together with PCI Express, Ethernet, and SRIO
Fig 2. Curtiss-Wright’s CHAMP-AV8 utilizes IDT’s PCI Express SRIO end-point to provide the Core i7 processors with direct access to the SRIO fabric.
Fig 3. IDT highlights the available bandwidth for SRIO, PCI Express, and 10G Ethernet. SRIO lends itself to smaller packets and provides a peer-to-peer network.
Fig 4. Mercury Computer’s LDS6521 employs Mercury’s POET (Protocol Offload Engine Technology) FPGA technology to link the quad-core Core i7 processors to the SRIO or Ethernet backplane fabrics.
Fig 5. GE Intelligent Platform’s 3U SBC312 single-board computer supports the eight-core QorIQ P4080 processor.
Fig 6. Elma’s 3U VPX TIC-FEP-VPX3a has a Xilinx Virtex-5 FGPA that uses the onboard FMC connection to utilize custom interfaces.
Fig 7. Curtiss Wright’s FMC-516 card holds a quad, 250 Msample/s, 16-bit ADC.
Fig 8. GE Intelligent Platforms’ NPN240 packs in dual nVidia GT240 GPUs. Each has 96 cores, providing a total of 750 GFLOPS per board.
Fig 9. Emerson Network Power’s iVPX7220 single-board computer handles 2.2-GHz quad-core Intel Core i7 processors along with 256 kbytes of nonvolatile FRAM.
Military and avionics applications must be rugged and offer high performance while meeting tight size, weight, and power (SWaP) requirements. High-performance computing (HPC) solutions in particular are in demand because achieving high performance while meeting those other requirements isn’t always easy (see “Military HPC Needs Software And Delivery Platforms,” p. xx).
The range of applications seems almost endless with ever larger computing projects on the drawing boards. Phased radar arrays can deliver finer images and track more targets with more computing power behind them. Unmanned aerial vehicles (UAVs) can perform more computing on board, reducing download bandwidth requirements.
Designers deliver this performance with a range of technologies, from high-speed serial interfaces to multicore CPUs, DSPs, and graphics processing units (GPUs) with a healthy sprinkling of FPGAs. Board form factors like VME, CompactPCI, VXS (VITA 41), and VPX (VITA 46) can support this range of processing chips. Conduction-cooled VPX boards running at 48 V can handle up to 768 W. This translates into lots of computing power. All but VME support high-speed serial interfaces.
The OpenVPX standard simplifies the fabric interconnect possibilities (see “OpenVPX Simplifies Rugged Design Tasks” at electronicdesign.com). This is key because high-speed serial fabrics are critical for HPC in these environments.
Big Jobs Need Fast Fabrics
The high-speed serial fabrics supported by OpenVPX include PCI Express (PCIe), Ethernet, Serial Rapid IO (SRIO), and InfiniBand. InfiniBand is popular for many HPC applications, but it’s even moreso in the scientific and enterprise arenas. PCIe tends to lack the peer-to-peer communications support that the other protocols have. PCIe also is used extensively for its designed purpose, interfacing processors to devices.
This leaves Ethernet and SRIO. Both are used extensively in military, avionics, and communications. Platforms like Curtiss-Wright Control Embedded Computing’s CHAMP-AV8 include both as well as PCIe to access peripherals (Fig. 1). On thing that’s different with this board is that it supports SRIO and Intel’s Core i7 processor.
The CHAMP-AV8 has two Core i7s, which is significant for two reasons (Fig. 2). First, SRIO isn’t a native interface for x86 processors. Ethernet and PCIe tend to be the interfaces found on the support chips. Native SRIO interfaces are found in many Power architecture processors. Second, Curtiss-Wright is using IDT’s PCIe 2 to SRIO 2 bridge chip, which simplifies the board designer’s job and provides a standard interface to the programmer.
The IDT chip is an SRIO adapter just like an Ethernet adapter. The primary difference between the two is the underlying protocol. SRIO tends to do better with smaller packet sizes because of its lower overhead (Fig. 3). It also provides guaranteed delivery. Ethernet targets larger packets and supports protocols like TCP/IP, which are needed to provide acknowledged delivery of data. Ethernet provides flexibility, but SRIO delivers better performance.
Bridging PCIe and SRIO is more than possible with IDT’s chip. FPGAs have been handling this chore for a while, and they were used initially when SRIO was being designed. The serializers-deserializers (SERDES) found on most of the high-performance FPGAs can handle any of the high-speed serial protocols like SRIO and Gigabit Ethernet.
Mercury Computer’s LDS6521 (Fig. 4) also supports Intel processors but uses an FPGA and employs Mercury’s POET (Protocol Offload Engine Technology) FPGA technology (see “Interconnect Technology A First For Intel Embedded Computing” at electronicdesign.com). SRIO is just one of the protocols supported by POET, which also supports Ethernet and InfiniBand plus OpenVPX data-plane PCIe. Like IDT’s chip, the POET FPGA interfaces with the host processor using PCIe.
POET’s flexibility would be useful alone, but FPGAs can do much more and Mercury makes it significantly easier. POET also provides access to off-chip memory as well as access to digital I/O for standard platforms like FMC (FPGA mezzanine card), XMC (switched mezzanine card), and RTM (rear transition module).
This allows POET to be used on a range of boards from Mercury as well as in a range of configurations. It also enables Mercury’s boards to be used in mixed networking environments where a more dedicated solution like Curtiss-Wright’s IDT-based solution usually could be found. Curtiss-Wright’s board supports Ethernet, SRIO, and PCIe but the connections are dedicated.
The common key to both Mercury’s and Curtiss-Wright’s boards are their use of Intel processor chips. In particular, they utilize the latest Sandy Bridge technology that incorporates Intel’s AVX vector processing technology (see “Intel’s AVX Scales To 1024-Bit Vector Math” at electronicdesign.com).
The Advanced Vector Extensions (AVX) are just one of many significant improvements from Intel that make these latest Core i7 processors desirable in high-performance military and avionics applications. AVX allows Intel’s processors to handle heavy duty DSP chores, image processing, and even encryption/decryption processing.
Intel’s multicore support has reduced power requirements in addition to providing more flexible power management support. Features related to power like Intel’s TurboBoost have moved the chips into applications that were typically reserved for Power-based microprocessors.
Power architecture processors from the likes of Freescale, AppliedMicro, and LSI are still the core for many companies delivering military and avionics solutions. One advantage they retain is built-in SRIO interfaces, which eliminate the extra chips used on the Curtiss-Wright and Mercury boards.
Power architecture processors are available in multicore configurations, and they are very power efficient. The AltiVec vector processing support was an advantage in the past, but Intel’s AVX is providing significant competition. The challenge right now is that AVX is 256 bits with support to 1024 bits within the architecture. AltiVec delivers 128-bit width registers.
Intel’s Core processors are designed for single-processor configurations. There are plenty of places where these processors will work for military and avionics applications and especially with support for SRIO via IDT’s chip or FPGAs.
Still, many developers are waiting for Intel’s Xeon server processors, which can really kick up the number of cores in a symmetrical multiprocessing (SMP) node. The current Xeon processors tend to be a bit hot for 3U and 6U boards. They also lack the advantages of the newer Core i7 processors such as AVX support. These new features will be in the next crop of Xeon processors.
Multicore DSP
DSPs are key to the success of real-time signal processing for applications like software defined radio. They typically challenge FPGAs for processing sensor information, especially in large sensor arrays like those found in phased array sonar and radar applications. Freescale’s Power-based QorIQ is just one of many Power DSP platforms available (see “Multicore And More” at electronicdesign.com).
GE Intelligent Platform’s 3U SBC312 takes advantage of the Freescale QorIQ family (Fig. 5). Available with the quad E500 Power core P4040 and octal core P4080 chips, the boards come in a range of conduction-cooled and air-cooled variations. Network options include 10-Gbit/s Ethernet and Gigabit Ethernet. The SBC312 is OpenVPX-compliant. The board has a PMC/XMC site with PCIe support in addition to SATA ports and serial ports.
DSP platforms are becoming more important in application areas such as UAVs where streaming multimedia processing can reduce the required radio bandwidth. DSP chips typically have the advantage over FPGAs when it comes to power consumption and are often the best alternative if the DSP can handle the processing requirements. Still, DSP blocks are a major feature of high-performance FPGAs.
Military FPGAs
FPGAs are used for a host of chores in military and avionics applications. Mercury’s POET is just one example of their power and flexibility. Tools like POET significantly reduce the amount of design work and programming necessary to take advantage of an FPGA’s performance.
The latest 28-nm FPGAs from the likes of Altera and Xilinx push the boundaries and incorporate very high-speed SERDES (see “Climb On Board Next-Generation FPGAs” at electronicdesign.com). Altera’s Stratix V GT and Xilinx’s Virtex-7 deliver 28-Gbit/s SERDES. They can easily handle 10-Gbit/s Ethernet. Both have hard-core PCI Express Gen 3 interfaces, allowing the high-speed SERDES to be used for other chores.
The use of SERDES connections greatly simplifies interconnects. This may be between devices using standard interfaces like SRIO and Ethernet, but they support a range of protocols including simpler protocols for linking FPGAs directly as well as handling sensor data.
As noted, DSP blocks are part of high-performance FPGA logic. This allows parallel computation to be performed, which is the key to an FPGAs performance advantage over conventional processors and DSPs. Altera’s DSP functionality provides flexible bit widths, enabling designers to tailor their design to the application rather than forcing the design to a fixed-width block.
Two other relatively new vendors target high-performance FPGA applications. First, Achronix’s Speedster FPGA employs a picoPipe technology that mitigates the delay normally associated with FPGA designs because signals need to be routed between lookup tables (LUTs) as well as registers (see “1.5-GHz FPGA Takes Clock Gating To The Max” at electronicdesign.com).
The picoPipes are asynchronous single-bit FIFOs that permit data to flow through the system in a fine-grain fashion. Designers have always been able to implement computations in a queued manner, but Speedster does it transparently with respect to the underlying design. Achronix has partnered with Intel to deliver Speedster on 28-nm technology.
Second, Tabula’s ABAX is another FPGA solution that uses a different approach (see “FPGAs Enter The Third Dimension” at electronicdesign.com). Its 3D SpaceTime architecture changes the underlying logic profile on a per clock basis. Normally, a RAM-based FPGA runs a fixed logic configuration. In Tabula’s case, there are multiple logic configurations and one is active in a particular area of the FPGA during a clock cycle. Another takes its place on the next clock cycle.
These configurations are called folds. A fold defines the logic within a region whose size remains fixed, but an ABAX FPGA can be configured with multiple regions. Each region has its own clock that can run at different rates. Likewise, each region can have up to eight folds, although it is possible to use fewer folds. Each fold increases the amount of logic available to the designer while reducing the maximum clock rate for the region.
The architecture has some interesting implications. For example, Tabula provides only single-port memory, but it can perform a read or write for each fold. This essentially implements a multiport memory system that can exceed the flexibility of the typical dual-port memory found in an FPGA. Likewise, each fold provides a new set of logic.
Fixed dual-port schemes must place some logic farther from the memory, whereas Tabula’s approach can put the logic next to the memory because it can change each clock cycle. This feature has significant implications for soft-core processor design, especially when it comes to pipeline architectures that are often less efficient on conventional FPGAs.
Like Achronix, Tabula provides a transparent solution to using the ABAX FPGA. The development tools use the same design tools as fixed FPGAs but account for the advantages and limitations of the design. Tabula’s and Achronix’s compilers use timing constraints to determine how the underlying logic will be configured. In Tabula’s case, the number of folds needed will also be automatically chosen.
Tabula has a potential edge on soft-core processor implementations, and soft-core processors are in most new FPGA designs. These processors often handle configuration, logging, or communications chores with the rest of the FPGA running in parallel providing improved performance that programs cannot match. Military and avionics applications have taken advantage of FPGAs for this reason, but these kinds of applications can also benefit from conventional programmable cores.
Not surprisingly, hard-core processors provide significant advantages in performance while reducing power requirements. Power architecture hard cores have been popular in military and avionics environments, but Arm-based hard-core processors have effectively replaced them.
Xilinx’s Zynq-7000 EPP FPGA family is based on dual hard-core ARM Cortex-A9 MPcore processors (see “FPGA Packs In Dual Cortex-A9 Micro” at electronicdesign.com). Altera has announced support for Cortex hard cores as well.
Intel’s E600C system-on-a-chip (SoC) packs a 40-nm Altera Arria II FPGA into a 37.5- by 37.5-mm ball-gird array (BGA) package (see “Configurable Platform Blends FPGA With Atom” at electronicdesign.com). This combination does not address the same arena as the 28-nm, high-performance FPGAs, but even arrays of lower-power processors like these can be advantageous in densely packed board solutions.
SeaMicro’s SM10000-64 packs 256 dual-core, 64-bit Intel Atom N570 chips and 1 Tbyte of DRAM into a 10U rack (see “512 64-Bit Atom Cores In 10U Rack” at electronicdesign.com). The system is linked by a 1.28-Tbit/s fabric that’s very efficient, although not based on SRIO or Gigabit Ethernet, and interfaces to the Atom processors via PCIe. Still, the system uses very little power, which is often a top criteria for military and avionics applications.
The latest SeaMicro platform uses conventional processors, but it is not a stretch to consider what might happen if they were replaced with FPGA augmented chips like the E600C. It uses a rather large 10U rack—well, large compared to 3U or 6U systems common in military and avionics applications. One way FPGAs are utilized in these form factors is in conjunction with FMC sites.
Elma’s 3U VPX TIC-FEP-VPX3a (Fig. 6) has a Xilinx Virtex-5 FGPA that can interface directly with mezzanine cards like Curtiss Wright’s FMC-516 card (Fig. 7). The FMC-516 has a quad, 250-Msample/s, 16-bit analog-to-digital converter (ADC) along with front-panel or rear-panel inputs that let designers develop or utilize existing FMC modules with standard FPGA cards. Large cards can support multiple FMC modules.
Military GPUs
FPGAs provide flexibility, but they aren’t the only parallel processing tools available to military and avionics application designers. GPUs have proven themselves in supercomputing environments, so it’s not surprising to find them moving into other high-performance areas like these.
The challenge with GPUs is similar to FPGAs, addressing power and programming issues. Lower-power modules like AMD’s Radeon E6760 Embedded GPU is one alternative (see “AMD Radeon E6760 Embedded GPU Offers Support For OpenCL And Six Independent Displays” at engineeringtv.com). Likewise, running GPUs slower than their limit to meet power envelopes for boards is an approach already taken with CPUs. GPUs often deliver performance improvements on the order of 10 to 100 times, so slowing down the GPU would still be beneficial and practical.
Programming GPUs is usually significantly easier than dealing with FPGAs. Programming frameworks like CUDA (see “Software Frameworks Tackle Load Distribution” at electronicdesign.com), for nVidia GPUs, and OpenCL (see “Parallel Programming Is Here To Stay” at electronicdesign.com) provide developers with an easy to use programming environment that takes advantage of a multicore GPU.
At this point, GPUs need to be paired with a CPU to handle communication chores. Most GPUs use PCIe these days, so they will work with most suitably equipped processors. In many cases, a CPU and GPU will be paired.
Alternatively, boards like GE Intelligent Platform’s 6U NPN240 bring a pair of nVidia GT240 GPUs into the mix (Fig. 8). The two 96-core chips deliver up to 750 GFLOPS of computational performance. The GPU board uses a x16 PCIe interface, and a single processor can handle multiple boards.
GPUs often can perform double duty, providing graphics output as well as doing computational work. They aren’t amenable to all algorithms, though, and moving data in and out can be a bottleneck. Still, GPUs clearly have a place in high-performance military and avionics applications.
Rugged Storage
HPC often is viewed only from the processor side and occasionally from the communication point of view, but storage is an indispensable part of the puzzle. Here, designers have also seen changes.
Nonvolatile storage has been moving from rotating magnetic storage to solid-state storage, in particular flash storage. Magnetic storage still has the edge when it comes to capacity, but flash is faster for reading and usually for writing. The problem is storage lifetime. This is critical in many military and avionics environments where product lifetimes can be very long. It’s also important for replacement and repair.
Single-level cell (SLC) flash is faster and has a longer write life and higher reliability compared to multi-level cell (MLC) flash. MLC has the advantage when it comes to cost and capacity. Cost tends to be less important for military and avionics environments, but capacity is often an issue. Compression can reduce the requirements but not eliminate them. In general, SLC flash is used most often in military and avionics applications.
Flash drives using SATA and SAS interfaces are available in standard form factors, but they aren’t a requirement like a magnetic solution would be. Flash can be easily placed on a board next to the processor, and most single-board computers typically have at least a flash-based boot memory. Megabytes or even gigabytes of flash can be found on off-the-shelf boards.
Still, other technologies like MRAM, FRAM, and phase change memory (PCM) are challenging flash’s perofrmance and reliability. They offer smaller capacities than flash memory, but their other characteristics are better even though their costs are higher. Cost is less important in military and avionics applications. Likewise, these technologies can be even more economical than alternatives like battery-backed RAM.
Emerson Network Power’s iVPX7220 single-board computer has a 2.2-GHz quad-core Intel Core i7 processor that has access to 256 kbytes of nonvolatile FRAM (Fig. 9). It also has up to 16 Gbytes of DDR3 error correction coding (ECC) DRAM. Embedded USB flash and a SATA drive round out the optional storage features.
Finally, encrypted storage is worth mentioning for military applications. This feature is available in a range of flash drives. It is likely to be a part of many military designs in systems such as UAVs and other mobile robotic systems.
High-performance military and avionics solutions require a range of tools from storage to fabrics to computational platforms. No one configuration is best, and many applications will require a complex mix of processing elements.