How To Put OpenCL Into An FPGA

1 of Enlarge image

OpenCL kernel in an FPGA

Altera is looking to put OpenCL (Open Computing Language) into FPGA hardware. This could give GPUs a run for the money when it comes to accelerating parallel processing. It could also increase adoption of FPGAs in new areas.

OpenCL is a parallel processing framework from the Khronos Group. It was developed shortly after NVidia's CUDA (see Software Frameworks Tackle Load Distribution) opened GPUs as compute engines. CUDA and OpenCL have made major inroads in high performance computing (see Match Multicore With Multiprogramming). Support for OpenCL can now be found in SoC (system-on-chip) solutions like Arm's Mali-T658 architectur (see Multicore Mobile GPU Handles Computation Chores).

GPUs and OpenCL can improve performance by as much as a factor of 10 or even 100. Of course, improvement is very application specific but it has made a major difference in the computational capability for many applications. OpenCL has the advantage of running on a variety of platforms including a single core CPU. It allows algorithms to be ported to different platforms and to take advantage of more cores when they are available.

OpenCL is a strict framework that divides data into arrays and algorithms into kernel code that can manipulate this data. The idea is a control program, usually running on a conventional CPU, specifies the source and destination data and the kernels to be applied to the source data. How this happens is up to the OpenCL framework. For a GPU, the data and kernel code is copied to the GPU's memory that is normally independent of the host. The kernel code runs and the results are returned to the host. This is a very general, high level view of OpenCL but it sufficient for this discussion.

Altera plans on doing the same thing but the FPGA essentially contains the kernel code and off-chip memory is used to store the data (Fig. 1). The host communicates with the FPGA via a PCI Express interface. Common management logic handles this data exchange as well as how to invoke a kernel providing the kernel with the data as necessary and reporting completion to the host. The host can then utilize the results, again via the PCI Express link.

OpenCL kernel code is essentially C code. Altera's yet-to-be-named tool will convert the OpenCL code into logic that can be programmed into the FPGA. Initially this programming will be a static process and multiple OpenCL kernels could be handled by a single FPGA. This approach works very well for an embedded application where algorithms tend to remain static. It allows system updates since the FPGA is programmable.

This is just a first step and tool and process is not a commercial product at this point. Still, the approach is quite viable and can provide even more impressive performance gains compared to GPUs. Again, all these gains will be application specific.

Likewise, the approach has its limitations at this point. For example, like a GPU, the data is moved from host memory to the FPGA's memory before any processing is done and then back when done. This matches OpenCL's protocol but FPGAs have long been used to handle streaming data very efficiently. Work is in progress to enhance the OpenCL specification to take streaming into account. This should further improve the suitability of FPGAs for implementing OpenCL-based applications.

Another FPGA feature that could be utilized is dynamic programming. Most FPGAs are programmed before use, or when they are started in the case of RAM-based FPGAs, are not changed once the application is running. Dynamic programming of FPGAs allows some or all of the FPGA logic to be reconfigured on-the-fly. This feature could allow OpenCL kernels to be loaded on demand. RAM-based FPGAs would be especially effective for this approach.

The use of off-chip memory and PCI Express links makes the approach amenable to existing hardware. FPGAs are regularly mated to CPUs. There are even single chip CPU/FPGA solutions available now. Intel's E600C system-on-chip packs a 40nm Altera Arria II FPGA into a 37.5mm by 37.5mm BGA package (see Configurable Platform Blends FPGA With Atom). This platform could support Altera's work right now since the Atom core is linked to the FPGA via a pair of PCI Express links.

The inclusion of hard core processors in FPGAs is also on the rise. Altera's future chips will include dual Arm Cortex-A9 cores. (see Dual Core Cortex-A9 With ECC Finds FPGA Home). In this case, as well as with soft core processors, the Cortex-A9's have a direct link into the FPGA fabric and to the memory controllers. A simple improvement would simply be for the hard cores to copy data into the FPGA's off-chip memory and allow the process to work in the normal fashion except that the data was not copied via a PCI Express link. A more advanced version might that the CPU cores feeding the kernels directly.

OpenCL on FPGAs is just in its infancy but it has a lot of promise. There are areas where significant improvements can be made and fine tuning the OpenCL specification can make FPGAs even more useful.

The approach is a win for FPGA vendors as well as designers. Obviously will generate more FPGA sales as well as providing developers with faster computational platforms. It also simplifies the methodology of getting logic into FPGAs since OpenCL is not RTL, it is C code. This means software developers will be able to take advantage of FPGAs without having to use the current crop of tools.

OpenCL on FPGAs will likely be a work in progress for at least a couple of years but it is progressing rapidly. Interest is high and it is essentially a software problem consisting primarily of the OpenCL kernel to FPGA compiler and the coordination logic. Altera has a head start on this approach but the methodology is not unique and OpenCL is an open standard. There are many tools which will generate FPGA logic based on software code. A good example is National Instrument's LabView. LabView can generate code for FPGAs from LabView applications. National Instrument's CompactRIO and Single Board RIO actually have an FPGA at their core along with a CPU that is programmed using LabView applications (see LabView 2010 and Single Board RIO). This is done without even using an FPGA IDE.