Supercomputing on the Cheap

C/C++ Users Journal August, 2004

FPGAs combine with streams-oriented programming methods to deliver extreme processing power for C applications

By David Pellerin, Maya Gokhale, and Steven Hurt

David Pellerin is CTO of Impulse Accelerated Technologies and can be contacted at david.pellerin@impulsec.com. Maya Gokhale is a project leader at Los Alamos National Labs. Steven Hurt is a senior staff engineer in the Advanced Software Technology Group at Lockheed Martin.
An Impulse C Primer

Recent advances in programmable hardware technologies are putting supercomputing-level processing power in the hands of an increasing number of engineering teams. At the heart of this trend is the continued reduction in the cost-per-gate of field programmable gate arrays (FPGAs), the availability of sophisticated embedded processor cores, and advances in development tools that enable the direct compilation of applications to FPGA-based, mixed hardware/software processing platforms. These platforms allow computationally intensive problems to be solved using a high (and often extreme) degree of parallelism.

Low-cost, high-performance computing using highly parallel platforms is not a new concept. It has long been recognized that many—perhaps most—of the problems requiring supercomputer solutions are well suited to parallel processing. By partitioning a problem into smaller parts that can be solved simultaneously, you can gain massive amounts of raw processing power, at the expense of increased hardware size, system cost, and power consumption. It is now common, for example, for supercomputing researchers to create clusters of traditional desktop PCs (often running Linux) to build what can be thought of as a "coarse-grained" parallel-processing system. To create applications for such a system, programmers can use various available programming tools, including PVM (Parallel Virtual Machine) and MPI (Message Passing Interface), both of which provide the ability to express parallelism through message passing and/or shared memory resources using C.

Using such an approach, traditional supercomputers can be replaced for many applications by clusters of readily available, low-cost computers. The computers used in such clusters can be of standard design and programmed to act as one large, parallel computing platform. Higher performance (and correspondingly higher cost) clusters often employ high-bandwidth, low-latency interconnects to reduce the overhead of communication between nodes. Another path to reducing communication overhead is to increase the functionality of each node through the introduction of low-level custom hardware to create high-speed data paths. More recently, supercomputer cluster designers have begun exploiting node-level parallelism by introducing programmable hardware into the mix.

FPGAs as Computing Machines

When FPGAs are introduced into a supercomputing strategy, opportunities exist for improving both coarse-grained (application-level), and fine-grained (instruction-level) parallelism. Because FPGAs provide a large amount of highly parallel, configurable hardware resources, you can create structures (such as parallel multiply/add instructions) that greatly accelerate individual operations or higher level statements such as loops. Through instruction scheduling, instruction pipelining, and other techniques, inner code loops can be further accelerated. And, at a higher level, these parallel structures can themselves be replicated to create further levels of parallelism, up to the limit of the target device capacity.

FPGA devices were introduced in the mid-1980s to serve as glue logic or ASIC prototyping/replacement. They were almost immediately discovered by researchers seeking low-cost, hardware-based computing solutions, and the field of Re-Configurable Computing (RCC) was born. FPGAs thereby serve a dual role: They provide a relatively quick, low-risk way to implement complex hardware functions, while at the same time providing opportunities for algorithm acceleration. Indeed, the applications for which FPGA-based computing platforms are appropriate include computationally intensive interface and control functions, real-time image processing and pattern recognition, data encryption and cryptography, genomics, biomedical, signal processing, scientific computing (including algorithms in physics, astronomy, and geophysics), data communications, and the like.

Most uses of FPGAs as computational resources are relatively static, meaning that the algorithm(s) being implemented on the FPGA are created as system startup and rarely, if ever, changed during operation of the system. While there has been extensive research into dynamic reconfiguration of FPGAs (changing the hardware-based algorithm on the fly to allow the FPGA to perform multiple tasks), there has not been widespread use of FPGAs for dynamically reconfigurable computing, due in large part to the system costs (measured in performance and power) of performing dynamic FPGA reprogramming. In most cases, then, the goal of FPGA-based computing is to increase the raw power and computation throughput for a specific algorithm.

Through a combination of design techniques that exploit spatial parallelism, key algorithms and algorithmic hot spots can be accelerated in FPGAs by multiple orders of magnitude. Until recently, however, it has been difficult or impossible for software developers to take advantage of these potential speedups because the development tools (and in particular compiler technologies) required to generate low-level structures from higher level software descriptions have not been widely available. Instead, programmers have had to learn low-level hardware design methods, or turn to more experienced FPGA designers to take on the task of manually optimizing an algorithm and expressing it as low-level hardware. However, this process is becoming easier.

Single-Board FPGA-Based Platforms

When combined with other peripherals on a board, FPGAs can become excellent platforms for algorithm experimentation, given appropriate design expertise and/or design tools.

The smallest FPGA-based platforms combine FPGA devices with standard I/O devices (such as PCI or network interfaces) on a prototyping board. There are many examples of such boards (like Figure 1), which typically make use of FPGA devices produced by Altera (http://www.altera.com/) or Xilinx (http://www.xilinx.com/). These board-level solutions may or may not include adjacent microprocessors, but in most cases it is possible to make use of embedded processors (which are implemented directly within the FPGA) to create mixed hardware/software applications. Using such boards—or a collection of them—in combination with existing tools, you can use C to create hardware implementations of computationally intensive software.

By making appropriate use of one or more embedded ("soft") processor cores, you can construct a computing environment roughly analogous to the multiprocessor cluster systems, and to combine this cluster—on a single programmable device or a collection of such devices—with custom FPGA hardware implementing the most performance-critical portions of the application. Soft processors are appropriate for this because they are generally configurable (you can select only those peripheral interfaces you need) and support high-performance connectivity with the adjacent FPGA logic.

Examples of such embedded processors include the royalty-free processor cores offered by the FPGA manufacturers themselves (including Altera's 32-bit Nios processor or the 32-bit MicroBlaze processor available from Xilinx), as well as processor cores available from numerous third-party suppliers. "Hard" embedded processors, such as the PowerPC in the Xilinx Virtex II Pro device, or external processors that share the board with one or more FPGAs, are also appropriate for creating a mixed hardware/software platform target.

By using a mixed processor/FPGA platform and evaluating the application's processing and bandwidth requirements, you can create mixed hardware/software solutions in which the FPGA logic serves as a hardware coprocessor for one or more on-chip embedded processors. In terms of cost, a board-level prototyping system consisting of a mainstream FPGA device (capable of hosting one or more embedded processors along with other FPGA-based computations) and various input/output peripherals will range in price from a few hundred to a few thousand dollars, depending on the capacity of the on-board FPGA device(s).

The next step is to the larger, multiple-FPGA systems intended for larger-scale configurable or reconfigurable computing applications. These are useful for ASIC prototyping as well as for creating high-performance, low-production (or one-off) computing platforms for specific applications. Many of these commercial systems trace their roots to DARPA-funded research and include dozens of high-capacity FPGA devices. The advantages include having a system predesigned for you (quicker time to market, less hardware engineering cost), while the disadvantages include a higher per-unit system cost and a system that may be less than a perfect fit for your particular application. Nonetheless, these systems are excellent alternatives to creating low-volume custom hardware platforms for high-performance, FPGA-based computing. Examples of such boards include those offered by SBS Technologies, Annapolis Microsystems, Gidel, Mango, the Dini Group, and others.

A Broad Range of Applications

What kinds of application are being developed using FPGA computing platforms? A program at Montana State University is being funded by the National Science Foundation to develop a computational platform that helps forward the understanding of how the nervous system encodes information (http://www .wernicke.montana.edu/ and http://www .hylitech.com/). The system contains an array of 27 Virtex-II FPGAs that will be used to analyze and build a model of an insect brain. Project leaders are collaborating with scientists at the Center for Computational Biology at MSU, where the aim is to connect this hardware directly to an insect (a cricket), analyze neural data in real time, and insert new models that interact with the recorded data. To do so, the data need to be processed in real time and the models run as fast as possible.

The project team evaluated a conventional DSP approach, but the benchmarks indicated that DSPs would not provide the needed performance. An array of FPGAs, however, provides the computational horsepower required at a relatively low cost. Benchmarks conducted using a single Virtex-II 3000 device indicate speed enhancements of up to 200× are practical for certain neural simulations (see Figure 2) when compared to conventional processors.

The flexibility of the Virtex-II devices lets the team embed software directly in hardware. These performance-critical algorithms describe how the sodium or potassium channels work inside a neuron. The next step is to scale the system up to the array of 27 FPGAs, and as a part of this process the team is investigating the use of a software-based design approach, using Impulse C tools to describe and implement an embedded "bio processor."

The team's goal is to create a platform supported with high-level tools that assist in the discovery of how to decode neural information in real time. This research will eventually let individuals control prosthetic limbs just by thinking, in the same way healthy limbs are controlled by the body. Xilinx has also provided help with this project by donating the necessary FPGA devices, and the National Institute of Health (NIH) has expressed interest in the project and is acting as a funding source.

Other Reconfigurable Computing Applications

At Lockheed Martin, engineers are working with multiple FPGA-based boards as part of a research project utilizing reconfigurable computing. In this application, a limited number of microprocessors handle resource management while the FPGAs handle the bulk of the computations. Analysis of data movement has shown that the hand-off to the decentralized processing is the system bottleneck, which emphasizes the need for careful planning and design of process interconnects. Minimizing the overhead of data communication thus provides an advantage in overall system speed that is often more important than the sheer cumulative processing power.

Mechanically, this system is a large PCI backplane providing power and ground, device management, and no data processing other than for device configuration. There is a separate bus for data movement, and depending on which FPGA-based boards are specified, different buses can be used, some of which are simple, fast buses for data movement in support of streams-oriented applications.

The Lockheed team is sensitive to developing technology, having identified optimal applications that operate on a streaming data source. Examples the Lockheed team has identified include: signal processing with a constant flow, video processors, high-end imaging applications, genetic-related algorithms, pattern-recognition algorithms, and high-level modeling.

Taking a Software Approach to FPGA Programming

Much of FPGA-related research has focused on programming for massively parallel targets. Again, parallelism in an application can be considered at two distinct levels: at the application level and the level of specific instruction within a computational process or loop. While there are ongoing attempts to create compiler technologies that can exploit both levels of parallelism with some level of automation, the best approach seems to focus on the efforts of automation (represented by compiler and optimizer technologies) on the lower level aspects of the problem, while at the same time providing programmers with an appropriate programming model and related tools allowing higher level, coarse-grained parallelism to be easily expressed. In this way, programmers—who have knowledge of the application's external requirements that are not necessarily reflected in a given set of source files—can make decisions and experiment with alternative algorithmic approaches while leaving the task of low-level optimization to compiler tools.

There are a number of programming models that can be applied to FPGA-based programmable platforms, but the most successful of these models share a common attribute—they support modularity and parallelism through a dataflow (or dataflow-like) method of design partitioning and abstraction. Research conducted at Los Alamos National Labs (LANL) and elsewhere has demonstrated that the communicating process model, using streams-based communications between independently synchronized processes, is an effective way of expressing application-level parallelism. A modified form of this programming model is supported in the Streams-C programming environment developed under the direction of Maya Gokhale at LANL (http://rcc.lanl.gov/Tools/Streams-C/), and is also the basis for the Impulse C (CoDeveloper) tool suite from Impulse Accelerated Technologies (http://www.impulsec.com/).

At the heart of this programming model are processes and streams. Processes are independently synchronized, concurrently operating segments of an application that are written in a standard language, C in this case. Processes perform the work of the application by accepting data, performing computations, and generating relevant outputs. Unlike subroutines written in C, processes are considered persistent; they are normally called once (whether in hardware or software) and persist as long as there are streaming data to be processed.

The data that are processed by a streams-oriented application flow from process to process by means of streams, or in some cases by means of messages and/or shared memories that are also supported in the programming model. Streams represent one-way channels of communication between concurrent processes, and are self-synchronizing with respect to the processes by virtue of buffering. The primary method of synchronization between processes is therefore the data being passed on the streams.

The key to allocating processing power within such a system, and using such a programming model, is to use FPGA to implement one or more processes that handle the heavy computation, and provide other processes running on embedded or external microprocessors to handle file I/O, memory management, system setup, and other nonperformance-critical tasks. Using tools such as those included with Impulse CoDeveloper, C applications consisting of multiple parallel processes can be modeled entirely in software, verified using a standard desktop C debugging environment applications, and then—after the application is functionally complete—be incrementally moved, in whole or in part, into the FPGA for further optimization and acceleration.

C and FPGA-Based Platforms

Impulse C is a C-language implementation of the programming model just described. Impulse C lets you describe highly parallel applications and algorithms, using C-compatible library functions in support of a communicating process programming model. Impulse C simplifies the expression of highly parallel algorithms through the use of well-defined data communication, message passing, and synchronization mechanisms.

Impulse C is designed for dataflow-oriented applications, but is also flexible enough to support alternate programming models including the use of shared memory as a communication mechanism. This is important because different FPGA-based platforms may have different performance trade-offs: In some platforms, it makes more sense to move data between the embedded processor and the FPGA gates via block memory reads/writes, while in other platforms a streaming communication channel might have higher performance. Therefore, the programming model that you choose depends not only on the requirements of the application, but also on the architectural constraints of the selected programmable platform target. The ability to quickly model, compile, and evaluate alternate algorithm approaches is therefore critical to getting the best possible results.

The Impulse C library consists of minimal extensions to C in the form of new data types and predefined function calls. Using the intrinsic functions, you can define multiple parallel program segments and describe their interconnection and synchronization. The Impulse C compiler translates and optimizes Impulse C programs into appropriate lower level representations, including Register-Transfer-Logic (RTL) hardware descriptions that can be synthesized to FPGAs, and Standard C (with associated library calls) that can be compiled onto supported microprocessors through the use of widely available C cross compilers.

The complete CoDeveloper development environment includes a set of emulation libraries, letting Impulse C applications be executed in standard desktop compilers, such as Microsoft Visual Studio or GCC/GDB for simulation and debugging purposes. An Application Monitor (see Figure 3) lets multiple Impulse C processes be examined and lets data blockages be quickly identified and fixed.

The output of an Impulse C application, when compiled, is a set of hardware and software source files ready for importing into Xilinx, Altera, or third-party synthesis tools for FPGA bitmap generation. These files include:

The result of this compilation process (see Figure 4) is a complete application, including the required hardware/software interfaces, ready for implementation on an FPGA-based programmable platform.

Conclusion

Applications requiring very high-performance computing are making increased use of FPGA-based computing platforms. These platforms have proven to be highly scalable and capable of handling large, supercomputing-class applications with dramatically lower cost-per-MIPS than other alternatives. There are trade-offs, of course: FPGA-based computing platforms do require more work on the part of programmers, but there are advances being made in compiler tools for FPGA platforms.

While they cannot yet be considered general-purpose computing platforms, for a wide range of computationally intensive applications FPGA-based platforms can truly represent "supercomputing on the cheap."