Dr. Dobb's Journal April, 2005
Computational demands are growing in virtually every area of scientific research and advanced engineering, from space, climate, and genetics research, to automobile crash simulations and materials modeling. This is impacting not only high profile areas of business and educational research, where funding may be widely available, but many other disciplines where individuals or small teams are working under limited budgets, yet still need to solve complex problems that exceed the capacity of standard workstations or servers.
High Performance Computing (HPC) clusters offer a solution that cuts across virtually the entire range of computational requirements. HPC clusters entered the arena about 15 years ago, and have come a long way in terms of cost, performance, and ease of use. They typically consist of networked computing platforms (nodes), with each node running its own image of the operating system. Complex workloads are then broken into chunks that can be processed simultaneously on multiple nodes, which communicate with each other using message-passing libraries to coordinate the computations.
Enormous progress has been made over the years in optimizing complex applications for parallel processing on clustered systems. With these advances, large, expensive, purpose-built systems are now needed only for highly serialized applications, in which the workload is not easily distributed across multiple nodes. In fact, HPC clusters now account for about 60 percent of the top 500 supercomputers in the world. This includes Columbia, the second most powerful system in operation todaya 10,240-processor cluster that NASA uses to extend its research capabilities. HPC clusters are now common in automotive, aerospace engineering, bioinformatics, seismic, and other day-to-day production work.
The major advantages of HPC clusters over large, monolithic systems are price and performance. Clusters can be built using commercial off-the-shelf (COTS) components, including standard platforms, processors, I/O, storage, and the like. Because these components are manufactured in enormous quantities, they are far less expensive than custom-designed components developed specifically for high-performance computing. Clusters can also be scaled very cost-effectively, simply by adding additional nodes (or by upgrading some or all of the nodes). Although most of the interconnects have a high level of bandwidth, the operating system, network driver, and/or communications protocol can add latency and, thus, impact performance. The traditional way to optimize latency involves tuning for the specific interconnect, although this can be tedious and time consuming. Intel (where we work) has developed the Intel MPI Library, which allows a single build of an application executable to be used over a variety of interconnects. The library provides support for GigE and Infiniband, with support for other fabrics in progress. The Intel MPI Library was only recently released and was, therefore, unavailable during the development of the example cluster built in this article.
The only hardware component of a basic HPC cluster that is outside the mainstream is the network used to connect the nodes. In some very demanding cluster implementations, proprietary interconnect technologies enable the very high-bandwidth, low-latency, node-to-node communications needed to avoid bottlenecks. Yet even in this respect, the tide has turned, and affordable, standards-based solutions, such as Infiniband (http://www .infinibandta.org/) and Gigabit Ethernet (http://www.10gea.org/), deliver sufficient performance and scalability for a broad range of HPC implementations.
A high-performance, 32-bit or 64-bit cluster can be built with a combination of:
In this article, we describe how you can build a 16-node cluster that's powerful enough for a variety of requirements. Once you understand the process, you can build this cluster in about six-eight hours, with about half of that time devoted to hardware setup. The process is based on a reference cluster (http://www.intel.com/ids/) built and optimized by Intel, which is currently available for scalability and benchmark testing over the Internet (http://www .intel.com/ids/ra/).
On the hardware side, the cluster we describe here is a 16-node cluster that uses 17 two-way servers16 for the cluster itself, and another to act as the head node, which schedules and balances the workload across the cluster. Each system is configured with two Intel Xeon processors, and 4 GB of RAM. The latest Intel processors include features especially useful in HPC environments:
On the software side, the cluster includes:
Both Rocks and MPICH include tools for creating, analyzing, optimizing, and debugging apps for the cluster. Rocks includes both the Intel C++ and Fortran compilers. MPICH also includes a programming environment for working with MPI programs, as well as sample MPI programs and tools for linking, running, and debugging your programs. (Intel provides additional tools for software development and optimization, including Intel Integrated Performance Primitives and the Intel Cluster Toolkit, the Intel MPI Library, the Intel Thread Checker, and Intel Thread Profiler; see http://developer.intel.com/ software/products/.)
Typical HPC applications are extremely compute intensive, so software optimization can deliver substantial benefits. It not only helps to accelerate performance for existing workloads, but can enable the cluster to handle larger and more complex problems without additional hardware resources. Though recompilation and thread optimization may not be necessary to run your applications on the cluster, if you plan to use the application intensively over time, it can be worth the effort.
Hardware setup is straightforward. You install an Ethernet adapter in the PCI slot for each node, including the head node, and connect all the nodes to an Ethernet switch (use Gigabit Ethernet components if you plan on using this network for cluster operation as well as initial setup). If you plan to use Infiniband for the cluster interconnect, install Infiniband host-channel adapters and connect all the nodes to an Infiniband switch. The Infiniband software can be installed after you set up the cluster.
To install the software, begin with the head node. (If you plan on building a lot of clusters, you might want to configure a dedicated build server, instead.) First, load Rocks onto the head node. This open-source utility includes a Linux OS kernel, so no separate OS is required. It also includes software management tools to distribute the OS to the nodes, although other open-source tools are available for this function.
With Rocks loaded onto the head node, you can distribute the OS to the cluster. Boot up each node over the network using PXE boot. The nodes look for a DHCP server over the TCP/IP Ethernet network and, following IP address assignment, connect with the appropriate files on the head node. The scripts on the head node then kick-start the compute nodes as they are discovered. Each node takes about five minutes to install from the time it's discovered by the master node.
You are now ready to test, profile, and benchmark the baseline performance of the cluster. You can use your own programs or the sample programs included with MPICH. If you have problems with any of these issues, look at the documentation on the MPICH web site, which includes troubleshooting tools and information.
Though the difficulty and cost of building your own cluster have been greatly reduced in recent years, you may want to test your applications on a similar cluster before investing time and money in planning and building your own. The cluster described here is available for application development, optimization, testing, and benchmarking. You can remotely schedule time on the system and manage the process completely from your own office. To get started, all you need is a direct, high-speed connection to the Internet. Connectivity to the clusters is through an IPSEC VPN or the SSL/SSH security protocols (see http://www.intel.com/ids/ra/).
Two additional HPC clusters are also available for remote testing:
DDJ