Building High-Performance Clusters

Dr. Dobb's Journal April, 2005

HPC cluster components

By Christopher Jeffords and Dung Pham

Christopher is a systems architect and Dung a network specialist/systems administrator for Intel. They can be contacted at christopher.jeffords@intel.com and dung.pham@intel.com, respectively.

Computational demands are growing in virtually every area of scientific research and advanced engineering, from space, climate, and genetics research, to automobile crash simulations and materials modeling. This is impacting not only high profile areas of business and educational research, where funding may be widely available, but many other disciplines where individuals or small teams are working under limited budgets, yet still need to solve complex problems that exceed the capacity of standard workstations or servers.

High Performance Computing (HPC) clusters offer a solution that cuts across virtually the entire range of computational requirements. HPC clusters entered the arena about 15 years ago, and have come a long way in terms of cost, performance, and ease of use. They typically consist of networked computing platforms (nodes), with each node running its own image of the operating system. Complex workloads are then broken into chunks that can be processed simultaneously on multiple nodes, which communicate with each other using message-passing libraries to coordinate the computations.

Enormous progress has been made over the years in optimizing complex applications for parallel processing on clustered systems. With these advances, large, expensive, purpose-built systems are now needed only for highly serialized applications, in which the workload is not easily distributed across multiple nodes. In fact, HPC clusters now account for about 60 percent of the top 500 supercomputers in the world. This includes Columbia, the second most powerful system in operation today—a 10,240-processor cluster that NASA uses to extend its research capabilities. HPC clusters are now common in automotive, aerospace engineering, bioinformatics, seismic, and other day-to-day production work.

The major advantages of HPC clusters over large, monolithic systems are price and performance. Clusters can be built using commercial off-the-shelf (COTS) components, including standard platforms, processors, I/O, storage, and the like. Because these components are manufactured in enormous quantities, they are far less expensive than custom-designed components developed specifically for high-performance computing. Clusters can also be scaled very cost-effectively, simply by adding additional nodes (or by upgrading some or all of the nodes). Although most of the interconnects have a high level of bandwidth, the operating system, network driver, and/or communications protocol can add latency and, thus, impact performance. The traditional way to optimize latency involves tuning for the specific interconnect, although this can be tedious and time consuming. Intel (where we work) has developed the Intel MPI Library, which allows a single build of an application executable to be used over a variety of interconnects. The library provides support for GigE and Infiniband, with support for other fabrics in progress. The Intel MPI Library was only recently released and was, therefore, unavailable during the development of the example cluster built in this article.

The only hardware component of a basic HPC cluster that is outside the mainstream is the network used to connect the nodes. In some very demanding cluster implementations, proprietary interconnect technologies enable the very high-bandwidth, low-latency, node-to-node communications needed to avoid bottlenecks. Yet even in this respect, the tide has turned, and affordable, standards-based solutions, such as Infiniband (http://www .infinibandta.org/) and Gigabit Ethernet (http://www.10gea.org/), deliver sufficient performance and scalability for a broad range of HPC implementations.

The Basic Components

A high-performance, 32-bit or 64-bit cluster can be built with a combination of:

Standard servers, workstations, or PCs. In general, no special characteristics are required, though platform characteristics such as processing power, memory, storage, and I/O impact the overall performance and efficiency of the cluster. As with any computing system, different workloads put greater demands on different system resources. For example, some applications depend on intense processing within each node, but do not require a lot of node-to-node communications to synchronize computations. These "coarse-grained" applications tend to benefit most from powerful processors and large memory capacity. Fine-grained applications, on the other hand, demand constant message passing, and benefit more from a high-bandwidth I/O solution, such as PCI Express.
A key consideration is whether the cluster will be running 32-bit or 64-bit code (the cluster we present here runs both). For applications with large data sets or larger than 32-bit floating-point calculations, 64-bit code can deliver substantial performance improvements.
Another issue is local/remote console access. Many implementations utilize local keyboard-video-mouse (KVM) switch technologies, many using technologies that allow out-of-band access via switched networks. As a cluster size increases, serial console access and manageability features using IPMI become increasingly important.
A standard, commercial, or open-source operating system. Linux is currently the most popular choice.
Cluster private network. Again, InfiniBand and Gigabit Ethernet offer affordable, standards-based alternatives, with Infiniband becoming increasingly common and more appropriate for larger and more demanding cluster solutions. Proprietary alternatives (such as Myrinet) are also available.
High-performance message-sharing middleware. Although there currently is no single package, many software products combine to balance and schedule the workload across the clustered nodes, and control communications among them. They provide the message passing library, cluster management tools, batch/queue software, and user interface. Both off-the-shelf and free software solutions are available that simplify and automate much of the work involved in building, tuning, and using HPC clusters.
Cluster configuration tools. The days of manually configuring each node in a cluster are gone. Popular (and free) utilities, such as Rocks or Oscar, make it easy to configure multiple nodes over a network. These utilities save time and reduce the chances of errors that could multiply across a number of nodes.

Building Your Own

In this article, we describe how you can build a 16-node cluster that's powerful enough for a variety of requirements. Once you understand the process, you can build this cluster in about six-eight hours, with about half of that time devoted to hardware setup. The process is based on a reference cluster (http://www.intel.com/ids/) built and optimized by Intel, which is currently available for scalability and benchmark testing over the Internet (http://www .intel.com/ids/ra/).

On the hardware side, the cluster we describe here is a 16-node cluster that uses 17 two-way servers—16 for the cluster itself, and another to act as the head node, which schedules and balances the workload across the cluster. Each system is configured with two Intel Xeon processors, and 4 GB of RAM. The latest Intel processors include features especially useful in HPC environments:

32-bit/64-bit capabilities. Processors that support Intel EM64T can run both 32-bit and 64-bit operating systems and applications. With 64-bit operating systems, each node can run 64-bit and 32-bit code simultaneously, which provides flexibility for handling diverse workloads. If you are building a cluster that may need to handle a variety of tasks, this is an important advantage.
Multithreading. Processors that support hyperthreading technology can simultaneously process two software threads (each physical processor looks like two virtual processors to the operating system). As a result, a two-way node can handle four software threads. This capability is especially valuable in HPC cluster environments, where software is highly optimized for parallel throughput. A single processor with hyperthreading technology will not process two threads as fast as two separate processors, but can deliver substantial performance improvements.
800-MHz Front Side Bus and support for PCI Express. These new technologies provide faster internal and external communications than earlier processor versions, which are important for most HPC cluster applications.
An InfiniBand switch is used for the cluster interconnect, and an InfiniBand host channel adapter (HCA) is needed for each server. Combining the fast front side bus with PCI Express provides excellent bandwidth and speed for node-to-node communications. Local storage is also provided on each node, although cluster storage solutions could easily be added if readily available or justified by the application data requirements.

On the software side, the cluster includes:

RedHat Enterprise Linux.
Rocks for cluster setup and management (http://www.rocksclusters.org/Rocks/) and MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) for message passing.
Message Passing Interface (MPI), the cluster communications software for scheduling, distributing, and managing tasks. There are multiple flavors of MPI available. This implementation uses MPICH (Message Passing Interface Chameleon), a freely available, portable version of MPI. If you build this cluster using a Myrinet interconnect (http://www.myri.com/), you would use MPICH-GM, which is an implementation of MPICH developed by Argonne National Laboratory. MPICH-GM was designed as a port on top of GM (ch_gm), which was developed and is currently supported by Myricom (http://www.myri.com/).

Both Rocks and MPICH include tools for creating, analyzing, optimizing, and debugging apps for the cluster. Rocks includes both the Intel C++ and Fortran compilers. MPICH also includes a programming environment for working with MPI programs, as well as sample MPI programs and tools for linking, running, and debugging your programs. (Intel provides additional tools for software development and optimization, including Intel Integrated Performance Primitives and the Intel Cluster Toolkit, the Intel MPI Library, the Intel Thread Checker, and Intel Thread Profiler; see http://developer.intel.com/ software/products/.)

Typical HPC applications are extremely compute intensive, so software optimization can deliver substantial benefits. It not only helps to accelerate performance for existing workloads, but can enable the cluster to handle larger and more complex problems without additional hardware resources. Though recompilation and thread optimization may not be necessary to run your applications on the cluster, if you plan to use the application intensively over time, it can be worth the effort.

System Setup

Hardware setup is straightforward. You install an Ethernet adapter in the PCI slot for each node, including the head node, and connect all the nodes to an Ethernet switch (use Gigabit Ethernet components if you plan on using this network for cluster operation as well as initial setup). If you plan to use Infiniband for the cluster interconnect, install Infiniband host-channel adapters and connect all the nodes to an Infiniband switch. The Infiniband software can be installed after you set up the cluster.

To install the software, begin with the head node. (If you plan on building a lot of clusters, you might want to configure a dedicated build server, instead.) First, load Rocks onto the head node. This open-source utility includes a Linux OS kernel, so no separate OS is required. It also includes software management tools to distribute the OS to the nodes, although other open-source tools are available for this function.

With Rocks loaded onto the head node, you can distribute the OS to the cluster. Boot up each node over the network using PXE boot. The nodes look for a DHCP server over the TCP/IP Ethernet network and, following IP address assignment, connect with the appropriate files on the head node. The scripts on the head node then kick-start the compute nodes as they are discovered. Each node takes about five minutes to install from the time it's discovered by the master node.

You are now ready to test, profile, and benchmark the baseline performance of the cluster. You can use your own programs or the sample programs included with MPICH. If you have problems with any of these issues, look at the documentation on the MPICH web site, which includes troubleshooting tools and information.

Testing HPC Applications Over the Internet

Though the difficulty and cost of building your own cluster have been greatly reduced in recent years, you may want to test your applications on a similar cluster before investing time and money in planning and building your own. The cluster described here is available for application development, optimization, testing, and benchmarking. You can remotely schedule time on the system and manage the process completely from your own office. To get started, all you need is a direct, high-speed connection to the Internet. Connectivity to the clusters is through an IPSEC VPN or the SSL/SSH security protocols (see http://www.intel.com/ids/ra/).

Two additional HPC clusters are also available for remote testing:

A 16-node cluster of two-way Intel Xeon processor-based systems, for 32-bit applications, available with Gigabit Ethernet or Myrinet interconnects. In addition, a 64-node EM64T cluster is scheduled for early 2005.
An 8-node cluster of four-way Intel Itanium 2 processor-based systems with Gigabit Ethernet or Myrinet interconnects. This cluster delivers better absolute performance for many applications, especially those with intense floating-point computations.

DDJ