InfiniBand Technology

Dr. Dobb's Journal November, 2005

Making Apple networks hot-pluggable

By Corky Seeber

Corky is president of Small Tree Communications, designer and manufacturer of high-performance network solutions for Apple's Mac OS X. He can be contacted at rcs@small-tree.com.

The Virginia Tech Terascale Computing Facility

Over the past two decades, the size of computer systems—including CPUs, disk drives, memory, interconnects, and even power components—has grown to meet the demands of larger computer tasks. However, as systems have grown more complex, the number of points of failure has also increased to the point of being expensive to maintain. And as job sizes have grown and become increasingly mission critical, the consequences of failure have become more difficult to overcome.

The industry has fought back with software solutions such as "checkpointing," where a process's state is periodically saved so that it can be restarted after an outage. However, none of these solutions have been foolproof. Many do not support all aspects of the process (such as open sockets). In some instances, files or elements of the process may no longer exist after a crash (such as files stored in /tmp) and you can never be sure that some aspect of the outage, such as a disk or memory problem, might not impact results or the checkpoint operation itself.

In response, vendors began designing fault-tolerant systems. The first components to be redesigned were those that were the easiest to duplicate or improve:

Error-correcting code memory (ECC) replaced parity memory.
Redundant power supplies.
Redundant fans.
Hot spare and redundant array of independent disks (RAID) type backup solutions were implemented.

The drawback to all of these improvements was that they added significant cost to the system without completely solving the underlying problem. As the CPU, memory, disk, and power-supply requirements increased, so did the opportunity for catastrophic single-component failure.

Subsequent efforts were geared towards designing redundancy into other components so that entire CPU boards, memory, and all, could be nonfunctional, yet the system would still operate properly. This type of redundancy greatly increased the complexity of the operating system as well as the costs of designing those systems.

As a result, costs soon outstripped what users were willing to pay: Adding redundancy to a design causes a significant increase in system cost, even if users do not purchase the redundant components. This is because accounting for redundancy in the design (backplane, power distribution, and disk subsystem) adds cost.

Enter the cluster design concept. Many large jobs required a plethora of CPUs but not redundancy. A cluster of 16 or 32 CPUs could be used effectively and the mean time between failures (MTBF) was adequate for most jobs. Large numbers of inexpensive machines could be connected and harnessed over a commodity network like Ethernet.

Clusters were far cheaper than monolithic single systems because cluster systems and networks were usually based on commodity elements and did not contain redundant components. Clusters, much like single-system images (SSI) systems, continued to grow to match computational and storage size requirements. Eventually, certain redundant features such as ECC memory, RAID, and dual power supplies became available in less expensive designs.

As the clusters grew, they migrated to specialized interconnect fabrics. Myrinet was a pioneer of interconnect fabrics, but competitors such as Dolphin's SCI, Quadrics, and most recently InfiniBand have entered the market. All of the specialized interconnect fabrics offer much lower latency and higher bandwidth than their Ethernet counterparts.

Using this newfound bandwidth and the much lower latencies seen across the fabric, cluster system sizes began growing rapidly. One such cluster was the Mac OS X-based cluster at the Virginia Tech Terascale Computing Facility (http://www.tcf.vt.edu/), which debuted on the top500.org list at number 3, for a staggeringly low price of $5.7 million—the lowest cost per flop on the top500.org list. (See the accompanying text box entitled "The Virginia Tech Terascale Computing Facility.")

Harnessing so many systems together to perform as one presents unique challenges. How do you deal with the logistics of 1100 systems, 1100 disk drives, 1100 Ethernet connections, 1100 InfiniBand connections, and 1100 power supplies?

Backed by more than 60 companies (including Dell, Hewlett-Packard, IBM, Intel, Microsoft, and Sun), InfiniBand is an industry-standard, channel-based, switched fabric, interconnect architecture for servers (http://www.infinibandta.org/). The reason InfiniBand is important is that, as processor speeds and demand increase, the biggest barrier to improved performance is the I/O subsystem. InfiniBand provides a blueprint for improved system performance.

InfiniBand is a switch-based serial I/O interconnect architecture that operates at a base speed of 20 Gbps per port, or 10 Gbps in each direction. InfiniBand differs from shared bus architecture in that it is a low pin-count serial architecture that connects devices on the PCB and enables "bandwidth out of the box." The latter attribute can span distances up to 50 feet (17 meters) farther than ordinary twisted-pair copper wires.

Most cluster computing networks typically aren't fault tolerant, even though users are increasingly expecting the computer systems they use to be available 24/7. Software for online banking, for instance, creates an expectation that users will be able to log in and check account balances or transfer funds at any time of the day or night. Thus, redundancy capability must be designed into the system.

For instance, eBay (http://www.ebay.com/) achieves its fault-tolerant goal by using a number of inexpensive servers clustered through a network. Each system serves some portion of eBay's users. If that system crashes, users think the Internet connection has locked up, and they refresh the browser. When the browser relaunches and reconnects with eBay, users are assigned to a different system.

In the case of eBay, necessity was the mother of invention when the "mother of all crashes" occurred in 1999. Up until then, all eBay transactions were carried in one massive database. When power outages occurred, the combination of one monolithic application addressing all transactions with one extremely large database caused a meltdown due to the "single point of failure" vulnerability of the symmetric multiprocessing (SMP) systems. Thus, eBay decided to redesign its infrastructure and software and rebuild its data centers as a grid structure. Instead of one huge back-end server and four or five large search databases, eBay went to 60 front-end NT servers and 200 back-end databases, some of which are used for disaster recovery and data replication.

InfiniBand's fabric-based architecture helps solve the single point of failure problem, particularly for inexpensive systems such as Apple's G5 Xserve platforms, which are not in and of themselves hot-pluggable.

To start with, InfiniBand switches are designed to support dynamic subnet management (to identify new paths in case of failure), multiple paths between any two points on the network, and hot-plug capabilities for all their switch line cards. Companies designing and manufacturing InfiniBand switches realized that scaling a commodity cluster up to several hundred nodes would require these types of features. The InfiniBand architecture was itself designed to support these ideas and innovative companies simply took it a step further by creating a dynamic fabric manager that runs natively within the switch as well as providing hot-plug capabilities for the components. This allows customers not only the ability to swap out a 12-port InfiniBand line card, but also make use of redundant bridged I/O line cards such as Fiber Channel and Gigabit Ethernet. Utilizing host-based 802.3ad trunking across InfiniNIC interfaces, users can completely remove a network line card without losing connectivity to the outside network.

InfiniBand's SCSI remote protocol (SRP) layer can be used to present "virtual" Fiber Channel interfaces to each server in the cluster. Thus, without the need for an additional Fiber Channel adapter or cable, the cluster nodes are able to see the entire available disk as if it were attached locally. Using dual-path technology within the host operating system, this results in an increased measure of cluster reliability—this is perhaps one of the biggest advantages of the InfiniBand interconnect fabric: The ability to make a commodity-based cluster (such as Virginia Tech's Terascale facility) hot-pluggable.

Stretching this notion further to consider Internet-based applications in which 24/7 system availability is expected, reliability, availability, and serviceability (RAS) function as a key measure of whether an architecture can move the industry forward. The Infiniband 1.1 specification is a powerful architecture designed to support I/O connectivity for the Internet infrastructure. In this regard, InfiniBand exceeds all previous RAS metrics by supporting a comprehensive silicon, software, and system solution.

InfiniBand's fabric-based architecture offers another huge advantage over traditional clustering technologies—the ability to combine multiple protocols into one wire across the cluster.

Even with the benefits offered by Infiniband, there are still numerous issues to be resolved to have robust clusters on the order of 1000 processors. Jobs running across thousands of processors still need to be designed to handle the eventual loss of some portion of their processors or memory while still returning useful results.

A new type of application providing systemic solution was achieved via Deju Vu, software licensed by Virginia Tech Intellectual Properties (VTIP) that provides a fault-tolerant software environment so that if any one component in the new supercomputer were to fail, the queuing system would be alerted (http://www.californiadigital.com/dejavu.shtml). Within milliseconds, a free node would take over, averting the need to restart a calculation from scratch, which can potentially represent months.

By combining the "one wire" strategy of InfiniBand's bridged I/O, hot-plug capabilities, and commodity compute architectures, you can assemble a serious cluster capable of cracking the top500.org list. However, no vendor and no technology can rest on its laurels and both SilverStorm Technologies (http://www.silverstorm.com/) and Virginia Tech are working on the new technologies that will put InfiniBand-based clusters at the top of the top500.org list.

InfiniBand technology started out as a 2.5 Gbps link. Today it's running at four times this speed (10 Gbps)—although that is generally limited by today's bus architectures. PCIX is only capable of 10 Gbps theoretical, and in practice, people are seeing 80-90 percent of that. PCI Express offers some improvement in that its initial offering will be full duplex, but it is still a bit limited. When 8X PCI Express becomes available, we will see DDR and QDR technologies come into play and Infiniband line cards should be capable of 20 and even 60 Gbps. This is necessary to help cluster architects take full advantage of tomorrow's dual- and quad-core processor architectures.

Conclusion

Clusters currently account for fully one half of the high-performance computing (HPC) market and it is the largest growing segment. Given this growth, you can't ignore the advantages of high-bandwidth, low-latency interconnects for expanding compute power using inexpensive commodity systems. While interconnects such as Myrinet, Quadrics, and Dolphin's SCI can provide some of the "speeds and feeds" necessary for MPI-based HPC work, they lack the overall bus-fabric underpinnings that were designed into InfiniBand from the beginning. InfiniBand has a strong roadmap, excellent RAS features, and provides the brightest future for near-term and long-term cluster projects. Ultimately, InfiniBand offers a cost-effective solution for making Apple networks hot-pluggable.

DDJ