Multicluster Environments

The desire to run MPI programs across heterogeneous sets of clusters has been around since the introduction of MPI, and other projects have provided this capability in different ways. For example, two portable and freely available implementations of MPI, MPICH (see "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard," by William Gropp, Ewing Lusk, N. Doss, and Anthony Skjellum, Parallel Computing, September 1996, http://www-unix.mcs .anl.gov/mpi/mpich), and LAM (http:// www.lam-mpi.org/) are capable of running programs across heterogeneous clusters of machines, so long as you use the same library (MPICH or LAM) on each of the clusters. Another MPI library, MagPIe (see "MagPIe: MPI's Collective Communication Operations for Clustered Wide Area Systems," by Thilo Kielmann, Rutger F.H. Hofman, Henri E. Bal, Aske Plaat, and Raoul A.F. Bhoedjang, Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [PPoPP'99], May 1999), is based on the MPICH source code and contains updated implementations of the MPI collective communications routines that are optimized for operation over WANs. All of these solutions, however, preclude you from using the vendor-tuned MPI libraries, thus sacrificing performance within each parallel machine or cluster.

The MPI-Connect project (see "MPI Inter-connection and Control," by G.E. Fagg and K.S. London, Technical Report Tech Rep. 98-42, Corps of Engineers Waterways Experiment Station Major Shared Resource Center, 1998) used another message passing library—PVM, short for "portable virtual machine" (PVM: Parallel Virtual Machine, A User's Guide and Tutorial for Networked Parallel Computing, by Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam, MIT Press, 1994) to connect processes under the control of different MPI libraries. While this lets you use the vendor MPI libraries on each of the machines, MPI-Connect does not allow the use of any of the MPI collective operations, such as broadcast or reduce. On a larger scale, the Global Grid Forum (http:// www.gridforum.org/) is coordinating a set of projects that will make computers, and other resources such as large databases, telescopes, wind tunnels, particle accelerators, and the like, available for use remotely and in concert. Many issues such as user authentication, resource scheduling, and security are being investigated by this forum.

—W.L.G., J.G.H., J.E.D.

Back to Article