Parallel Programming with Interoperable MPI

Dr. Dobb's Journal February 2004

By William L. George, John G. Hagedorn, and Judith E. Devaney

William and John are computer scientists and Judith is group leader in the Scientific Applications and Visualization Group at the National Institute of Standards and Technology. They can be contacted at william.george@nist.gov, john.hagedorn@ nist.gov, and judith.devaney@nist.gov, respectively.

Multicluster Environments Basic Parallel Architectures and Programming

Modern computing centers typically provide users with a variety of computing resources, ranging from single-processor workstations to high-performance parallel computers. Increasingly, this mix also includes Beowulf class machines—clusters of commodity PCs configured to operate as parallel computers. The Message Passing Interface (MPI) library, which provides C and Fortran interfaces to routines for sending data (messages) between processors, was designed to implement portable applications for diverse systems such as these.

Although parallel computations are normally run on single parallel computers, there is often a need to harness the resources of multiple clusters and parallel computers, forming what we call "multiclusters" to perform a single computation. (For information on related technologies, see the accompanying text box entitled "Multicluster Environments.") This might be required, for example, for simulations that are too large to be performed on any available individual parallel machine. Interoperable MPI (IMPI) provides a means of accomplishing this with minimal effort on the part of application programmers. IMPI is a set of protocols—implemented within an MPI library—that let multiple MPI libraries cooperate, acting like a single MPI library for programs running on a multicluster. The IMPI protocol specification is available at http://impi.nist.gov/ impi-report/index.html. In this article, we'll examine IMPI and provide examples of how it can be used. For background on parallel architectures and programming, see the accompanying text box entitled "Basic Parallel Architectures and Programming."

A Crash Course in MPI

For readers unfamiliar with message passing, we'll briefly describe some basics of this programming style using C and MPI. Assuming you are running a program using P processes, each process will be identified in calls to MPI by an integer rank from 0 to (P-1). Listing One, for instance, sends an integer from the lowest rank process to the highest rank process.

Once this program is compiled and linked to the MPI library (-lmpi), it can be executed by a command-line utility program provided with the MPI library. Often this utility is named "mpirun." Assuming our executable is named "program1" and -np is the command-line switch for specifying the number of MPI processes (this syntax varies between MPI implementations), the command line to run our program with eight MPI processes could look like: mpirun -np 8 program1.

In Listing One, the MPI_Init and MPI_Finalize calls are required in all MPI programs. No calls to MPI routines can be made before the call to MPI_Init or after the call to MPI_Finalize. To get the rank of the local process, you call MPI_Comm_rank. To get the total number of processes, call MPI_Comm_size.

In most MPI routines, an MPI communicator is a required parameter. A communicator describes a set of processes (including the assignment of ranks to those processes) and defines a separate communications context. A message sent using one communicator can only be received by a call using the same communicator. The predefined communicator MPI_COMM_WORLD simply includes all of the processes; however, subsets of MPI_COMM_WORLD are possible.

In Listing One, the communication is performed with the most basic MPI communications routines MPI_Send and MPI_Recv. The parameters to these routines describe the message to be sent/received (message, count, and MPI_INT), the rank of the destination process (dst for MPI_Send) or source process (src for MPI_Recv), an arbitrary tag value, and an MPI communicator for message matching. The status parameter to the MPI_Recv routine holds details of the message once it has been received.

So where does IMPI fit into all of this? At the source-code level, an IMPI program is simply an MPI program. Adding IMPI support to an MPI library does not add, remove, or change any user-level MPI routines. However, there can be some additional considerations to take into account when writing an MPI program that is specifically designed to run on a heterogeneous collection of parallel machines.

Starting an IMPI Program

When running an MPI program on a multicluster with IMPI, each cluster or parallel machine in the multicluster is referred to as an IMPI client. Before running the program, users must decide on an order for these clients. This ordering determines the ranking of the processes in MPI_COMM_WORLD such that the ranks of the processes in client 0 are the lowest ranks, followed by the ranks of the processes in client 1, and so on. This client rank must be a number from 0 to 1 less than the number of clients.

Normally, an MPI program is started with a command such as: mpirun -np <N> program-name args, where <N> is the number of processes to use. To run an MPI program using IMPI on a multicluster, an IMPI server process must first be started using the command mpirun -server <count>, where <count> is the number of IMPI clients that will be started. The IMPI server is the rendezvous point for the IMPI clients and acts as a relay between the clients during the startup of the IMPI program. The IMPI server will print to the terminal a string such as 192.168.0.1:12345, which gives the IP address and the port number of the IMPI server. This information, in this exact form, is needed to start the clients. Once the IMPI server is running, each of the clients can be started with a command of the form: mpirun -client <C> <host:port> <rest>, where <C> is the client number, <host:port> is the rendezvous information from the IMPI server, and <rest> is the rest of the standard mpirun command line.

Once an MPI program has started, all of the processes from all of the IMPI clients are included in the MPI communicator MPI_COMM_WORLD, and they are ranked according to the ranks given to the IMPI clients.

Some IMPI Usage Patterns

Now that we have described the basics of parallel message-passing programming with MPI and how to start an IMPI program, we now turn to how IMPI can be used to expand the power of MPI programs. There are several types of applications that we anticipate will use IMPI to great advantage, but most likely there are many others we have not yet considered.

These are not new classes of parallel programs we are describing, but types of parallel programs that are easily supported by IMPI and likely to successfully run in a multicluster environment.

Case 1: Legacy data-parallel programs. One immediate use we anticipate for IMPI is to simply allow legacy MPI programs to run in a multicluster environment. The motivation for doing so would be to either decrease the total execution time of the program or, more likely, to enable the running of larger problems than would be possible on any one of the clusters alone.

There are aspects of this use of IMPI that could require some modifications to your application in order to obtain reasonable performance. Unless the processing nodes in the clusters of this environment are closely matched in speed, available memory, and I/O capabilities, you may need to perform some load balancing that was not needed when you ran only on a homogeneous set of processing nodes.

One other consideration that needs to be addressed in running your legacy data-parallel application in a multicluster environment is the handling of file I/O. Depending on the configuration of the networks connecting the clusters, and whether disk volumes are cross-mounted with some form of networked filesystem, you may need to add some preprocessing and postprocessing to move input and output files where they are needed.

Case 2: Pipelined programs. Another anticipated use of IMPI is in support of applications designed as large-grain data-flow algorithms. A simple case of this is a pipelined computation comprised of several large-grain stages. Each stage of the computation can be executed on a separate parallel machine. One example of this type of application is a global climate simulator. This simulator could include separate models for the ocean, the lower atmosphere, and the upper atmosphere with defined physical boundaries between each of these modeled environments. Each of these models could be run on separate parallel machines with the coupling between the models enabled by communication over the IMPI channels; that is, the network connections between the IMPI clients (see Figure 1).

In this type of application, each MPI process needs to know not only its rank within the MPI_COMM_WORLD communicator, but also to which stage of the computation it belongs and possibly which stage each of the processes in MPI_COMM_WORLD belongs to. IMPI provides this information to the application at runtime through the use of an existing MPI facility called "attribute caching." This allows for arbitrary information to be associated with an MPI communicator for each process. For IMPI support, each process can determine which IMPI client it belongs to by retrieving a cached attribute called the IMPI_CLIENT_COLOR, which is simply an integer. For the communicator MPI_COMM_WORLD, this integer will be identical to the client rank given to the mpirun command.

The term COLOR is used to match the terminology used in the MPI routine MPI_Comm_split(MPI_Comm comm, int color, ...), a routine that creates a set of new communicators, each of which consists of all of the MPI processes that share the same color. For our pipelined application, each MPI process would pass in its IMPI client color. This is likely to be one of the first operations completed in a pipelined IMPI application so that the processes in each stage of the pipeline obtain their own private communicator to use within its stage of the computations. Listing Two presents the MPI calls needed to create the communicators for each stage of the computation.

Once these calls are completed, each MPI process knows to which stage it belongs (stage), has an MPI communicator for communications within the set of processes that comprise that stage (stage_comm), and knows its rank within that set of processes (stage_rank).

The third parameter in the call to MPI_Comm_split can be used to allow the reordering (reranking) of the processes in stage_comm if the MPI implementation would like to do so (presumably for performance reasons); 0 here means do not reorder the processes.

Thus, using the IMPI-supplied attribute IMPI_CLIENT_COLOR in addition to the standard MPI routines for creating new communicators, you can implement a pipelined application that adapts to whatever size clusters (IMPI clients) you have available. More work would be needed if you wanted to either assign more than one client to one of the computational stages or more than one stage to a single client.

Example code for enabling communication between the three pipeline stages, using the MPI routine MPI_Intercomm_create to create special MPI communicators, is provided in the MPI 1.1 document, Section 5.6.3 Intercommunication Examples (http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html). This is a standard MPI programming technique not affected by the use of IMPI.

Case 3: Computational steering and interactive applications. The ability to monitor the progress of large simulations, especially during initial development, can be of great help in debugging the code and in experimentally determining a set of reasonable simulation parameters. In this case, we can use IMPI to run two or three subprograms, all aware of each other and connected via MPI. These extra programs are used for monitoring and controlling the main simulation. This is akin to the model-view-controller (MVC) style of program, except that the coupling between the model/view/controller is much looser. With the size and computational complexity of the models (simulations), the time between view updates may be from minutes to hours or even longer. Figure 2 shows a configuration of IMPI clients for this type of parallel application. In Figure 2, MPI processes are colored to indicate the various values of the IMPI_CLIENT_COLOR attribute. Code similar to that shown for pipelined programs could be used to create MPI communicators for each of the distinct parts of the program (simulator, monitor, and controller), and then, assuming the simulator is a pipelined program, create the communicators for each of the stages of the pipeline. The outline for this type of IMPI application is as follows:

One client is assigned to the monitor, one to the controller, and three to the simulator. Each MPI process, represented in Figure 2 by a circle, has a value for the IMPI_CLIENT_COLOR attribute that is cached onto MPI_COMM_WORLD. These attribute values, which match the associated client numbers in Figure 2, are emphasized here by mapping each value to a separate color; see Listing Three.

As with the pipelined program case, communication between the viewer, the controller, and the simulator is enabled by creating special MPI communicators using the MPI routine MPI_Intercomm_create.

So, the model part of Listing Three contains the simulation that is to be run on one or more clusters. This part of the program can be a data-parallel or large-grain pipelined program, as previously described, or any other type of MPI program. It is also possible for this simulator to be a multithreaded program that runs on a large shared-memory machine that uses MPI only to communicate with the view and controller parts of the IMPI program.

The second program referenced in Listing Three is a monitor program (the view portion of MVC) that performs the following steps in a loop: accept image data from the simulation, possibly once every iteration of its main loop; render this data into a form suitable for the target display; and display the image, either on your workstation or other suitable device. If the simulation is not working as expected, you will know this as early as possible. To minimize the effect of this monitoring on the performance of the simulator, the communication between the simulation and the monitor can be reduced by decimating the image data or reducing the frequency of image updates.

The third program, if needed, allows for some amount of interactivity with the simulation, perhaps letting you modify the controlling parameters of the simulation or, more drastically, allowing you to kill or restart the simulation from within the main simulation. This control could also let you turn on/off the monitoring of the simulation as needed.

Conclusion

IMPI lets legacy MPI programs run unaltered on multiclusters consisting of two or more computing resources such as parallel machines, clusters, workstations, and PCs. Also, applications can be written specifically to run in such a multicluster, allowing greater control over various aspects of the application such as large-grain pipelining, load balancing, and file I/O. One major design advantage of IMPI over other available techniques to the problem of running on a multicluster is that IMPI uses the vendor-tuned MPI libraries for optimum communication within each parallel machine, while still allowing the unrestrained use of all of MPI.

The freely available MPI library LAM/ MPI (http://www.lam-mpi.org/) supports IMPI. Full implementations of IMPI are available from Hewlett-Packard, MPI Software Technology, and Pallas GmbH (for Fujitsu). Other implementations of IMPI are anticipated in the future. Furthermore, the National Institute of Standards and Technology (NIST) IMPI test tool (http://impi.nist .gov/ImpiTT.html) lets you test IMPI implementations for conformance to the IMPI protocol standard. For a detailed background on MPI, see Using MPI: Portable Parallel Programming with the Message Passing Interface, by William Gropp, Ewing Lusk, and Anthony Skjellum (MIT Press, 1999).

Note: Certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose.

DDJ

Listing One

#include <mpi.h>
int main(int argc, char *argv[])
{
  int my_rank, src, dst, tag, message, nprocs, count;
  MPI_Status  status;
  count=1;
  tag=100;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

  src=0;

  dst=nprocs-1;
  if (my_rank == src) {
    message=42;
    MPI_Send(&message, count, MPI_INT, dst, tag, MPI_COMM_WORLD);
  }  else if (my_rank == dst) {
    MPI_Recv(&message, count, MPI_INT, src, tag, MPI_COMM_WORLD, &status);
  }
  MPI_Finalize();
  return  0;
}

Back to Article

Listing Two

int  *stage,  stat,  stage_rank;
MPI_Comm  stage_comm;

MPI_Attr_get(MPI_COMM_WORLD,  IMPI_CLIENT_COLOR,  &stage,  &stat);
MPI_Comm_split(MPI_COMM_WORLD,  *stage,  0,  &stage_comm);
MPI_Comm_rank(stage_comm,  &stage_rank);

Back to Article

Listing Three

int *color, stat, rank;
MPI_Comm comm;

MPI_Attr_get(MPI_COMM_WORLD, IMPI_CLIENT_COLOR, &color, &stat);
if (color > 1) color=2; /* Simulator gets all clients > 1 */
MPI_Comm_split(MPI_COMM_WORLD, *color, 0, &comm);
switch (color)  {
case 0: /* Call the Controller */ break;
case 1: /* Call the Monitor    */ break;
case 2: /* Call the Simulator  */ break;
}

Back to Article