Feb04: Basic Parallel Architectures and Programming

Basic Parallel Architectures and Programming

At the highest level, most parallel scientific programs can be characterized as either task parallel or data parallel. These terms describe how the program obtains parallelism.

A task-parallel program consists of multiple independent tasks that can be completed with little or no communication between the tasks. Typically, one process is in charge of assigning the tasks to the available processors and collecting the results. The SETI@home project is one example of this model of parallel processing (http://setiathome.ssl .berkeley.edu/). The amount of parallelism available in a task-parallel program increases as the number of independent tasks increases.

A data-parallel program typically operates on large multidimensional arrays with a main loop that updates these arrays once per iteration while converging toward a solution. Each iteration of the main loop completes identical, or nearly identical, calculations to update each of the elements in one or more of the large arrays. The amount of parallelism available in a data-parallel program increases with the size of the arrays. Data-parallel programs often require communication between the processing nodes during the update calculations. Therefore, the distribution of the data among the processing nodes is an important consideration when designing these programs.

A classification scheme also exists for describing hardware that supports parallel programs. This classification scheme focuses on one of the most important considerations in parallel programming—access to main memory by the processors. At the top level, a parallel machine can be described as either a distributed-memory machine or a shared-memory machine.

In a distributed-memory machine each processor has its own local main memory that is not directly accessible by other processors. Sharing data between processors is accomplished via message passing; that is, explicitly sending data from one processing node to another. These messages are sent over a network that connects the processors.

In a shared-memory machine, all processors have equal access to all available memory. Like distributed-memory machines, shared-memory machines also need an interconnection network; however, in this case the network connects the processors to the main memory. For best performance, applications must avoid contention on this network by avoiding multiple simultaneous requests for data stored in the same area of memory. Even on a shared-memory machine, message-passing style programs, using MPI, perform well. In this case, each process only directly accesses portions of the memory that hold its data. Message passing is implemented (within MPI) using standard memory-to-memory moves. Unlike most multithreaded shared-memory applications, this results in very scalable parallel applications due to the low contention on the processor-to-memory interconnection network.

Of course in the real world, these classification schemes are blurred. Currently, many high-performance parallel machines available from manufacturers such as IBM, Sun, Hewlett-Packard, and SGI are distributed-memory machines with high-speed interconnection networks, in which each processing node in the network is a 2- to 16-processor shared-memory multiprocessor. Additional hardware and software is sometimes available for these machines that provide you with a shared-memory API on top of the basic distributed-memory architecture. Luckily, with vendor-tuned MPI libraries, all of these machines can still run your C/Fortran MPI programs without change and with good communications performance.

For an introduction to parallel programming with MPI, including books and online tutorials, see http://www.lam-mpi .org/tutorials/ and http://www.ERC .MsState.Edu/labs/hpcl/projects/mpi/.

—W.L.G., J.G.H., J.E.D.

Back to Article