1. Field of the Invention
The present invention relates generally to reducing the communication cost between tasks running on different processing elements, which may be a processor, core, hardware thread, node, or computing machine in an execution environment.
2. Description of the Prior Art
Message Passing Interface (MPI) is the prevalent programming model for high performance computing (HPC), mainly due to its portability and support across HPC platforms. Because most HPC centers have a large variety of machines, portability is a major concern of MPI programmers. Therefore, MPI programs are typically optimized for algorithmic and generic communication issues.
There is a significant body of work on modeling communication between tasks in parallel programs; see for example, the reference to A. Aggarwal, et al. entitled “On communication latency in PRAM computation”, Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 11-21, June 1989; and, the reference to D. Culler, et al. entitled “Log P: Towards a realistic model parallel Computation”, Proceedings of the ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, May 1993. The entire contents and disclosure of these two references are incorporated by reference as if fully set forth herein.
Most of these models are designed in order to analyze parallel algorithms, and typically contain a small number of parameters that abstract the communication on the machine such that machine specific features are suppressed. Parallel computation models such as the Log P and Log GP, allow users to analyze parallel algorithms by providing a small set of parameters that characterize an abstract machine. Often times, the execution environment characteristics (actual machine specific characteristics) are ignored by design. For example, the Log P model intentionally leaves out the intercommunication network characteristics and the network routing algorithm in order to keep the model tractable.
In the reference to Traeff entitled “Implementing the MPI process topology mechanism”, Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1-14, Los Alamitos, Calif., USA, 2002. IEEE Computer Society Press, there is presented a graph embedding algorithm that optimizes the MPI communication by matching the application communication patterns to the topology using the MPI virtual topology mechanism. His study focuses on the performance of the embedding algorithm and requires the user to specify the application communication patterns and to code the virtual topology in the application.
A. Pant and H. Jafri in their reference entitled “Communicating efficiently on cluster based grids with MPICH-VMI”, Cluster 2004, September 2004, presents two complementary approaches, which extend the MPICH implementation of MPI, to reduce the communication cost of an MPI application that runs on a cluster of machines. Their topology consists of slow wide-area links that interconnect clusters and faster links to interconnect processors within a cluster. They use a profile guided optimization approach to map MPI tasks to processors to reduce the cost of point to point communication. They also replace sets of communications with collective operations (e.g. allreduce or broadcast) to minimize the traffic on the slow inter-cluster links, using topology information.
It is understood that repeatability may be used to infer message sizes and change the MPI library to take advantage of the extra knowledge to reduce the amount of time spent on the rendezvous protocol.
In many cases, for a specific application running on a specific machine, the mapping of MPI tasks to processing elements has a significant impact on performance. This effect is due in part to the fact that many scientific applications exhibit a regular point-to-point communication pattern between a subset of the neighbors. Again this is partly a consequence of good MPI programming education—if global communication is needed, MPI programmers use collective operations over defined MPI communicators, which are typically tuned to the underlying architecture. However, MPI tasks are often mapped by default to the processing elements in a linear order, which may not be the mapping that achieves the best performance.
To address this problem, there is a need to be able to understand and model the hardware communication topology of the execution environment and the application communication pattern.
Thus, it would be highly desirable to provide an algorithm to map MPI tasks to processing elements, and a cost estimator, to evaluate a mapping algorithm's effectiveness in improving computing system performance.
The present invention addresses the notion that the mapping of MPI tasks to processors has a significant impact on performance.
Accordingly, there is provided a system and method for mapping application tasks to processing elements in a computing environment that takes into account the communication topology of a machine and communication pattern of an application. The Hardware Communication Topology (HCT) is defined according to hardware parameters affecting communication between two tasks, such as connectivity, bandwidth, and latency. Bandwidth is derived for different message sizes. The HCT models processing elements and how the processing elements communicate as switch elements. The Application Communication Pattern (ACP) models point-to-point communication by capturing, from information collected on the messages exchanged by tasks that communicate, the number and size of messages that are communicated between the different pairs of communicating tasks of the application. Both the hardware communication topology and the application communication pattern can be advantageously used by an algorithm to determine a mapping of tasks to processing elements thereby benefiting overall performance.
Thus, given a hardware communication topology, and an application communication pattern, the invention provides a means for combining them to produce a mapping of tasks to processing elements. Moreover, a means is provided that, given the hardware communication topology and an application communication pattern, the cost of a particular mapping is calculated. Different mappings are explored and the most desirable one, as predicted by the cost algorithm, is chosen. The mechanism can be employed automatically and thus with it, it is feasible to optimize the execution environment on a machine-by-machine and application-by-application basis.
Thus, in accordance with one aspect of the invention, there is provided, in a computer system including multiple processors and a mechanism for the processing elements to communicate with each other, a hardware communication topology that models how processing elements communicate, and a program containing tasks that run on the processing elements and communicate, a method of providing a mapping of tasks to processing elements, the method comprising:
a) determining the hardware communication topology defined by the cost of communication between processing elements,
b) measuring the application communication pattern between said program tasks,
c) producing a mapping of said tasks to said processing elements by combining said determination with said measurement
Further, to this aspect of the invention, the hardware communication topology comprises: one or more processing elements, and one or more switch elements.
Moreover, the hardware communication topology may comprise a tree-structured topology having: one or more processing elements at a lowest level in the topology. Thus, the task mapping determination comprises: determining an amount of data communicated between tasks; and clustering said tasks for the switch elements at each level in said topology based on said data amount communicated. Clusters may be assigned at each level in said topology to the collections of processing elements at that level, in a manner that includes balancing of processing resources to improve communication balance.
The task mapping determination further implements concurrency to reduce communication imbalance by separating task clusters that exchange less data thereby allowing them to more evenly proceed in parallel.
Advantageously, the system and method of the invention for mapping application tasks to processing elements in a computing environment that takes into account the hardware communication topology of a machine and application communication pattern of an application results in significant performance advantages.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
As referred to herein, the term “processing element” refers to an element adapted to provide a unit of computation.
A large class of scientific applications are written in a stylized manner. Typically these applications consist of a series of steps, where in each step, the application executes two phases: A computation phase in general is followed by a communication phase with synchronization between them. This class of applications has “concurrent communication” where the communication between MPI tasks occurs at the same time, i.e., there is a period of time during which all of the tasks are performing communication. In this class of applications, there are four places where performance can be improved: 1) Computation: the amount of time required to perform the computation; 2) Computation Imbalance: the difference between the longest and shortest compute times per phase; 3) Communication Overhead: the amount of time required to perform communication; and, 4) Communication Imbalance: the difference in time between when the last task finishes communication in a given phase from when the first task finishes.
The system and method of the present invention addresses reducing the communication overhead and imbalance in applications using MPI for communication. Improving computation performance is typically achieved with traditional performance analysis and tuning on a single processing element. This work reduces communication overhead and communication imbalance, and thereby improves performance, by adapting an application to its execution environment, without modifying source code. The system and method of the present invention focuses on how to exploit a hardware communication topology with respect to bandwidth and concurrency by explicitly mapping MPI tasks to processing elements to reduce the communication overhead of point-to-point communication. Bandwidth can be exploited to reduce communication overhead by mapping tasks that communicate the most data to areas of the topology that are closest together with the highest bandwidth. Concurrency can be exploited to reduce communication imbalance by separating tasks or groups of tasks that exchange less data thereby allowing them to more evenly proceed in parallel.
The problem of finding the best mapping for a given hardware communication topology and application communication pattern is exponential. Any MPI task may be mapped to any processing element. Thus, for “T” tasks there are T factorial mappings. Because there is an exponential number of mappings, any reasonable-sized problem requires a heuristic algorithm to determine a mapping.
The graph in
Further in the graph of
Further it is evident from the graph of
Finally, switch elements have a certain maximum bandwidth between any given pair of communicating tasks. That is, as shown from the graph of
The data from the graph depicted in
While
The modeling of the hardware communication topology that MPI tasks use to communicate, and the modeling of application communication patterns from the data that these tasks communicate, are now described.
Today's large-scale machines are often constructed out of smaller SMP nodes connected in a hierarchical manner by a high bandwidth interconnect. The tree-structured hardware communication topology is modeled as shown in
A hardware communication topology may define multiple processing elements to reflect multiple types of computation units and define multiple switch elements to reflect multiple types of switch elements. An exemplary, non-limiting hardware communication topology that will be referred to herein for illustrative purposes has two switch elements defined, one dual-processor machine, and another for a network 8-port 100 Mb switch.
An application communication pattern characterizes the way in which one MPI task exchanges data with another MPI task. An application communication pattern is characterized by the number of messages of each size that is communicated between each pair of MPI tasks. An application communication pattern does not contain the order that messages are sent between two MPI tasks.
Application communication patterns are derived from a trace of the point-to-point communication in an MPI application. A histogram of the communication is computed and is available to the analysis tools performing the MPI mapping task of the present invention. In particular classes of applications under consideration, only communication from a single phase needed to be modeled because each communication phase was representative of the other communication phases. This is true for many MPI applications. For those that have bimodal or other patterns between phases, each unique class of phases would need to be characterized with the weighted combination of those phases being used to determine the overall effect on performance. Thus, it is understood that application communication patterns may be specified in varying detail. Other possibilities include, on one hand, specifying that task A communicates with task B a total number of N bytes provides minimal detail. On the other hand, specifying the individual messages between two tasks with their size, latency and ordering provides additional detail that may be used to further characterize the application communication pattern if needed.
According to one embodiment of the invention, as depicted in the methodology depicted in
Continuing in
Thus, in accordance with the bottom-up pass, as depicted in
Then, the process continues as shown at step 113, where the second pass of the heuristic starts from the top to assign the clusters (MPI task groups) to switch elements to exploit concurrency as indicated at step 115. That is, in accordance with the top-down pass, as depicted in
The complexity of the algorithm described herein with respect to
In the example embodiment described herein with respect to
Given the hardware communication topology “T”, the mapping of MPI tasks to processing elements “M”, and the application communication pattern between these tasks “P”, the cost estimator estimates the communication time (or cost “C”) for a communication phase using the algorithm as now described herein with respect to
For illustrative purposes, the model depicted in
Referring now to
After all communication of all tasks is processed, the total time “C” for the communication phase is taken as the maximum time over all ports in the hardware communication topology.
The complexity of this algorithm is O(M×N) where M is the number of messages sent between MPI ranks and N is the number of ports in the hardware communication topology.
As the cost estimator assumes fully concurrent communication, the cost estimator may be less accurate for those applications where only a percentage of their communication is concurrent. It is understood that the approach of the model described herein, while sufficiently accurate enough to model the applications as presented herein, may be extended by taking into account the percentage overlap, i.e., addressing the tradeoff between the improved accuracy and increased model complexity.
As is readily seen from
The inventive methodology that provides a mapping of MPI tasks to processing elements in an MPI program is a critical decision that significantly impacts performance. By taking into account both the characteristics of the hardware communication topology (memory and network) and of the application communication pattern, the methodology described herein estimates a communication cost. Using the HCT, ACP, and cost estimator, a heuristic algorithm may be used to generate a mapping of MPI tasks to processing elements that improves overall performance for the given execution environment. The invention is not limited to implementing a heuristic task to map MPI tasks to processing elements and then computing the cost estimate based on that mapping. Other MPI mappings may be used that may be evaluated using the cost estimator to determine a spread of performance between a best-case and a worst-case mapping. For example, it would be possible at each level to probabilistically cluster processing elements with the likelihood assigned in a weighted manner based on the amount they communicate, but then run the cost estimator after each clustering level to determine if a particular cluster is more or less beneficial. Tying in the cost estimator to the heuristic algorithm allows for the information contained therein, i.e., the cost estimator, to guide the heuristic, however at a cost because the cost estimator itself must be more frequently run. Other simplistic heuristic algorithms for example generating a random mapping and using the cost estimator to evaluate each mapping are all possibilities as provided by the current invention.
The invention has been described herein with reference to particular exemplary embodiments. Certain alterations and modifications may be apparent to those skilled in the art, without departing from the scope of the invention. For instance, the task mapping algorithm may employ the communication cost metric as it running to determine if the result is expected to outperform the default mapping. The exemplary embodiments are meant to be illustrative, not limiting of the scope of the invention.
The U.S. Government has a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of Contract. No. NBCH3039004 awarded by the Defense Advanced Research Projects Agency.