1. Field of the Invention
This invention generally relates to processing or computer systems, and more specifically, the invention relates to multiple party communications, such as collective communications or third party transfers, in parallel processing or computer systems.
2. Background Art
Multiple party communication operations, such as collective communications or third party transfers, in processing or computer systems (through MPI and other similar programming models) can cause a significant slow down in the running of parallel applications and the sustained performance as realized by the application. Most collective communications operations, for example, are generally implemented in software through the construction of a tree of the tasks in the parallel application.
Typical collective operations are: a) barrier; b) broadcast; c) reduce; and d) all reduce. For the barrier and all_reduce operations, the communication first goes up the tree and then comes down the tree. For the broadcast operation, the communication starts at the root and goes down the tree, and for the reduce operation, the communication starts at the leaves and goes up until it reaches the root task, which has the result of the reduction. The barrier and all reduce operations have the same communication pattern but the difference between these two operations is that in the case of a barrier operation, the message is just a single flag, whereas in the case of an all_reduce operation, the message can be as large as the size that can be specified by 64 bits.
Each of the above-identified operations suffers from a number of performance problems. For instance, a communication going up the tree incurs an overhead for the receivers, as they have to receive from all their children and they receive them in order since the CPU has to typically single thread the receives (since each parallel task has only 1 CPU assigned to it). A communication going down the tree incurs a serialization overhead in that the parent has to send data to all its immediate children.
Also, at any given stage of the operation, only tasks/processes in two levels of the software tree are active (one level that is sending and the other level that is receiving). So, only a small fraction of the overall processors are busy (especially for a large number of tasks).
An object of this invention is to improve multiple party communication operations, such as collective communication or third party transfers, in computer systems.
Another object of the present invention is to use intelligent pipelining in conjunction with remote direct memory access exploitation to improve multiple party communication operations in parallel processing or distributed computer systems.
A further object of the present invention is to use cut-through or wormhole style routing through the software tree to allow more efficient pipelining of communications in a multiple party communication operation.
These and other objectives are attained with a method of and system for multiple party communications in a processing system including multiple processing subsystems. Each of the processing subsystems includes a central processing unit and one or more network adapters for connecting said each processing subsystem to the other processing subsystems. A multitude of nodes are established or created, and each of these nodes is associated with one of the processing subsystems.
A first aspect of the invention involves pipelined communication using RDMA among three nodes, where the first node breaks up a large communication into multiple parts and sends these parts one after the other to the second node using RDMA, and the second node in turn absorbs and forwards each of these parts (perhaps after operating on it, as in the case of reduce or all reduce) to a third node before all parts of the communication arrive from the first node.
In accordance with a second aspect of the invention, a tree is constructed having a multitude of nodes, each of the nodes representing a task in the processing system and being associated with one of the processing subsystems. These nodes include parent nodes and children nodes and each parent node has a plurality of children nodes, and at least one of the parent nodes has a respective one network interface adapter with each of the children nodes of said at least one of the parent nodes.
In accordance with this aspect of the invention, at least one parent node receives a first message from a first child node of said first parent node via the network interface adapter with said first child node and using remote direct memory access (RDMA). This parent node also receives a second message from a second child node of said first parent node via the network interface adapter with said second child node and using remote direct memory access.
Further benefits and advantages of the invention will become apparent from a consideration of the following detailed description, given with reference to the accompanying drawings, which specify and show preferred embodiments of the invention.
The present invention generally relates to multiple party communication operations in parallel processing or distributed computer systems, and the preferred embodiment of the invention uses remote direct memory access (RDMA) in parallel or distributed systems to improve multiple party communication operations in those systems.
More specifically, system 10 of
As indicated above, multiple party communications, such as collective communications or third party transfers, are used to transmit data among subsystems of system 100; and most collective communication operations, for example, are generally implemented in software through the construction of a software tree of the tasks in the parallel applications. Typical collective operations are: a) barrier; b) broadcast; c) reduce; and d) all reduce, and
The arrow labels show the typical order of operations. Arrows with the same label are intended to occur concurrently. One of the basic assumptions in these operations is that there is one CPU assigned per task of the parallel application (which is a typical mode of operation for MPI applications on parallel systems). This limits the ability to use multiple threads to achieve more concurrency in these operations or if multiple threads are used by each process, it creates disruptive scheduling impacting the efficient running of these parallel applications.
Also, as mentioned above, there are a number of problems associated with each of the above-identified operations. One problem is that a communication going up the tree incurs an overhead for the receivers, as they have to receive from all their children and they receive them in order since the CPU has to typically single thread the receives (since each parallel task has only 1 CPU assigned to it). A communication going down the tree incurs a serialization overhead in that the parent has to send data to all its immediate children.
Another problem is that, at any given stage of the operation, only tasks/processes in two levels of the software tree are active (one level that is sending and the other level that is receiving). So, only a small fraction of the overall processors are busy (especially for large number of tasks). The total time for these operations is typically Θ(log(n)), where n is the number of the tasks in the collective operation. The example above shows a binary tree (α=2) but the implementations can have different fan out (a trees where the fan out is α, which can be greater than 2). This only changes the base of the log but does not help reduce the order of the overhead,
In addition, with α>2, the height of the tree can be made smaller but this increases the pipelining overhead at each intermediate stage. So this approach just trades off one bottleneck for another. In addition, the repeatability requirements enforce ordering constraints, which makes the out of order handling more complex with increasing α. So increasing α has some tradeoffs that need to be balanced and tuned based on the various system parameters.
The present invention solves the above-discussed problems. The solution to the first problem is achieved through an intelligent application of RDMA technology (See for example, U.S. patent application Ser. No. 10/929,943, for “Remote Direct Memory Access System And Method,” filed Aug. 30, 2004; U.S. patent application Ser. No. 11/017,406, for “Half RDMA and Half FIFO Operations,” filed Dec. 20, 2004; and U.S. patent application Ser. No. 11/017,574, for “Failover Mechanisms In RDMA Operations,” filed Dec. 20, 2004). The disclosure of the above-identified patent application Ser. Nos. 10/929,943, 11/017,406 and 11/017,574 are hereby incorporated herein by reference in their entireties.
RDMA does not involve the CPU in communication. Whereas normal approaches using the CPU in communication would result in all the serialization and other bottlenecks described above, RDMA can be done over multiple adapters concurrently because the adapters are directly transferring data from memory into the network and into the remote memory locations.
With prior art multiple party communications, most communication involves the CPU. With RDMA, though, the CPU is not involved in the communication path and message transfer occurs directly between local and remote memories (See for example, U.S. patent application Ser. No. 10/929,943, for “Remote Direct Memory Access System And Method,” filed Aug. 30, 2004; U.S. patent application Ser. No. 11/017,406, for “Half RDMA and Half FIFO Operations,” filed Dec. 20, 2004; and U.S. patent application Ser. No. 11/017,574, for “Failover Mechanisms In RDMA Operations,” filed Dec. 20, 2004.).
Multiple communication adapters are prevalent in present day “parallel subsystems” since these typically have multiple CPUs and therefore need multiple adapters to handle the overall communication load for the subsystem (e.g., large SMPs). In contrast to the above-discussed use of the CPUs to communicate multiple messages, RDMA coupled with multiple adapters can help alleviate the serialization bottleneck since multiple messages may be received separately on each of the adapters, and multiple communications can be initiated over each of the adapters as well.
The present invention effectively utilizes this RDMA technology to provide a solution to the above-discussed first problem.
The second of the above-discussed problems is solved through intelligent pipelining in conjunction with RDMA exploitation which allows cut-through or wormhole style routing to allow much more efficient overlap of communication throughout the tree. This effect of cut-through routing is simulated by intelligent software controls. An example of this feature, used for a broadcast of a 4M message, is shown in
Since the RDMA control requires a simple tap to the adapter, the two transfers of 1M to each of its children occurs concurrently with only a pipeline overhead of tapping the adapter. As soon as task 2 and task 3 receive the first IM from task 1, they forward that 1M through RDMA to their respective children and simultaneously receive the next 1M from their parent task 1. This, in effect, creates a cut-through routing effect for the entire message.
An important advantage of this feature of the present invention is that, without this pipelining, task 2 and task 3 would not send any data to their children until they had received the entire message. It should be noted that some pipelining efficiency can be achieved even without RDMA, but that pipelining would provide limited benefits because the CPU has to be involved in receiving all the data as well as to forward (sent) it to other tasks in the tree.
For large messages, the cut-through effects of the pipelining utilized in the preferred embodiment of this invention can result in a substantial increase in the efficiency of the collective operations. Another advantage of this embodiment is that, with such cut-through pipelining, messaging is in effect performed in most or all levels of the tree, except the initialization and the final drain stages of the collectives operation. It may be noted that this does not apply to the barrier case, where the message is just a single bit and does not lend itself to breaking it down for pipelining efficiency. However, this technique, although demonstrated for broadcast, applies to reduce and all reduce as well, as will be apparent to those skilled in the art.
The choice of the granularity at which the messages should be broken up would depend on the various system parameters and can be tuned to maximize performance. In the example of
It should be understood that the present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, carries out the respective methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention, could be utilized.
Furthermore the invention may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The invention may also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
The present invention can also be embodied in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
While it is apparent that the invention herein disclosed is well calculated to fulfill the objects stated above, it will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art and it is intended that the appended claims cover all such modifications and embodiments as fall within the true spirit and scope of the present invention.
This application claims the benefit, under 35 U.S.C. 120, of provisional application No. 60/704,404, filed Aug. 1, 2005. This application is related to copending application No. ______, (Attorney Docket No. POU920050108US3) for “Efficient Pipelining and Exploitation of RDMA for Multiple Party Communications,” filed ______, the entire disclosure of which is hereby incorporated by reference in its entirety.
This invention was made with Government support under Agreement No. NBCH3039004 awarded by DARPA. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
60704404 | Aug 2005 | US |