The invention relates to message passing infrastructure implementations. More specifically, the invention relates to techniques for improving the performance of Message Passing Interface (“MPI”) and similar message passing implementations in multifabric systems.
Many computational problems can be subdivided into independent or loosely-dependent tasks, which can be distributed among a group of processors or systems and executed in parallel. This often permits the main problem to be solved faster than would be possible if all the tasks were performed by a single processor or system. Sometimes, the processing time can be reduced proportionally to the number of processors or systems working on the sub-tasks.
Cooperating processors and systems (“workers”) can be coordinated as necessary by transmitting messages between them. Messages can also be used to distribute work and to collect results. Some partitionings or decompositions of problems can place significant demands on a message passing infrastructure, either by sending and receiving a large number of messages, or by transferring large amounts of data within the messages.
Messages may be transferred from worker to worker over a number of different communication channels, or “fabrics.” For example, workers executing on the same physical machine may be able to communicate efficiently using shared memory. Workers on different machines may communicate through a high-speed network such as InfiniBand® (a registered trademark of the Infiniband Trade Association), Myrinet® (a registered trademark of Myricom, Inc. of Arcadia, Calif.), Scalable Coherent Interface (“SCI”), or QSNet by Quadrics, Ltd. of Bristol, United Kingdom. These networks may provide a native operational mode that exposes all of the features available from the fabric, as well as an emulation mode that permits the network to be used with legacy software. A commonly-provided emulation mode may be a Transmission Control Protocol/Internet Protocol (“TCP/IP”) mode, in which the high-speed network is largely indistinguishable from a traditional network such as Ethernet. Emulation modes may not be able to transmit data as quickly as a native mode.
To prevent the varying operational requirements of different communication fabrics from causing extra complexity in message-passing applications, a standard set of message passing functions may be defined, and “shim” libraries provided to perform the standard functions over each type of fabric. One standard library definition is the Message Passing Interface (“MPI”) from the members of the MPI Forum. An MPI (or similar) library may provide the standard functions over one or more fabrics. However, as the number of fabrics supported by a library increases, the message passing performance tends to decrease. Conversely, a library that supports only one or two fabrics may have better performance, but its applicability is limited. Techniques to improve the performance of a message passing infrastructure that supports many different communication fabrics may be of value in the field.
Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean “at least one.”
Embodiments of the invention can improve data throughput and message latency in a multi-fabric message-passing system by tracking the use of each fabric and avoiding operations on fabrics that are not expected to be active.
The examples discussed herein share certain non-critical features that are intended to simplify the explanations and avoid obscuring elements of the invention. These features include: worker processes are assumed to be identified by a unique, consecutive integer (which is called the process's rank). A cooperating process is assumed to establish a connection or message channel to every other worker over one of the available communication fabrics when the process starts. An out-of-band method to provide a certain amount of initialization data to a worker process may also be useful. Alternate methods of identifying worker processes may be used, and dynamic connection establishment and termination paradigms are also supported.
Next, the process loops to initialize an infrastructure for a connection to every worker of higher rank (120, 130, 140). If every worker follows this strategy, each worker will be able to establish a connection to every other worker. Initializing an infrastructure may entail opening a network socket, creating a shared memory segment, or configuring parameters of a high-speed communication fabric to support a connection. Details of this process are discussed with reference to
Once a connection has been initialized for each cooperating worker, the worker process enters a second loop to establish all the connections (150, 160). This second loop employs a subroutine known as a progress engine, which is described with reference to
The fabrics may be tried in a preferred order, for example, from fastest to slowest. Alteratively, information received through an out-of-band channel may guide the worker in choosing a fabric to initialize for a connection to another worker. For example, workers executing on the same machine may prefer to initialize and use a shared-memory channel, while workers on separate machines that each have InfiniBand® interfaces may prefer an InfiniBand® connection to another, slower fabric. A TCP/IP fabric may be used as a fallback, since it is commonly available on worker systems.
Upon entry, the progress engine begins a loop over each of the communication fabrics it is to manage (300). If any connections are in progress (e.g. eat least one connection was initialized but has not been established, so a connection is expected) (310), then appropriate actions are taken to check for and process a connection over the current fabric (320). These actions may differ between fabrics, and might include calling a select( ) or poll( ) subroutine for a TCP/IP connection, or inspecting a shared memory location or interprocess communication object for shared memory. If a new connection is established (330), the counter of in-progress connections is decremented and a count of established connections over the particular fabric is incremented (340) and the progress engine returns (390).
If no connections are expected (315), or if no new connection was established over the current fabric (335), then the progress engine inspects an indicator such as a count of connections over the current fabric. If the indicator shows that connections have been established over the fabric (for example, if the count is non-zero (350)), appropriate actions are taken to check for received data or state changes on the fabric (360). These actions may differ between fabrics, and might include calling select( ) or poll( ), or read( ) or write( ) for a TCP/IP connection, or inspecting or changing a shared memory location or interprocess communication object for shared memory.
If any data is received or sent in response to a state change over the current fabric (370), the progress engine returns (390). Otherwise, the loop continues to manage the next communication fabric (380). Note that the loop divides the work of exchanging data between cooperating processes according to the communication fabric used to send or receive data. Even if two fabrics use identical semantics to establish connections and/or exchange data, so that their operations could theoretically be combined, an embodiment may nevertheless process the fabrics separately. An embodiment may terminate the progress engine after a single connection is serviced, or after servicing a single fabric (over which several connections might have been established).
In
In some embodiments, the different communication fabrics may be sorted according to one or more characteristic properties and processed in the sorted order by the progress engine. Sorting may be done, for example, based on the fabrics' bandwidth, typical or measured latency, or round-trip transmission time.
Although the progress engine described with reference to
Embodiments of the invention can be used with one or more systems similar to that shown in
Any of these systems can be provided with a subroutine 480 or other executable instruction sequence to perform the methods described above with reference to
An embodiment of the invention may be a machine-readable medium having stored thereon instructions which cause a processor to perform operations as described above. In other embodiments, the operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmed computer components and custom hardware components.
In one embodiment, instructions to direct a processor may be stored on a machine-readable medium in a human-readable form known as source code. This is a preferred form for reading and modifying the instructions, and is often accompanied by scripts or instructions that can be used to direct a compilation process, by which the source code can be placed in a form that may be executed by a processor in a system. Source code distributions may be particularly useful when the type of processor or operating system under which the embodiment of the invention will be used is not known beforehand.
A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to Compact Disc Read-Only Memory (CD-ROMs), Read-Only Memory (ROMs), Random Access Memory (RAM), Erasable Programmable Read-Only Memory (EPROM), and a transmission over the Internet.
The applications of the present invention have been described largely by reference to specific examples and in terms of particular allocations of functionality to certain hardware and/or software components. However, those of skill in the art will recognize that a multi-fabric, message passing infrastructure can also be implemented by software and hardware that distribute the functions of embodiments of this invention differently than herein described. Such variations and implementations are understood to be apprehended according to the following claims.