The present invention in general relates to a method and apparatus for establishing connections in distributed computing systems. It more particularly relates to such a method and apparatus to facilitate expansion of such computing systems.
There is no admission that the background art disclosed in this section legally constitutes prior art.
The size of distributed high performance computing (HPC) systems used for running large parallel jobs is continuously growing. Scalability bottlenecks in software and the communication infrastructure (network hardware and transport protocols) are often impediments for running efficient parallel jobs on large computer clusters. Connection oriented protocols require allocation of resources for each connection a particular node in the cluster establishes to any other node. These resources include memory and software objects maintained by the operating systems, such as file descriptors and ports.
With growing the number of nodes in the computer cluster, the number of connections to be established for each node also grows, and along with that the resources allocated for these connections. When the number of nodes in a computer cluster reaches a sufficiently large number such as several thousands, the memory allocated for the connections can occupy a significant portion of the overall system memory and thus reduce the available memory for the application algorithm and the remaining operating system services. Thus, for some applications, the overall performance may become degraded. Also, the operating system allocates internal software objects for each connection. When many connections are established, the operating system in certain circumstances may run out of such resources and subsequently refuse or be unable to efficiently establish new connections, thus limiting the scalability of the parallel jobs and the effectiveness of the parallel system as a whole.
Frequently, message passing systems for parallel processing are based on the peer-to-peer communication model, as opposed to the client-server model on which are based many of the Internet services and common database management systems. In the client-server model, clients usually communicate only with the server and not among themselves. In the peer-to-peer model, it is the usual situation where every process of the parallel job can communicate with any other process in the job. Whether communication operations between any two nodes may take place or not depends on the actual user algorithm that uses the message passing system, but the message passing system generally has no way of knowing this in advance. Thus, many connections and associated resources may be dedicated to a job, depending on the actual requirements of the application, when they may not all be required by the application.
In recent years, new high-speed networks with specialized software interfaces and transport protocols have been used to alleviate the scalability limitations of general purpose networks, such as Ethernet with connection oriented transports, such as TCP/IP. These high-speed networks, such as Myrinet and others, provide a number of special features that provide higher communication performance and also increased scalability. Although the high-speed networks solve many of the performance and scalability problems of large computer clusters, because they are very expensive, they have not been commonly accepted in the area of HPC cluster computing for some applications. The cost of the high-speed network in a large computer cluster can exceed a significant percentage such as 30% or more of the total system cost. Consequently, HPC clusters are presently largely being built using Ethernet (100 Mbps or Gigabit) with the TCP/IP transport protocol for many applications.
At least some conventional message-passing systems that work with connection-oriented transports establish connections between each pair of processes in order to ensure the requirement for global connectivity among the processes of a single job. These connections are established during the initialization phase of the message passing system. When such a message passing system is used on a large-scale computer cluster, it may result in the creation of an excessive number of connections on each node. With increasing the size of the jobs, this may, under certain circumstances, lead to resource exhaustion, ultimately limiting the scalability of the whole computation system.
The features of this invention and the manner of attaining them will become apparent, and the invention itself will be best understood by reference to the following description of certain embodiments of the invention taken in conjunction with the accompanying drawings, wherein:
It will be readily understood that the components of the embodiments as generally described and illustrated in the drawings herein, could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the system, components and method of the present invention, as represented in the drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of the embodiment of the invention.
A method and apparatus are disclosed for establishing connections in a distributed computing system to execute a job having a group of processes. Connection acceptors associated individually with each process wait for on demand connection requests. A determination is made whether a connection is already established between a sender process and a receiver process. If none exists, the connection acceptor receives the new connection on demand request associated with the receiver process. The requested new connection is established to facilitate the processes. Other connections between other processes may also be established for completing the job.
One aspect of the disclosed embodiments of the invention is to postpone the creation of communication connections in certain circumstances between the processes that belong to a single job until the time when these connections are actually needed for communication transactions as requested by the application algorithm. Thus, unnecessary connections and their associated resources are not dedicated to a particular job being run. This mechanism may be referred to as “connections on demand”. Connections are created only between processes that exchange messages. If the application algorithm does not exchange messages between some pair of processes, the “connections on demand” mechanism may not establish a connection between these two processes.
Many parallel applications may use algorithms that cause processes to communicate with a small sub-set of all of the remaining processes in the parallel jobs under some circumstances. An example of such algorithms are Computational Fluid Dynamics algorithms that solve large problems by dividing the problem among multiple processes. Each of these processes communicates only with processes that work on adjacent pieces of the large problem.
The “connections on demand” mechanism reduces greatly the number of effectively created and maintained connections for many parallel algorithms in used today. As a result, these applications when executed with a message passing system implementing an embodiment of the invention may be able to scale to much larger sizes, which may not otherwise be achieved by conventional message passing systems without the “connections on demand mechanism” in at least some circumstances.
During the initialization part of the message passing system with support for “connections on demand”, information about the end points of the connections needed for communication between each pair of processes may be exchanged. This information may be distributed to each process and stored in this process' memory space. Connections between the end points of the processes are not created during the initialization phase. The connection creation is postponed until the moment when a particular connection is necessary for transmitting a message requested by the application algorithm. Depending on the particular application and environment, once a connection is created on demand, it may either be kept until the end of the job, or destroyed. Applications with static pattern of connections and repeatable communication requests over the same connection may benefit from keeping the connection opened. Since establishing new connections may be a relatively high overhead operation, keeping the connections open for subsequent communications may avoid this overhead and improve performance under certain circumstances. The disclosed embodiments of the invention relate to the process of adaptive or dynamic connection creation, but the decision whether the connections are kept open or destroyed immediately after the communication transaction finishes, is beyond the scope of the disclosed embodiments of the invention.
The message passing software systems of the disclosed embodiments of the invention may provide communication infrastructure for exchanging messages among processes that execute a distributed application. The generic software architecture may include an application thread, which performs the application specific processing and a communication or progress thread that performs the communication operations. Depending on the architecture of the message passing system, the communication thread may be the same as the application thread or a separate system thread maintained by the message passing system. The application thread may interact with the communication thread through synchronization primitives, which may indicate when a message is sent or received. This synchronization may be necessary to ensure integrity of the message transfers under certain circumstances.
The communication thread may receive messages from other processes through connections established to these processes. Each process may have a connection descriptor associated with each connection. The connection descriptors may be maintained in an array, which may be used for checking whether new messages arrive. Different message passing systems may chose to use a method for continuous polling of these connections for new message arrivals or use specific software mechanisms for aggregating the connections. The latter approach allows the message passing system to reduce the number of polls, which in turn reduces the processor overhead. Also, if the communication thread is distinct from the application thread, using the aggregation mechanism the communication thread may be able to sleep until a new message arrives. This may further reduce the waste of processor time on communication activities under certain circumstances.
The disclosed embodiments of the invention use a connection acceptor in the form of a special purpose system thread, which accepts requests for creation of new connections between processes that need to communicate according to the application algorithm. This thread is referred to herein as an “accept thread”. This thread may be distinct from the communication thread, or it may be the same thread.
The message passing system with support for “connections on demand” may complete its initialization phase without creation of any connections. As a result, the array of connection descriptors used by the communication thread may be empty.
When a process requests a message to be transferred to another destination process, the message passing system checks if a connection to the destination node is already established. If it is not, the sender process sends or initiates a connection creation request with the destination process. The accept thread on the destination process may accept the request and may complete the connection creation. Then, the accept thread of the receiver process and the sending process enter a handshake procedure that is intended to avoid race conditions which might arise in a situation when both processes attempt to initiate connections simultaneously. The procedure ensures that only one of the possible concurrent requests succeeds. The other request may be rejected. Once the procedure for race condition avoidance completes, both processes add the new connection descriptor to their array of active connections. Thus, the communication thread may be able to send and/or receive messages on this connection.
Referring now to the drawings and more particularly to
The computing nodes may be similar to one another. For example, the node 502 may include a processor 410, a memory 512 and a transport 514 for communicating with other nodes via the network 501. Similarly, for example, node 504 includes a processor 520, a memory 522 and a transport 524 for communication purposes.
It should be understood that the system 500 is a distributed computing system having a group of nodes. The nodes can be distributed geographically, or can be disposed in close proximity, such as on the same circuit board, or any combination thereof.
Referring now to
Referring now to
The user thread 2512 of the sender process 2502 initiates a communication operation to the user thread 2522 of the receiver process 2504. Before the communication operation can be performed, a connection on demand between the two processes may need to be established. The sender's user thread 2512 sends a request for a new connection to receiver's accept thread 2524. Receiver's accept thread accepts the requests, which leads to the creation of the requested connection between the two communicating processes. Once the connection is accepted by the receiver's accept thread, the accept thread informs receiver's communication or progress thread 2526 about the availability of a new connection. Receiver's progress thread in turn adds the new connection to the array of open connections 2528. Similarly to the receiver, after the sender's request for a new connection succeeds, sender's user thread 2512 informs sender's progress thread 2516 about the new connection, which is added by sender's thread 2516 to sender's array of open connections 2518. Once the new connection is added to the arrays of open connections in both the sender and receiver processes, the communication operation requested by the sender's user thread can be executed as specified.
A sender or a sender process is a process of a peer-to-peer distributed application that executes a send operation, which may result in a connection on demand request. A receiver or a receiver process is a process of a peer-to-peer distributed application that may or may not execute a receive operation, and which may accept the connection on demand request from the sender. A user thread UT is the thread that executes the application code. A progress thread PT is a system thread which may be used by the communication middleware to implement the communication protocols. An accept thread AT is a thread which may be used by the disclosed embodiments of the invention for implementing the mechanism for establishing connections on demand.
All processes in the distributed application may take the roles of senders, receivers, or both, depending on the application algorithm. Each process may have one or more UT (depending on the application's design), one PT and one AT. PT and AT may be used by the communication middleware, PT and AT may be transparent to the user code and the interactions between UT, PT and AT may be handled internally by the middleware. Part of the code used by the embodiments of the invention may be executed by UT and PT.
As shown in
When the UT requests a communication operation to a receiver process at box 3000, a check for the existence of an already created and opened connection to receiver is made at box 3010. If such connection exists, the communication protocol for transfer to peer receiver is invoked at box 3080. If the connection does not exist, a request for establishing a connection on demand to the receiver is issued and after the request succeeds, a connection to receiver is created at box 3020. Then, the UT waits for a reply from receiver's AT at box 3030. The reply can be either “KEEP” or “DROP” meaning that UT should either keep this newly established connection or disconnect it. In normal circumstances, the reply is “KEEP”. The “DROP” reply is used in rare situations when both communicating peers make simultaneous requests for establishing a connection between them. A “DROP” reply may be received during the send side of the protocol for connections on demand when the local AT of this process has received a request from the same peer after the protocol has been initiated. This situation leads to a race condition when the communicating peers are both sender and receiver processes. This race condition is resolved by the disclosed embodiments of the invention through the employment of an internal mechanism for serializing the admission of new connections. One of the requests for connections may be admitted first, thus making the second connection unnecessary.
If the check for the contents of the reply at box 3040 yields “KEEP”, the UT records the new connection as opened at box 3070 and notifies the progress thread PT about the newly established connection, in turn PT inserts the connection descriptors in the array for active connections at box 3080. From this moment forward, the connection is used for communication to receiver, for both sending and receiving operations with which the connections on demand protocol finishes and the requested send operation can be executed at box 3090.
If the check for the contents of the reply at box 3040 yields “DROP”, then the connection to receiver is disconnected at box 3050 and the UT goes into a wait state expecting a notification from the local AT at box 3060. Since the newly opened connection was dropped in this branch of the protocol algorithm, an earlier request from the same peer receiver must have arrived and succeeded. After the admission of this earlier connection by the local AT, it needs to notify the PT about the new connection so that this new connection can be used for communication to peer receiver (see box 4060 in
As shown in
The receive side of the connections on demand protocol is executed by the AT of the receiving process. AT may wait for requests for new connections on demand at box 4000. When a request arrives from sender process at box 4010, a new connection to sender is established at box 4020. Then, AT checks if the connection has already been established at box 4030. This connection may have been established only if the above described race conditions had occurred, with the send protocol of the UT being first to create the connection and record it as open. If the connection has been already established, AT sends a “DROP” reply to sender at box 4080, disconnects the connection to sender at box 4090, and goes into a wait state for a subsequent connection request at box 4000.
If the connection has not been established yet, AT records the connection as opened at box 4040 and sends a “KEEP” reply to the requesting sender at box 4050. Then, AT notifies PT about the newly created operation and PT in turn adds the new connection descriptor to the array of active connections at box 4060. AT sends a notification to UT that a new connection to sender has been created in case UT had simultaneously requested a connection to sender and is waiting on a signal from AT in order to continue (see at box 3060 in
While particular embodiments of the present invention have been disclosed, it is to be understood that various different modifications are possible and are contemplated within the true spirit and scope of the appended claims. For example, the apparatus and method of the present invention may be implemented in a variety of different ways including techniques not employing threads. The method and apparatus may be used for suitable networks such as Fast Ethernet and Gigabit Ethernet. Also, they may be implemented in MPI COMMUNICATION MIDDLEWARE, but can be in any other peer-to-peer middleware. There is no intention, therefore, of limitations to the exact abstract or disclosure herein presented.