This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2014-166741, filed on Aug. 19, 2014, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to system and method for transmitting data embedded into control information.
When using a so-called supercomputer, large-scaled scientific computing (high performance computing (HPC)), for example, is mainly intended in many cases. Therefore, in the supercomputer, the processing performance of an entire system is one of the most important items.
The supercomputer includes a plurality of computing nodes serving as information processing apparatuses each include a processor element (PE) and a communication module, a network that interconnects the plurality of computing nodes, and the like in the system. Each of the PEs includes a central processing unit (CPU) as an operation processing device and a memory as a main storage device.
A user executes a single program by activating respective processes on the plurality of PEs of the system. When such a program is executed, there is a case in which inter-process data communication occurs. In the inter-process data communication, for example, an application programming interface (API) called a message passing interface (MPI) is used.
In a recent supercomputer, there is a tendency that the number of included PEs increases. In addition, the use of a multi-core CPU is progressed in the PE, and thus a tendency is gradually changing such that a plurality of processes are set up on a single PE and then processing is performed. Therefore, the number of processes, which are used due to the execution of the program in the supercomputer, or the number of inter-process communications, which occur due to the execution of the program, increases.
In a program, in which inter-process data communications between a plurality of processes occur, a communication process between the computing nodes is more important than ever in the processing of the entire program. The MPI includes, for example, an API in which data is transferred to a specific process (group) by another process (group), and an API in which the data of entire inter-process are transposed and exchanged. When the process of such an API is executed, communication increases according to the increase in the number of processes, and thus the influence of the inter-process communication increases with regard to the process performance of the entire program.
In contrast, as a technique related to data communication, there is provided a technique in which, when a data transfer request is received, a transmission unit prepares a remote direct memory access (RDMA) packet from transmission target data and speculatively transmits the RDMA packet without inquiring of a transfer destination about whether or not reception of data is permitted. When a reception area is not available for data reception, the transmission unit retransmits the RDMA packet when a retransmission request is received from the transfer destination. A reception unit destructs a reception packet when the packet is received and it is determined that transfer is not permitted with reference to transfer area management information, and thereafter when it is determined that transfer is permitted, the reception unit transmits the retransmission request so as to transfer the packet.
Japanese Laid-open Patent Publications Nos. 2007-257479 and 2011-234145 have been known as examples of the related art.
Yuichiro Ajima, Yuzo Takagi, Tomohiro Inoue, Shinya Hiramoto, Toshiyuki Shimizu: “The Tofu Interconnect”, The 19th Annual Symposium on High-Performance Interconnects, p. 87-94(2011) and Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Toshiyuki Shimizu, Yuzo Takagi: “The Tofu Interconnect”, IEEE Micro, Vol. 32, No. 1, p. 21-31(2012) have been known as examples of the related art.
According to an aspect of the invention, a system includes a transmission-side apparatus and a reception-side apparatus. The transmission-side apparatus includes a first processor configured to execute a transmission-side process on target data to be transmitted to the reception-side apparatus through a communication path, where the transmission-side process generates transmission data including payload information and control information, and the control information includes the target data and address information indicating a destination address of the target data. The transmission-side apparatus further includes a first memory including a transmission-side storage area for holding the target data, and a first communication module configured to transmit the transmission data through the communication path. The reception-side apparatus is coupled to the transmission-side apparatus through the communication path, and includes a second memory including a queue area configured to store pieces of information as queueing data so as to prevent a piece of information from being overwritten by another piece of information, a second communication module configured to receive the transmission data transmitted from the transmission-side apparatus through the communication path, and a second processor configured to execute a reception-side process. The transmission-side process controls transmission of the transmission data to the reception-side apparatus through the communication path by embedding the target data into the control information included in the transmission data, and the reception-side apparatus stores the control information included in the received transmission data into the queue area as queuing data, and extracts the embedded target data from the control information stored in the queue area.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In the above-described technique, there is a case in which, when a plurality of inter-process communications simultaneously occurs, reception data, which is stored in a reception-side buffer, is overwritten by reception data in another inter-process communication before the reception data is used by a reception process. In this case, the data, on which the other reception data has been overwritten, disappears from the reception side buffer, and thus, the control communication occurs in order to acquire the lost data again. When there are a large number of processes to be communicated or when there is a large amount of processing in each process, the influence of delay increases due to the occurrence of the control communication.
For example, an MPI is used for an inter-process data communication in a supercomputer as an information processing system. A communication is performed by selecting, for example, one of three protocols when an MPI library is implemented. The three protocols include an Eager protocol, a Rendezvous-RDMA read protocol, and a Rendezvous-RDMA write protocol.
In the description below, there is a case in which target data, which is transferred to a reception process by a transmission process in the inter-process communication, is described as transmission target data.
First, the difference between an MPI communication function and an RDMA communication function will be described.
As illustrated in
In contrast, in a case of the communication using the RDMA communication function, the process A transmits data by using information on communication control with regard to the process B in addition to the process ID of the process B, as illustrated in
Subsequently, the three protocols used in the embodiment will be described with reference to
In the communication using the Eager protocol, the control communication is not performed in advance of communication of transmission target data. Instead, a reception buffer is secured using the MPI library in the reception node. When the transmission target data is received by the reception node, the transmission target data is stored in the reception buffer. Thereafter, the transmission target data, which is stored in the reception buffer, is copied into a memory area (user area) which is used by an application. In the description below, there are cases in which a memory area used by the MPI library is described as an MPI area, and a memory area used by the application is described as an application area. The reception buffer is secured in the MPI area. Also, the application area may be a specific storage area.
The communication using the Eager protocol is suitable for transferring a relatively small amount of data. More specifically, data, which is suitable for the communication using the Eager protocol, includes data having a size which fits in the reception buffer corresponding to the transmission process, and includes, for example, data whose size is up to 1 megabyte. Also, it is assumed that the transmission process in the communication using the Eager protocol knows the address of the reception node in advance.
Subsequently, the Rendezvous-RDMA read protocol will be described. In the communication using the Rendezvous-RDMA read protocol, the control communication is performed in order to exchange information on the communication control of the communication counter party in advance of communication of the transmission target data. More specifically, the reception process acquires the information on the communication control of the transmission process through the control communication. Further, the reception process acquires the transmission target data from a transmission-side application area by using the acquired information, and directly stores the transmission target data in the application area of the reception node.
The communication using the Rendezvous-RDMA read protocol is suitable for transferring large capacity data compared to the communication using the Eager protocol. More specifically, the data, which is suitable for the communication using the Rendezvous-RDMA read protocol, includes data whose size is larger than the size of the reception buffer corresponding to the transmission process, for example, data whose size is larger than 1 megabyte.
Subsequently, the Rendezvous-RDMA write protocol will be described. In the communication using the Rendezvous-RDMA write protocol, the control communication is performed in order to exchange information on the communication control of the communication counter party in advance of the transmission target data communication, in a manner similar to the communication using the Rendezvous-RDMA read protocol. More specifically, the transmission process transmits information on the communication control of the transmission process to the reception process via the control communication, and acquires the information on the communication control of the reception process as a response. Further, the transmission process transmits the transmission target data from the transmission-side application area, and directly stores the transmission target data into the application area of the reception node (RDMA Write).
As illustrated in
Similarly to the communication using the Rendezvous-RDMA read protocol, the communication using the Rendezvous-RDMA write protocol is suitable for transferring a large amount of data compared to the communication using the Eager protocol.
In the communications using the three protocols, there is a case in which data is simultaneously transmitted from different transmission processes. Therefore, there is a case in which reception buffers are secured for the respective transmission processes. In this case, when a mass communication occurs, reception-side memory usage increases.
Also, a transmission-side MPI area is an area which may be used when another transmission process transmits another data. Therefore, even when a plurality of transmission processes cause simultaneous communications, a case in which the transmission-side MPI area becomes insufficient occurs less-frequently compared to a reception-side MPI area.
Subsequently, a procedure in which the data of the MPI area of a transmission node is transmitted to the MPI area of the reception node will be described. In the description below, there is a case in which data that is transferred from the MPI area of the transmission node to the MPI area of the reception node is referred to as communication data. In the case of communication using the Eager protocol, communication data is transmission target data. In the cases of communications using the Rendezvous-RDMA read protocol and the Rendezvous-RDMA write protocol, communication data is information on the communication control.
In
Upon receiving the communication data and the descriptor from the transmission node, the reception node stores the communication data in a buffer secured in the MPI area. When the communication data is completely stored in the buffer, the communication module of the reception node generates a reception completion notification including information of a part of the descriptor, and stores the generated reception completion notification in a queue. The queue is implemented by the communication module by using the memory of the reception node. Here, the communication module may be controlled by the hardware of the reception node or may be implemented by the function of the MPI library.
Subsequently, the MPI library of the reception node (hereinafter, referred to as a reception MPI library) refers to the reception completion notification which is stored in the queue and accesses the communication data in the buffer, based on the referred reception completion notification. Here, the reception MPI library recognizes whether or not the reception completion notification is stored in the queue by performing polling on the queue.
Further, the MPI library copies the communication data from the buffer onto the application area or the MPI area, according to the communication protocol. That is, when the communication protocol is the Eager protocol, the MPI library copies the communication data onto the application area so that the application or the like accesses the copied data in the application area. When the communication protocol is the Rendezvous-RDMA read or the Rendezvous-RDMA write, the MPI library copies the communication data onto another area in the MPI area. When the communication protocol is the Rendezvous-RDMA read or the Rendezvous-RDMA write, the communication data, which is copied onto the MPI area, is information on the communication control of the transmission node, and RDMA-Read or RDMA-Write is performed based on the information. In the description below, there is a case in which it is described that the communication data is being used for the communication during a time period from when the communication data is stored in the buffer to when the communication data is copied onto another area in the reception node.
The descriptor is communication control information which is used to control the inter-process communication of relevant communication data. The descriptor is used by the communication module which performs the communication process. More specifically, although being described later, the descriptor includes, for example, information of a communication method, the location of the data of the communication counter party on the memory, the size of the communication data, and the location of the communication counter party. For example, a descriptor, which is generated when a process X transmits the communication data to a process Y, may include information below. That is, the information, which is included in the descriptor, includes information indicating that RDMA-Write is performed, information indicating that data having a size of “L” bytes is written into the memory of the process Y from an address “AAAA”, and information indicating that the communication data of the process X exists at an address “BBBB”.
The transmission MPI library may be performed by the transmission process, and the reception MPI library may be performed by the reception process. In addition, a process of the communication module, which have received a notification that data is transmitted from the transmission process, or specific software (program), may generate the descriptor.
In the communication using the MPI as described above, an area, which is used to temporally store the communication data, is used in the reception node. For example, a process of performing a communication acquires a buffer in the MPI area of each node, and uses the buffer for the communication. In contrast, for example, in the communication using transmission control protocol/internet protocol (TCP/IP) or the like, the communication is possible without using the buffer on the process.
A method of securing a buffer in the inter-process communication includes two methods, that is, a method of securing buffers for each inter-process communication, and a method of securing a fixed-length reception buffer which is shared and used by all the processes.
In the method (first method) of securing a buffer for each inter-process communication, the reception process secures buffers in different areas of the MPI areas for each inter-process communication. That is, the reception process secures the buffers such that the areas of the buffers, which are secured for the respective inter-process communications, do not overlap with each other.
In addition, the process B transmits communication data M to the process X. A descriptor M′ corresponding to the communication data M, which is transmitted by the process B, includes information indicating that the communication data M is stored in an area having a size of m bytes from a head address “BBB” in the memory of the transmission node B. In addition, the descriptor M′ includes information indicating that a process corresponding to the transmission destination of the communication data M is the process X. In addition, the descriptor M′ includes information indicating that the area of the transfer destination of the communication data M is an area, in which an address “YYY” is the head address, of the reception node X.
In the reception node X, the buffers are allocated for the respective process A and process B which are the processes of the communication counter parties of the process X. Pieces of communication data, which are received from the process A and the process B, are stored in the buffers which are allocated for the respective processes. When the pieces of communication data are stored in the buffers, the reception completion notification is written in a queue for reception. The process X recognizes the reception completion notification in the queue, and accesses the communication data at the address of the memory, which is written in the reception completion notification. Further, the process X copies the communication data onto another area according to the communication protocol.
For example, the process X specifies the top address “XXX” of an area, in which the communication data N is stored, and the data size n with reference to the reception completion notification of the communication data N. Further, the process X accesses data which is stored in an area starting from the address “XXX” and having a size of n, and copies the data onto another area.
The reception buffer may be allocated for each inter-process communication.
Although being described later, when a buffer is shared and used by all the processes, there may be a case in which data being used in a communication is overwritten by another piece of data when a plurality of inter-process communications simultaneously occurs. In contrast, in the first method, it is possible to prevent the data being used in the communication from being overwritten even when the plurality of inter-process communications simultaneously occurs within a single node. This is because buffer areas, which are secured in the respective inter-process communications, are different from each other. Therefore, a separate inter-process communication does not occur in order to implement control such that overwriting of data is avoided. Accordingly, even when the plurality of inter-process communications simultaneously occurs, it is possible to suppress the deterioration of the processing performance due to the control for preventing data from being overwritten.
However, in the first method, buffers are secured for the respective inter-process communications. Therefore, when a mass inter-process communication occurs, a memory capacity which is used by the MPI library becomes enormous. When a MPI area increases, an application area, which may be used in the memory of each node, is decreased as much as the increased MPI area. A physical memory, which is mounted on each node, is limited. In particular, similarly to the field of HPC, when a mass inter-process communication occurs, there is a problem of the deterioration in the process performance of the application due to the depletion of the memory or the lack of the memory.
When the transmission node is connected to the reception node through an interconnect in which a temporary area capable of being used in a communication does not exist, influence of the problem arising from the memory usage of the reception node increases. Such an interconnect includes, for example, Torus fusion (Tofu) interconnect or the like.
The method (second method) of securing a reception buffer which is common to all the processes is a method of setting a buffer used in the reception node to a common buffer in all the inter-process communications. The common buffer is secured so as not to exceed the capacity of a memory capable of being secured as the MPI area in each node. Therefore, when the mass inter-process communication occurs, it is possible to suppress the memory capacity which is used by the MPI library to a prescribed range. However, in the second method, in order to prevent in-use data from being overwritten, an inter-process communication, which is separated from the communication performed to transmit the communication data, occurs. The second method is further partitioned into three methods (2-1 method to 2-3 method) based on a method of controlling the buffer.
The 2-1 method is a method of not controlling the use of the buffer of each process. In the 2-1 method, no control is performed on the use of the buffer. Therefore, when a plurality of inter-process communication occurs in a single node, there is a possibility that data which is being used in the communication is overwritten. For example, there is a case in which two different processes simultaneously transmit data to the same process so that data is stored in the same address of the reception buffer. When data which is being used in the communication is overwritten, the received data disappears from the reception node, and thus it is difficult for the application or the like to access the received data. Therefore, when data is overwritten, the reception process controls the communication so that the overwritten data is retransferred.
As described above, in the 2-1 method, a communication for retransfer occurs when data is overwritten. In addition, in order to control the retransfer, a process of recording the order of communications during execution is generated like a checkpoint/restart. Therefore, a time for backup or recovery is required.
The 2-2 method is a method of performing control for dividing a reception buffer, which is common to all the processes, by the number of processes. In the 2-2 method, a transmittable data size is restricted for each process. The data size is in inverse proportion to the number of processes. Therefore, when data, which has a size greater than the transmittable data size, is transmitted, the transmission process transmits the communication data by dividing the communication data into plural pieces of communication data. It may be considered that the plural pieces of data are continuously transmitted. In this case, there is a high probability that a case occurs in which, after a piece of divided data is received, a subsequent piece of divided data is received in the reception node before the previously received piece of divided data is completely processed. According to the 2-2 method, when data is newly received before the previously received data is completely processed, the reception process overwrites data, which is received thereafter, on the unprocessed data stored in the buffer.
In order to prevent the data from being overwritten, control is performed such that the reception process transmits a notification (transmission permission notification) that transmission of data is permissible, to the transmission process, and the transmission process transmits data after receiving the transmission permission notification.
Here, when the communication data N1 is received, the process X recognizes that the communication data N1 is copied onto another area, and then transmits the transmission permission notification to the process A. The process A transmits the communication data N2 to the process X after recognizing that the transmission permission notification is received. Therefore, it is possible to prevent the data, which is being used in the communication, from being overwritten.
However, in the 2-2 method, a communication in which the transmission permission notification is transmitted, occurs in order to avoid the overwriting, and the transmission process transmits subsequent communication data after waiting for the reception of the transmission permission notification, thereby causing deterioration of latency.
The 2-3 method is a method of performing control such that the reception process dynamically secures a buffer in the reception buffer which is common to all the processes.
In the 2-3 method, first, the transmission process A demands that the reception process X secure an area of the buffer. The process X secures the area of the buffer according to the demand. Here, the process X performs exclusive control on the secured area of the buffer. Also, when it is difficult to secure an area of the buffer by reason of the lack of the memory, waiting occurs. Further, the process X notifies the process A of the address of the secured buffer. The process A waits until the address of the buffer is notified. When the notification is received, the process A transmits the data by designating the address of the buffer indicated by the notification.
As described above, in the 2-3 method, a communication occurs in order to secure the area of the buffer. Increase in the number of communications causes the deterioration of latency, and increases a time taken for the exclusive control performed on the reception buffer.
Subsequently, a state in which the transmission target data is transferred from the transmission process to the reception process in an embodiment will be described.
In the embodiment, the transmission process sets communication data as payload information, for example, as dummy data of null data, and embeds the transmission target data into descriptors corresponding to respective pieces of dummy data. Further, the transmission process transmits the dummy data as the communication data and the descriptor to the reception node. The reception process acquires the transmission target data by extracting the transmission target data from the descriptors into which the transmission data is embedded. Here, the dummy data may be null data or may be actual data which includes some data.
When the reception node receives dummy data and the descriptor corresponding to the dummy data, the reception node first stores the dummy data in a reception buffer. When the dummy data is completely stored in the reception buffer, the reception process generates a reception completion notification based on the descriptor and stores the reception completion notification in a queue. Subsequently, the reception process extracts a data segment from the reception completion notification which is stored in the queue. Further, the reception process reconstructs communication data by connecting the extracted data segments thereto. Thereafter, the reception process copies the reconstructed communication data onto the application area or the MPI area according to the communication protocol.
In the description below, the communication data, which is described with reference to
The transmission-side information processing apparatus 2 includes a transmission-side operation processing device 4, a transmission-side storage device 5, and a transmission-side communication device 6. The transmission-side operation processing device 4 performs a transmission-side process (transmission process). The transmission-side storage device 5 includes a transmission-side storage area which holds target data (transmission target data) as a transmission target. The transmission-side communication device 6 transmits transmission data generated by the transmission-side process through the communication path, where the transmission data includes the control information (descriptor) including the target data and address information indicative of the address of the target data.
The reception-side information processing apparatus 3 includes a reception-side communication device 7, a reception-side operation processing device 8, and a reception-side storage device 9. The reception-side communication device 7 receives transmission data through the communication path. The reception-side operation processing device 8 performs a reception-side process (reception process). The reception-side storage device 9 includes a reception-side storage area which stores the target data that is extracted from the transmission data through the reception-side process.
Further, the transmission data includes payload information (communication data) in addition to the control information. The payload information may be null data or actual data.
Further, the control information includes communication mode information (communication type information), which indicates the communication mode between the transmission-side information processing apparatus 2 and the reception-side information processing apparatus 3, in addition to the address information. In addition, the transmission-side communication device 6 and the reception-side communication device 7 perform communication, based on a communication mode according to the communication mode information.
The transmission-side process generates the transmission data by dividing the target data into data segments and embedding the data segments into the control information.
Hereinafter, the process between the transmission-side information processing apparatus 2 and the reception-side information processing apparatus 3 will be described in detail. The transmission-side process (transmission process) and the reception-side process (reception process) perform communication using an inter-process data communication procedure. The inter-process data communication procedure is a procedure of the inter-process communication in which communication of the payload information (communication data) and the control information (descriptor) for controlling the communication of the payload information is performed between the transmission-side process and the reception-side process. In the inter-process data communication procedure, the payload information, which is received from the transmission-side process, is stored in a storage area (buffer) and the received respective pieces of control information are managed in order of reception in the reception-side information processing apparatus 3.
The transmission-side process performs communication with the reception-side process by using the inter-process data communication procedure. The transmission-side process transmits payload information and control information to the reception-side process by using the inter-process data communication procedure. In the embodiment, the payload information is dummy information. The control information is associated with the dummy information. In addition, the control information includes a first area (embedded area) in which data is rewritable, and a second area in which the information of the reception-side process is stored. The first area is different from the second area. The control information includes the first area which includes at least a part of the target data (transmission target data).
The reception-side process receives the dummy information and the control information from the transmission-side process, and takes at least a part of the target data out from the control information which is managed in order of reception.
Therefore, when a plurality of inter-process communication simultaneously occurs, it is possible to prevent the target data from being overwritten by communication data in another inter-process communication in the reception-side information processing apparatus 3. Therefore, it is possible to suppress occurrence of inter-process communications in order to prevent the target data from being overwritten, as described in, for example, the 2-2 method and the 2-3 method. In addition, it is possible to suppress communication for retransfer performed when data is overwritten, as described in, for example, the 2-1 method. In addition, it is possible to suppress a process of controlling retransfer, thereby preventing occurrence of backup or recovery for controlling retransfer.
In addition, in the embodiment, the target data is written in the first area which is used for storing information indicative of the size and stored location of data when the data is transmitted in the inter-process data communication procedure.
Therefore, it is possible to effectively cause the target data to be included in the control information and to transmit the target data from the transmission-side process to the reception-side process.
In addition, in the embodiment, the target data is written in the first area in which built-in data is stored in the inter-process data communication procedure.
Therefore, it is possible to effectively cause the target data to be included in the control information and to transmit the target data from the transmission-side process to the reception-side process.
In addition, the control information includes, in the first area, at least a part of the target data and the communication mode information (communication type information) indicative of the communication mode. The reception-side process takes at least a part of the target data out from the control information according to the communication mode indicated by the communication mode information included in the received control information.
Therefore, the reception-side process may identify whether or not the target data is included in the received control information.
In addition, the control information includes, in the second area, identification information identifying a transmission-side process. The transmission-side process divides the target data into plural data segments, and stores one or more data segments and the communication type information in the first area of each of plural pieces of control information. Further, the transmission-side process transmits the plural pieces of control information to the reception-side process. When the control information is received, the reception-side process extracts the data segments out from the first area according to the communication type information included in the first area, and reconstructs the target data by connecting the data segments which are extracted from the plural pieces of control information, according to the identification information included in the second area.
This allows the transmission-side process to transmit the target data by dividing the target data into plural pieces of control information. In addition, when the inter-process communication with a plurality of transmission-side processes simultaneously occurs in the reception-side information processing apparatus 3, the reception-side process is able to reconstruct the target data for a communication with each of the transmission-side processes, based on the plural pieces of control information.
In addition, in the reception-side information processing apparatus 3, an area, in which the dummy information is stored, is a fixed length buffer which is shared by the plurality of inter-process communications.
Therefore, when the plurality of inter-process communications occurs, it is possible to reduce the amount of buffer usage of the reception-side information processing apparatus 3.
The communication node 18 is an example of the transmission-side information processing apparatus 2 and the reception-side information processing apparatus 3. Some of the functions of the transmission control unit 11, the setting unit 12, and the transmission unit 13 are examples of the transmission-side process which is performed by the transmission-side operation processing device 4. In addition, a part of the function of the transmission unit 13 is an example of the transmission-side communication device 6. The reception control unit 14 and the extraction unit 16 are examples of the reception-side process which is performed by the reception-side operation processing device 8. The storage unit 10 is an example of the transmission-side storage device 5 in the transmission-side information processing apparatus 2. In addition, the storage unit 10 is an example of the reception-side storage device 9 in the reception-side information processing apparatus 3. The reception unit 15 is an example of the reception-side communication device 7.
In the storage unit 10, the MPI area, the application area, and a queue are secured. The queue is secured in, for example, an area such as kernel which is secured by the operating system (OS) of the communication node 18. In the MPI area, the area of the buffer, in which the received communication data is stored, is secured.
First, a data transmission process will be described.
The transmission control unit 11 selects a communication protocol to be used in the inter-process communication, according to the size of the transmission target data. More specifically, for example, the transmission control unit 11 first determines whether or not the size of the transmission target data is greater than a prescribed threshold. When the size of the transmission target data is less than the prescribed threshold, the transmission control unit 11 selects the Eager protocol as a protocol to be used when the transmission target data communication is performed. In contrast, when the size of the transmission target data is equal to or greater than the prescribed threshold, the transmission control unit 11 selects the Rendezvous-RDMA read protocol or the Rendezvous-RDMA write protocol as a protocol to be used when the transmission target data communication is performed. The prescribed threshold is stored in the storage unit 10 in advance. Also, the communication protocol, which is used in the transmission target data communication, may be designated by a user terminal. In addition, the transmission control unit 11 performs various processes according to the selected communication protocol.
In addition, the transmission control unit 11 generates communication data and a descriptor corresponding to the communication data. However, in the embodiment, the communication data is set at dummy data. The transmission control unit 11 sets communication control information for controlling the communication of the communication data (dummy data) to the descriptor. The dummy data is null data. The size of the dummy data may be a prescribed size, and it is possible to reduce the quantity of the communication or the quantity of buffer which is used for the communication by reducing the size of the dummy data.
The setting unit 12 embeds the transmission target data, which is originally desired to be communicated in the inter-process, into an area (hereinafter, referred to as an “embedded area”) of the descriptor, which becomes rewritable by setting the communication data as the dummy data. The embedded area is an area for storing rewritable data and is different from an area in which the information of the reception process is stored. Here, the setting unit 12 divides the transmission target data into a plurality of data segments and then embeds the resulting data segments into the embedded areas, according to the size of the transmission target data. The setting unit 12 performs division of the transmission target data so that each data segment has a size that allows the each data segment to be stored in the embedded area. Also, there is a case in which a single descriptor includes a plurality of embedded areas, and the respective sizes of the descriptor are different from each other. In this case, the setting unit 12 performs division by adjusting the sizes of the respective data segments so that the data segments are stored in the respective embedded areas. In addition, the setting unit 12 may store the communication type information, which is information indicative of whether or not the transmission target data is included in the descriptor, in the embedded areas. A process of storing the communication data in the descriptor will be described in detail later.
The transmission unit 13 transmits the communication data (dummy data) and the descriptor, into which the transmission target data (the data segments) is embedded, to the reception node.
Next, a data reception process will be described.
The reception control unit 14 secures a fixed length buffer in an MPI area in advance before data is received.
The reception unit 15 receives communication data (dummy data) and a descriptor from the transmission node. Thereafter, the reception unit 15 stores the communication data (dummy data) in the fixed length buffer. Here, the fixed length buffer, which stores the communication data, is a buffer which is commonly used for all the dummy communication. When the communication data (dummy data) is completely stored in the buffer, the reception unit 15 generates a reception completion notification based on the descriptor corresponding to the communication data (dummy data), and stores the reception completion notification in the queue.
The extraction unit 16 periodically performs polling, and determines whether or not the reception completion notification is stored in the queue. In monitoring though the polling, when it is recognized that the reception completion notification is stored in the queue, the extraction unit 16 determines whether or not the reception completion notification is generated in the dummy communication. More specifically, for example, the extraction unit 16 determines whether or not an area, into which the reception completion notification is embedded, includes the communication type information. When it is recognized that the reception completion notification includes the communication type information, the extraction unit 16 is able to realize that the reception completion notification is relevant to the dummy communication. When it is determined that the reception completion notification is generated in the dummy communication, the extraction unit 16 extracts the data segments, embedded into the embedded areas of the descriptor corresponding to the reception completion notification, from the reception completion notification. Further, the extraction unit 16 combines the extracted plurality of data segments.
The extraction unit 16 may determine whether communication corresponding to the reception completion notification is the basic communication or the dummy communication, based on information, which is included in the descriptor and which indicates the location of the buffer in which the communication data is stored. When the information, which indicates the location of the buffer in which the communication data is stored, indicates the fixed length buffer for the dummy communication, the extraction unit 16 may determine that communication corresponding to the reception completion notification is the dummy communication.
The combining of the data segments is performed based on a combining rule which is determined in advance. The combining rule is stored in the prescribed storage area of the storage unit 10. Also, it is assumed that the combining rule corresponds to a rule under which data segments, acquired through division performed on the transmission target data, are embedded into the descriptor in the transmission node. When a plurality of data segments corresponding to a single transmission target data are transmitted across a plurality of descriptors, the extraction unit 16 may combine the plurality of data segments, based on the order of reception of the reception completion notification. Also, for example, the transmission node may store prescribed information indicative of the end of the transmission target data at the last of the transmission target data, and the extraction unit 16 may realize the end of the transmission target data by recognizing whether the information indicative of the end is included in the data segments.
Further, the extraction unit 16 copies the reconstructed communication data onto the MPI area or the application area, according to the communication protocol. That is, when the communication protocol is the Eager protocol, the extraction unit 16 copies the transmission target data onto the application area. When the communication protocol is Rendezvous-RDMA read or Rendezvous-RDMA write, the extraction unit 16 copies the transmission target data (the information relevant to the communication control) onto the MPI area.
In addition, the reception control unit 14 and the reception unit 15 selects the communication protocol, according to the size of the transmission target data, or performs various processes, according to the selected communication protocol.
Next, embedment of the transmission target data into the descriptor by the setting unit 12 will be described.
As illustrated in
In the dummy communication, the transmission target data is stored in some of the embedded areas, which satisfy the three conditions below, in the areas of the descriptor, as shown in
(1) An area in which it is possible to rewrite data.
(2) An area in which information included in the reception completion notification is stored.
(3) The embedded areas, which are different from the areas in which the communication counter party-relevant information is stored, are areas which are determined in advance according to the specifications of the descriptor. Although the transmission target data is stored in some of the embedded areas, the transmission target data may be stored in the entire embedded area when the communication is not affected.
In the dummy communication, a reason that it is possible to embed the transmission target data into the embedded areas of the descriptor is that information originally stored in the embedded areas is information which is not used since the communication data is the dummy data. The information includes, for example, information indicative of the size of the communication data, information indicative of the location of a buffer for the communication data, or the like. When the communication data is the dummy data, actual communication is not affected even when the information is changed.
As illustrated in
The area, to which the transmission target data is embedded, is any one of areas in which information included in the reception completion notification is stored. Therefore, the transmission target data, which is embedded into the descriptor, is also included in the reception completion notification. The transmission target data, which is included in the reception completion notification, is extracted and combined by the extraction unit 16.
In
There is a case in which the size of the transmission target data is greater than the sum of sizes of the embedded areas of a single descriptor. In this case, the transmission process divides the transmission target data, stores the resulting divided data segments in a plurality of descriptors, and then transmits the transmission target data to the reception node.
In addition, in
Here, the extraction unit 16 is able to identify the communication counter party corresponding to the reception completion notification, with reference to information on the communication counter party, which is included in the reception completion notification. Accordingly, even when a plurality of inter-process communications simultaneously occurs, it is possible to extract, for each transmission process, transmission target data from the reception completion notifications and combine the data.
Next, the configuration of the descriptor will be described in detail. The descriptor is information which is used to control the communication of the inter-process of relevant communication data. The descriptor includes information on a communication method, a location on the memory of the communication counter party, in which the communication data is stored, the size of the communication data, the locational information of the communication counter party, and the like.
The reception completion notification is information which is generated by the reception unit 15 (communication module) based on the descriptor in the reception node. When the relevant communication data is completely stored in the buffer, the reception completion notification is stored in a queue by the reception unit 15. The extraction unit 16 periodically recognizes (performs polling) whether or not the reception completion notification is stored in the queue, and checks that the reception completion notification is stored in the queue. As a result, the extraction unit 16 is able to realize that the communication data is stored in the buffer, and to grasp the location in which the communication data is stored in the buffer.
More specifically, the reception completion notification includes the coordinate information of the node of the communication counter party. The coordinate information is information corresponding to information which is stored in the areas “ABC1”, “Remote node address”, and “RI” of
The embedded area includes the areas “Embedded data”, “Message length”, and “Remote Offset”. The embedded area satisfies the above-described three conditions as the embedded area. Also, it is possible to change the size of the transmission target data, which is capable of being embedded into each of the embedded areas, by adjusting the size of the fixed length reception buffer which is used in the dummy communication. Also, an area, in which the information of the communication counter party is stored, includes areas “ABC1”, “Remote node address”, and “RI”, and includes the coordinate information of the node of the communication counter party. Also, when it is possible to use a plurality of queues, in which the reception completion notification is stored, it is possible to include the information on the transmission target data in the descriptor by associating a queue from which the reception completion notification is received with the bits of the transmission target data.
The communication type information, which is used for identifying whether the reception completion notification is generated in the basic communication or the dummy communication when the extraction unit 16 recognizes the reception completion notification which is stored in the queue, may be stored in the embedded area. More specifically, the communication type information may be stored in, for example, the area “Embedded data”. The communication type information may be information of 1 bit which indicates the basic communication or the dummy communication. In addition, the communication type information may further include protocol type information which indicates the type of a protocol used in a communication corresponding to the reception completion notification. The protocol type information may be information of 2 bits which indicates one of the communications according to the Eager protocol, the Rendezvous-RDMA read protocol, and the Rendezvous-RDMA write protocol.
The extraction unit 16 may identify whether the communication corresponding to the reception completion notification is the basic communication or the dummy communication, based on the value of the “steering tag”. That is, when a location indicated by the “steering tag” indicates the fixed length buffer for the dummy communication, the extraction unit 16 may determine that the communication corresponding to the reception completion notification is the dummy communication. This method enables the size of the transmission target data, which is capable of being stored in the embedded area, to increase, compared to a method in which the communication type information is embedded into the embedded area. This method is realizable because it is possible to fix a buffer, which is be used in a communication, to a prescribed address in advance in the dummy communication.
In the dummy communication, there is no possibility that the transmission target data disappears due to another data being overwritten on the transmission target data, even when a plurality of inter-process communications occurs in a single node. As described with reference to
The process A (setting unit 12) divides the transmission target data N and embeds the divided transmission target data N into a plurality of descriptors (DN1 and DN2). Further, the process A (transmission unit 13) executes the dummy communication. Dummy data A and the descriptor DN1 are transmitted from the process A (transmission unit 13), and the dummy data A is stored in the reception buffer of the reception node X. Further, a reception completion notification DN′1 which is generated based on the descriptor DN1 is stored in a queue.
In the same manner, the process B (setting unit 12) divides the transmission target data M, and embeds the divided transmission target data M into a plurality of descriptors (DM1 and DM2). Further, the process B (transmission unit 13) executes the dummy communication. Dummy data B and the descriptor DM1 are transmitted from the process B (transmission unit 13), and the dummy data B is stored in the reception buffer of the reception node X. A reception completion notification DM′1 which is generated based on the descriptor DM1 is stored in the queue. Here, the reception buffer in which the dummy data B transmitted from the process B is stored, is the same as the reception buffer in which the dummy data A is stored. The dummy data B may be overwritten on the dummy data A. However, the communication is not affected because the dummy data is not referred to.
The process X (extraction unit 16) extracts the divided data of the transmission target data N and M from the respective reception completion notifications DN′1 and DM′1 which are stored in the queue. Further, the process X (extraction unit 16) extracts the divided data of the transmission target data N and M from the reception completion notifications DN′2 and DM′2 which are stored in the queue subsequent to the reception completion notifications DN′1 and DM′1. Further, the process X (extraction unit 16) acquires the transmission target data N and M by combining the extracted data for the respective processes (process A and process B). Here, the process X (extraction unit 16) is able to identify counter parties corresponding to the respective reception completion notifications with reference to information on the communication counter party included in the reception completion notifications.
The reception completion notifications are stored in the queue in the order in which the dummy data corresponding to the reception completion notifications are stored in the buffer. This allows the extraction unit 16 to control the combining of plural pieces of divided data included in the plurality of reception completion notifications, based on the order in which the reception completion notifications are stored. More specifically, for example, first, the setting unit 12 performs control so that the locations of the divided data in the transmission target data corresponds to the transmission orders of the descriptors into which the divided data are embedded. Further, the extraction unit 16 specifies the locations of the pieces of divided data, which are included in the reception completion notifications, within the transmission target data, from the order in which the reception completion notifications are stored in the queue. Further, the extraction unit 16 is able to correctly combine plural pieces of divided data by combining the plural pieces of divided data so that original transmission target data is obtained.
For example, in
In the dummy communication, the buffer in which the dummy data is stored, is a fixed length buffer which is shared and used by a plurality of inter-process communications.
Buffers corresponding to the processes of the respective communication counter parties (processes A to N in
In the dummy communication, data to be stored in the buffer is dummy data. Therefore, there is no problem even when the dummy data, which are stored by the plurality of processes, are overwritten. Accordingly, even when the number of processes of the communication counter parties increases, the communication is possible without increasing the amount of buffer usage. In the embodiment, the transmission target data is not overwritten, and thus it is possible to suppress the control communication which occurs to prevent the overwriting of the transmission target data. In the embodiment, it is possible to share and use a single fixed length buffer regardless of the number of processes of the communication counter parties, and thus it is possible to reduce the amount of buffer usage compared to the first method. The size of the fixed length buffer is 544 bytes in
Next, an operational flowchart for the process of the transmission node according to the embodiment will be described.
In
Subsequently, the transmission control unit 11 recognizes the selected communication protocol (S102). When it is determined that the selected communication protocol is the Eager protocol (Eager protocol in S102), the process proceeds to the transmission process A (S103). The transmission process A will be described in detail later with reference to
In S102, when it is determined that the selected communication protocol is the Rendezvous-RDMA write protocol (Rendezvous-RDMA write protocol in S102), the process proceeds to the transmission process B (S104). The transmission process B will be described in detail later with reference to
In S102, when it is determined that the selected communication protocol is the Rendezvous-RDMA read protocol (Rendezvous-RDMA read protocol in S102), the process proceeds to the transmission process A (S105). The transmission process A will be described in detail later with reference to
Subsequently, the transmission process A which is executed in S103, S105, and S301 (which will be described later), will be described.
In
Subsequently, the transmission control unit 11 generates dummy data as communication data, and a descriptor corresponding to the communication data (S202). In S202, the transmission control unit 11 sets each of the areas of the generated descriptor at a setting value used when the communication data is transmitted to the reception node.
Subsequently, the setting unit 12 embeds the divided data into the embedded areas of the descriptor, which is generated in S202, (S203). In addition, the setting unit 12 may store communication type information in the embedded areas such that it is possible to determine whether a descriptor is used in the basic communication or the dummy communication.
Further, the transmission unit 13 issues the dummy communication (S204). That is, the transmission unit 13 transmits the communication data (dummy data), which is generated in S202, and the descriptor, to which the divided data (transmission target data) is embedded in S203, to the reception node.
Subsequently, the transmission control unit 11 determines whether or not the transfer of the transmission target data is completed (S205). That is, the transmission control unit 11 determines whether or not all the pieces of divided data, which are generated by dividing the transmission target data in S201, are embedded into the descriptors and transmitted. When it is determined that the transfer of the transmission target data is not completed (No in S205), the transmission control unit 11 causes the process to proceed to S202, and newly generates dummy data and a descriptor. In contrast, when it is determined that the transfer of the transmission target data is completed (Yes in S205), the process ends.
Subsequently, the transmission process B, which is executed in S103, will be described.
In
When the transfer of the information on the communication control is completed in S301, the transmission control unit 11 determines whether or not the information on the communication control of the reception node for RDMA transfer is received from the reception node (S302). When it is determined that the information on the communication control of the reception node is not received (No in S302), the transmission control unit 11 repeats the process in S302.
In S302, when it is determined that the information on the communication control of the reception node is received (Yes in S302), the transmission control unit 11 transfers the transmission target data according to RDMA-Write (S303). That is, the transmission control unit 11 designates the address of the application area of the reception node, which is the storage destination of the transmission target data, based on the information on the communication control of the reception node which is received in S302, and transmits the transmission target data. As a result, the transmission control unit 11 writes the transmission target data in the designated application area of the reception node.
Further, the transmission control unit 11 transmits the notification that the communication of RDMA-Write is completed to the reception control unit 14 (S304). Thereafter, the process ends.
Next, an operational flowchart for the process of the reception node according to the embodiment will be described.
In
Subsequently, the reception control unit 14 recognizes the selected communication protocol (S112). When it is determined that the selected communication protocol is Eager protocol (Eager protocol in S112), the process proceeds to the reception process A (S113). The details of the reception process A will be described later with reference to
When it is determined that the selected communication protocol is the Rendezvous-RDMA write protocol in S112 (Rendezvous-RDMA write protocol in S112), the process proceeds to the reception process B (S114). The details of the reception process B will be described later with reference to
When it is determined that the selected communication protocol is Rendezvous-RDMA read protocol in S112 (Rendezvous-RDMA read protocol in S112), the process proceeds to the reception process C (S115). The details of the reception process C will be described later with reference to
Subsequently, the reception process A which is executed in S113 will be described.
In
Subsequently the, extraction unit 16 determines whether or not a reception completion notification is stored in a queue (S402). When it is determined that the reception completion notification is not stored in the queue (No in S402), the extraction unit 16 executes the process in S402 again.
In contrast, when it is determined that the reception completion notification is stored in the queue (Yes in S402), the extraction unit 16 takes the reception completion notification out from the top of the queue, and determines whether or not the reception completion notification is generated in the dummy communication (S403). More specifically, the extraction unit 16 determines whether or not the communication type information is included in the reception completion notification. In addition, the extraction unit 16 determines whether or not information, which is included in the reception completion notification and which indicates the top address of a buffer in which the communication data is stored, indicates the fixed length buffer for the dummy communication.
When it is determined that the reception completion notification is generated in the dummy communication (Yes in S403), the extraction unit 16 extracts pieces of divided data (transmission target data) from the reception completion notification (S404).
Further, the extraction unit 16 combines the pieces of divided data which are extracted in S404 (S405). That is, the extraction unit 16 combines the plural pieces of transmission target data (the plural data segments), which are extracted in S404, based on a combining rule, which is determined in advance, and the order in which the reception completion notifications are received. Also, the extraction unit 16 identifies a transmission source process corresponding to the reception completion notification with reference to information on communication counter party included in the reception completion notification. Then, for each of the identified communication source processes, the extraction unit 16 extracts pieces of data from the reception completion notification and combines the extracted pieces of data.
Subsequently, the reception control unit 14 determines whether or not the transfer of the communication data is completed (S406). That is, the reception control unit 14 determines whether or not information indicative of the end of the transmission target data is included in the divided data which are extracted in S404. When it is determined that the transfer of the communication data is not completed (No in S406), the reception control unit 14 causes the process proceeds to S401. When it is determined that the transfer of the communication data is completed (Yes in S406), the reception control unit 14 causes the process proceeds to S408.
In S403, when it is determined that the reception completion notification is not generated in the dummy communication (No in S403), the reception control unit 14 specifies the location (address) of a buffer, in which the communication data is stored, from the reception completion notification, and accesses the specified location (S407). Thereafter, the process proceeds to S408.
In S408, the reception control unit 14 copies the reconstructed transmission target data into the MPI area or the application area according to the communication protocol. Thereafter, the process ends.
Next, the reception process B, which is executed in S114, will be described.
In
Subsequently, the reception control unit 14 transmits information on the communication control of the reception node to the transmission node (S502). Thereafter, the transmission target data is transferred by the transmission node according to the RDMA-Write, based on the information on the communication control. Thereafter, the process ends.
Next, the reception process C, which is executed in S115, will be described.
In
Subsequently, the reception control unit 14 acquires the transmission target data according to the RDMA-Read, based on the information on the communication control of the transmission node, which is copied into the MPI area in S408 (S602).
Then, the reception control unit 14 transmits the notification that the communication is completed according to the RDMA-Read, to the transmission control unit 11 (S603). Thereafter, the process ends.
Subsequently, an operational flowchart for a data reception process performed by the reception unit 15 according to the embodiment will be described.
In
Subsequently, the configuration of an information processing system according to the embodiment will be described.
In
The CPU 23 provides some or the entirety of the functions of the transmission control unit 11, the setting unit 12, the transmission unit 13, the reception control unit 14, and the extraction unit 16, by executing, using the memory 24, a program in which the procedure of the above-described flowchart is written. That is, the CPU 23 provides some or the entirety of the functions of the transmission-side operation processing device 4 in the transmission-side information processing apparatus 2, by executing, using the memory 24, the program in which the procedure of the above-described flowchart is written. The CPU 23 provides some or the entirety of the functions of the reception-side operation processing device 8 in the reception-side information processing apparatus 3, by executing, using the memory 24, the program in which the procedure of the above-described flowchart is written.
The memory 24 is, for example, a semiconductor memory, and includes a random access memory (RAM) area and a read only memory (ROM) area. The memory 24 provides some or the entirety of the functions of the storage unit 10. That is, the memory 24 provides some or the entirety of the functions of the transmission-side storage device 5 in the transmission-side information processing apparatus 2. In addition, the memory 24 provides some or the entirety of the functions of the reception-side storage device 9 in the reception-side information processing apparatus 3.
The communication module 22 is a module which performs communication through a network or an interconnect, according to the instruction of the CPU 23. The communication module 22 provides some or the entirety of the functions of the transmission unit 13 and the reception unit 15. That is, the communication module 22 provides some or the entirety of the functions of the transmission-side communication device 6 in the transmission-side information processing apparatus 2. In addition, the memory 24 provides some or the entirety of the functions of the reception-side communication device 7 in the reception-side information processing apparatus 3.
The information processing apparatus 20 may include a reading device. The reading device accesses a detachable storage medium, according to the instruction of the CPU 23. The detachable storage medium is realized by, for example, a semiconductor disk (USB memory or the like), a medium (magnetic disc or the like) to which information is input by a magnetic action, a medium (a CD-ROM, a DVD, or the like) to which information is input by an optical action, or the like.
A program according to the embodiment is provided to the CPU 23, for example, in the form below
(1) installed in the memory 24 in advance.
(2) provided using a detachable storage medium (not shown in the drawing).
(3) provided from a program server (not shown in the drawing) through the communication module 22.
A part of the information processing apparatus 20 may be realized by hardware. Further, the information processing apparatus 20 may be realized by the combination of software and hardware.
The user terminal 25 transmits a program execution instruction to the information processing system 30, and receives a program execution result as a response. The program execution instruction is transmitted to one or more PEs 21 which are included in the information processing system 30. In execution of a single program, a plurality of processes is executed, and data communication is performed through inter-process. Each of the processes is executed in one or more PEs 21. In inter-process communication, an MPI is used. Each of the processes which performs the inter-process communication may switch between protocols (Eager and Rendezvous-RDMA), which are used in the inter-process communication, according to, for example, the amount of transfer data for communication data.
Also, in the embodiment, an area, in which the reception completion notification is stored, is a queue. However, the area is not limited to the queue if the area is controlled such that the reception completion notification corresponding to the communication data is sequentially stored in the order in which the reception node receives the communication data, and a reception sequence is secured. In addition, the dummy communication may be performed through two inter-processes which are executed in the same communication node 18.
Also, the embodiment is not limited to the above-described embodiment, and various configurations or embodiments can be provided without departing from the gist of the embodiment.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2014-166741 | Aug 2014 | JP | national |