The present invention relates to a method and a system for parallel computation.
A system for performing parallel computation by use of plural computation nodes has been developed. An example of parallel computation is matrix product computation. The matrix product computation is one of the most basic computational elements that is widely used in scientific computation in general, analysis of big data, artificial intelligence, and so on.
Non-patent Literature 1, for example, has been known as that showing a prior-art method for obtaining a matrix product by performing parallel computation.
Speeding up of parallel computation is important for reducing the quantity of electric power consumption in a system such as a data center or the like.
The present invention has been achieved in view of the above matters; and an object of the present invention is to speed up parallel computation.
For solving the above problem, a mode of the present invention provides a method for performing parallel computation in a parallel computation system comprising plural computation nodes, wherein the method comprises: a first step for distributing respective first-level small pieces of data, that are formed by dividing data, to the respective computation nodes in the plural computation nodes; a second step for further dividing, in a first group of computation nodes which includes at least one computation node in the plural computation nodes, the first-level small pieces of data into second-level small pieces of data; a third step for transferring, in parallel, the respective second-level small pieces of data from the first group of computation nodes to a group of relay nodes which is a subset of the plural computation nodes; a fourth step for transferring, in parallel, the transferred second-level small pieces of data from the group of relay nodes to a second group of computation nodes which includes at least one computation node in the plural computation nodes; and a fifth step for reconstructing, in the second group of computation nodes, the first-level small pieces of data by using the second-level small pieces of data transferred from the group of relay nodes.
Another mode of the present invention comprises the above mode, and provides the parallel computation method that further comprises a sixth step for performing a part of the parallel computation by using the reconstructed first-level small pieces of data.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein, in the parallel transfer from the first group of computation nodes in the third step, the first group of computation nodes transfer, in parallel, the respective second-level small pieces of data, in such a manner that all usable communication links between the first group of computation nodes and the group of relay nodes are used.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein, in the parallel transfer to the second group of computation nodes in the fourth step, the group of relay nodes transfer, in parallel, the respective second-level small pieces of data, in such a manner that all usable communication links between the group of relay nodes and the second group of computation nodes are used.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein each of the computation nodes comprises plural communication ports; and data communication from the first group of computation nodes to the group of relay nodes in the third step or data communication from the group of relay nodes to the second group of computation nodes in the fourth step is performed via the plural communication ports.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein the plural computation nodes are logically full-mesh connected.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein the parallel computation is matrix operation; the data is data representing a matrix; and the first-level small pieces of data are data representing submatrices formed by dividing the matrix along a row direction and a column direction.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein the submatrices are submatrices formed by dividing the matrix into N pieces (provided that N is the number of computation nodes); and the second-level small pieces of data are data formed by further dividing the submatrix into N pieces.
Another mode of the present invention comprises the above mode, and provides the parallel computation method, wherein the matrix operation is computation of a product of matrices.
Another mode of the present invention provides a method for performing parallel computation in a parallel computation system comprising plural computation nodes, wherein the method comprises: a step for further dividing each of first-level small pieces of data, that are formed by dividing data, into second-level small pieces of data; a step for distributing the respective second-level small pieces of data to the respective computation nodes in the plural computation nodes; a step for transferring, in parallel, the second-level small pieces of data from the respective computation nodes in the plural computation nodes to at least one computation node in the plural computation nodes; and a step for reconstructing, in the at least one computation node, the first-level small piece of data by using the second-level small pieces of data transferred from the plural computation nodes.
Another mode of the present invention provides a parallel computation system comprising plural computation nodes, wherein respective first-level small pieces of data, that are formed by dividing data, are distributed to the respective computation nodes in the plural computation nodes; at least one first computation node in the plural computation nodes is constructed to further divide the first-level small piece of data into second-level small pieces of data, and transfer, in parallel, the respective second-level small pieces of data to a group of relay nodes which is a subset of the plural computation nodes; and at least one second computation node in the plural computation nodes is constructed to obtain, by parallel transfer, the second-level small pieces of data from the group of relay nodes, and reconstruct the first-level small piece of data by using the second-level small pieces of data transferred from the group of relay nodes.
Another mode of the present invention provides a parallel computation system comprising plural computation nodes, wherein each of first-level small pieces of data, that are formed by dividing data, is further divided into second-level small pieces of data; the respective second-level small pieces of data are distributed to the respective computation nodes in the plural computation nodes; and at least one computation node in the plural computation nodes is constructed to obtain, by parallel transfer, the second-level small pieces of data from the respective computation nodes in the plural computation nodes, and reconstruct the first-level small piece of data by using the second-level small pieces of data transferred from the plural computation nodes.
According to the present invention, parallel computation can be performed at high speed.
In the following description, embodiments of the present invention will be explained in detail, with reference to the figures.
The respective computation nodes 100 are connected by communication links 20. The communication link 20 is a transmission path which allows transmission/reception of data between computation nodes 100 which are connected to ends of the communication link 20. The communication link 20 transmits data in the form of an electric signal or an optical signal. The communication link 20 may be wired or wireless. In the example in
As explained above, the parallel computation system 10 relating to the embodiment of the present invention has the construction wherein the respective computation nodes 100 are logically full-mesh connected. In a prior-art parallel computation system 10 wherein the respective computation nodes 100 are connected via a packet switch, the links between the computation node 100 and the packet switch are used in a time division manner, so that the system is highly flexible; however, it is required to perform a complicated procedure for avoiding collision between packets, and such procedure is a cause of delay in communication and increase in consumed electric power. On the other hand, in the parallel computation system 10 of the present embodiment wherein the respective computation nodes 100 are logically full-mesh connected, all computation nodes 100 are always directly connected with each other, so that it is not required to consider collision between packets, and becomes possible to adopt a simpler process; thus, communication delay and electric power consumption can be reduced.
In the case that specific computation is performed, the parallel computation system 10 divides a process for the computation into plural processes, and assigns the divided sub-processes to the respective computation nodes 100. That is, each computation node 100 is in charge of a part of the computation that is performed as a whole by the parallel computation system 10. Further, the parallel computation system 10 divides data, that is used in computation or is an object of computation, into plural pieces of data, and distributes the divided small pieces of data to respective computation nodes 100. Although each computation node 100 performs computation that the computation node is in charge of, the computation node may not hold data required for the computation. The computation node 100 obtains, via the communication link 20, such data from the other computation node 100 which holds the data. As explained above, each of the computation nodes 100 performs sub-process assigned to the computation node, so that computation in the parallel computation system 10 is processed, in a parallel manner, by cooperation of plural computation nodes 100.
In
A small piece of data, that is one of small pieces of data formed by dividing entire data used for parallel computation and is designated to be distributed to the computation node 100, is stored in the data storage area 124 in advance. Further, a small piece of data, that is required by the computation node 100 for computation and is obtained from the other computation node 100, is stored in the data storage area 124 temporarily. Further, data generated as a result of execution of computation by the computation node 100 is also stored in the data storage area 124.
The transmission/reception unit 130 transmits/receives, between the computation node 100 and other computation nodes 100, small pieces of data that are required by the respective computation nodes 100 for computation. Specifically, the transmission/reception unit 130 transmits a small piece of data, that is distributed to the computation node 100 and stored in the data storage area 124 of the memory 120 in advance, to the other computation node 100 to be used for computation in the other computation node 100. Further, the transmission/reception unit 130 receives a small piece of data, that is not held in the computation node 100 and is necessary for computation, from the other computation node 100.
The transmission/reception unit 130 comprises plural communication ports 132 for transmitting/receiving, in parallel, data to/from the respective ones of plural computation nodes 100. The respective communication ports 132 are connected to the respective corresponding computation nodes 100 via communication links 20. In the example in
In the following description, embodiments of the present invention will be explained in relation to computation of a product of matrices. In the case that the number of computation nodes 100 is N (=p*q, wherein p and q are natural numbers), each of the matrices A and B is divided into p parts in the row direction and q parts in the column direction. Although it is not necessary to set p=q, the case wherein p=q, i.e., N=p2, will be explained in the following description; this is because the number of times of communication of the matrix A and that of the matrix B coincide with each other if p=q, so that computation can be performed most efficiently. To be able to compute matrix multiplication A*B with respect to the matrix A and the matrix B, it is required that the number of columns of the matrix A and the number of rows of the matrix B be the same. Thus, it is supposed that the matrix A has I rows and K columns and the matrix B has K rows and J columns. In such a case, the number of rows and the number of columns of each of submatrices, that are formed by dividing the matrix A into N (=p2) parts, are I/p and K/p, respectively; and the number of rows and the number of columns of each of submatrices, that are formed by dividing the matrix B into N (=p2) parts, are K/p and J/p, respectively. Thus, the number of columns of the submatrix of the matrix A and the number of rows of the submatrix of the matrix B coincide with each other, so that a product of matrices with respect to the submatrix of the matrix A and the submatrix of the matrix B can be calculated. For example, in the case of the parallel computation system 10 in
cij=Σ
k(aik*bkj)
(provided that i=1, 2, and 3; j=1, 2, and 3; and k=1, 2, and 3)
First, an algorithm for parallel computation, that has been known, will be explained.
First, in step 402, respective submatrices aij of the matrix A are distributed to respective corresponding computation nodes Nn (provided that n=3(i−1)+j−1). Specifically, as shown in
Next, in step 404, respective submatrices bij of the matrix B are distributed to respective corresponding computation nodes Nn, similarly.
Next, in step 405, each computation node Nn secures a part of the data storage area 124 in its memory 120 as an area for storing the submatrix cij, and initializes all elements of the submatrix cij by 0. In this case, indexes i and j of the submatrix cij are represented by i=n/3+1 and j=n %3+1, respectively. In this regard, n/3 means the integer part of the quotient obtained by dividing n by 3, and n %3 means the remainder obtained when n is divided by 3.
At this point, the computation node N0, for example, merely holds the submatrix all of the matrix A and the submatrix b11 of the matrix B. Thus, the computation node N0 cannot yet perform computation of submatrix c11, that the computation node N0 is in charge of, in computation of the matrices product C. The matters similar to the above matters also apply to other computation nodes 100. The above is the preparation stage before performing the following repeated processes. In the following process, three times of repeated processes, i.e., steps 406-410, steps 412-146, and steps 418-422 are performed.
In the repeated processes of the first time, in step 406, each of the computation nodes N0, N3, and N6 transmits, by “Broadcast communication,” a submatrix, that is held thereby, of the matrix A to other two computation nodes 100 “in the same row.” The expression “in the same row” means that they belong to the same row in the box shown in
Next, in step 408, each of the computation nodes N0, N1, and N2 transmits, by Broadcast communication, a submatrix, that is held thereby, of the matrix B to other two computation nodes 100 “in the same column.” The expression “in the same column” means that they belong to the same column in the box shown in
Next, in step 410, each computation node Nn computes a product of matrices ai1*b1j of two submatrices (provided that i=n/3+1 and J=n %3+1), that is a part of calculation that the computation node is in charge of. For example, the computation node N0 uses the submatrix a11 and the submatrix b11 that have been stored in the data storage area 124 in the memory 120 in step 402 and step 404, respectively, for calculating a product of matrices a11*b11. Also, the computation node N1 uses the submatrix b12 that has been stored in the data storage area 124 in the memory 120 in step 404 and the submatrix a11 that has been obtained from the computation node N0 in step 406, for calculating a product of matrices a11*b12. Also, for example, the computation node N4 uses the submatrix a21 that has been obtained from the computation node N3 in step 406 and the submatrix b12 that has been obtained from the computation node N1 in step 408, for calculating a product of matrices a21*b12. Regarding other computation nodes 100, refer to
In the repeated processes of the second time, in step 412, each of the computation nodes N1, N4, and N7 transmits, by Broadcast communication, a submatrix, that is held thereby, of the matrix A to other two computation nodes 100 in the same row. Specifically, the computation node N1 transmits the submatrix a12 to the computation node N0 and the computation node N2, the computation node N4 transmits the submatrix a22 to the computation node N3 and the computation node N5, and the computation node N7 transmits the submatrix a32 to the computation node N6 and the computation node N8.
Next, in step 414, each of the computation nodes N3, N4, and N5 transmits, by Broadcast communication, a submatrix, that is held thereby, of the matrix B to other two computation nodes 100 in the same column. Specifically, the computation node N3 transmits the submatrix b21 to the computation node N0 and the computation node N6, the computation node N4 transmits the submatrix b22 to the computation node N1 and the computation node N7, and the computation node N5 transmits the submatrix b23 to the computation node N2 and the computation node N8.
Next, in step 416, in a manner similar to that in step 410, each computation node Nn calculates a product of matrices ai2*b2j of two submatrices, that is a part of calculation that the computation node is in charge of; and adds, element by element, each element in the obtained product of matrices ai2*b2j to each element of the submatrix cij in the data storage area 124 in the memory 120 in the computation node. Although details are omitted for avoiding complication in the above explanations, a person skilled in the art can easily understand, based on the above explanation relating to step 410 and the description in
In the repeated processes of the third time, in step 418, each of the computation nodes N2, N5, and N8 transmits, by Broadcast communication, a submatrix, that is held thereby, of the matrix A to other two computation nodes 100 in the same row. Specifically, the computation node N2 transmits the submatrix a13 to the computation node N0 and the computation node N1, the computation node N5 transmits the submatrix a23 to the computation node N3 and the computation node N4, and the computation node N8 transmits the submatrix a33 to the computation node N6 and the computation node N7.
Next, in step 420, each of the computation nodes N6, N7, and N8 transmits, by Broadcast communication, a submatrix, that is held thereby, of the matrix B to other two computation nodes 100 in the same column. Specifically, the computation node N6 transmits the submatrix b31 to the computation node N0 and the computation node N3, the computation node N7 transmits the submatrix b32 to the computation node N1 and the computation node N4, and the computation node N8 transmits the submatrix b33 to the computation node N2 and the computation node N5.
Next, in step 422, in a manner similar to that in each of step 410 and step 416, each computation node Nn calculates a product of matrices ai3*b3j of two submatrices, that is a part of calculation that the computation node is in charge of and adds, element by element, each element in the obtained product of matrices ai3*b3j to each element of the submatrix cij in the data storage area 124 in the memory 120 in the computation node. Regarding tangible contents of computation, refer to the above explanation relating to step 410 and the description in
As a result, each computation node 100 obtains a result of computation with respect to the submatrix cij that is a part of the matrix C that represents the product of matrices A*B and that the computation node in charge of computation thereof.
For example, in the table shown as step 406 in
As shown in
In this regard, in the above explanation, an example wherein a matrix is divided into 3*3=9 submatrices and they are distributed to nine computation nodes 100 has been explained. However, the number by which a matrix is divided and the number of computation nodes 100 are not limited to those in the above example. When it is generalized, it is possible to divide a matrix into p*p=p2 (p is an integer equal to or greater than 2) submatrices, and distribute them to p2 computation nodes Nn (provided that n=0, 1, . . . , p2-1). In the case of the operation show by the flow chart in
A repeated process at the m-th time (m=1, 2, . . . , p) in p times of repeated processes is performed in the following manner. That is, first, as operation corresponding to step 406 in the flow chart in
First, in step 702, respective submatrices aij of a matrix A and respective submatrices bij of a matrix B are arranged to be positioned in respective corresponding computation nodes 100. This step is the same as steps 402 and 404 in the above-explained prior-art example.
Next, in step 703, similar to the case of step 405 in the prior-art example, each computation node 100 secures a part of the data storage area 124 in its memory 120 as an area for storing the submatrix cij, and initializes all elements of the submatrix cij by 0.
Next, in step 704, each of the computation nodes N0, N3, and N6 transmits a submatrix, that is held thereby, of the matrix A to all computation nodes 100 by “Scatter communication.” The “Scatter communication” is communication wherein data held by a computation node 100 is divided into small pieces of data, and respective small pieces of data are transmitted to respective corresponding computation nodes 100, so that parts, that are different from each other, of the original data are distributed to different computation nodes 100.
Specifically, for example, the computation node N0 divides a submatrix a11 into nine small pieces of data a110, a111, a112, a113, a114, a115, a116, a117, and a118; and transmits the small piece of data a111 to the computation node N1, transmits the small piece of data a112 to the computation node N2, transmits the small piece of data a113 to the computation node N3, transmits the small piece of data a114 to the computation node N4, transmits the small piece of data a115 to the computation node N5, transmits the small piece of data a116 to the computation node N6, transmits the small piece of data a117 to the computation node N7, and transmits the small piece of data a118 to the computation node N8. Also, the computation node N3 divides a submatrix a21 into nine small pieces of data and transmits the respective small pieces of data to other computation nodes 100, in a similar manner. The computation node N6 also performs operation similar to the above operation. In the table shown as step 704 in
Next, in step 706, the respective computation nodes N1, N2, N4, N5, N7, and N8 collect the small pieces of data, that have been distributed to the respective computation nodes 100 in above step 704, by performing “Allgather communication,” and reconstruct the submatrices of the matrix A by use of the collected small pieces of data, respectively. The “Allgather communication” is that by which plural processes, wherein each process is that for collecting pieces of data distributed to plural computation nodes 100 and combining the pieces of data by a single computation node 100, are performed in parallel.
Specifically, for example, the computation node N1 obtains small pieces of data a110, a111, a112, a113, a114, a115, a116, a117 and a118 from computation nodes N0, N1, N2, N3, N4, N5, N6, N7, and N8, respectively, and reconstructs the submatrix a11 of the matrix A by use of the above small pieces of data. The above transferring of the respective small pieces of data from the respective computation nodes 100 to the computation node N1 is shown in the second column, when viewed from the left side, of the table shown as step 706 in
As a result that steps 704 and 706 are performed as explained above, each of the computation nodes N0, N1, and N2 is made to be in the state that it holds the submatrix a11 of the matrix A, each of the computation nodes N3, N4, and N5 is made to be in the state that it holds the submatrix a21 of the matrix A, and each of the computation nodes N6, N7, and N8 is made to be in the state that it holds the submatrix a31 of the matrix A, similar to the case that when step 406 in the above-explained prior-art algorithm is completed. When step 406 in the prior-art algorithm and steps 704 and 706 in the present embodiment are compared, it is worth to note that, although the number of communication steps is doubled in the present embodiment, the number of used communication links 20 is increased nine-fold, and the size of data transmitted through each communication link 20 is decreased to 1/9 of that of the prior art, so that communication time required for transferring a submatrix is shortened to 2/9 of that of the prior art.
Next, in step 708, each of the computation nodes N0, N1, and N2 transmits a submatrix, that is held thereby, of the matrix B to all computation nodes 100 by performing Scatter communication. Specifically, as shown in the table of step 708 in
Next, in step 710, the respective computation nodes N3, N4, N5, N6, N7, and N8 collect the small pieces of data, that have been distributed to the respective computation nodes 100 in above step 708, by performing “Allgather communication,” and reconstruct the submatrices of the matrix B by use of the collected small pieces of data, respectively. Specifically, for example, the computation node N3 obtains small pieces of data b110, b111, b112, b113, b114, b115, b116, b117, and b118 from computation nodes N0, N1, N2, N3, N4, N5, N6, N7, and N8, respectively, and reconstructs the submatrix b11 of the matrix B by use of the above small pieces of data. The above transferring of the respective small pieces of data from the respective computation nodes 100 to the computation node N3 is shown in the fourth column, when viewed from the left side, of the table shown as step 710 in
As a result that steps 708 and 710 are performed as explained above, each of the computation nodes N0, N3, and N6 is made to be in the state that it holds the submatrix b11 of the matrix B, each of the computation nodes N1, N4, and N7 is made to be in the state that it holds the submatrix b12 of the matrix B, and each of the computation nodes N2, N5, and N8 is made to be in the state that it holds the submatrix b13 of the matrix B, similar to the case that when step 408 in the above-explained prior-art algorithm is completed. When step 408 in the prior-art algorithm and steps 708 and 710 in the present embodiment are compared, it is worth to note, similar to the case of above steps 704 and 706, that, although the number of communication steps is doubled in the present embodiment, the number of used communication links 20 is increased nine-fold, and the size of data transmitted through each communication link 20 is decreased to 1/9 of that of the prior art, so that communication time required for transferring a submatrix is shortened to 2/9 of that of the prior art.
Next, in step 712, each computation node 100 computes a product of matrices ai1*b11 of two submatrices, that is a part of calculation that the computation node is in charge of; and adds, element by element, each element in the obtained product of matrices to each element of the submatrix cij in the data storage area 124 in the memory 120 in the computation node. The above step corresponds to step 410 in the above-explained prior-art algorithm. In this regard, for example, the computation node N1 has obtained the submatrix a11, that is required for the calculation of the product of matrices, by the Scatter communication in step 704 and the Allgather communication in step 706. Also, for example, the computation node N4 has obtained the submatrix a21, that is required for the calculation of the product of matrices, by the Scatter communication in step 704 and the Allgather communication in step 706, and has obtained the submatrix b12 by the Scatter communication in step 708 and the Allgather communication in step 710. Similarly, other computation nodes 100 obtain necessary submatrices by performing Scatter communication and Allgather communication in series. In this manner, respective submatrices aij and bij are transferred from origination computation nodes 100 to destination computation nodes 100 in such a manner that small pieces of data, that are formed by dividing the respective submatrices aij and bij, are relayed by other computation nodes 100 by performing two-step communication comprising Scatter communication and Allgather communication, rather than direct transfer from the origination computation nodes 100 to the destination computation nodes 100.
Next, steps 714-722 are performed in a manner similar to that in above-explained steps 704-712. Steps 714 and 716 are processes for sending the submatrix a12 of the matrix A to the computation nodes N0 and N2, sending the submatrix a22 to the computation nodes N3 and N5, and sending the submatrix a32 to the computation nodes N6 and N8, by performing Scatter communication and Allgather communication similar to those performed in steps 704 and 706. Also, steps 718 and 720 are processes for sending the submatrix b21 of the matrix B to the computation nodes N0 and N6, sending the submatrix b22 to the computation nodes N1 and N7, and sending the submatrix b23 to the computation nodes N2 and N8, by performing Scatter communication and Allgather communication similar to those performed in steps 708 and 710. The above processes are shown respectively in the tables corresponding to the respective steps in
Next, similar to above-explained steps 704-712 and steps 714-722, steps 724-732 are performed. Steps 724 and 726 are processes for sending the submatrix a13 of the matrix A to the computation nodes N0 and N1, sending the submatrix a23 to the computation nodes N3 and N4, and sending the submatrix a33 to the computation nodes N6 and N7, by performing Scatter communication and Allgather communication similar to those explained above. Also, steps 728 and 730 are processes for sending the submatrix b31 of the matrix B to the computation nodes N0 and N3, sending the submatrix b32 to the computation nodes N1 and N4, and sending the submatrix b33 to the computation nodes N2 and N5, by performing Scatter communication and Allgather communication similar to those explained above. The above processes are shown respectively in the tables corresponding to the respective steps in
As a result, each computation node 100 finally obtains a result of computation with respect to the submatrix cij, wherein the submatrix cij is a part of the matrix C that represents the product of matrices A*B and the computation is that the computation node is in charge of.
Now, evaluation with respect to how much the above parallel computation method relating to the first embodiment of the present invention is speeded up, compared with the parallel computation method using the above-explained prior-art algorithm, will be made. In the above two methods, it is supposed that there is no difference in the computation ability of respective computation nodes 100 and the communication band of respective communication links 20. Further, it is supposed that the number of computation nodes 100, which constitute the parallel computation system 10, is N (as explained above, N=9 in
In the case of the prior-art algorithm, broadcast communication is performed in steps 406, 408, 412, 414, 418, and 420, and the total number of times of communication (that is to be represented by M) is 2N1/2. Also, at each time of communication (i.e., in each step), since the submatrix aij or bij is transferred, the data length (that is to be represented by S) transferred at a single time of communication is 1. Thus, the whole relative communication time T (=MS) is 2N1/2. For example, T=16 in the case that N=64.
In the case of the parallel computation method relating to the first embodiment of the present invention, Scatter communication is performed in steps 704, 708, 714, 718, 724, and 728, and Allgather communication is performed in steps 706, 710, 716, 720, 726, and 730, and the total number of times of communication is 2N1/2. Also, since small pieces data, that are formed by dividing the submatrix aij or bij by the number of computation nodes 100, are transferred at each time of communication, the data length transferred at a single time of communication is 1/N. Thus, the whole relative communication time is T=4/N1/2. For example, T=0.5 in the case that N=64.
Thus, the relative communication time in the case that the parallel computation method relating to the first embodiment of the present invention is used is 2/N of that in the case that the prior-art algorithm is used, so that the relative theoretical performance (i.e., 1/T) is speeded up to N/2-fold of that of the prior-art algorithm. In the case that N=64, the parallel computation method relating to the first embodiment of the present invention can realize speeded-up processing that is 32-fold faster than that of the prior-art algorithm.
A difference between the second embodiment and the first embodiment of the present invention is that, in the second embodiment, the three times of Scatter communication in the first embodiment are aggregated into a single time of “Alltoall communication.” That is, the Scatter communication in steps 704, 714, and 724 is aggregated into “Alltoall communication” in step 904 in the second embodiment, and the Scatter communication in steps 708, 718, and 728 is aggregated into “Alltoall communication” in step 906 in the second embodiment. In this regard, in the flow chart shown in
In step 904, all computation nodes 100 transmit submatrices aij, that are held by them respectively, of a matrix A to the all computation nodes 100, by performing “Alltoall communication.” The “Alltoall communication” is that wherein all computation nodes 100 perform, in parallel, processes, wherein each of the processes is that performed by a computation node 100 for dividing data held by the computation node 100 into small pieces of data and transmitting the respective small pieces of data to corresponding computation nodes 100. By the above process, divided parts, that are different from each other, of the all submatrices aij are distributed to different computation nodes 100 simultaneously.
Specifically, for example, the computation node N0 divides a submatrix a11 into nine small pieces of data and transmits the small pieces of data a111, a112, a113, a114, a115, a116, a117, and a118 to the computation nodes N1, N2, N3, N4, N5, N6, N7, and N8, respectively. The above transferring of the respective small pieces of data from the computation node N0 to the respective computation nodes 100 is shown in the top row in the table represented by step 904 in
As is apparent from the routing table in
Similarly, in step 906, all computation nodes 100 transmit submatrices bij, that are held by them respectively, of a matrix B to the all computation nodes 100, by performing Alltoall communication. Tangible contents thereof are shown in the table of step 906 in
In this manner, the respective submatrices aij and bij are transferred from origination computation nodes 100 to destination computation nodes 100 in such a manner that small pieces of data of the respective submatrices aij and bij are relayed by other computation nodes 100 by performing two-step communication comprising Alltoall communication and Allgather communication.
In the case of the parallel computation method relating to the second embodiment of the present invention, Alltoall communication is performed in steps 914 and 906, and Allgather communication is performed in steps 908, 910, 914, 916, 920, and 922, and the total number of times of communication is 2+2N1/2. Also, similar to the case of the first embodiment, since small pieces data, that are formed by dividing the submatrix aij or bij by the number of computation nodes 100, are transferred at each time of communication, the data length transferred at a single time of communication is 1/N. Thus, the whole relative communication time is T=(2+2N1/2)/N. For example, T=0.28 in the case that N=64.
Thus, the relative communication time in the case that the parallel computation method relating to the second embodiment of the present invention is used is (1+N1/2)/(NN1/2) of that in the case that the prior-art algorithm is used, so that the relative theoretical performance is speeded up to (NN1/2)/(1+N1/2)-fold of that of the prior-art algorithm. In the case that N=64, the parallel computation method relating to the second embodiment of the present invention can realize speeded-up processing that is 57-folds faster than that of the prior-art algorithm.
The third embodiment of the present invention is that wherein transferring of small pieces of data in the above-explained second embodiment is modified to further improve efficiency thereof. In Allgather communication in steps 908, 910, 914, 916, 920, and 922 in the second embodiment, the whole communication band of the parallel computation system 10 is not completely used. For example, as would be understood from the state that the left-most column of the table of step 908, the fourth column, when viewed from the left side, and the seventh column, when viewed from the left side, in
Thus, a difference between the third embodiment and the second embodiment of the present invention is that, in the third embodiment, the three times of Allgather communication in the second embodiment are aggregated into two times of “Alltoallv communication” by using the above “blank cells.”
Specifically, Alltoallv communication in step 1108 in the third embodiment is that wherein the processes for obtaining, by the computation nodes N0, N3, and N6, respective small pieces of data a13k of a submatrix a13, respective small pieces of data a23k of a submatrix a23, and respective small pieces of data a33k of a submatrix a33 from other respective computation nodes 100, respectively, in the Allgather communication in step 920 in the second embodiment are incorporated in the blank cells in the Allgather communication in step 908 in the second embodiment. The above matter is shown by the frames formed by broken-lines in the table of step 1108 in
By performing the above Alltoallv communication in step 1108, the computation node N0 obtains the submatrix a13, each of the computation nodes N1 and N2 obtains the submatrix a11, the computation node N3 obtains the submatrix a23, each of the computation nodes N4 and N5 obtains the submatrix a21, the computation node N6 obtains the submatrix a33, and each of the computation nodes N7 and N8 obtains the submatrix a31. Further, by performing the Alltoallv communication in step 1114, the computation node N1 obtains the submatrix a13, each of the computation nodes N0 and N2 obtains the submatrix a12, the computation node N4 obtains the submatrix a23, each of the computation nodes N3 and N5 obtains the submatrix a22, the computation node N7 obtains the submatrix a33, and each of the computation nodes N6 and N8 obtains the submatrix a32.
Further, Alltoallv communication in step 1110 in the third embodiment is that wherein the processes for obtaining, by the computation nodes N0, N1, and N2, respective small pieces of data b31k of a submatrix b31, respective small pieces of data b32k of a submatrix b32, and respective small pieces of data b33k of a submatrix b33 from other respective computation nodes 100, respectively, in the Allgather communication in step 922 in the second embodiment are incorporated in the blank cells in the Allgather communication in step 910 in the second embodiment. The above matter is shown by the frames formed by broken-lines in the table of step 1110 in
By performing the above Alltoallv communication in step 1110, the computation node N0 obtains the submatrix b31, the computation node N1 obtains the submatrix b32, the computation node N2 obtains the submatrix b33, each of the computation nodes N3 and N6 obtains the submatrix b11, each of the computation nodes N4 and N7 obtains the submatrix b12, and each of the computation nodes N5 and N8 obtains the submatrix b13. Further, by performing the Alltoallv communication in step 1116, the computation node N3 obtains the submatrix b31, the computation node N4 obtains the submatrix b32, the computation node N5 obtains the submatrix b33, each of the computation nodes N0 and N6 obtains the submatrix b21, each of the computation nodes N1 and N7 obtains the submatrix b22, and each of the computation nodes N2 and N8 obtains the submatrix b23.
In this regard, steps 1102, 1103, 1104, 1106, 1112, 1118, and 1120 in the flow chart shown in
As is apparent from the routing table in
In this manner, the respective submatrices aij and bij are transferred from origination computation nodes 100 to destination computation nodes 100 in such a manner that small pieces of data of the respective submatrices aij and bij are relayed by other computation nodes 100 by performing two-step communication comprising Alltoall communication and Alltoallv communication.
As explained above, in the case of the parallel computation method relating to the third embodiment of the present invention, Alltoall communication is performed in steps 1104 and 1106, and Alltoallv communication is performed in steps 1108, 1110, 1114, and 1116, and the total number of times of communication is 2N1/2. Also, similar to the cases of the first and second embodiments, since small pieces data, that are formed by dividing the submatrix aij or bij by the number of computation nodes 100, are transferred at each time of communication, the data length transferred at a single time of communication is 1/N. Thus, the whole relative communication time is T=2/N1/2. For example, T=0.25 in the case that N=64.
Thus, the relative communication time in the case that the parallel computation method relating to the third embodiment of the present invention is used is 1/N of that in the case that the prior-art algorithm is used, so that the relative theoretical performance is speeded up to N-fold of that of the prior-art algorithm.
A difference between the fourth embodiment and the above-explained respective embodiments is that, in the fourth embodiment, small pieces of data of submatrices aij and bij are distributed in advance to respective computation nodes 100, in such a manner that the state thereof becomes the same as the state after the submatrices aij of the matrix A are distributed to respective computation nodes 100 by the Alltoall communication in step 904 in the above-explained second embodiment (or step 1104 in the third embodiment) and the submatrices bij of the matrix B are distributed to respective computation nodes 100 by the Alltoall communication in step 906 in the second embodiment (or step 1106 in the third embodiment).
First, in step 1302, respective submatrices aij of the matrix A are divided, respectively, into plural small pieces of data, and the divided small pieces of data are distributed to corresponding computation nodes 100, respectively. Specifically, as shown in the table of step 1302 in
As a result of the above initial distribution, for example, the computation node N0 holds the small piece of data a110 of the submatrix a11, the small piece of data a120 of the submatrix a12, the small piece of data a130 of the submatrix a13, the small piece of data a210 of the submatrix a21, the small piece of data a220 of the submatrix a22, the small piece of data a230 of the submatrix a23, the small piece of data a310 of the submatrix a31, the small piece of data a320 of the submatrix a32, and the small piece of data a330 of the submatrix a33. Similarly, the computation node N1 holds the small piece of data a111 of the submatrix a11, the small piece of data a121 of the submatrix a12, the small piece of data a131 of the submatrix a13, the small piece of data a211 of the submatrix a21, the small piece of data a221 of the submatrix a22, the small piece of data a231 of the submatrix a23, the small piece of data a311 of the submatrix a31, the small piece of data a321 of the submatrix a32, and the small piece of data a331 of the submatrix a33. Similar results will be obtained with respect to other computation nodes.
Next, similar to step 1302, in step 1304, respective submatrices bij of the matrix B are divided, respectively, into plural small pieces of data, and the divided small pieces of data are distributed to corresponding computation nodes 100, respectively.
Thereafter, in steps 1306, 1312, and 1318, small pieces of data, that are held by respective computation nodes 100, of the submatrices aij are exchanged between the computation nodes 100 by performing Alltoallv communication, serially.
Specifically, Alltoallv communication in step 1306 is that wherein the processes for obtaining, by the computation nodes N0, N3, and N6, respective small pieces of data a11k of the submatrix a11, respective small pieces of data a21k of the submatrix a21, and respective small pieces of data a31k of the submatrix a31 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 908 in the second embodiment. Also, Alltoallv communication in step 1312 is that wherein the processes for obtaining, by the computation nodes N1, N4, and N7, respective small pieces of data a12k of the submatrix a12, respective small pieces of data a22k of the submatrix a22, and respective small pieces of data a32k of the submatrix a32 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 914 in the second embodiment. Further, Alltoallv communication in step 1318 is that wherein the processes for obtaining, by the computation nodes N2, N5, and N8, respective small pieces of data a13k of the submatrix a13, respective small pieces of data a23k of the submatrix a23, and respective small pieces of data a33k of the submatrix a33 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 920 in the second embodiment.
By performing the Alltoallv communication in step 1306, each of the computation nodes N0, N1 and N2 obtains the submatrix a11, each of the computation nodes N3, N4 and N5 obtains the submatrix a21, and each of the computation nodes N6, N7 and N8 obtains the submatrix a31. Also, by performing the Alltoallv communication in step 1312, each of the computation nodes N0, N1 and N2 obtains the submatrix a12, each of the computation nodes N3, N4 and N5 obtains the submatrix a22, and each of the computation nodes N6, N7 and N8 obtains the submatrix a32. Further, by performing the Alltoallv communication in step 1318, each of the computation nodes N0, N1 and N2 obtains the submatrix a13, each of the computation nodes N3, N4 and N5 obtains the submatrix a23, and each of the computation nodes N6, N7 and N8 obtains the submatrix a33.
Also, in steps 1308, 1314, and 1320, small pieces of data, that are held by respective computation nodes 100, of the submatrices bij are exchanged between the computation nodes 100 by performing Alltoallv communication, serially.
Specifically, Alltoallv communication in step 1308 is that wherein the processes for obtaining, by the computation nodes N0, N1, and N2, respective small pieces of data b11k of the submatrix b11, respective small pieces of data b12k of the submatrix a, and b12 respective small pieces of data b13k of the submatrix b13 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 910 in the second embodiment. Also, Alltoallv communication in step 1314 is that wherein the processes for obtaining, by the computation nodes N3, N4, and N5, respective small pieces of data b21k of the submatrix b21, respective small pieces of data b22k of the submatrix b22, and respective small pieces of data b23k of the submatrix b23 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 916 in the second embodiment. Further, Alltoallv communication in step 1320 is that wherein the processes for obtaining, by the computation nodes N6, N7, and N8, respective small pieces of data b31k of the submatrix b31, respective small pieces of data b32k of the submatrix b32, and respective small pieces of data b33k of the submatrix b33 from respective computation nodes 100, respectively, are incorporated in the blank cells in the Allgather communication in step 922 in the second embodiment.
By performing the Alltoallv communication in step 1308, each of the computation nodes N0, N3 and N6 obtains the submatrix b11, each of the computation nodes N1, N4 and N7 obtains the submatrix b12, and each of the computation nodes N2, N5 and N8 obtains the submatrix b13. Also, by performing the Alltoallv communication in step 1314, each of the computation nodes N0, N3 and N6 obtains the submatrix b21, each of the computation nodes N1, N4 and N7 obtains the submatrix b22, and each of the computation nodes N2, N5 and N8 obtains the submatrix b23. Further, by performing the Alltoallv communication in step 1320, each of the computation nodes N0, N3 and N6 obtains the submatrix b31, each of the computation nodes N1, N4 and N7 obtains the submatrix b32, and each of the computation nodes N2, N5 and N8 obtains the submatrix b33.
As explained above, in the case of the parallel computation method relating to the fourth embodiment of the present invention, Alltoallv communication is performed in steps 1306, 1308, 1312, 1314, 1318, and 1320, and the total number of times of communication is 2N1/2. Also, similar to the cases of the above-explained embodiments, since small pieces data, that are formed by dividing the submatrix aij or bij by the number of computation nodes 100, are transferred at each time of communication, the data length transferred at a single time of communication is 1/N. Thus, the whole relative communication time is T=2/N1/2 that is the same as that in the third embodiment. For example, T=0.25 in the case that N=64.
Thus, the relative communication time in the case that the parallel computation method relating to the fourth embodiment of the present invention is used is 1/N of that in the case that the prior-art algorithm is used, so that the relative theoretical performance is speeded up to N-fold of that of the prior-art algorithm.
The parallel computation in each of the above-explained embodiments uses, as the basis thereof, SUMMA that is one of prior-art algorithms for matrices product computation. However, the essence of the present invention disclosed in this specification is not limited to application to SUMMA only. Cannon algorithm, Fox algorithm, and so on have been known as other examples of matrices product computation algorithms; and, based on these algorithms, additional embodiments, that are similar to the above-explained embodiments, can be provided.
When
When
In each of the above-explained embodiments, the parallel computation system 10 is constructed to have the form wherein each computation nodes 100 is full-mesh connected to all of other computation nodes 100, as shown in
As shown in the figure, the nine computation nodes N0-N8 in the parallel computation system 210 are divided into three groups G1, G2, and G3, wherein each group comprises three computation nodes 100. The first group G1 comprises computation nodes N0, N1, and N2, the second group G2 comprises computation nodes N3, N4, and N5, and the third group G3 comprises computation nodes N6, N7, and N8. Full-mesh connection between the computation nodes 100 are made in the respective groups. For example, in the first group G1, full-mesh connection between the computation nodes N0, N1, and N2 is made (i.e., one is connected to computation nodes 100 other than the one). Similar connection is made in each of the second group G2 and the third group G3. As a result, three full-mesh connection networks G1, G2, and G3, that do not overlap with each other, are formed.
Further, the nine computation nodes N0-N8 in the parallel computation system 210 are divided into three different groups G4, G5, and G6, wherein each group is different from the above three different groups G1, G2, and G3, and comprises three computation nodes 100. The fourth group G4 comprises computation nodes N0, N3, and N6, the fifth group G5 comprises computation nodes N1, N4, and N7, and the sixth group G6 comprises computation nodes N2, N5, and N8. Similar to the cases of the above groups G1, G2, and G3, full-mesh connection is made in each of the groups G4, G5, and G6. For example, in the fourth group G4, full-mesh connection between the computation nodes N0, N3, and N6 is made. Similar connection is made in each of the fifth group G5 and the sixth group G6. As a result, three full-mesh connection networks G4, G5, and G6, that are independent of the above full-mesh connection networks G1, G2, and G3, are formed.
In this regard, for example, the computation node N0 is a component of the full-mesh connection network G1 which comprises computation nodes arranged in a horizontal direction in
In this manner, the parallel computation system 210 comprises three full-mesh connection networks G1, G2, and G3, each comprising computation nodes arranged in a horizontal direction in
For example, when attention is directed to the full-mesh connection network G1, operation therein is that each of the computation nodes N1, N1, and N2 divides a submatrix a1j, that is held thereby, into three small pieces of data, and transmits the divided small pieces of data to the respective computation nodes 100 in the full-mesh connection network G1 by performing Scatter communication or Alltoall communication. Next, each of the computation nodes N0, N1, and N2 collects the above small pieces of data, that have been distributed in the full-mesh connection network G1 by performing Allgather communication or Alltoallv communication, and reconstructs the original submatrix a1j. Similarly, regarding the full-mesh connection networks G2 and G3, each of submatrices a2j and a3j is divided into small pieces of data, and they are transferred between the computation nodes 100 in the corresponding full-mesh connection network.
On the other hand, in the full-mesh connection network G4, three small pieces of data, that are formed by dividing a submatrix bi1 into three parts, are transferred in a similar manner between the computation nodes N0, N3, and N6. Further, in the full-mesh connection networks G5 and G6, small pieces of data of each of submatrices bi2 and bi3 are transferred in a similar manner between the computation nodes 100.
In this manner, in the parallel computation system 210 in which two-dimensional full-mesh connection of the computation nodes 100 has been made, each of the computation nodes 100 can obtain data, that is required for calculation with respect to a submatrix cij, from other computation nodes 100.
The communication time in the case that Alltoall communication and Alltoallv communication is used for transferring submatrices in the present invention is compared with that in the above-explained third embodiment (it should be reminded that Alltoall communication and Alltoallv communication is used similarly therein). As explained above, in the case of the third embodiment, the number of times of communication is M=2N1/2, and the data length transferred per single time of communication is S=1/N. On the other hand, in the case of the present invention, a submatrix is divided by the number of computation nodes in a single group in the parallel computation system 210 (rather than the number of all computation nodes in the parallel computation system 210), so that the data length transferred per single time of communication is S=1/N1/2. Further, in the case of the present invention, since transferring of a submatrix aij and transferring of a submatrix bij can be performed at the same time by a single time of Alltoall communication or Alltoallv communication, the number of times of communication is M=N1/2. Further, if it is supposed that the communication band per single computation node is a constant value “1,” communication band B per communication link in the third embodiment is B=1/(N−1)≈1/N, since each computation node 100 communicates with other (N−1) computation nodes 100; and, on the other hand, B=½(N1/2−1)≈½N1/2 in the present embodiment, since each computation node 100 communicates with other 2(N1/2−1) computation nodes 100. Thus, the total relative communication time T (=MS/B), that is required for transferring all pieces of data, with respect to the present embodiment is equal to that with respect to the third embodiment.
As explained above, the parallel computation system 210 according to the seventh embodiment of the present invention can realize speeded-up processing similar to that in the parallel computation system 10 relating to each of the above-explained embodiments. Further, when it is supposed that wavelength multiplex communication is performed between computation nodes that are connected by adopting full-mesh connection (one-dimensional or two-dimensional), it is required to prepare N different wavelength in the case of the parallel computation system 10 in
Respective computation nodes 300 are physically connected to a wavelength router 225 by optical fibers 227. The parallel computation system 220 has star-connection-type physical topology wherein all computation nodes 300 are physically connected to the wavelength router 225. Each computation node 300 can communicate with any other computation nodes 300 via the wavelength router 225. Thus, logically, the parallel computation system 220 is constructed to have logical topology of one-dimensional full-mesh connection such as that shown in
The wavelength router 225 comprises plural input/output ports P1-P8, and the respective computation nodes N0-N8 are connected to corresponding input/output ports. An optical signal transmitted from each computation node 300 is inputted to one of the ports P1-P8 of the wavelength router 225. The wavelength router 225 has a function to allocate an optical signal inputted to each port to an output port, that corresponds to the wavelength of the optical signal, in the ports P1-P8. By the above wavelength routing, an optical signal from an origination computation node 300 is routed to a destination computation node 300. For example, as shown in
In this regard, although data transfer between the memory 120 and the crossbar switch 330 is performed via the processor 110 in
In the parallel computation system 220 which is constructed to perform wavelength routing as explained above, it is also possible to perform data communication for parallel computation in a manner similar to those in the above-explained first to seventh embodiments, and speeding up of parallel computation can be achieved thereby.
As explained above, the parallel computation system 220 according to the present embodiment has the construction wherein physical connection between the respective computation nodes 300 is made via the optical fibers 227 and the wavelength router 225, and logical full-mesh connection is made between the respective computation nodes 300 by wavelength routing by the wavelength router 225. In the following description, advantageous points of the above parallel computation system 220, when it is compared with a prior-art parallel computation system in which connection between respective computation nodes is made via a packet switch, will be explained. First, regarding consumption of electric power required for communication between computation nodes, electric power consumption of a prior-art electric-driven packet switch is proportional to throughput ((Line rate)*(Number of ports)); on the other hand, electric power consumption of the wavelength router 225 is independent of throughput, so that, in a state that the throughput is high, electric power consumption of the parallel computation system 220 in the present embodiment is lower than that in the prior art. Next, regarding the number of ports, the upper limit of the number of port in a prior-art electric-driven packet switch is determined mainly based on the number of electric connectors installable on a front panel, and it is approximately 36 per 1U. On the other hand, the number of port in a wavelength router is determined mainly based on the number of wavelengths; so that, when it is supposed that a symbol rate of a signal is 25 GBaud and a distance between channels is 50 GHz, approximately 80 ports can be provided with respect to the whole C-band used in long-distance optical fiber communication. In the case that an MT connector or the like is used as an optical fiber, it is possible to form arrays with a pitch equal to or shorter than 250 μm; and a connector for 160 optical fiber cores, that are required for connecting 80 computation nodes, can be installed within a 1U front panel. Thus, the parallel computation system 220 according to the present embodiment can be downsized further, compared with that of a prior art. Further, regarding easiness of adaption to speeding up of communication speed between computation nodes, a prior-art electric-driven packet switch is dependent on a bit rate and a modulation method, so that it is necessary to change the electric-driven packet switch also when speeding up the communication speed between the computation nodes; on the other hand, the wavelength router 225 can be used continuously as it stands, since it does not include electrical signal processing and is independent of a bit rate and a modulation method. Thus, compared with a prior-art parallel computation system, the parallel computation system 220 according to the present embodiment has advantageous points that it is economical and friendly to the global environment.
The embodiments of the present invention have been explained in the above description; and, in this regard, the present invention is not limited by the above embodiments, and the embodiments can be modified in various ways without departing the scope of the gist of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/028252 | 7/18/2019 | WO | 00 |