The embodiments discussed herein are related to an information processing apparatus and an information processing method.
In order to increase the throughput of a parallel distributed process executed by a parallel computer, it is important to efficiently perform communication in the parallel computer by optimizing the connection form of nodes and switches (for example, a network topology). In addition, in a case where the network topology is optimized to connect more nodes with fewer switches, the construction cost of the parallel computer may be reduced.
A certain document discloses a system (hereinafter, referred to as a multilayer full mesh system) employing a multilayer full mesh topology that is a topology capable of connecting more nodes than a fat tree topology even with the same number of switches.
However, in the multilayer full mesh system, there are fewer routes for efficient communication. Thus, route contention is likely to occur. While the above document discloses a method for avoiding the route contention at the time of executing all-to-all communication, a method for avoiding the route contention at the time of executing all-reduce communication is not considered.
The relating technology is disclosed in Japanese Laid-open Patent Publication No. 2015-232874.
According to an aspect of the embodiments, an information processing apparatus includes a memory and a processor coupled to the memory. The processor performs first all-reduce communication with another information processing apparatus coupled to a first leaf switch coupled to the information processing apparatus, performs second all-reduce communication with one information processing apparatus coupled to a second leaf switch included in the same layer as the first leaf switch and third all-reduce communication with one information processing apparatus coupled to a third leaf switch which is coupled to a spine switch coupled to the first leaf switch included in a layer different from the layer including the first leaf switch, using a result of the first all-reduce communication, and transmits a result of a process of the second all-reduce communication and the third all-reduce communication to the another information processing apparatus coupled to the first leaf switch.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
First, all-reduce communication will be described.
For example, the all-reduce communication for implementing a state illustrated on the right side of
Then, as illustrated in
Then, as illustrated in
Last, as illustrated in
Not all of the nodes n0 to n5 may be a target. A part of nodes among the nodes n0 to n5 may be a target. The all-reduce communication in a case where the nodes n0, n1, n3, and n4 are set as a target will be described as one example. First, as illustrated in
Then, as illustrated in
Below, a way of causing route contention not to occur in the case of executing such all-reduce communication in a multilayer full mesh system is considered. The route contention means that a plurality of packets are transmitted at the same time in the same direction of one route. The route contention in a case where the all-reduce communication is executed in a topology of a normal tree structure is illustrated in
A Spine switch A is coupled to a Leaf switch a1, a Leaf switch b1, a Leaf switch a2, a Leaf switch b2, a Leaf switch a3, and a Leaf switch b3.
A Spine switch B is coupled to the Leaf switch a1, a Leaf switch c1, the Leaf switch a2, a Leaf switch c2, the Leaf switch a3, and a Leaf switch c3.
A Spine switch C is coupled to the Leaf switch a1, a Leaf switch d1, the Leaf switch a2, a Leaf switch d2, the Leaf switch a3, and a Leaf switch d3.
A Spine switch D is coupled to the Leaf switch b1, the Leaf switch c1, the Leaf switch b2, the Leaf switch c2, the Leaf switch b3, and the Leaf switch c3.
A Spine switch E is coupled to the Leaf switch b1, the Leaf switch d1, the Leaf switch b2, the Leaf switch d2, the Leaf switch b3, and the Leaf switch d3.
A Spine switch F is coupled to the Leaf switch c1, the Leaf switch d1, the Leaf switch c2, the Leaf switch d2, the Leaf switch c3, and the Leaf switch d3.
Three nodes are coupled to each Leaf switch.
Each node is an information processing apparatus that performs communication using a communication library such as the message passing interface (MPI).
In the present embodiment, an InfiniBand network in which regular and fixed routing is performed is used for avoiding the route contention. Routing in the InfiniBand network will be described using
In
For example, the network of the present embodiment is not a network in which the route is automatically decided like Ethernet (registered trademark), and is a network in which regular and fixed routing is performed.
For example, the generation circuit 301 is implemented by executing a program stored in a memory 2501 in
The generation circuit 301 generates a communication table based on information related to the topology of the multilayer full mesh system 1000. The generation circuit 301 stores the generated communication table in the communication table memory 303. The generation circuit 301 transmits the communication table stored in the communication table memory 303 to each node at a predetermined timing or in response to a request.
For example, the processing circuit 101 is implemented by executing a program stored in a memory 2501 in
The communication table memory 103 stores the communication table received from the management apparatus 3. The first communication circuit 1011, the second communication circuit 1013, and the third communication circuit 1015 in the processing circuit 101 transmit and receive the packet in accordance with the communication table stored in the communication table memory 103.
Next, a process executed in the multilayer full mesh system 1000 of the present embodiment will be described using
The multilayer full mesh system 1000 executes the all-reduce under control of each Leaf switch (
First, a case where the number of nodes under control of the Leaf switch is an even number (four which is a power of two) will be described using
For example, as illustrated in
Then, as illustrated in
Accordingly, finally, each node has a value “23” as illustrated in
Next, a case where the number of nodes under control of the Leaf switch is an odd number (five) will be described using
For example, as illustrated in
Then, as illustrated in
Then, as illustrated in
Then, as illustrated in
Then, as illustrated in
While the above description is a description of one example of the all-reduce performed in step S1, the all-reduce may be basically performed using the same method even in a case where the number of nodes is other than that in the example.
Returning to the description of
In the present embodiment, an ID is assigned to a node under control of each Leaf switch. For example, a node under control of the Leaf switch a1 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch a2 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch a3 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch b1 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch b2 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch b3 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch c1 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch c2 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch c3 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch d1 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch d2 is assigned 0, 1, and 2 as an ID. A node under control of the Leaf switch d3 is assigned 0, 1, and 2 as an ID.
As illustrated in
Returning to the description of
In the present embodiment, a column includes a Leaf switch at the same position in the layer and a node under control of the Leaf switch. For example, in
For example, a Leaf switch coupled to the same Spine switch and a node under control of the Leaf switch belong to the same column. For example, since the Leaf switch a1, the Leaf switch a2, and the Leaf switch a3 are coupled to the Spine switch A, the Spine switch B, and the Spine switch C, the Leaf switch a1, the Leaf switch a2, and the Leaf switch a3 belong to the same column. In addition, the node under control of the Leaf switch a1, the node under control of the Leaf switch a2, and the node under control of the Leaf switch a3 belong to the same column. Similarly, since the Leaf switch b1, the Leaf switch b2, and the Leaf switch b3 are coupled to the Spine switch A, the Spine switch D, and the Spine switch E, the Leaf switch b1, the Leaf switch b2, and the Leaf switch b3 belong to the same column. In addition, the node under control of the Leaf switch b1, the node under control of the Leaf switch b2, and the node under control of the Leaf switch b3 belong to the same column. Similarly, since the Leaf switch c1, the Leaf switch c2, and the Leaf switch c3 are coupled to the Spine switch B, the Spine switch D, and the Spine switch F, the Leaf switch c1, the Leaf switch c2, and the Leaf switch c3 belong to the same column. In addition, the node under control of the Leaf switch c1, the node under control of the Leaf switch c2, and the node under control of the Leaf switch c3 belong to the same column. Similarly, since the Leaf switch d1, the Leaf switch d2, and the Leaf switch d3 are coupled to the Spine switch C, the Spine switch E, and the Spine switch F, the Leaf switch d1, the Leaf switch d2, and the Leaf switch d3 belong to the same column. In addition, the node under control of the Leaf switch d1, the node under control of the Leaf switch d2, and the node under control of the Leaf switch d3 belong to the same column.
The all-reduce in step S5 is performed in the same manner as the all-reduce described in step S1. In the example in
Returning to
Next, a process in which the management apparatus 3 generates the communication table will be described using
The generation circuit 301 of the management apparatus 3 reads information related to the topology of the multilayer full mesh system 1000 (for example, information related to the number of nodes) from the HDD 2505. The generation circuit 301 generates all-reduce communication data for the node under control of each Leaf switch (
A generation procedure (Allreduce(n)) for the communication table in the case of performing the all-reduce among n (n is a natural number) nodes will be described. In the present embodiment, the communication table is generated by a recursive process.
The process is finished in a case where a number n of nodes under control of the Leaf switch is one.
In a case where the number n of nodes under control of the Leaf switch is two, the communication data for communication between two nodes is written into the communication table.
In a case where the number n of nodes under control of the Leaf switch is an odd number 2m+1 (m is a natural number), two nodes (a node P and a node Q) are selected from the n nodes, and the communication data for the all-reduce communication between the node P and the node Q is written into the communication table. Then, Allreduce(2m) is called for any node of the node P and the node Q and the remaining (2m−1) nodes (for example, 2m nodes). The communication data for delivering the result of Allreduce(2m) from the node P to the node Q is written into the communication table.
In a case where the number of nodes under control of the Leaf switch is 2m (m is a natural number greater than or equal to two), the Leaf switch is divided into a group of m nodes and a group of m nodes, and Allreduce(m) is called for each group in parallel at the same time.
Returning to the description of
For example, the radial structure in step S13 is a radial structure that has its center at the Leaf switch a1 and is represented by two bold solid lines and two bold broken lines in
The generation circuit 301 specifies a radial structure that has its center at each Spine switch. The generation circuit 301 generates the all-reduce communication data for the specified radial structure (step S15). The generation circuit 301 writes the generated communication data into the communication table stored in the communication table memory 303. The communication data generated in step S15 is communication data for performing the all-reduce in each column in parallel at the same time.
For example, the radial structure in step S15 is a radial structure represented by links connecting three nodes with the same hatching in
The generation circuit 301 generates broadcast communication data for the node under control of each Leaf switch (step S17). The generation circuit 301 writes the generated communication data into the communication table stored in the communication table memory 303 and transmits the communication table to each node. Then, the process is finished.
Node 1: -, 2, 3, -
Node 2: -, 1, 4, -
Node 3: -, 4, 1, -
Node 4: 5, 3, 2, 5 (transmission)
Node 5: 4, -, -, 4 (reception)
Here, “-” represents that communication is not performed. A number represents the ID of a communication opponent. Transmission is represented by “(transmission)”, and reception is represented by “(reception)”.
Next, a process executed in a case where each node transmits the packet will be described using
The processing circuit 101 in the node acquires the ID of the transmission source node of the packet (for example, the node of the processing circuit 101) from the HDD 2505 (
The processing circuit 101 sets i that is a variable representing the phase number to 1 (step S23).
The processing circuit 101 reads the communication data of phase i from the communication table stored in the communication table memory 103 (step S25).
The processing circuit 101 determines whether or not the ID of the node acquired in step S21 is included in the communication data read in step S25 (step S27).
In a case where the ID of the node acquired in step S21 is not included in the communication data read in step S25 (step S27: No route), the node does not perform communication in phase i. Thus, the process transitions to step S31.
In a case where the ID of the node acquired in step S21 is included in the communication data read in step S25 (step S27: Yes route), the processing circuit 101 executes communication in accordance with the communication data read in step S25 (step S29). In step S29, the first communication circuit 1011, the second communication circuit 1013, or the third communication circuit 1015 operates depending on the content of communication.
The processing circuit 101 determines whether or not i=imax is satisfied (step S31). The maximum value of the phase number is imax.
In a case where i=imax is not satisfied (step S31: No route), the processing circuit 101 increments i by one (step S33) and returns to the process of step S25. The finish of the phase is confirmed by barrier synchronization. In a case where i=imax is satisfied (step S31: Yes route), the process is finished.
As described above, in a case where a representative node under control of each Leaf switch executes the all-reduce, the occurrence of the route contention may be avoided in the case of performing the all-reduce communication in the multilayer full mesh system 1000. In addition, in the case of using the method of the present embodiment, the all-reduce may be executed with a calculation amount of approximately O(log n) (n is the number of nodes).
While the all-reduce in which all nodes in the multilayer full mesh system 1000 participate is performed in the first embodiment, the all-reduce in which a part of nodes in the multilayer full mesh system 1000 participate is performed in a second embodiment. For example, in an example in
By doing so, appropriate sizes of resources may be used depending on the calculation size of a job.
The all-reduce in which a part of nodes of the multilayer full mesh system 1000 participate is basically performed using the same method as the all-reduce of the first embodiment. Thus, a detailed description of the all-reduce will not be repeated.
In the first embodiment, the all-reduce based on a node having an ID equal to a predetermined number is executed in each layer, and then, the all-reduce based on a node having an ID equal to a predetermined number in is executed in each column. Meanwhile, in a third embodiment, the all-reduce based on a node having an ID equal to a predetermined number is executed in each column, and then, the all-reduce based on a node having an ID equal to a predetermined number is executed in each layer.
The multilayer full mesh system 1000 executes the all-reduce under control of each Leaf switch (
The multilayer full mesh system 1000 executes the all-reduce based on a node having an ID equal to a predetermined number (for example, zero) in each column (step S43). In step S43, the all-reduce communication is performed. The all-reduce communication in step S43 is performed by the second communication circuit 1013.
The multilayer full mesh system 1000 executes the all-reduce based on a node having an ID equal to a predetermined number in each layer (step S45). In step S45, the all-reduce communication is performed. The all-reduce communication in step S45 is performed by the second communication circuit 1013.
The multilayer full mesh system 1000 transmits the result of step S45 by broadcasting from a node having an ID equal to a predetermined number (for example, a node on which the all-reduce is executed in step S45) to another node under control of each Leaf switch (step S47). The transmission in step S47 is performed by the third communication circuit 1015. Then, the process is finished.
Even in the case of executing the above process, the occurrence of the route contention may be avoided in the case of performing the all-reduce communication in the multilayer full mesh system 1000.
The management apparatus 3 generates the communication table in the first embodiment. However, in a fourth embodiment, a node in the multilayer full mesh system 1000 generates the communication table and distributes to another node.
For example, the processing circuit 101 and the generation circuit 105 are implemented by executing a program stored in the memory 2501 in
The generation circuit 105 generates the communication table based on information related to the topology of the multilayer full mesh system 1000 and stores the generated communication table in the communication table memory 103. In addition, the generation circuit 105 transmits the generated communication table to another node in the multilayer full mesh system 1000 at a predetermined timing or in response to a request. The first communication circuit 1011, the second communication circuit 1013, and the third communication circuit 1015 in the processing circuit 101 transmit and receive the packet in accordance with the communication table stored in the communication table memory 103.
In a case where the above configuration is employed, the management apparatus 3 for generating the communication table is not separately disposed.
While one embodiment of the embodiments is described thus far, the embodiments are not limited thereto. For example, the function block configurations of the node and the management apparatus 3 described above may not match the actual program module configurations.
In addition, the configuration of each table described above is one example, and the above configuration may not be used. Furthermore, in the process flow, the order of processes may be changed without affecting the process result. Furthermore, execution may be performed in parallel.
In addition, the management apparatus 3 may not send the whole communication table to each node and may send only the communication data related to each node.
In addition, the multilayer full mesh system 1000 is not limited to the example illustrated above.
The node and the management apparatus 3 described above are computer apparatuses. As illustrated in
In addition, as illustrated in
The embodiments described above may be summarized as follows.
An information processing apparatus according to a first aspect of the embodiments is an information processing apparatus among a plurality of information processing apparatuses in a multilayer full mesh system including a plurality of spine switches, a plurality of leaf switches, and the plurality of information processing apparatuses. The information processing apparatus includes (A) a first communication circuit (the first communication circuit 1011 in the embodiments is one example of the first communication circuit) that performs first all-reduce communication with another information processing apparatus coupled to a first leaf switch coupled to the information processing apparatus, (B) a second communication circuit (the second communication circuit 1013 in the embodiments is one example of the second communication circuit) that performs second all-reduce communication with one information processing apparatus coupled to a second leaf switch included in the same layer as the first leaf switch and third all-reduce communication with one information processing apparatus coupled to a third leaf switch which is coupled to a spine switch coupled to the first leaf switch and is included in a layer different from the layer including the first leaf switch, using a result of the first all-reduce communication, and (C) a third communication circuit (the third communication circuit 1015 in the embodiments is one example of the third communication circuit) that transmits a result of a process of the second communication circuit to the other information processing apparatus coupled to the first leaf switch.
The occurrence of the route contention may be avoided in the case of performing the all-reduce communication in the multilayer full mesh system.
In addition, the second communication circuit may perform (b1) the second all-reduce communication with the one information processing apparatus coupled to the second leaf switch using the result of the first all-reduce communication and (b2) the third all-reduce communication with the one information processing apparatus coupled to the third leaf switch using a result of the second all-reduce communication.
The all-reduce is appropriately executed.
In addition, the second communication circuit may perform (b3) the third all-reduce communication with the one information processing apparatus coupled to the third leaf switch using the result of the first all-reduce communication and (b4) the second all-reduce communication with the one information processing apparatus coupled to the second leaf switch using a result of the third all-reduce communication.
The all-reduce is appropriately executed.
In addition, the information processing apparatus may further include (D) a generation circuit (the generation circuit 105 in the embodiments is one example of the generation circuit) that generates a communication table used by each of the plurality of information processing apparatuses for communication and transmits the generated communication table to an information processing apparatus other than the information processing apparatus among the plurality of information processing apparatuses.
Each of the plurality of information processing apparatuses may perform the all-reduce communication such that the route contention does not occur as a whole.
An information processing method according to a second aspect of the embodiments is an information processing method executed by an information processing apparatus among a plurality of information processing apparatuses in a multilayer full mesh system including a plurality of spine switches, a plurality of leaf switches, and the plurality of information processing apparatuses. The information processing apparatus includes a process of (E) performing first all-reduce communication with another information processing apparatus coupled to a first leaf switch coupled to the information processing apparatus, (F) performing second all-reduce communication with one information processing apparatus coupled to a second leaf switch included in the same layer as the first leaf switch and third all-reduce communication with one information processing apparatus coupled to a third leaf switch which is coupled to a spine switch coupled to the first leaf switch and is included in a layer different from the layer including the first leaf switch, using a result of the first all-reduce communication, and (G) transmitting results of the second all-reduce communication and the third all-reduce communication to the other information processing apparatus coupled to the first leaf switch.
A program for causing a computer or a processor to perform the process of the method may be created. The program is stored in a computer-readable storage medium or a storage device such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, or a hard disk. An intermediate process result is temporarily held in a storage device such as a main memory.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
JP2017-086962 | Apr 2017 | JP | national |
This application is a continuation application of International Application PCT/JP2018/004367 filed on Feb. 8, 2018 and designated the U.S., the entire contents of which are incorporated herein by reference. The International Application PCT/JP2018/004367 is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2017-086962, filed on Apr. 26, 2017, the entire contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
20090307467 | Faraj | Dec 2009 | A1 |
20100185718 | Archer | Jul 2010 | A1 |
20130151713 | Faraj | Jun 2013 | A1 |
20130290223 | Chapelle | Oct 2013 | A1 |
20150215379 | Tamano | Jul 2015 | A1 |
20150334035 | Miwa | Nov 2015 | A1 |
20160301565 | Zahid | Oct 2016 | A1 |
20160352824 | Miwa et al. | Dec 2016 | A1 |
Number | Date | Country |
---|---|---|
11-110362 | Apr 1999 | JP |
2015-232874 | Dec 2015 | JP |
2016-224756 | Dec 2016 | JP |
2014020959 | Feb 2014 | WO |
Entry |
---|
International Search Report dated Apr. 3, 2018 for PCT/JP2018/004367 filed on Feb. 8, 2018, 6 pages including English Translation. |
Number | Date | Country | |
---|---|---|---|
20190229949 A1 | Jul 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2018/004367 | Feb 2018 | US |
Child | 16368864 | US |