This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2016-052100, filed on Mar. 16, 2016, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a collective communication technique.
There has been known a Latin square fat-tree network as a network that connects multiple nodes (physical servers, for example) configured to perform collective communication. The Latin square fat-tree network is capable of connecting a larger quantity of nodes than an ordinary fat-tree network.
There has been known a technique for an ordinary fat-tree network to perform all-to-all communication (all-to-all communication is a kind of collective communication) in which all nodes participate. However, there has not been known a technique for a Latin square fat-tree network to perform all-to-all communication in which all nodes participate.
The related techniques are described in, for example, Japanese Laid-open Patent Publication Nos. 2014-164756 and 2013-25505, as well as M. Valerio, L. E. Moser and P. M. Melliar-Smith, “Recursively Scalable Fat-Trees as Interconnection Networks”, IEEE 13th Annual International Phoenix Conference on Computers and Communications, 1994.
According to an aspect of the invention, an information processing apparatus includes a memory; and a processor coupled to the memory and the processor configured to exclude a combination for satisfying a condition from multiple combinations each including a number of shifts of multiple switch layers in a fat-tree network using Latin square, create relay settings for multiple switches for performing communication through multiple communication paths corresponding to remain combinations except the combination excluded from the multiple combinations, and transmit the created relay settings to the multiple switches.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
All-to-all communication is communication in which each node transmits data to all nodes, and is performed in phases the number of which is the same as the number of nodes. Consider the case where all-to-all communication is performed in the fat-tree network illustrated in
In the fat-tree network illustrated in
As illustrated in
However, the shift communication patterns illustrated in
Note that it is conceivable to think of performing communication by setting the number of shifts in the shift communication patterns for every layer. Descriptions are provided for the number of shifts using
As an example, consider the case of changing the number of shifts in each switch layer (second to fourth layers in
Then, for example as illustrated in
There is only one communication path from the node 14 to the node 32. On the other hand, there are three communication paths from the node 2 to the node 23, as illustrated in
As described above, in the case where communication is performed by setting the number of shifts in the shift communication patterns for every layer, there may be multiple communication paths from a source node to a destination node depending on the positional relationship between the source node and the destination node, which may cause redundant communication. Hence in this embodiment, descriptions are provided for a method of performing all-to-all communication in a Latin square fat-tree network, in which all nodes participate without causing redundant communication.
Assume that a system in this embodiment is a Latin square fat-tree network system as illustrated in
(1) Port numbers 0, 1, 2, and so on are attached to the lower links and the upper links of each switch. (2) In the third layer, the port numbers of the upper links agree with the port numbers of the lower links. (3) In the second and fourth layers, the port numbers of the upper links does not have to agree with the port numbers of the lower links.
The processing section 110 communicates with processing sections 110 of other nodes via MPI 100 while advancing processing. The network control section 120 executes processing to control the NIC 153 or other processing. The collecting section 103 collects data on the topology of the Latin square fat-tree network (for example, data on connection relation of switches, data on connection relation between switches and nodes, and the like), and stores the thus-collected data in the topology data storing section 101. The creating section 104 creates a communication table, which is a relay setting that the switches use to relay packets, based on the data stored in the topology data storing section 101, and stores the communication table in the communication table storing section 102. The excluding section 105, in response to a call from the creating section 104, executes processing to exclude combinations that satisfy a condition.
Next, descriptions are provided for processing executed in the Latin square fat-tree network of this embodiment using
The collecting section 103 in the node ND collects data on the topology of the Latin square fat-tree network (step S1 in
The creating section 104 in the node ND executes creation processing using the data stored in the topology data storing section 101 (step S3). The creation processing is described using
First, the creating section 104 sets a, which is the number of shifts of the second layer, to 0 (step S11 in
The creating section 104 registers combinations (a, b, c) with the memory 152 (step S17), and increments c by 1 (step S19).
The creating section 104 determines whether c>n holds (step S21). If c>n does not hold (No route at step S21), the process returns to the processing at step S17. On the other hand, if c>n holds (Yes route at step S21), the creating section 104 increment b by 1 (step S23).
The creating section 104 determines whether b>n holds (step S25). If b>n does not hold (No route at S25), the process returns to the processing at step S15. On the other hand, if b>n holds (Yes route at step S25), the creating section 104 increments a by 1 (step S27).
The creating section 104 determines whether a>n holds (step S29). If a>n does not hold (No route at step S29), the process returns to the processing at step S13. If a>n holds (Yes route at step S29), the creating section 104 advances the process to processing at step S31 in
Moving to descriptions for
After completing the processing up to step S31, the following combinations (a, b, c) are registered in the memory 152.
(a, b, c) 1≦b≦n, 0≦a, c≦n
(a, 0, 0) 0≦a≦n
In the case where n=2, for example, the combinations registered in the memory 152 are (0,0,0), (0,1,0), (0,1,1), (0,1,2), (0,2,0), (0,2,1), (0,2,2), (1,0,0), (1,1,0), (1,1,1), (1,1,2), (1,2,0), (1,2,1), (1,2,2), (2,0,0), (2,1,0), (2,1,1), (2,1,2), (2,2,0), (2,2,1), and (2,2,2). The number of the combinations is n(n+1)2+(n+1)=(n+1)(n(n+1)+1)=(n+1)(n2+n+1)=21. This number is the same as a half of the number of nodes in
The creating section 104 specifies one unprocessed combination (a, b, c) out of the combinations (a, b, c) registered in the memory 152 (step S33).
The creating section 104 identifies communication paths for the combination (a, b, c) specified at step S33 (step S35). At step S35, the source node, the leaf switch connected to the source node, the spine switch, the leaf switch connected to the destination node, and the destination node, which are on each of the communication paths, are identified based on the data stored in the topology data storing section 101.
The creating section 104 registers the port numbers of the ports to create the communication path with the communication tables of the leaf switches and the spine switch on the communication path identified at step S35 (step S37). For example, in the case where the spine switch on the communication path transmits a packet from a port number “X” of the spine switch to a leaf switch on the communication path, the port number “X” is registered being associated with the combination of the source node number and the destination node number.
The creating section 104 determines whether an unprocessed combination (a, b, c) is left, out of the combinations (a, b, c) registered with the memory 152 (step S39). If an unprocessed combination is left (Yes route at step S39), the process returns to the processing at step S33. On the other hand, if there is no unprocessed combination left (No route at step S39), the process returns to the processing of the caller.
Returning to the descriptions for
The processing above enables each switch to obtain the communication table to be applied to the switch.
Next, descriptions are provided for the processing executed by each switch in the Latin square fat-tree network using
First, the relay processing section 301 in the switch SW receives a communication table to be applied to the switch SW from the node that created the communication table (step S41 in
The relay processing section 301 stores the communication table received at step S41 in the communication table storing section 302 (step S43). Then, the process ends.
When the switch SW receives a packet, the following processing is executed.
The relay processing section 301 in the switch SW receives a packet from a node or another switch (step S51 in
The relay processing section 301 identifies, from the packet, information indicating the destination of the packet (a node number in this embodiment) and information indicating the source of the packet (a node number in this embodiment). Then, the relay processing section 301 identifies the port number corresponding to the combination of the destination and the source of the packet from the communication table stored in the communication table storing section 302 (step S53). For example, the communication table illustrated in
The relay processing section 301 transmits the packet received at step S51 from the port with the port number identified at step S53 (step S55). Then, the process ends.
The processing described above enables each switch to perform relays according to the relay setting that causes no redundant communication. This makes it possible to perform all-to-all communication in which all nodes participate without causing redundant communication in a Latin square fat-tree network.
Next, using
First, assume a communication path for sending a packet from a node NS to a node NE, which are positioned as illustrated in
Meanwhile, assume a communication path for sending a packet from a node NS to a node NE positioned as illustrated in
Note that in the case where a half of the ports of the spine switches are used, all-to-all communication is possible with (n+1)(n2+n+1) nodes. In the case all the ports of the spine switches are used, all-to-all communication is divided into a first phase group and a second phase group. For the first phase group, a packet, turning back at the spine switch, is transmitted to a node in the layer the originating node belongs to, and for the second phase group, a packet is transmitted to a node in a layer the originating node does not belong to, not turning back at the spine switch. This enables 2(n+1)(n2+n+1) nodes to perform all-to-all communication. For example, in the case where the number of ports is 36 (in other words, n=17), all-to-all communication is possible with 2×(17+1)×(172+17+1)=11052 nodes.
Although an embodiment of the disclosure has been described, the disclosure is not limited to the embodiment. For example, the functional block configurations for the nodes and the switches described above may not be the same as actual program module configurations in some cases.
In addition, the data structure described above is a mere example and does not mean that data structures have to be the same as that above. Moreover, also in the procedures, it is possible to change the order of processes unless the processing result changes. Furthermore, the processing may be executed in parallel.
In addition, the processing described using
In addition, specified combinations (a, b, c) may be received from an administrator of a Latin square fat-tree network, instead of determining a combination (a, b, c) at the creation processing.
Note that the switches described above are a computer apparatus, which may include, as illustrated in
The embodiment of the disclosure described above is summarized as follows.
An information apparatus according to a first aspect of the embodiment includes: (A) an excluding section configured to exclude a combination satisfying a condition from multiple combinations each including the numbers of shifts of multiple switch layers in a Latin square fat-tree network; (B) a creating section configured to create relay settings for multiple switches for performing communication through multiple communication paths corresponding to the remaining combinations except the combination excluded from the multiple combinations; and (C) a transmission section configured to transmit correspondingly the relay settings created by the creating section to the multiple switches.
The processing above allows for communication between nodes on an appropriate communication path, making it possible to perform all-to-all communication in which all nodes participate in a Latin square fat-tree network.
The condition described above may be that the number of shifts of a switch layer to which a spine switch belongs is 0, and the number of shifts of a switch layer to which a leaf switch connected to a destination node belongs is an integer larger than 0. This excludes redundant communication paths.
In addition, the creating section described above may create, for each of the remaining combinations except the combination excluded from the multiple combinations, the relay settings for the multiple switches by determining an output port number of each switch on each of the communication paths corresponding to the combination based on the number of shifts of the switch layer to which the switch belongs. This enables each switch to relay a packet appropriately.
In addition, the number of the multiple switch layers may be 3, and the number of nodes connected to the multiple switches and the number of the multiple communication paths may be 2(n+1)(n2+n+1), where n is a value obtained by subtracting 1 from a value obtained by dividing the number of ports of each of the multiple switches by 2. This makes it possible to perform the all-to-all communication in which a larger number of nodes participate.
In addition, the number of the multiple switch layers may be 2, and the number of nodes connected to the multiple switches and the number of the multiple communication paths may be (n+1)(n2+n+1), where n is a value obtained by subtracting 1 from the number of ports of each of the multiple switches. This makes it possible to perform all-to-all communication using a half of the ports of spine switches.
In addition, each of the multiple switches may output a received packet to a port with a port number corresponding to the remainder of a division in which a sum of the number of shifts and the port number of a port from which the packet has been received is divided by the number of ports of the switch.
A communication management method according to a second aspect of the embodiment includes: (D) excluding a combination satisfying a condition from multiple combinations each including numbers of shifts of multiple switch layers in a Latin square fat-tree network, (E) creating relay settings for multiple switches for performing communication through multiple communication paths corresponding to the remaining combinations except the combination excluded from the multiple combinations, and (F) transmitting correspondingly the relay settings created by the creating section to the multiple switches.
Note that it is possible to make a program for causing a computer to execute the processing according to the above method, and the program is stored in a computer readable storage medium or storage apparatus such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, a hard disk, or the like. Note that intermediate processing results are temporarily stored in a storage apparatus such as a main memory.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2016-052100 | Mar 2016 | JP | national |