Generally, a broadcast is implemented with a tree-based algorithm, where the branching factor of the tree determines how many nodes (or processes) a given node sends data to. In general, a tree-based algorithm is best for small messages as it has a time complexity of logkN*(latency+message_size/BW), where N is the number of nodes, k is the branching factor of the tree, latency is the network latency and other overheads needed to send a message, message size is the size of the message, and BW is the bandwidth of the fabric used to send the message.
For larger messages, an algorithm that uses a scatter followed by an allgather operation is more efficient, because the bandwidth component of this algorithm is more efficient than using a tree-based implementation. In general, for small/medium messages, most runtimes use either a k-ary or k-nomial tree. These are topology-unaware trees that do not take into account the network topology. The main difference between the k-ary and the k-nomial trees is that with a k-ary tree each parent node has exactly k-children nodes. However, with a k-nomial tree, at each step of the algorithm, a parent node sends a message to k-nodes, and each node continues sending a message to k-different nodes until all the nodes in the system have received the message.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of methods and apparatus for efficient topology-aware tree search algorithm for a broadcast operation are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
For clarity, individual components in the Figures herein may also be referred to by their labels in the Figures, rather than by a particular reference number. Additionally, reference numbers referring to a particular type of component (as opposed to a particular component) may be shown with a reference number followed by “(typ)” meaning “typical.” It will be understood that the configuration of these components will be typical of similar components that may exist but are not shown in the drawing Figures for simplicity and clarity or otherwise similar components that are not labeled with separate reference numbers. Conversely, “(typ)” is not to be construed as meaning the component, element, etc. is typically used for its disclosed function, implement, purpose, etc.
For illustrative purposes, example broadcast operations are discussed using a dragonfly network topology. First, a description of a dragonfly network topology is presented and then discuss common solutions and their disadvantages.
A dragonfly topology is a hierarchical network topology with the following characteristics: 1) Several groups are connected using all-to-all links, that is, each group has at least one direct link to the other group; 2) The topology inside each group can be any topology, with the butterfly network topology being common; and 3) The focus of the dragonfly network is the reduction of the diameter of the network.
An example of a three-tier dragonfly topology 100 is shown in
It is noted that three-tier dragonfly topology 100 is a simplified representation showing groups of nodes at the switch and group levels to be the same, and the size of the groups to be the same and length of links to be the same or similar. In practice, multi-tier dragonfly topologies will generally be somewhat asymmetric (and could be very asymmetric), and the lengths of links would differ. Moreover, in a large-scale implementation of thousands of nodes, the differences in link latencies might be an order of magnitude or more between the shortest links and the largest links. Additionally, a network topology may employ a hierarchical structure comprising N-tiers, where N is three or more.
Under conventional practice, a spanning tree is built to perform the broadcast operation. Conventional spanning tree algorithms build a hierarchical tree structure comprising an undirected graph with no cycles. Based on the spanning tree that is generated, each node knows the parent node from which it will receive messages and its children nodes to which it needs to send the messages. Algorithms using trees that do not take into account the network topology are generally easier to implement but usually take more time to broadcast a message to all nodes. The reason is that messages can go back and forth several times across groups and/or across switches in the same group. This results in significant performance loss.
As an example, assume that in the dragonfly network topology in
As shown in the left part of tree 300 in
A more efficient and simple heuristic uses a hierarchical topology-aware tree that sends the message to the furthest away node first, so that nodes in the critical path (the furthest away from the root) can receive the message earliest. In this hierarchical design, each switch has a designated node leader (switch leader) and each group has a designated node leader (group leader). In practice, each node also has a leader rank, but for the discussion here we assume a single rank per node and we refer to it as node leader. The broadcast is performed in three steps, as shown in
A corresponding tree 500 with the time when the message is available on each node is shown in
The benefits of this hierarchical approach in
While this hierarchical approach is better than a topology unaware tree, sending the data first to the nodes that are farther apart, delays the time when the first nodes receive the messages. Empirical and analytical results show that the heuristic used for the algorithm disclosed below performs better than this hierarchical approach.
Under one aspect, embodiments of the solution build a tree where each node sends the message first to the nodes that can be reached earlier. The rationale for this approach is that the earliest a node receives the message, the earlier it can broadcast the message to other nodes, increasing the number of nodes that are broadcasting the message and therefore decreasing the overall time to perform the broadcast operation.
A pictorial view of the dragonfly network topology 600 and tree 700 using this heuristic are shown in
Moving to the next level in tree 700, node 604 sends copies of the message to nodes 616, 626, 620, 622, and 632. Node 606 sends copies of the message to nodes 634, 636, 630, and 640. Node 608 sends copies of the message to nodes 628, 638, and 648, while node 610 sends a copy of the message to node 646.
As tree 700 in
A drawback of the heuristic that sends to the nearest neighbors first is the time it takes to generate the tree. It is noted for all the trees illustrated herein, consideration of both the tree structure and branch order are important. Generally, identification of nodes in the different levels in a tree (what nodes should be at what levels) is moderately complex. However, considering a combination involving the tree structure and branch order (or other message transmission order) adds another level of complexity.
A goal of the embodiments is to minimize the broadcast time, that is, the time it takes for a root node to send the data to all the nodes in a supercomputer system. To this end, an algorithm is disclosed to efficiently compute the tree to perform the broadcast based on the heuristic that the broadcast time can be minimized by sending the message first to the nearest neighbor(s), that is, the node(s) that can receive the message the earliest. The rationale behind this heuristic is that when a node receives a message it becomes a broadcaster itself, so by sending the data first to the nodes that can receive the data earlier, the number of broadcasters increase, and since more nodes are sending the data, the time to complete the broadcast reduces.
In the embodiments described and illustrated herein, the solution is applied to a network with a dragonfly network topology; however, this is merely exemplary and non-limiting, as the teachings and principles described and illustrated herein may be applied to any network where it is possible to identify the latency needed for a message to go from a node A to a node B, and which includes the time due to the processing time of each of the switches in the path from A to B plus the time to process the message in the sender and in the receiver nodes. Generally, the approach assumes that there are a set or cluster of nodes that are at the same latency (or distance). Notice that while usually multiple paths exist between two given nodes in a supercomputer system, small messages usually follow along the same path (especially since standards such as MPI (Message Passing Interface) impose ordering requirements).
As previously explained, the algorithm to execute a broadcast needs to compute a tree so that each node knows its parent node (node from which it will receive the message) and its child or children nodes (nodes to which a given node will send the message). One challenge is that the tree generation for the heuristic that sends first to the nearest neighbor has a time complexity on the order of N3, where N is the number of nodes in the system. As the number of nodes available for distributed processing on today's supercomputers can be quite large, e.g., >20,000, the time to generate the tree itself could make use of conventional heuristics nonviable in practice.
To better under and appreciate the advantages provided by the novel tree generation algorithm discussed herein, a discussion of a naïve algorithm employing a nearest neighbor heuristic, as illustrated in
The outer while loop (line 4) of the algorithm iterates until the list visited_nodes contains all the nodes in the system, a total of N iterations, where N is the number of nodes. On each iteration of this outer while loop, the algorithm finds the unvisited node u (unode) that can be reached the earliest in time from any of the already visited nodes v. The algorithm computes the node in the visited_nodes list (vnode) that is used to reach the unode, updates the available Time of both nodes, removes vnode from the unvisited_nodes list and adds it to the visited_nodes list.
The algorithm illustrated (via pseudocode) in
Given a three-tier dragonfly network topology (such as shown in
The improved tree building algorithm applies the following three optimizations to optimize the naive algorithm:
As shown in a block 1004, the process begins at the root node, which is also the first vnode. In a block 1006 the visited_nodes list (vnode list) and unvisited_node list (unode list) is initialized. The vnode list will contain the root node, and the unode list will initially include nodes attached to the same switch as the root (also referred to as the root switch) other than the root node.
The operations shown in blocks 1008, 1010, and 1012 are performed iteratively in a loop until all nodes have been moved to the visited list. In block 1008 a search is performed to find the unode that can be reached earliest from a vnode taking into account the distance between the unode and vnode. The search will calculate an overall latency (overall time it takes to send a message) for the paths traversed by a message that is sent from the vnode to the unodes being considered. As discussed above, the time it takes for node X to send a message to node Y is computed as o+distance[X][Y]+o, where distance[X][Y] is the time it takes for the message to flow from node X to node Y and that takes into account the latency, time due to message size and network bandwidth, and delay incurred in each of the switches in the path between nodes X and Y, and o is a predetermined time it takes to send out and receive a message at the sender and recipient.
For vnodes other than the root node, the overall latency that is calculated is added to the time when the message is received by the vnode (referred to as the available time in the following formula from line 9 in
where v is the vnode and u is the unode.
Once the unode is found in block 1008, it is moved from the unvisited_node list to the visited_node list. The times when the unodes are next available from the new vnode are also updated and the min-heaps are rebuilt accordingly in lines 22 and 23.
In block 1012, new unodes are added to the unvisited_node list based on the location of the unode that has been found (the new vnode). In addition, a single node is marked to search for each set of new nodes that have been added to the unvisited_node list having the same distance (e.g., coupled to the same switch or within the same group). The logic than loops back to block 1008 to perform the next search iteration.
Further details of the operations performed when adding unodes to the unvisited_node list in block 1012 are shown in flowcharts 1100 and 1200 of
In a decision block 1106, a determination is made to whether the unode is not on the root switch. If the answer is YES, the logic proceeds to a block 1108 in which another leader node from a different switch in the same group is marked.
Next, in a decision block 1110 a determination is made to whether the unvisited_nodes list contains nodes from the same switch as the unode. If the answer is YES, the logic proceeds to a block 1112 in which one of the nodes from the same switch is marked.
The logic then proceeds to a decision block 1114 in which a determination is made to whether the unode is a leader node of a group other than the root group. If the answer is YES, the logic proceeds to a block 1116 in which a leader_node from a group different from the unode group is marked. The flow then returns in a return block 1118. If the answer to decision block 1102 is NO, the logic flows to decision block 1110. As shown by the other NO branches, whenever the determination of decision blocks 1106, 1110, and 1114 is NO, the immediately following blocks are skipped.
Flowchart 1200 in
Next, in a decision block 1208 a determination is made to whether there are any unvisited leader nodes from other groups. If the answer is YES, the logic proceeds to a block 1210 in which the leader nodes from the switches in the other groups (with unvisited leader nodes) are added. One of the switch leader nodes that is added is the marked to participate on the search. The flow then returns, as depicted by a return block 1212.
Generally, the algorithms disclosed herein may be implemented on a single compute node, such as a server, or in on multiple compute nodes in a distributed manner. Such compute nodes may be implemented via platforms having various types of form factors, such as server blades, server modules, 1U, 2U and 4U servers, servers installed in “sleds” and “trays,” etc. In addition to servers, the algorithms may be implemented on an Infrastructure Processing Unit (IPU), and Data Processing Unit (DPU), or a SmartNIC).
CPU/SOC 1306 employs a System on a Chip including multiple processor cores. Various CPU/processor architectures may be used, including but not limited to x86, ARM®, and RISC architectures. In one non-limiting example, CPU/SOC 1306 comprises an Intel® Xeon®-D processor. Software executed on the processor cores may be loaded into memory 1314, either from a storage device (not shown), for a host, or received over a network coupled to QSFP module 1308 or QSFP module 1310.
Generally, and IPU and a DPU are similar, whereas the term IPU is used by some vendors and DPU is used by others. A SmartNIC is similar to an IPU/DPU except in will generally by less powerful (in terms of CPU/SoC and size of the FPGA). As with IPU/DPU cards, the various functions and logic in the embodiments of algorithms described and illustrated herein may be implemented by programmed logic in an FPGA on the SmartNIC and/or execution of software on CPU or processor on the SmartNIC.
As discussed above, the naive algorithm has a time complexity of O(N3) (the order of N cubed), where N is the number of nodes in the system. In comparison, the improved algorithm disclosed herein reduces the complexity significantly. The outer loop is still bounded by the number of nodes, N. However, the worst case for both the middle and the inner loops is bounded by the number of switches in the system (instead of number of nodes), that is the complexity of the disclosed algorithm is O(N*S*S), where S is the number of switches in the system. Given that switches generally have between 64 and 128 ports, S is significantly smaller than N.
We have implemented the naive and the improved algorithm and have measured the time to generate the tree, as shown in the tables below. TABLE 1 shows the running times in seconds for the tree search of the naïve algorithm, while TABLE 2 shows the running times in seconds for the improved algorithm disclosed herein. With the naive algorithm, we had to abort the tree search generation for a system with 10,000 nodes, because after 1793 seconds, the search had not completed. However, with the improved algorithm, we were able to generate the tree for 10,000 nodes in 0.1357144 seconds and we were even able to run the search for 1,000,000 nodes in a little bit over 10 seconds. Thus, with our disclosed algorithm, the broadcast with a nearest neighbor heuristic becomes practical.
We have also assessed the performance of a Broadcast when using a topology unaware tree, a topology aware hierarchical tree, similar to the one in
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. Additionally, “communicatively coupled” means that two or more elements that may or may not be in direct contact with each other, are enabled to communicate with each other. For example, if component A is connected to component B, which in turn is connected to component C, component A may be communicatively coupled to component C using component B as an intermediary component.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
An algorithm is here, and generally, considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers or the like. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
As discussed above, various aspects of the embodiments herein may be facilitated by corresponding software running on a compute node, server, etc., or running on multiple compute nodes in a distributed manner, or on an IPU, DPU, or SmartNIC. Thus, embodiments of this invention may be used as or to support a software program, software modules, and/or distributed software executed upon some form of processor, processing core or embedded logic a virtual machine running on a processor or core or otherwise implemented or realized upon or within a non-transitory computer-readable or machine-readable storage medium. A non-transitory computer-readable or machine-readable storage medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a non-transitory computer-readable or machine-readable storage medium includes any mechanism that provides (e.g., stores and/or transmits) information in a form accessible by a computer or computing machine (e.g., computing device, electronic system, etc.), such as recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The content may be directly executable (“object” or “executable” form), source code, or difference code (“delta” or “patch” code). A non-transitory computer-readable or machine-readable storage medium may also include a storage or database from which content can be downloaded. The non-transitory computer-readable or machine-readable storage medium may also include a device or product having content stored thereon at a time of sale or delivery. Thus, delivering a device with stored content, or offering content for download over a communication medium may be understood as providing an article of manufacture comprising a non-transitory computer-readable or machine-readable storage medium with such content described herein.
The operations and functions performed by various components described herein may be implemented by software running on one or more a processing elements, via embedded hardware or the like, or any combination of hardware and software. Such components may be implemented as software modules, hardware modules, special-purpose hardware (e.g., application specific hardware, ASICs, DSPs, etc.), embedded controllers, hardwired circuitry, hardware logic, etc. Software content (e.g., data, instructions, configuration information, etc.) may be provided via an article of manufacture including non-transitory computer-readable or machine-readable storage medium, which provides content that represents instructions that can be executed. The content may result in a computer performing various functions/operations described herein.
As used herein, a list of items joined by the term “at least one of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
This invention was made with Government support under Agreement No. 8F-30005, awarded by DOE. The Government has certain rights in this invention.