Embodiments described herein relate generally to in-network computing, and particularly to methods and systems for network elements supporting flexible data reduction operations.
Some computing systems support performing computation tasks by network elements of a communication system. Methods for distributing a computation among multiple network elements are known in the art. For example, U.S. Pat. No. 10,284,383 describes a switch in a data network, configured to mediate data exchanges among network elements. The apparatus further includes a processor, which organizes the network elements into a hierarchical tree having a root node network element, vertex node network elements, and child node network elements that include leaf node network elements. The leaf node network elements originate aggregation data and transmit the aggregation data to respective parent vertex node network elements. The vertex node network elements combine the aggregation data from at least a portion of the child node network elements, and transmit the combined aggregation data from the vertex node network elements to parent vertex node network elements. The root node network element is operative for initiating a reduction operation on the aggregation data.
An embodiment that is described herein provides a network element that includes a plurality of ports, multiple computational modules, configurable forwarding circuitry and a central block. The plurality of ports includes multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements. The plurality of ports being configured to connect to a communication network. The computational modules are configured to collectively perform a data reduction operation in accordance with a data reduction protocol. The configurable forwarding circuitry is configured to interconnect among the ports and the computational modules. The central block is configured to receive a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received from the child network elements or network nodes via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port, to derive, from the request, a topology that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform the data reduction operations and to forward the respective reduced data for transmission to the selected parent port, and to configure the forwarding circuitry to apply the topology.
In some embodiments, the selected child ports are configured to receive data messages including a reduction operation and respective data portions, and to send the reduction operation to the central block, and the central block is configured to set the computational modules to apply the reduction operation to the data portions. In other embodiments, the central block is configured to derive the topology to interconnect computational modules that receive data for reduction via the selected child ports, in a chain configuration. In yet other embodiments, the central block is configured to derive the topology to interconnect outputs of two computational modules that receive data for reduction via the selected child ports as inputs to an aggregator computational module.
In an embodiment, the selected parent port and each of the selected child ports include a QP responder and a QP requester, configured to respectively handle reliable transport layer reception and transmission of packets. In another embodiment, the central block is configured to receive a first request indicative of first child ports, a first parent port and first computational modules required to perform first data reduction operations on data received via the first child ports and destined to the first parent port, and further receive a second request indicative of second child ports, a second parent port, and second computational modules required to perform second data reduction operations on data received via the second child ports and destined to the second parent port, to derive from the first request a first topology for performing the first data reduction operations and derive from the second request a second topology for performing the second data reduction operations, and to configure the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel. In yet another embodiment, the request is indicative of the network element serving as a root network element, and the central block is configured to derive from the request a topology that interconnects among the selected child ports and the computational modules so as to perform the data reduction operations for producing aggregated data and to route the aggregated data to one or more child ports.
In some embodiments, the request or a separately received request is indicative of a given parent port and one or more given child ports, and the central block is configured to derive from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes. In other embodiments, the forwarding circuitry includes upstream forwarding circuitry and downstream forwarding circuitry, and the central block is configured to apply, in parallel, an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and to apply a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.
There is additionally provided, in accordance with an embodiment that is described herein, a method including, in a network element including (i) a plurality of ports that connect to a communication network, including multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements, (ii) multiple computational modules that collectively perform a data reduction operation, in accordance with a data reduction protocol, and (iii) configurable forwarding circuitry that interconnects among the ports and the computational modules, receiving by a central block of the network element a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. A topology is derived, from the request, that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform data reduction operations, and to forward the reduced data for transmission to the selected parent port. The topology is applied by the forwarding circuitry.
These and other embodiments will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Embodiments that are described herein provide systems and methods for in-network computing using network elements that support flexible data reduction operations.
In-network computing involves dividing a calculation over a stream of data into multiple sub-calculations executed by network elements of a communication network. A network element may comprise any suitable network device such as, for example, a switch or a router.
In some embodiments, an in-network calculation is carried out hierarchically by multiple network elements arranged in a multi-level configuration. Network elements of the lowest level receive portions of the data stream from multiple network nodes and based on the data portions produce partial results. Elements of higher levels further aggregate the partial results up to a root network element that produces a final calculation result. The root network element typically distributes the final calculation result to some or all of the network nodes that provided the data, and/or to other network elements.
A partial result produced by a network element as part of an in-network calculation is also referred to herein as a “reduced data” and the final result produced by the root network element is also referred to herein as an “aggregated data.” A logical structure that models the hierarchical in-network calculation is referred to as a “data reduction tree.”
In-network calculations are often implemented in accordance with a data reduction protocol. An example data reduction protocol of this sort is the Scalable Hierarchical Aggregation and Reduction Protocol (SHArP™) described in U.S. Pat. No. 10,284,383 cited above. The data reduction protocol typically specifies messages that the network nodes and network elements exchange with one another for delivering data and control. Messages related to the data reduction protocol typically comprise multiple packets, wherein each of the packets comprises a transport layer header and a payload. In some embodiments, the first packet of the message comprises a header of the underlying data reduction protocol, e.g., a SHArP header.
An important requirement in implementing in-network computing is to efficiently carry out multiple complex calculations over multiple respective high-bandwidth data streams in parallel and with low latency. Some aspects of accelerating data reduction operations in hardware are described, for example, in U.S. patent application Ser. No. 16/357,356, of Elias et al., filed Mar. 19, 2019.
In principle, multiple data reduction trees may be used for modeling multiple respective in-network calculations in parallel. Such data reduction trees, however, may use separate sets of ports and computational resources across respective network elements, in which case they can maintain full port bandwidth. Reduction trees that do not share ports are also referred to as “disjoint reduction trees.”
In the disclosed embodiments, each network element comprises multiple computational modules for performing data reduction operations in hardware. In some embodiments, each port that receives data for reduction has a respective computational module. The computational modules and ports may be interconnected using configurable forwarding circuitry in various topologies. This allows flexible usage of the computational modules in separate reduction trees without sharing port bandwidth.
Consider a network element, comprising a plurality of ports coupled to network elements and/or network nodes. Ports coupled to respective child network elements or network nodes are referred to as “child ports” and ports coupled to respective parent network elements are referred to as “parent ports.” The network element further comprises multiple computational modules, configurable forwarding circuitry and a central block. The ports are configured to connect to a communication network. The multiple computational modules are configured to collectively perform a data reduction operation, in accordance with a data reduction protocol. The forwarding circuitry is configured to interconnect among the ports and the computational modules. The central block is configured to receive a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. The central block derives from the request a topology that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform the requested data reduction operations and to forward the reduced data for transmission to the selected parent port, and configures the forwarding circuitry to apply the topology.
In some embodiments, the selected child ports are configured to receive data messages comprising a reduction operation and respective data portions, and to send the reduction operation to the central block. The central block is configured to set the computational modules to apply the reduction operation to the data portions.
The central block may derive the topology in any suitable way. For example, the central block derives a topology that interconnects multiple computational modules that receive data from child ports in a chain configuration, or in an aggregated configuration that aggregates two or more chains. In some embodiments, the network element stores multiple predefined topologies, e.g., in a table in memory. In such embodiments, the central block derives a requested topology by retrieving it from the table.
In some embodiments, each of the parent port and the child ports comprises a QP responder and a QP requester, that handle reliable transport layer communication of packets related to the data reduction protocol. Handling transport layer communication at the port level (and not by a central element such as the central block) allows fast and reliable packet delivery to and from other network elements and network nodes, at full port bandwidth.
In some embodiments, the central block receives a first data reduction request indicative of first child ports, a first parent port and first computational modules required to perform a first data reduction operations on data received via the first child ports and destined to a the first parent port, and further receives a second data reduction request indicative of second child ports, a second parent port, and second computational modules required to perform a second data reduction operations on data received via the second child ports and destined to a the second parent port. The central block derives from the first request a first topology for performing the first data reduction operations and derives from the second request a second topology for performing the second data reduction operations. The central block configures the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel.
The first and second topologies may use disjoint subsets of ports and computational modules. The central block may configure the forwarding circuitry to apply the derived first and second topologies so that respective data first and second reduction operations are executed at full port bandwidth, and may overlap in time.
In some embodiments, the request is indicative of the network element serving as a root network element, and the central block derives from the request a topology that interconnects among the selected child ports and the computational modules so as to perform data reduction operations for producing aggregated data and to route the aggregated data to one or more child ports.
In an embodiment, the request or a separately received request is indicative of a given parent port and one or more given child ports, and the central block is configured to derive from the request a topology that interconnects the given parent port to the one or more given child ports, for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes.
In an embodiment, the forwarding circuitry comprises upstream forwarding circuitry and downstream forwarding circuitry. In this embodiment, the central block applies in parallel an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and applies a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.
In the disclosed techniques a network element supports flexible interconnections among ports and computational modules, without unnecessarily using computational modules for just passing data, thus refraining from bandwidth sharing. Ports that receive data for reduction have local computational modules that may be interconnected, e.g., in a serial chain having a suitable length, or in an aggregated configuration that aggregates multiple chains. This flexibility in connecting computational modules via the forwarding circuitry allows efficient usage of limited resources in performing different data reduction operations at different times, and/or in performing multiple data reduction operations in parallel without sharing port bandwidth.
Computing system 20 may be used in various applications such as, High Performance Computing (HPC) clusters, data center applications and Artificial Intelligence (AI), to name a few.
In computing system 20, multiple end nodes 28 communicate with one another over a communication network 32. “End node” 28 is also referred to herein as a “network node.” Communication network 32 may comprise any suitable type of a communication network operating using any suitable protocols such as, for example, an Infiniband™ network or an Ethernet network. End node 28 is coupled to the communication network using a Network Interface Controller (NIC) 36. In Infiniband terminology, the network interface is referred to as a Host Channel Adapter (HCA). End node 28 may comprise any suitable processing module such as, for example, a server or a multi-core processing module comprising, for example, one or more Graphics Processing Units (GPUs) or other types of accelerators. End node 28 typically comprises (not shown) multiple processing units such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), coupled via a suitable link (e.g., a PCIe) to a memory and peripheral devices, e.g., NIC 36.
Communication network 32 comprises multiple network elements 24 interconnected in a multi-level configuration that enables performing complex in-network calculations using data reduction techniques. In the present example, network elements 24 are arranged in a tree configuration having a lower level, a middle level and a top level, comprising network elements 24A, 24B and 24C, respectively. Typically, a network element 24A connects to multiple end nodes 28 using NICs 36.
A practical computing system 20 may comprise several thousands or even tens of thousands of end nodes 28 interconnected using several hundreds or thousands of network elements 24. For example, communication network 32 of computing system 20 may be configured in four-level Fat-Tree topology comprising on the order of 3,500 switches.
In the multi-level tree structure, a network element may connect to child network elements in a lower level or to network nodes, and to a parent network element in a higher level. A network element at the top level is also referred to as a root network element. A subset (or all) of the network elements of a physical tree structure may form a data reduction tree, which is a logical structure typically used for modeling in-network calculations, as will be described below.
In some embodiments, multiple network elements 24 perform a calculation for some or all of network nodes 28. The network elements collectively perform the calculation as modeled using a suitable data reduction tree. In the hierarchical calculation, network elements in lower levels produce partial results that are aggregated by network elements in higher levels of the data reduction tree. A network element serving as the root of the data reduction tree produces the final calculation result (aggregated data), which is typically distributed to one or more network nodes 28. The calculation carried out by a network element 24 for producing a partial result is also referred to as a “data reduction operation.”
The data flow from the network nodes toward the root is also referred to as “upstream,” and the data reduction tree used in the upstream direction is also referred to as an “upstream data reduction tree.” The data flow from the root toward the network nodes is also referred to as “downstream,” and the data reduction tree used in the downstream direction is also referred to as a “downstream data reduction tree.”
Breaking a calculation over a data stream to a hierarchical in-network calculation by network elements 24 is typically carried out using a suitable data reduction protocol. An example data reduction protocol is the SHArP described in U.S. Pat. No. 10,284,383 cited above.
As will be described below, network elements 24 support flexible usage of ports and computational resources for performing multiple data reduction operations in parallel. This enables flexible and efficient in-network computations in computing system 20.
Network element 24 may be used, for example, in implementing network elements 24A, 24B and 24C in communication network 32.
Network element 24 comprises a central block 40 that manages the operation of the network element in accordance with the underlying data reduction protocol, e.g., the SHArP mentioned above. The functionality of central block 40 will be described in more detail below.
Network element 24 further comprises configurable forwarding circuitry 42, which is connected using fixed connections 44 to various elements within network element 24. Forwarding circuitry 42 is flexibly configurable to interconnect among the various elements to which it connects. This allows creating various topologies of ports and computational resources for performing data reduction operations. In an embodiment, forwarding circuitry 42 comprises a configurable crossbar switch. The flexibility in interconnections contributes to the ability to support full port bandwidth.
Network element 24 comprises multiple ports 46 for connecting the network element to communication network 32. Each of ports 46 functions both as an input port for receiving packets from the communication network and as an output port for transmitting packets to the communication network. A practical network element 24 may comprise, for example, between 64 and 128 ports 46. Alternatively, a network element having any other suitable number of ports can also be used.
In some embodiments, each port 46 is respectively coupled to a transport-layer reception module 48, denoted “TRM-RX,” and to a transport-layer transmission module 52, denoted “TRM-TX.” The input part of port 46 is coupled to TRM-RX 48 via a parser 56. TRM-RX 48 comprises a QP responder 60 and a computational module 64, which is also referred to herein as an Arithmetic Logic Unit (ALU). TRM-TX comprises QP requester 68. TRM-RX 48 further comprises a reception buffer 70 denoted RX-BUFFER for storing incoming packets. TRM-TX 52 further comprises a transmission buffer 71 denoted TX-BUFFER for storing outgoing packets.
In some embodiments, central block 40 controls the internal connectivity of forwarding circuitry 42 and the configurations of ALUs 64 so that the ports and the ALUs are interconnected in a topology suitable for performing a requested data reduction operation.
Parser 56 is configured to parse incoming packets, and to identify and send relevant packets to TRM-RX 48.
In some embodiments, parser 56 identifies that a request for applying a data reduction operation is received and notifies the request to central block 40. The request may be indicative of a topology required in the upstream direction, a topology required in the downstream direction or both. Same or different ports may be used in the upstream topology and in the downstream topology, respectively. The data reduction operation itself (e.g., indicative of the function to which ALUs 64 should be configured) may be specified in the request that is indicative of the topology (or topologies) or alternatively, carried in a header of a data message.
The upstream topology supports data reduction operations on data received from certain child network elements via multiple child ports, for producing reduced data destined to a given parent network element via a parent port. The downstream topology specifies a parent port for receiving aggregated data via a given parent port and distributing that aggregated data to certain child ports.
In the upstream direction, the request is indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations. The central block derives from the request a topology that interconnects among the selected child ports, selected parent port and ALUs, so as to perform data reduction operations and to forward the resulting reduced data to the selected parent port). As noted above, the actual ALU function may be specified in the request or in a separate data message.
In some embodiments, the selected child ports receive data messages comprising a reduction operation and respective data portions and send the reduction operation to the central block, which sets the computational modules to apply the reduction operation to the data portions.
In the downstream direction, the request is indicative of a given parent port and one or more given child ports, and the central block derives from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes.
Transport-layer modules TRM-RX 48 and TRM-TX 52 handle reliable connections with other entities via ports 46, such as ports of another network element or a port of a NIC of some network node 28. QP responder 60 in TRM-RX 48 handles reliable data reception via port 46. QP requester 68 in TRM-TX handles reliable data transmission via port 46.
In some embodiments, QP responder 60 receives packets transmitted by a corresponding QP requester, and signals back ACK/NACK notifications. QP requester 68 transmits packets to a corresponding QP responder on the other side of the link and handles re-transmissions as necessary.
Note that since each port 46 has a local QP responder 60 and a local QP requester 68, communication among the network elements (and network nodes) can be carried out at wire speed and with minimal latency. This (and the flexible connectivity via the forwarding circuitry) allow executing multiple data reduction operations using respective disjoint data reduction trees, in parallel, at full port bandwidth.
Network element 24 comprises one or more aggregators 72, each of which comprising an ALU 74, which is identical or similar to ALU 64 of TRM-RX module 48. Aggregator 72 does not receive data directly from any port 46. Instead, aggregator 72 aggregates data output by ALUs 64 of TRM-RXs 48. Aggregator 72 may also aggregate data output by an ALU 74 of another aggregator to create a hierarchical computational topology, in an embodiment.
The functionality of ALU 64 as will described below, also applies similarly to ALU 74. In the present example, ALU 64 (and ALU 74) comprises two inputs and a single output. Let A1, and A2 denote input arguments and let A3 denote a result calculated by the ALU. The ALU typically supports multiple predefined functions to which the ALU may be configured by the central block. When configured to a given function “F( )”, the ALU calculates A3 as A3=F(A1, A2). ALUs 64 and 74 support any suitable operation such as, for example, mathematical functions such as integer and floating-point addition, multiplication and division, and logical functions such as logical AND, OR and XOR, bitwise AND, OR and XOR. Other operations supported by ALUs 64 and 74 comprise, for example, min, max, min loc, and max loc. In some embodiments ALUs 64 and 74 support configurable operators.
In some embodiments, data received via port 46 (from a child network element or from a network node) is provided to one input of ALU 64. ALU 64 may be configured to a Null function, in which case the other input of the ALU is ignored and the data received from the port is output by ALU 64 with no modification. Alternatively, ALU 64 receives on its other input (via the forwarding circuitry) data calculated by another ALU 64, and applies the function F( ) to the data received on both inputs. ALU 74 typically receives, via the forwarding circuitry, data output by two ALUs 64. In performing a data reduction operation, the participating ALUs 64 and ALU 74 are configured by the central block to a common function F( ). Alternatively, at least some of the ALUs (64, 74 or both) assigned to a given data reduction operation may be configured to apply different functions.
The output of ALU 64 may be routed via the forwarding circuitry as input to another ALU 64 (or ALU 74) as described above. Alternatively, the output of ALU 64 may be routed via the forwarding circuitry to a QP requester of the parent port for transmission to a parent network element. In a root network element, the output of the last ALU 64 that concludes the calculation specified by the underlying reduction tree may be routed to the QP requesters of the child ports participating in the downstream tree.
The configurations of computing system 20 and network element 24 in
Some elements of network element 24, such as central block 40, forward circuitry 42 (possibly implemented as separate upstream crossbar 82 and downstream crossbar 84, in
Elements that are not necessary for understanding the principles of the present application, such as various interfaces, addressing circuits, timing and sequencing circuits and debugging circuits, have been omitted from
In some embodiments, some of the functions of central block 40 may be carried out by a general-purpose processor, which is programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
In
ALU1 . . . ALU4 receive data denoted D1 . . . D4 from child network elements (or from network nodes 28) via respective ports denoted PORT1 . . . PORT4 and are collectively configured to perform a data reduction operation. In the present example, the data reduction operation calculates the sum (D1+D2+D3+D4). To this end, ALU1 is configured to transfer D1 to the output of ALU1 and each of ALU2 . . . ALU4 calculates a sum function between its inputs. The calculation is carried out accumulatively as follows: ALU2 outputs the sum (D1+D2), ALU3 outputs the sum [(D1+D2)+D3], and ALU4 outputs the final sum {[(D1+D2)+D3]+D4}. The data reduction result (D1+D2+D3+D4) output by ALU4 is routed via forwarding circuitry 42 to a parent network element via PORT5.
In
ALU1 . . . ALU4 receive data denoted D1 . . . D4 from child network elements via respective ports denoted PORT1 . . . PORT4 and together with ALU5 are collectively configured to perform a data reduction operation, in the present example calculating the sum (D1+D2+D3+D4). The topology in
The chain comprising ALU1 and ALU2 calculates a partial sum (D1+D2) and the chain comprising ALU3 and ALU4 calculates a partial sum (D3+D4). ALU 5 calculates the aggregated result [(D1+D2)+(D3+D4)], which the forwarding circuitry routes to port5 for transmission to a parent network element.
In the example of
The data reduction topologies in
The method will be described for the upstream and downstream directions.
The method of
The request is indicative of selected child ports, a selected parent port and computational modules required for applying data reduction operations. The same data reduction request supports multiple different reduction operations on data that will be received from certain child network elements via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. Performing the data reduction operation typically requires data manipulation using ALUs 64, possibly with one or more ALUs 74 of aggregators 72. In the present example, the same selected child ports and selected parent port are used in both the upstream and downstream directions.
At a topology derivation step 104, central block 40 derives, from the data reduction request, a topology that interconnects among the selected child ports, the selected parent port, and computational modules (ALUs 64, 74 or both) so as to perform data reduction operations and to forward the reduced data for transmission to the selected parent port. Further at step 104, the central block configures forwarding circuitry 42 to apply the derived topology.
When the network element comprises a root network element, the topology routes the aggregated data calculated by the last ALU to the QP requesters of relevant the child ports that distribute the aggregated data in accordance with a corresponding downstream tree.
At a data message reception step 106, the central block receives header parts of data messages received from child network elements or network nodes, via the selected child ports. Each data message comprises multiple packets. The data message specifies, e.g., in the header part (e.g., in the first packet), the data reduction operation to be performed using the already configured topology. In some embodiments, parser 56 sends the header part of the data message to the control block and forwards the payload data of the data message to the relevant computational module.
At a computational module configuration step 108, central block 40 configures the computational modules that participate in the data reduction operations to a function specified in the header of the data message(s). Step 108 is relevant to the upstream direction and may be skipped in the downstream direction.
At an upstream data flow step 116, the computational modules assigned based on the data message apply to the data payloads received in the data messages the specified data reduction operation, and the resulting reduced data is sent to the parent network element via the selected parent port.
When the network element comprises a root network element, the resulting reduced data comprises the aggregated data, which is sent via the forwarding circuitry to all the selected child ports. At a downstream data flow step 120, the network element receives aggregated data from the selected parent port, and distributes the aggregated data, via the forwarding circuitry, to the selected child ports. Following step 120 the method terminates.
At steps 116 and 120 above each QP requester of the parent port and child ports is responsible for sending the messages on a reliable connection or on the transport layer.
In some embodiments, the method of
In
In describing
The flow steps in
In the upstream direction, depicted in
At step (2), QP responders 60A and 60B of respective ports 46A and 46B receive packets of data messages from the child network elements. In the data messages, each packet comprises a transport layer header and a payload, wherein the first packet of the data message additionally comprises a SHArP header. The QP responder of each port handles the transport layer, and after sending the SHArP header to the central block forwards the payloads of the packets to ALU 64 of that port. At step (3) TRM-RX modules 48 of the child ports forward the SHArP header of the first packet to the central block, which at step (4) prepares a SHArP header for transmitting the reduced data. Further at step (4) the central block sets ALUs 64 and 74 to apply a function specified in the first packet.
At steps (5) and (6), ALUs 64A and 64B perform data reduction to the payload received in each data message via child ports 46A and 46B. At steps (7) and (8) ALU 74 of aggregator 72 receives partially reduced data from ALU 64B and from the other chain, and at step (9) ALU 74 calculates the overall reduced data. At step (10), the reduced data is forwarded to port 46C for transmission to the parent network element.
At step (11) QP requestor 68C packetizes a reduced data message that contains the reduced data and the SHArP header of step (4) and rebuilds the transport layer by attaching to each packet of the reduced data message a transport layer header. QP requester 68C handles reliable transport layer packet transmission, including retransmissions. In some embodiments, the QP requester uses some storage space of a local buffer (e.g., transmission buffer 71) of the port as a retry buffer for retransmission. In some embodiments, at step (12), the network element applies a suitable scheduling scheme (not shown) for packet transmission via port 46C including, for example, bandwidth allocation and prioritization using Virtual Lane (VL) management.
In the downstream direction, depicted in
At step (2), QP responder 60C of port 46C receives packets carrying aggregated data, in an aggregated data message, from the parent network element. In the aggregated data message, each packet comprises a transport layer header and a payload, and the first packet additionally comprises a SHArP header. QP responder 60C handles the transport layer, and after sending the SHArP header to the central block forwards the payloads of the packets to the downstream crossbar. In some embodiments, the payloads of the packets are forwarded via ALU 64C that is configured by the central block to a Null function, so that the packet payload is transferred by the ALU with no modification. In alternative embodiments, ALU 64C is bypassed, and the packet payload is forwarded directly to the downstream crossbar, as will be described at step (5) below.
At step (3) TRM-RX 48C of port 46C forwards the SHArP header of the received packet to central block 40, and at step (4) the central block prepares a SHArP header for transmitting with the aggregated data to the child network elements.
At steps (5) and (6) the downstream crossbar receives the payload of the aggregated data message and forwards the payload to both child ports 46A and 46B in parallel.
At step (7), each of QP requesters 68A and 68B packetizes an aggregated data message that contains the aggregated data and the SHArP header of step (4) and rebuilds the transport layer by attaching to each packet of the aggregated message a transport layer header. QP requesters 68A and 68B handle transport layer packet transmission, including retransmissions. As noted above, in some embodiments, the QP requester may use storage space of a local buffer (e.g., transmission buffer 71) of the port as a retry buffer for retransmission. In some embodiments, at step (8), the network element applies a suitable scheduling scheme (not shown) for packet transmission to ports 46A and 46B including, for example, bandwidth allocation and prioritization using Virtual Lane (VL) management.
The embodiments described above are given by way of example, and other suitable embodiments can also be used. For example, in the embodiments of
Although the embodiments described herein mainly address data reduction operations such as “all reduce” and “reduce” operations, the methods and systems described herein can also be used in other applications, such as in performing, for example, “reliable multicast,” “reliable broadcast,” “all gather” and “gather” operations.
It will be appreciated that the embodiments described above are cited by way of example, and that the following claims are not limited to what has been particularly shown and described hereinabove. Rather, the scope includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
Number | Name | Date | Kind |
---|---|---|---|
4933969 | Marshall et al. | Jun 1990 | A |
5068877 | Near et al. | Nov 1991 | A |
5325500 | Bell et al. | Jun 1994 | A |
5353412 | Douglas et al. | Oct 1994 | A |
5404565 | Gould et al. | Apr 1995 | A |
5606703 | Brady et al. | Feb 1997 | A |
5944779 | Blum | Aug 1999 | A |
6041049 | Brady | Mar 2000 | A |
6370502 | Wu et al. | Apr 2002 | B1 |
6483804 | Muller et al. | Nov 2002 | B1 |
6507562 | Kadansky et al. | Jan 2003 | B1 |
6728862 | Wilson | Apr 2004 | B1 |
6857004 | Howard et al. | Feb 2005 | B1 |
6937576 | Di Benedetto et al. | Aug 2005 | B1 |
7102998 | Golestani | Sep 2006 | B1 |
7124180 | Ranous | Oct 2006 | B1 |
7164422 | Wholey, III et al. | Jan 2007 | B1 |
7171484 | Krause et al. | Jan 2007 | B1 |
7313582 | Bhanot et al. | Dec 2007 | B2 |
7327693 | Rivers et al. | Feb 2008 | B1 |
7336646 | Muller | Feb 2008 | B2 |
7346698 | Hannaway | Mar 2008 | B2 |
7555549 | Campbell et al. | Jun 2009 | B1 |
7613774 | Caronni et al. | Nov 2009 | B1 |
7636424 | Halikhedkar | Dec 2009 | B1 |
7636699 | Stanfill | Dec 2009 | B2 |
7738443 | Kumar | Jun 2010 | B2 |
8213315 | Crupnicoff et al. | Jul 2012 | B2 |
8380880 | Gulley et al. | Feb 2013 | B2 |
8510366 | Anderson et al. | Aug 2013 | B1 |
8738891 | Karandikar et al. | May 2014 | B1 |
8761189 | Shachar et al. | Jun 2014 | B2 |
8768898 | Trimmer et al. | Jul 2014 | B1 |
8775698 | Archer et al. | Jul 2014 | B2 |
8811417 | Bloch et al. | Aug 2014 | B2 |
9110860 | Shahar | Aug 2015 | B2 |
9189447 | Faraj | Nov 2015 | B2 |
9294551 | Froese et al. | Mar 2016 | B1 |
9344490 | Bloch et al. | May 2016 | B2 |
9563426 | Bent et al. | Feb 2017 | B1 |
9626329 | Howard | Apr 2017 | B2 |
9756154 | Jiang | Sep 2017 | B1 |
10015106 | Florissi et al. | Jul 2018 | B1 |
10158702 | Bloch et al. | Dec 2018 | B2 |
10284383 | Bloch et al. | May 2019 | B2 |
10296351 | Kohn et al. | May 2019 | B1 |
10305980 | Gonzalez et al. | May 2019 | B1 |
10318306 | Kohn et al. | Jun 2019 | B1 |
10425350 | Florissi | Sep 2019 | B1 |
10521283 | Shuler et al. | Dec 2019 | B2 |
10541938 | Timmerman et al. | Jan 2020 | B1 |
10621489 | Appuswamy et al. | Apr 2020 | B2 |
20020010844 | Noel et al. | Jan 2002 | A1 |
20020035625 | Tanaka | Mar 2002 | A1 |
20020150094 | Cheng et al. | Oct 2002 | A1 |
20020150106 | Kagan et al. | Oct 2002 | A1 |
20020152315 | Kagan et al. | Oct 2002 | A1 |
20020152327 | Kagan et al. | Oct 2002 | A1 |
20020152328 | Kagan et al. | Oct 2002 | A1 |
20030018828 | Craddock et al. | Jan 2003 | A1 |
20030061417 | Craddock et al. | Mar 2003 | A1 |
20030065856 | Kagan et al. | Apr 2003 | A1 |
20040062258 | Grow et al. | Apr 2004 | A1 |
20040078493 | Blumrich | Apr 2004 | A1 |
20040120331 | Rhine et al. | Jun 2004 | A1 |
20040123071 | Stefan et al. | Jun 2004 | A1 |
20040252685 | Kagan et al. | Dec 2004 | A1 |
20040260683 | Chan et al. | Dec 2004 | A1 |
20050097300 | Gildea et al. | May 2005 | A1 |
20050122329 | Janus | Jun 2005 | A1 |
20050129039 | Biran et al. | Jun 2005 | A1 |
20050131865 | Jones et al. | Jun 2005 | A1 |
20050281287 | Ninomi et al. | Dec 2005 | A1 |
20060282838 | Gupta et al. | Dec 2006 | A1 |
20070127396 | Jain | Jun 2007 | A1 |
20070162236 | Lamblin et al. | Jul 2007 | A1 |
20080104218 | Liang | May 2008 | A1 |
20080126564 | Wilkinson | May 2008 | A1 |
20080168471 | Benner et al. | Jul 2008 | A1 |
20080181260 | Vonog et al. | Jul 2008 | A1 |
20080192750 | Ko | Aug 2008 | A1 |
20080244220 | Lin et al. | Oct 2008 | A1 |
20080263329 | Archer et al. | Oct 2008 | A1 |
20080288949 | Bohra et al. | Nov 2008 | A1 |
20080298380 | Rittmeyer et al. | Dec 2008 | A1 |
20080307082 | Cai | Dec 2008 | A1 |
20090037377 | Archer et al. | Feb 2009 | A1 |
20090063816 | Arimilli et al. | Mar 2009 | A1 |
20090063817 | Arimilli et al. | Mar 2009 | A1 |
20090063891 | Arimilli et al. | Mar 2009 | A1 |
20090182814 | Tapolcai et al. | Jul 2009 | A1 |
20090247241 | Gollnick et al. | Oct 2009 | A1 |
20090292905 | Faraj | Nov 2009 | A1 |
20100017420 | Archer et al. | Jan 2010 | A1 |
20100049836 | Kramer | Feb 2010 | A1 |
20100074098 | Zeng et al. | Mar 2010 | A1 |
20100095086 | Eichenberger et al. | Apr 2010 | A1 |
20100185719 | Howard | Jul 2010 | A1 |
20100241828 | Yu et al. | Sep 2010 | A1 |
20110060891 | Jia | Mar 2011 | A1 |
20110066649 | Berlyant et al. | Mar 2011 | A1 |
20110119673 | Bloch et al. | May 2011 | A1 |
20110173413 | Chen et al. | Jul 2011 | A1 |
20110219208 | Asaad | Sep 2011 | A1 |
20110238956 | Arimilli et al. | Sep 2011 | A1 |
20110258245 | Blocksome et al. | Oct 2011 | A1 |
20110276789 | Chambers et al. | Nov 2011 | A1 |
20120063436 | Thubert et al. | Mar 2012 | A1 |
20120117331 | Krause et al. | May 2012 | A1 |
20120131309 | Johnson | May 2012 | A1 |
20120216021 | Archer et al. | Aug 2012 | A1 |
20120254110 | Takemoto | Oct 2012 | A1 |
20130117548 | Grover et al. | May 2013 | A1 |
20130159410 | Lee et al. | Jun 2013 | A1 |
20130318525 | Palanisamy et al. | Nov 2013 | A1 |
20130336292 | Kore et al. | Dec 2013 | A1 |
20140033217 | Vajda et al. | Jan 2014 | A1 |
20140047341 | Breternitz et al. | Feb 2014 | A1 |
20140095779 | Forsyth et al. | Apr 2014 | A1 |
20140122831 | Uliel et al. | May 2014 | A1 |
20140189308 | Hughes et al. | Jul 2014 | A1 |
20140211804 | Makikeni et al. | Jul 2014 | A1 |
20140280420 | Khan | Sep 2014 | A1 |
20140281370 | Khan | Sep 2014 | A1 |
20140362692 | Wu et al. | Dec 2014 | A1 |
20140365548 | Mortensen | Dec 2014 | A1 |
20150106578 | Warfield et al. | Apr 2015 | A1 |
20150143076 | Khan | May 2015 | A1 |
20150143077 | Khan | May 2015 | A1 |
20150143078 | Khan et al. | May 2015 | A1 |
20150143079 | Khan | May 2015 | A1 |
20150143085 | Khan | May 2015 | A1 |
20150143086 | Khan | May 2015 | A1 |
20150154058 | Miwa et al. | Jun 2015 | A1 |
20150180785 | Annamraju | Jun 2015 | A1 |
20150188987 | Reed et al. | Jul 2015 | A1 |
20150193271 | Archer et al. | Jul 2015 | A1 |
20150212972 | Boettcher et al. | Jul 2015 | A1 |
20150269116 | Raikin et al. | Sep 2015 | A1 |
20150379022 | Puig et al. | Dec 2015 | A1 |
20160055225 | Xu et al. | Feb 2016 | A1 |
20160105494 | Reed et al. | Apr 2016 | A1 |
20160112531 | Milton et al. | Apr 2016 | A1 |
20160117277 | Raindel et al. | Apr 2016 | A1 |
20160179537 | Kunzman et al. | Jun 2016 | A1 |
20160219009 | French | Jul 2016 | A1 |
20160248656 | Anand et al. | Aug 2016 | A1 |
20160299872 | Vaidyanathan et al. | Oct 2016 | A1 |
20160342568 | Burchard et al. | Nov 2016 | A1 |
20160364350 | Sanghi et al. | Dec 2016 | A1 |
20170063613 | Bloch et al. | Mar 2017 | A1 |
20170093715 | McGhee et al. | Mar 2017 | A1 |
20170116154 | Palmer et al. | Apr 2017 | A1 |
20170187496 | Shalev et al. | Jun 2017 | A1 |
20170187589 | Pope et al. | Jun 2017 | A1 |
20170187629 | Shalev et al. | Jun 2017 | A1 |
20170187846 | Shalev et al. | Jun 2017 | A1 |
20170199844 | Burchard et al. | Jul 2017 | A1 |
20180004530 | Vorbach | Jan 2018 | A1 |
20180046901 | Xie et al. | Feb 2018 | A1 |
20180047099 | Bonig et al. | Feb 2018 | A1 |
20180089278 | Bhattacharjee et al. | Mar 2018 | A1 |
20180091442 | Chen et al. | Mar 2018 | A1 |
20180097721 | Matsui | Apr 2018 | A1 |
20180173673 | Daglis et al. | Jun 2018 | A1 |
20180262551 | Demeyer et al. | Sep 2018 | A1 |
20180285316 | Thorson et al. | Oct 2018 | A1 |
20180287928 | Levi et al. | Oct 2018 | A1 |
20180302324 | Kasuya | Oct 2018 | A1 |
20180321912 | Li | Nov 2018 | A1 |
20180321938 | Boswell et al. | Nov 2018 | A1 |
20180367465 | Levi | Dec 2018 | A1 |
20180375781 | Chen et al. | Dec 2018 | A1 |
20190018805 | Benisty | Jan 2019 | A1 |
20190026250 | Das Sarma et al. | Jan 2019 | A1 |
20190065208 | Liu et al. | Feb 2019 | A1 |
20190068501 | Schneider et al. | Feb 2019 | A1 |
20190102179 | Fleming et al. | Apr 2019 | A1 |
20190102338 | Tang et al. | Apr 2019 | A1 |
20190102640 | Balasubramanian | Apr 2019 | A1 |
20190114533 | Ng et al. | Apr 2019 | A1 |
20190121388 | Knowles et al. | Apr 2019 | A1 |
20190138638 | Pal et al. | May 2019 | A1 |
20190147092 | Pal et al. | May 2019 | A1 |
20190235866 | Das Sarma et al. | Aug 2019 | A1 |
20190303168 | Fleming, Jr. et al. | Oct 2019 | A1 |
20190303263 | Fleming, Jr. et al. | Oct 2019 | A1 |
20190324431 | Celia et al. | Oct 2019 | A1 |
20190339688 | Cella et al. | Nov 2019 | A1 |
20190347099 | Eapen et al. | Nov 2019 | A1 |
20190369994 | Parandeh Afshar et al. | Dec 2019 | A1 |
20190377580 | Vorbach | Dec 2019 | A1 |
20190379714 | Levi et al. | Dec 2019 | A1 |
20200005859 | Chen et al. | Jan 2020 | A1 |
20200034145 | Bainville et al. | Jan 2020 | A1 |
20200057748 | Danilak | Feb 2020 | A1 |
20200103894 | Celia et al. | Apr 2020 | A1 |
20200106828 | Elias et al. | Apr 2020 | A1 |
20200137013 | Jin et al. | Apr 2020 | A1 |
Entry |
---|
Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” Proceedings of the Forth Conferene on Partitioned Global Address Space Programming Model, pp. 1-4, Oct. 2010. |
Priest et al., “You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives”, IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 221-230, year 2019. |
Wikipedia, “Nagle's algorithm”, pp. 1-4, Dec. 12, 2019. |
Mellanox Technologies, “InfiniScale IV: 36-port 40GB/s Infiniband Switch Device”, pp. 1-2, year 2009. |
Mellanox Technologies Inc., “Scaling 10Gb/s Clustering at Wire-Speed”, pp. 1-8, year 2006. |
IEEE 802.1D Standard “IEEE Standard for Local and Metropolitan Area Networks—Media Access Control (MAC) Bridges”, IEEE Computer Society, pp. 1-281, Jun. 9, 2004. |
IEEE 802.1AX Standard “IEEE Standard for Local and Metropolitan Area Networks—Link Aggregation”, IEEE Computer Society, pp. 1-163, Nov. 3, 2008. |
Turner et al., “Multirate Clos Networks”, IEEE Communications Magazine, pp. 1-11, Oct. 2003. |
Thayer School of Engineering, “An Slightly Edited Local Copy of Elements of Lectures 4 and 5”, Dartmouth College, pp. 1-5, Jan. 15, 1998 http://people.seas.harvard.edu/˜jones/cscie129/nu_lectures/lecture11/switching/clos_network/clos_network.html. |
“MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, version 3.1, pp. 1-868, Jun. 4, 2015. |
Coti et al., “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), pp. 1-12, Aug. 2009. |
Petrini et al., “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (Hotl'01), pp. 1-6, Aug. 2001. |
Sancho et al., “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 1-10, Sep. 17-20, 2007. |
Infiniband Trade Association, “InfiniBand™ Architecture Specification”, release 1.2.1, pp. 1-1727, Jan. 2008. |
InfiniBand Architecture Specification, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007. |
Deming, “Infiniband Architectural Overview”, Storage Developer Conference, pp. 1-70, year 2013. |
Fugger et al., “Reconciling fault-tolerant distributed computing and systems-on-chip”, Distributed Computing, vol. 24, Issue 6, pp. 323-355, Jan. 2012. |
Wikipedia, “System on a chip”, pp. 1-4, Jul. 6, 2018. |
Villavieja et al., “On-chip Distributed Shared Memory”, Computer Architecture Department, pp. 1-10, Feb. 3, 2011. |
U.S. Appl. No. 16/357,356 office action dated May 14, 2020. |
European Application # 20156490.3 search report dated Jun. 25, 2020. |
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997. |
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pp. 298-309, Aug. 1, 1994. |
Chiang et al., “Toward supporting data parallel programming on clusters of symmetric multiprocessors”, Proceedings International Conference on Parallel and Distributed Systems, pp. 607-614, Dec. 14, 1998. |
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI AII-to-AII”, Proceedings of the 23rd European MPI Users' Group Meeting, pp. 167-179, Sep. 2016. |
Pjesivac-Grbovic et al., “Performance Analysis of MPI Collective Operations”, 19th IEEE International Parallel and Distributed Processing Symposium, pp. 1-19, 2015. |
Danalis et al., “PTG: an abstraction for unhindered parallelism”, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, pp. 1-10, Nov. 17, 2014. |
Cosnard et al., “Symbolic Scheduling of Parameterized Task Graphs on Parallel Machines,” Combinatorial Optimization book series (COOP, vol. 7), pp. 217-243, year 2000. |
Jeannot et al., “Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors using paramerized Task Graphs”, World Scientific, pp. 1-8, Jul. 23, 2001. |
Stone, “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations,” Journal of the Association for Computing Machinery, vol. 10, No. 1, pp. 27-38, Jan. 1973. |
Kogge et al., “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C-22, No. 8, pp. 786-793, Aug. 1973. |
Hoefler et al., “Message Progression in Parallel Computing—To Thread or not to Thread?”, 2008 IEEE International Conference on Cluster Computing, pp. 1-10, Tsukuba, Japan, Sep. 29-Oct. 1, 2008. |
U.S. Appl. No. 16/430,457 Office Action dated Jul. 9, 2021. |
Yang et al., “SwitchAgg: A Further Step Toward In-Network Computing,” 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking, pp. 36-45, Dec. 2019. |
EP Application # 20216972 Search Report dated Jun. 11, 2021. |
U.S. Appl. No. 16/782,118 Office Action dated Jun. 3, 2021. |
U.S. Appl. No. 16/789,458 Office Action dated Jun. 10, 2021. |
“Message Passing Interface (MPI): History and Evolution,” Virtual Workshop, Cornell University Center for Advanced Computing, NY, USA, pp. 1-2, year 2021, as downloaded from https://cvw.cac.cornell.edu/mpi/history. |
Pacheco, “A User's Guide to MPI,” Department of Mathematics, University of San Francisco, CA, USA, pp. 1-51, Mar. 30, 1998. |
Wikipedia, “Message Passing Interface,” pp. 1-16, last edited Nov. 7, 2021, as downloaded from https://en.wikipedia.org/wiki/Message_Passing_Interface. |
U.S. Appl. No. 16/782,118 Office Action dated Nov. 8, 2021. |
Number | Date | Country | |
---|---|---|---|
20210234753 A1 | Jul 2021 | US |