Network element supporting flexible data reduction operations

Information

  • Patent Grant
  • 11252027
  • Patent Number
    11,252,027
  • Date Filed
    Thursday, January 23, 2020
    4 years ago
  • Date Issued
    Tuesday, February 15, 2022
    2 years ago
Abstract
A network element includes a plurality of ports, multiple computational modules, configurable forwarding circuitry and a central block. The ports include child ports coupled to child network elements or network nodes and parent ports coupled to parent network elements. The computational modules collectively perform a data reduction operation of a data reduction protocol. The forwarding circuitry interconnects among ports and computational modules. The central block receives a request indicative of child ports, a parent port, and computational modules required for performing reduction operations on data received via the child ports, for producing reduced data destined to the parent port, to derive from the request a topology that interconnects among the child ports, parent port and computational modules for performing the data reduction operations and to forward the reduced data for transmission to the selected parent port, and to configure the forwarding circuitry to apply the topology.
Description
TECHNICAL FIELD

Embodiments described herein relate generally to in-network computing, and particularly to methods and systems for network elements supporting flexible data reduction operations.


BACKGROUND

Some computing systems support performing computation tasks by network elements of a communication system. Methods for distributing a computation among multiple network elements are known in the art. For example, U.S. Pat. No. 10,284,383 describes a switch in a data network, configured to mediate data exchanges among network elements. The apparatus further includes a processor, which organizes the network elements into a hierarchical tree having a root node network element, vertex node network elements, and child node network elements that include leaf node network elements. The leaf node network elements originate aggregation data and transmit the aggregation data to respective parent vertex node network elements. The vertex node network elements combine the aggregation data from at least a portion of the child node network elements, and transmit the combined aggregation data from the vertex node network elements to parent vertex node network elements. The root node network element is operative for initiating a reduction operation on the aggregation data.


SUMMARY

An embodiment that is described herein provides a network element that includes a plurality of ports, multiple computational modules, configurable forwarding circuitry and a central block. The plurality of ports includes multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements. The plurality of ports being configured to connect to a communication network. The computational modules are configured to collectively perform a data reduction operation in accordance with a data reduction protocol. The configurable forwarding circuitry is configured to interconnect among the ports and the computational modules. The central block is configured to receive a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received from the child network elements or network nodes via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port, to derive, from the request, a topology that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform the data reduction operations and to forward the respective reduced data for transmission to the selected parent port, and to configure the forwarding circuitry to apply the topology.


In some embodiments, the selected child ports are configured to receive data messages including a reduction operation and respective data portions, and to send the reduction operation to the central block, and the central block is configured to set the computational modules to apply the reduction operation to the data portions. In other embodiments, the central block is configured to derive the topology to interconnect computational modules that receive data for reduction via the selected child ports, in a chain configuration. In yet other embodiments, the central block is configured to derive the topology to interconnect outputs of two computational modules that receive data for reduction via the selected child ports as inputs to an aggregator computational module.


In an embodiment, the selected parent port and each of the selected child ports include a QP responder and a QP requester, configured to respectively handle reliable transport layer reception and transmission of packets. In another embodiment, the central block is configured to receive a first request indicative of first child ports, a first parent port and first computational modules required to perform first data reduction operations on data received via the first child ports and destined to the first parent port, and further receive a second request indicative of second child ports, a second parent port, and second computational modules required to perform second data reduction operations on data received via the second child ports and destined to the second parent port, to derive from the first request a first topology for performing the first data reduction operations and derive from the second request a second topology for performing the second data reduction operations, and to configure the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel. In yet another embodiment, the request is indicative of the network element serving as a root network element, and the central block is configured to derive from the request a topology that interconnects among the selected child ports and the computational modules so as to perform the data reduction operations for producing aggregated data and to route the aggregated data to one or more child ports.


In some embodiments, the request or a separately received request is indicative of a given parent port and one or more given child ports, and the central block is configured to derive from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes. In other embodiments, the forwarding circuitry includes upstream forwarding circuitry and downstream forwarding circuitry, and the central block is configured to apply, in parallel, an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and to apply a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.


There is additionally provided, in accordance with an embodiment that is described herein, a method including, in a network element including (i) a plurality of ports that connect to a communication network, including multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements, (ii) multiple computational modules that collectively perform a data reduction operation, in accordance with a data reduction protocol, and (iii) configurable forwarding circuitry that interconnects among the ports and the computational modules, receiving by a central block of the network element a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. A topology is derived, from the request, that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform data reduction operations, and to forward the reduced data for transmission to the selected parent port. The topology is applied by the forwarding circuitry.


These and other embodiments will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram that schematically illustrates a computing system supporting flexible in-network computing, in accordance with an embodiment that is described herein;



FIG. 2 is a block diagram that schematically illustrates a network element supporting flexible data reduction operations in the computing system of FIG. 1, in accordance with an embodiment that is described herein;



FIGS. 3A and 3B are diagrams that schematically illustrate example data reduction schemes within a network element, in accordance with embodiments that are described herein;



FIG. 4 is a flow chart that schematically illustrates a method for performing a data reduction operation in a network element, in accordance with an embodiment that is described herein; and



FIGS. 5 and 6 are diagrams that schematically illustrates upstream and downstream data reduction flows within a network element, in accordance embodiments that is described herein.





DETAILED DESCRIPTION OF EMBODIMENTS
Overview

Embodiments that are described herein provide systems and methods for in-network computing using network elements that support flexible data reduction operations.


In-network computing involves dividing a calculation over a stream of data into multiple sub-calculations executed by network elements of a communication network. A network element may comprise any suitable network device such as, for example, a switch or a router.


In some embodiments, an in-network calculation is carried out hierarchically by multiple network elements arranged in a multi-level configuration. Network elements of the lowest level receive portions of the data stream from multiple network nodes and based on the data portions produce partial results. Elements of higher levels further aggregate the partial results up to a root network element that produces a final calculation result. The root network element typically distributes the final calculation result to some or all of the network nodes that provided the data, and/or to other network elements.


A partial result produced by a network element as part of an in-network calculation is also referred to herein as a “reduced data” and the final result produced by the root network element is also referred to herein as an “aggregated data.” A logical structure that models the hierarchical in-network calculation is referred to as a “data reduction tree.”


In-network calculations are often implemented in accordance with a data reduction protocol. An example data reduction protocol of this sort is the Scalable Hierarchical Aggregation and Reduction Protocol (SHArP™) described in U.S. Pat. No. 10,284,383 cited above. The data reduction protocol typically specifies messages that the network nodes and network elements exchange with one another for delivering data and control. Messages related to the data reduction protocol typically comprise multiple packets, wherein each of the packets comprises a transport layer header and a payload. In some embodiments, the first packet of the message comprises a header of the underlying data reduction protocol, e.g., a SHArP header.


An important requirement in implementing in-network computing is to efficiently carry out multiple complex calculations over multiple respective high-bandwidth data streams in parallel and with low latency. Some aspects of accelerating data reduction operations in hardware are described, for example, in U.S. patent application Ser. No. 16/357,356, of Elias et al., filed Mar. 19, 2019.


In principle, multiple data reduction trees may be used for modeling multiple respective in-network calculations in parallel. Such data reduction trees, however, may use separate sets of ports and computational resources across respective network elements, in which case they can maintain full port bandwidth. Reduction trees that do not share ports are also referred to as “disjoint reduction trees.”


In the disclosed embodiments, each network element comprises multiple computational modules for performing data reduction operations in hardware. In some embodiments, each port that receives data for reduction has a respective computational module. The computational modules and ports may be interconnected using configurable forwarding circuitry in various topologies. This allows flexible usage of the computational modules in separate reduction trees without sharing port bandwidth.


Consider a network element, comprising a plurality of ports coupled to network elements and/or network nodes. Ports coupled to respective child network elements or network nodes are referred to as “child ports” and ports coupled to respective parent network elements are referred to as “parent ports.” The network element further comprises multiple computational modules, configurable forwarding circuitry and a central block. The ports are configured to connect to a communication network. The multiple computational modules are configured to collectively perform a data reduction operation, in accordance with a data reduction protocol. The forwarding circuitry is configured to interconnect among the ports and the computational modules. The central block is configured to receive a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. The central block derives from the request a topology that interconnects among the selected child ports, the selected parent port and the computational modules so as to perform the requested data reduction operations and to forward the reduced data for transmission to the selected parent port, and configures the forwarding circuitry to apply the topology.


In some embodiments, the selected child ports are configured to receive data messages comprising a reduction operation and respective data portions, and to send the reduction operation to the central block. The central block is configured to set the computational modules to apply the reduction operation to the data portions.


The central block may derive the topology in any suitable way. For example, the central block derives a topology that interconnects multiple computational modules that receive data from child ports in a chain configuration, or in an aggregated configuration that aggregates two or more chains. In some embodiments, the network element stores multiple predefined topologies, e.g., in a table in memory. In such embodiments, the central block derives a requested topology by retrieving it from the table.


In some embodiments, each of the parent port and the child ports comprises a QP responder and a QP requester, that handle reliable transport layer communication of packets related to the data reduction protocol. Handling transport layer communication at the port level (and not by a central element such as the central block) allows fast and reliable packet delivery to and from other network elements and network nodes, at full port bandwidth.


In some embodiments, the central block receives a first data reduction request indicative of first child ports, a first parent port and first computational modules required to perform a first data reduction operations on data received via the first child ports and destined to a the first parent port, and further receives a second data reduction request indicative of second child ports, a second parent port, and second computational modules required to perform a second data reduction operations on data received via the second child ports and destined to a the second parent port. The central block derives from the first request a first topology for performing the first data reduction operations and derives from the second request a second topology for performing the second data reduction operations. The central block configures the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel.


The first and second topologies may use disjoint subsets of ports and computational modules. The central block may configure the forwarding circuitry to apply the derived first and second topologies so that respective data first and second reduction operations are executed at full port bandwidth, and may overlap in time.


In some embodiments, the request is indicative of the network element serving as a root network element, and the central block derives from the request a topology that interconnects among the selected child ports and the computational modules so as to perform data reduction operations for producing aggregated data and to route the aggregated data to one or more child ports.


In an embodiment, the request or a separately received request is indicative of a given parent port and one or more given child ports, and the central block is configured to derive from the request a topology that interconnects the given parent port to the one or more given child ports, for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes.


In an embodiment, the forwarding circuitry comprises upstream forwarding circuitry and downstream forwarding circuitry. In this embodiment, the central block applies in parallel an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and applies a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.


In the disclosed techniques a network element supports flexible interconnections among ports and computational modules, without unnecessarily using computational modules for just passing data, thus refraining from bandwidth sharing. Ports that receive data for reduction have local computational modules that may be interconnected, e.g., in a serial chain having a suitable length, or in an aggregated configuration that aggregates multiple chains. This flexibility in connecting computational modules via the forwarding circuitry allows efficient usage of limited resources in performing different data reduction operations at different times, and/or in performing multiple data reduction operations in parallel without sharing port bandwidth.


System Description


FIG. 1 is a block diagram that schematically illustrates a computing system 20 supporting flexible in-network computing, in accordance with an embodiment that is described herein.


Computing system 20 may be used in various applications such as, High Performance Computing (HPC) clusters, data center applications and Artificial Intelligence (AI), to name a few.


In computing system 20, multiple end nodes 28 communicate with one another over a communication network 32. “End node” 28 is also referred to herein as a “network node.” Communication network 32 may comprise any suitable type of a communication network operating using any suitable protocols such as, for example, an Infiniband™ network or an Ethernet network. End node 28 is coupled to the communication network using a Network Interface Controller (NIC) 36. In Infiniband terminology, the network interface is referred to as a Host Channel Adapter (HCA). End node 28 may comprise any suitable processing module such as, for example, a server or a multi-core processing module comprising, for example, one or more Graphics Processing Units (GPUs) or other types of accelerators. End node 28 typically comprises (not shown) multiple processing units such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), coupled via a suitable link (e.g., a PCIe) to a memory and peripheral devices, e.g., NIC 36.


Communication network 32 comprises multiple network elements 24 interconnected in a multi-level configuration that enables performing complex in-network calculations using data reduction techniques. In the present example, network elements 24 are arranged in a tree configuration having a lower level, a middle level and a top level, comprising network elements 24A, 24B and 24C, respectively. Typically, a network element 24A connects to multiple end nodes 28 using NICs 36.


A practical computing system 20 may comprise several thousands or even tens of thousands of end nodes 28 interconnected using several hundreds or thousands of network elements 24. For example, communication network 32 of computing system 20 may be configured in four-level Fat-Tree topology comprising on the order of 3,500 switches.


In the multi-level tree structure, a network element may connect to child network elements in a lower level or to network nodes, and to a parent network element in a higher level. A network element at the top level is also referred to as a root network element. A subset (or all) of the network elements of a physical tree structure may form a data reduction tree, which is a logical structure typically used for modeling in-network calculations, as will be described below.


In some embodiments, multiple network elements 24 perform a calculation for some or all of network nodes 28. The network elements collectively perform the calculation as modeled using a suitable data reduction tree. In the hierarchical calculation, network elements in lower levels produce partial results that are aggregated by network elements in higher levels of the data reduction tree. A network element serving as the root of the data reduction tree produces the final calculation result (aggregated data), which is typically distributed to one or more network nodes 28. The calculation carried out by a network element 24 for producing a partial result is also referred to as a “data reduction operation.”


The data flow from the network nodes toward the root is also referred to as “upstream,” and the data reduction tree used in the upstream direction is also referred to as an “upstream data reduction tree.” The data flow from the root toward the network nodes is also referred to as “downstream,” and the data reduction tree used in the downstream direction is also referred to as a “downstream data reduction tree.”


Breaking a calculation over a data stream to a hierarchical in-network calculation by network elements 24 is typically carried out using a suitable data reduction protocol. An example data reduction protocol is the SHArP described in U.S. Pat. No. 10,284,383 cited above.


As will be described below, network elements 24 support flexible usage of ports and computational resources for performing multiple data reduction operations in parallel. This enables flexible and efficient in-network computations in computing system 20.


Network Element Supporting Flexible Data Reduction Operations


FIG. 2 is a block diagram that schematically illustrates a network element 24 supporting flexible data reduction operations in computing system 20 of FIG. 1, in accordance with an embodiment that is described herein.


Network element 24 may be used, for example, in implementing network elements 24A, 24B and 24C in communication network 32.


Network element 24 comprises a central block 40 that manages the operation of the network element in accordance with the underlying data reduction protocol, e.g., the SHArP mentioned above. The functionality of central block 40 will be described in more detail below.


Network element 24 further comprises configurable forwarding circuitry 42, which is connected using fixed connections 44 to various elements within network element 24. Forwarding circuitry 42 is flexibly configurable to interconnect among the various elements to which it connects. This allows creating various topologies of ports and computational resources for performing data reduction operations. In an embodiment, forwarding circuitry 42 comprises a configurable crossbar switch. The flexibility in interconnections contributes to the ability to support full port bandwidth.


Network element 24 comprises multiple ports 46 for connecting the network element to communication network 32. Each of ports 46 functions both as an input port for receiving packets from the communication network and as an output port for transmitting packets to the communication network. A practical network element 24 may comprise, for example, between 64 and 128 ports 46. Alternatively, a network element having any other suitable number of ports can also be used.


In some embodiments, each port 46 is respectively coupled to a transport-layer reception module 48, denoted “TRM-RX,” and to a transport-layer transmission module 52, denoted “TRM-TX.” The input part of port 46 is coupled to TRM-RX 48 via a parser 56. TRM-RX 48 comprises a QP responder 60 and a computational module 64, which is also referred to herein as an Arithmetic Logic Unit (ALU). TRM-TX comprises QP requester 68. TRM-RX 48 further comprises a reception buffer 70 denoted RX-BUFFER for storing incoming packets. TRM-TX 52 further comprises a transmission buffer 71 denoted TX-BUFFER for storing outgoing packets.


In some embodiments, central block 40 controls the internal connectivity of forwarding circuitry 42 and the configurations of ALUs 64 so that the ports and the ALUs are interconnected in a topology suitable for performing a requested data reduction operation.


Parser 56 is configured to parse incoming packets, and to identify and send relevant packets to TRM-RX 48.


In some embodiments, parser 56 identifies that a request for applying a data reduction operation is received and notifies the request to central block 40. The request may be indicative of a topology required in the upstream direction, a topology required in the downstream direction or both. Same or different ports may be used in the upstream topology and in the downstream topology, respectively. The data reduction operation itself (e.g., indicative of the function to which ALUs 64 should be configured) may be specified in the request that is indicative of the topology (or topologies) or alternatively, carried in a header of a data message.


The upstream topology supports data reduction operations on data received from certain child network elements via multiple child ports, for producing reduced data destined to a given parent network element via a parent port. The downstream topology specifies a parent port for receiving aggregated data via a given parent port and distributing that aggregated data to certain child ports.


In the upstream direction, the request is indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations. The central block derives from the request a topology that interconnects among the selected child ports, selected parent port and ALUs, so as to perform data reduction operations and to forward the resulting reduced data to the selected parent port). As noted above, the actual ALU function may be specified in the request or in a separate data message.


In some embodiments, the selected child ports receive data messages comprising a reduction operation and respective data portions and send the reduction operation to the central block, which sets the computational modules to apply the reduction operation to the data portions.


In the downstream direction, the request is indicative of a given parent port and one or more given child ports, and the central block derives from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port and distributing the aggregated data via the given child ports to respective network elements or network nodes.


Transport-layer modules TRM-RX 48 and TRM-TX 52 handle reliable connections with other entities via ports 46, such as ports of another network element or a port of a NIC of some network node 28. QP responder 60 in TRM-RX 48 handles reliable data reception via port 46. QP requester 68 in TRM-TX handles reliable data transmission via port 46.


In some embodiments, QP responder 60 receives packets transmitted by a corresponding QP requester, and signals back ACK/NACK notifications. QP requester 68 transmits packets to a corresponding QP responder on the other side of the link and handles re-transmissions as necessary.


Note that since each port 46 has a local QP responder 60 and a local QP requester 68, communication among the network elements (and network nodes) can be carried out at wire speed and with minimal latency. This (and the flexible connectivity via the forwarding circuitry) allow executing multiple data reduction operations using respective disjoint data reduction trees, in parallel, at full port bandwidth.


Network element 24 comprises one or more aggregators 72, each of which comprising an ALU 74, which is identical or similar to ALU 64 of TRM-RX module 48. Aggregator 72 does not receive data directly from any port 46. Instead, aggregator 72 aggregates data output by ALUs 64 of TRM-RXs 48. Aggregator 72 may also aggregate data output by an ALU 74 of another aggregator to create a hierarchical computational topology, in an embodiment.


The functionality of ALU 64 as will described below, also applies similarly to ALU 74. In the present example, ALU 64 (and ALU 74) comprises two inputs and a single output. Let A1, and A2 denote input arguments and let A3 denote a result calculated by the ALU. The ALU typically supports multiple predefined functions to which the ALU may be configured by the central block. When configured to a given function “F( )”, the ALU calculates A3 as A3=F(A1, A2). ALUs 64 and 74 support any suitable operation such as, for example, mathematical functions such as integer and floating-point addition, multiplication and division, and logical functions such as logical AND, OR and XOR, bitwise AND, OR and XOR. Other operations supported by ALUs 64 and 74 comprise, for example, min, max, min loc, and max loc. In some embodiments ALUs 64 and 74 support configurable operators.


In some embodiments, data received via port 46 (from a child network element or from a network node) is provided to one input of ALU 64. ALU 64 may be configured to a Null function, in which case the other input of the ALU is ignored and the data received from the port is output by ALU 64 with no modification. Alternatively, ALU 64 receives on its other input (via the forwarding circuitry) data calculated by another ALU 64, and applies the function F( ) to the data received on both inputs. ALU 74 typically receives, via the forwarding circuitry, data output by two ALUs 64. In performing a data reduction operation, the participating ALUs 64 and ALU 74 are configured by the central block to a common function F( ). Alternatively, at least some of the ALUs (64, 74 or both) assigned to a given data reduction operation may be configured to apply different functions.


The output of ALU 64 may be routed via the forwarding circuitry as input to another ALU 64 (or ALU 74) as described above. Alternatively, the output of ALU 64 may be routed via the forwarding circuitry to a QP requester of the parent port for transmission to a parent network element. In a root network element, the output of the last ALU 64 that concludes the calculation specified by the underlying reduction tree may be routed to the QP requesters of the child ports participating in the downstream tree.


The configurations of computing system 20 and network element 24 in FIGS. 1 and 2, as well as network element 24 in FIGS. 5 and 6 below, are given by way of example, and other suitable computing system and network element configurations can also be used.


Some elements of network element 24, such as central block 40, forward circuitry 42 (possibly implemented as separate upstream crossbar 82 and downstream crossbar 84, in FIGS. 5 and 6 below), ALU 64, ALU 74, parser 56, QP responder 60, QP requester 68, reception buffer 70 and transmission buffer 71 may be implemented in hardware, e.g., in one or more Application-Specific Integrated Circuits (ASICs) or Field-Programmable Gate Arrays (FPGAs). Additionally or alternatively, some elements of the network element can be implemented using software, or using a combination of hardware and software elements.


Elements that are not necessary for understanding the principles of the present application, such as various interfaces, addressing circuits, timing and sequencing circuits and debugging circuits, have been omitted from FIGS. 1, 2, 5 and 6 for clarity.


In some embodiments, some of the functions of central block 40 may be carried out by a general-purpose processor, which is programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.


Example Computational Configurations within Network Element


FIGS. 3A and 3B are diagrams that schematically illustrate example data reduction schemes within network element 24, in accordance with embodiments that are described herein.


In FIG. 3A, ALUs 64 denoted ALU1 . . . ALU4 are connected serially in a daisy-chain topology. The connections (in dotted lines) between successive ALUs 64 in the chain are implemented within forwarding circuitry 42 under the control of central block 40.


ALU1 . . . ALU4 receive data denoted D1 . . . D4 from child network elements (or from network nodes 28) via respective ports denoted PORT1 . . . PORT4 and are collectively configured to perform a data reduction operation. In the present example, the data reduction operation calculates the sum (D1+D2+D3+D4). To this end, ALU1 is configured to transfer D1 to the output of ALU1 and each of ALU2 . . . ALU4 calculates a sum function between its inputs. The calculation is carried out accumulatively as follows: ALU2 outputs the sum (D1+D2), ALU3 outputs the sum [(D1+D2)+D3], and ALU4 outputs the final sum {[(D1+D2)+D3]+D4}. The data reduction result (D1+D2+D3+D4) output by ALU4 is routed via forwarding circuitry 42 to a parent network element via PORT5.


In FIG. 3B, ALUs 64 denoted ALU1 . . . ALU4 and ALU 74 denoted ALU5 are connected in an aggregated topology. The connections (in dotted lines) from each ALU output to the input of the next ALU are implemented within forwarding circuitry 42 under the control of central block 40.


ALU1 . . . ALU4 receive data denoted D1 . . . D4 from child network elements via respective ports denoted PORT1 . . . PORT4 and together with ALU5 are collectively configured to perform a data reduction operation, in the present example calculating the sum (D1+D2+D3+D4). The topology in FIG. 3B comprises a first chain comprising ALU1 and ALU2 and a second chain comprising ALU3 and ALU4. The forwarding circuitry connects the outputs of ALU2 and of ALU4 to the inputs of ALU5. ALU1 and ALU3 are configured to transfer their input data (D1 and D3) to their respective outputs, and each of ALU2, ALU4 and ALU5 calculates the sum of its inputs.


The chain comprising ALU1 and ALU2 calculates a partial sum (D1+D2) and the chain comprising ALU3 and ALU4 calculates a partial sum (D3+D4). ALU 5 calculates the aggregated result [(D1+D2)+(D3+D4)], which the forwarding circuitry routes to port5 for transmission to a parent network element.


In the example of FIGS. 3A and 3B, two different topologies in which the ports and ALUs are interconnected are used in calculating the same sum (D1+D2+D3+D4). In FIG. 3A the calculation is accumulated over a chain of four ALUs. In FIG. 3B the calculation aggregates two short chains, and therefore the calculation latency in FIG. 3B is shorter than in FIG. 3A.


The data reduction topologies in FIGS. 3A and 3B are given by way of example, and other suitable topologies can also be used. For example, since forwarding circuitry 42 is flexibly configurable, complex topologies with multiple aggregation levels using multiple aggregators 72 can be used. Moreover, different groups of ports and ALUs may be allocated by central block 40 to perform multiple respective data reduction operations in parallel. This allows computing system 20 to perform multiple high-bandwidth in-network computations in parallel, using disjoint data reduction trees having separate respective groups of ports and ALUs within each network element.


A Method for Data Reduction


FIG. 4 is a flow chart that schematically illustrates a method for performing a data reduction operation in network element 24, in accordance with an embodiment that is described herein.


The method will be described for the upstream and downstream directions.


The method of FIG. 4 begins with central block 40 receiving a data reduction request, in accordance with a data reduction protocol, at a request reception step 100. The central block may receive the data reduction request from one or more child network elements or using some out-of-band link. The data reduction request comprises information regarding a data reduction tree to be implemented by the network element, typically as part of executing a calculation by computing system 20 using a suitable data reduction tree.


The request is indicative of selected child ports, a selected parent port and computational modules required for applying data reduction operations. The same data reduction request supports multiple different reduction operations on data that will be received from certain child network elements via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port. Performing the data reduction operation typically requires data manipulation using ALUs 64, possibly with one or more ALUs 74 of aggregators 72. In the present example, the same selected child ports and selected parent port are used in both the upstream and downstream directions.


At a topology derivation step 104, central block 40 derives, from the data reduction request, a topology that interconnects among the selected child ports, the selected parent port, and computational modules (ALUs 64, 74 or both) so as to perform data reduction operations and to forward the reduced data for transmission to the selected parent port. Further at step 104, the central block configures forwarding circuitry 42 to apply the derived topology.


When the network element comprises a root network element, the topology routes the aggregated data calculated by the last ALU to the QP requesters of relevant the child ports that distribute the aggregated data in accordance with a corresponding downstream tree.


At a data message reception step 106, the central block receives header parts of data messages received from child network elements or network nodes, via the selected child ports. Each data message comprises multiple packets. The data message specifies, e.g., in the header part (e.g., in the first packet), the data reduction operation to be performed using the already configured topology. In some embodiments, parser 56 sends the header part of the data message to the control block and forwards the payload data of the data message to the relevant computational module.


At a computational module configuration step 108, central block 40 configures the computational modules that participate in the data reduction operations to a function specified in the header of the data message(s). Step 108 is relevant to the upstream direction and may be skipped in the downstream direction.


At an upstream data flow step 116, the computational modules assigned based on the data message apply to the data payloads received in the data messages the specified data reduction operation, and the resulting reduced data is sent to the parent network element via the selected parent port.


When the network element comprises a root network element, the resulting reduced data comprises the aggregated data, which is sent via the forwarding circuitry to all the selected child ports. At a downstream data flow step 120, the network element receives aggregated data from the selected parent port, and distributes the aggregated data, via the forwarding circuitry, to the selected child ports. Following step 120 the method terminates.


At steps 116 and 120 above each QP requester of the parent port and child ports is responsible for sending the messages on a reliable connection or on the transport layer.


In some embodiments, the method of FIG. 4 may be executed similarly assuming different upstream and downstream data reduction trees. In these embodiments, different sets of selected child ports and a parent port may be used for the respective upstream and downstream directions.


Upstream and Downstream Example Flows


FIGS. 5 and 6 are diagrams that schematically illustrate upstream and downstream data reduction flows within network element 24, in accordance embodiments that is described herein.


In FIGS. 5 and 6, forwarding circuitry 42 comprises upstream forwarding circuitry 82 and downstream forwarding circuitry 84, which may be implemented as separate crossbar elements.


In describing FIGS. 5 and 6, it is assumed that the network element resides at a level lower than the root level. It is further assumed that forwarding circuitry 42 comprises separate upstream crossbar 82 and downstream crossbar 84, and that the same data reduction tree is used for both the upstream and downstream directions.


The flow steps in FIGS. 5 and 6 are numbered in the diagrams and will be described below.


In the upstream direction, depicted in FIG. 5, at step (1), central block 40 configures upstream crossbar 82 to connect among ports 46 and computational modules (ALUs 64 and ALU 74 of aggregator 72) in accordance with an upstream data reduction tree. In the present example, the central block configures ports 46A and 46B for receiving data for reduction from child network elements and configures port 46C for transmitting the calculated reduced data to a parent network element. Central block 40 additionally configures the upstream crossbar to connect ALUs 64A and 64B serially in a chain configuration whose output connects to an input of ALU 74. The other input of ALU 74 connects via the upstream crossbar to another chain of ALUs 64 (not shown).


At step (2), QP responders 60A and 60B of respective ports 46A and 46B receive packets of data messages from the child network elements. In the data messages, each packet comprises a transport layer header and a payload, wherein the first packet of the data message additionally comprises a SHArP header. The QP responder of each port handles the transport layer, and after sending the SHArP header to the central block forwards the payloads of the packets to ALU 64 of that port. At step (3) TRM-RX modules 48 of the child ports forward the SHArP header of the first packet to the central block, which at step (4) prepares a SHArP header for transmitting the reduced data. Further at step (4) the central block sets ALUs 64 and 74 to apply a function specified in the first packet.


At steps (5) and (6), ALUs 64A and 64B perform data reduction to the payload received in each data message via child ports 46A and 46B. At steps (7) and (8) ALU 74 of aggregator 72 receives partially reduced data from ALU 64B and from the other chain, and at step (9) ALU 74 calculates the overall reduced data. At step (10), the reduced data is forwarded to port 46C for transmission to the parent network element.


At step (11) QP requestor 68C packetizes a reduced data message that contains the reduced data and the SHArP header of step (4) and rebuilds the transport layer by attaching to each packet of the reduced data message a transport layer header. QP requester 68C handles reliable transport layer packet transmission, including retransmissions. In some embodiments, the QP requester uses some storage space of a local buffer (e.g., transmission buffer 71) of the port as a retry buffer for retransmission. In some embodiments, at step (12), the network element applies a suitable scheduling scheme (not shown) for packet transmission via port 46C including, for example, bandwidth allocation and prioritization using Virtual Lane (VL) management.


In the downstream direction, depicted in FIG. 6, at step (1), central block 40 configures downstream crossbar 84 to connect among ports 46 for distributing aggregated data received from a parent network element to multiple child network elements, in accordance with a downstream data reduction tree. In the present example, the central block configures parent port 46C for receiving aggregated data from the parent network element, and configures child ports 46A and 46B for transmitting the aggregated data to respective child network elements or end nodes. Central block 40 additionally configures the downstream crossbar to forward the aggregated data to both child ports 46A and 46B in parallel.


At step (2), QP responder 60C of port 46C receives packets carrying aggregated data, in an aggregated data message, from the parent network element. In the aggregated data message, each packet comprises a transport layer header and a payload, and the first packet additionally comprises a SHArP header. QP responder 60C handles the transport layer, and after sending the SHArP header to the central block forwards the payloads of the packets to the downstream crossbar. In some embodiments, the payloads of the packets are forwarded via ALU 64C that is configured by the central block to a Null function, so that the packet payload is transferred by the ALU with no modification. In alternative embodiments, ALU 64C is bypassed, and the packet payload is forwarded directly to the downstream crossbar, as will be described at step (5) below.


At step (3) TRM-RX 48C of port 46C forwards the SHArP header of the received packet to central block 40, and at step (4) the central block prepares a SHArP header for transmitting with the aggregated data to the child network elements.


At steps (5) and (6) the downstream crossbar receives the payload of the aggregated data message and forwards the payload to both child ports 46A and 46B in parallel.


At step (7), each of QP requesters 68A and 68B packetizes an aggregated data message that contains the aggregated data and the SHArP header of step (4) and rebuilds the transport layer by attaching to each packet of the aggregated message a transport layer header. QP requesters 68A and 68B handle transport layer packet transmission, including retransmissions. As noted above, in some embodiments, the QP requester may use storage space of a local buffer (e.g., transmission buffer 71) of the port as a retry buffer for retransmission. In some embodiments, at step (8), the network element applies a suitable scheduling scheme (not shown) for packet transmission to ports 46A and 46B including, for example, bandwidth allocation and prioritization using Virtual Lane (VL) management.


The embodiments described above are given by way of example, and other suitable embodiments can also be used. For example, in the embodiments of FIGS. 5 and 6, upstream and downstream directions are described separately. In some embodiments, same or different respective topologies for the upstream and downstream directions may be applied in parallel using a dedicated crossbar for each direction, e.g., for calculating a reduced data and distributing the resulting aggregated data.


Although the embodiments described herein mainly address data reduction operations such as “all reduce” and “reduce” operations, the methods and systems described herein can also be used in other applications, such as in performing, for example, “reliable multicast,” “reliable broadcast,” “all gather” and “gather” operations.


It will be appreciated that the embodiments described above are cited by way of example, and that the following claims are not limited to what has been particularly shown and described hereinabove. Rather, the scope includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.

Claims
  • 1. A network element, comprising: a plurality of ports, including multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements, the plurality of ports being configured to connect to a communication network;multiple computational modules, configured to collectively perform a data reduction operation in accordance with a data reduction protocol, wherein the data reduction operation calculates, based on data received via the multiple child ports, a partial result to be sent via the parent port for aggregation with other partial results produced by one or more other network elements;configurable forwarding circuitry, configured to interconnect within the network element among the multiple child ports, the parent port, and the multiple computational modules; anda central block, configured to: receive a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received from the child network elements or network nodes via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port;derive, from the request, a topology that interconnects within the network element among the selected child ports, the selected parent port and the computational modules so as to perform the data reduction operations and to forward the respective reduced data for transmission to the selected parent port; andconfigure the forwarding circuitry to apply the topology.
  • 2. The network element according to claim 1, wherein the selected child ports are configured to receive data messages comprising a reduction operation and respective data portions, and to send the reduction operation to the central block, and wherein the central block is configured to set the computational modules to apply the reduction operation to the data portions.
  • 3. The network element according to claim 1, wherein the central block is configured to derive the topology to interconnect computational modules that receive data for reduction via the selected child ports, in a chain configuration.
  • 4. The network element according to claim 1, wherein the central block is configured to derive the topology to interconnect outputs of two computational modules that receive data for reduction via the selected child ports as inputs to an aggregator computational module.
  • 5. The network element according to claim 1, wherein the selected parent port and each of the selected child ports comprise a Queue Pair (QP) responder and a QP requester, configured to respectively handle reliable transport layer reception and transmission of packets.
  • 6. The network element according to claim 1, wherein the central block is configured to: receive a first request indicative of first child ports, a first parent port and first computational modules required to perform first data reduction operations on data received via the first child ports and destined to the first parent port, and further receive a second request indicative of second child ports, a second parent port, and second computational modules required to perform second data reduction operations on data received via the second child ports and destined to the second parent port;derive, from the first request a first topology for performing the first data reduction operations and derive from the second request a second topology for performing the second data reduction operations; andconfigure the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel.
  • 7. The network element according to claim 1, wherein the request is indicative of the network element serving as a root network element, and wherein the central block is configured to derive from the request a topology that interconnects among the selected child ports and the computational modules so as to perform the data reduction operations for producing aggregated data, and to route the aggregated data to one or more child ports.
  • 8. The network element according to claim 1, wherein the request or a separately received request is indicative of a given parent port and one or more given child ports, and wherein the central block is configured to derive from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port, and to distribute the aggregated data via the given child ports to respective network elements or network nodes.
  • 9. The network element according to claim 1, wherein the forwarding circuitry comprises upstream forwarding circuitry and downstream forwarding circuitry, and wherein the central block is configured to apply, in parallel, an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and to apply a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.
  • 10. A method, comprising: in a network element comprising (i) a plurality of ports that connect to a communication network, including multiple child ports coupled to respective child network elements or network nodes and one or more parent ports coupled to respective parent network elements, (ii) multiple computational modules that collectively perform a data reduction operation in accordance with a data reduction protocol, wherein the data reduction operation calculates, based on data received via the multiple child ports, a partial result to be sent via the parent port for aggregation with other partial results produced by one or more other network elements, and (iii) configurable forwarding circuitry that interconnects within the network element among the multiple child ports, the parent port, and the multiple computational modules,receiving by a central block of the network element a request indicative of selected child ports, a selected parent port, and computational modules required for performing data reduction operations on data received via the selected child ports, for producing reduced data destined to a parent network element via the selected parent port;deriving, from the request, a topology that interconnects within the network element among the selected child ports, the selected parent port and the computational modules so as to perform data reduction operations, and to forwarding the reduced data for transmission to the selected parent port; andconfiguring the forwarding circuitry to apply the topology.
  • 11. The method according to claim 10, and comprising receiving via the selected child ports data messages comprising a reduction operation and respective data portions, sending the reduction operation to the central block, and setting the computational modules, by the central block, to apply the reduction operation to the data portions.
  • 12. The method according to claim 10, wherein deriving the topology comprises deriving the topology to interconnect computational modules that receive data for reduction via the selected child ports, in a chain configuration.
  • 13. The method according to claim 10, wherein deriving the topology comprises deriving the topology to interconnect outputs of two computational modules that receive data for reduction via the selected child ports as inputs to an aggregator computational module.
  • 14. The method according to claim 10, wherein the selected parent port and each of the selected child ports comprise a Queue Pair (QP) responder and a QP requester, and comprising respectively handling, using the QP requester and the QP responder reliable transport layer reception and transmission of packets.
  • 15. The method according to claim 10, and comprising: receiving a first request indicative of first child ports, a first parent port and first computational modules required to perform first data reduction operations on data received via the first child ports and destined to the first parent port, and further receiving a second request indicative of second child ports, a second parent port, and second computational modules required to perform second data reduction operations on data received via the second child ports and destined to the second parent port;deriving, from the first request a first topology for performing the first data reduction operations and deriving from the second request a second topology for performing the second data reduction operations; andconfiguring the forwarding circuitry to apply both the first topology and the second topology so as to support performing the first data reduction operations and the second data reduction operations in parallel.
  • 16. The method according to claim 10, wherein the request is indicative of the network element serving as a root network element, and comprising deriving from the request a topology that interconnects among the selected child ports and the computational modules so as to perform the data reduction operations for producing aggregated data, and routing the aggregated data to one or more child ports.
  • 17. The method according to claim 10, wherein the request or a separately received request is indicative of a given parent port and one or more given child ports, and comprising deriving from the request, a topology that interconnects the given parent port to the one or more given child ports for receiving aggregated data from a respective parent network element via the given parent port, and distributing the aggregated data via the given child ports to respective network elements or network nodes.
  • 18. The method according to claim 10, wherein the forwarding circuitry comprises upstream forwarding circuitry and downstream forwarding circuitry, and comprising applying, in parallel, an upstream topology to the upstream forwarding circuitry for applying the data reduction operations, and applying a downstream topology to the downstream forwarding circuitry for distributing aggregated data produced by a root network element toward one or more network nodes.
US Referenced Citations (199)
Number Name Date Kind
4933969 Marshall et al. Jun 1990 A
5068877 Near et al. Nov 1991 A
5325500 Bell et al. Jun 1994 A
5353412 Douglas et al. Oct 1994 A
5404565 Gould et al. Apr 1995 A
5606703 Brady et al. Feb 1997 A
5944779 Blum Aug 1999 A
6041049 Brady Mar 2000 A
6370502 Wu et al. Apr 2002 B1
6483804 Muller et al. Nov 2002 B1
6507562 Kadansky et al. Jan 2003 B1
6728862 Wilson Apr 2004 B1
6857004 Howard et al. Feb 2005 B1
6937576 Di Benedetto et al. Aug 2005 B1
7102998 Golestani Sep 2006 B1
7124180 Ranous Oct 2006 B1
7164422 Wholey, III et al. Jan 2007 B1
7171484 Krause et al. Jan 2007 B1
7313582 Bhanot et al. Dec 2007 B2
7327693 Rivers et al. Feb 2008 B1
7336646 Muller Feb 2008 B2
7346698 Hannaway Mar 2008 B2
7555549 Campbell et al. Jun 2009 B1
7613774 Caronni et al. Nov 2009 B1
7636424 Halikhedkar Dec 2009 B1
7636699 Stanfill Dec 2009 B2
7738443 Kumar Jun 2010 B2
8213315 Crupnicoff et al. Jul 2012 B2
8380880 Gulley et al. Feb 2013 B2
8510366 Anderson et al. Aug 2013 B1
8738891 Karandikar et al. May 2014 B1
8761189 Shachar et al. Jun 2014 B2
8768898 Trimmer et al. Jul 2014 B1
8775698 Archer et al. Jul 2014 B2
8811417 Bloch et al. Aug 2014 B2
9110860 Shahar Aug 2015 B2
9189447 Faraj Nov 2015 B2
9294551 Froese et al. Mar 2016 B1
9344490 Bloch et al. May 2016 B2
9563426 Bent et al. Feb 2017 B1
9626329 Howard Apr 2017 B2
9756154 Jiang Sep 2017 B1
10015106 Florissi et al. Jul 2018 B1
10158702 Bloch et al. Dec 2018 B2
10284383 Bloch et al. May 2019 B2
10296351 Kohn et al. May 2019 B1
10305980 Gonzalez et al. May 2019 B1
10318306 Kohn et al. Jun 2019 B1
10425350 Florissi Sep 2019 B1
10521283 Shuler et al. Dec 2019 B2
10541938 Timmerman et al. Jan 2020 B1
10621489 Appuswamy et al. Apr 2020 B2
20020010844 Noel et al. Jan 2002 A1
20020035625 Tanaka Mar 2002 A1
20020150094 Cheng et al. Oct 2002 A1
20020150106 Kagan et al. Oct 2002 A1
20020152315 Kagan et al. Oct 2002 A1
20020152327 Kagan et al. Oct 2002 A1
20020152328 Kagan et al. Oct 2002 A1
20030018828 Craddock et al. Jan 2003 A1
20030061417 Craddock et al. Mar 2003 A1
20030065856 Kagan et al. Apr 2003 A1
20040062258 Grow et al. Apr 2004 A1
20040078493 Blumrich Apr 2004 A1
20040120331 Rhine et al. Jun 2004 A1
20040123071 Stefan et al. Jun 2004 A1
20040252685 Kagan et al. Dec 2004 A1
20040260683 Chan et al. Dec 2004 A1
20050097300 Gildea et al. May 2005 A1
20050122329 Janus Jun 2005 A1
20050129039 Biran et al. Jun 2005 A1
20050131865 Jones et al. Jun 2005 A1
20050281287 Ninomi et al. Dec 2005 A1
20060282838 Gupta et al. Dec 2006 A1
20070127396 Jain Jun 2007 A1
20070162236 Lamblin et al. Jul 2007 A1
20080104218 Liang May 2008 A1
20080126564 Wilkinson May 2008 A1
20080168471 Benner et al. Jul 2008 A1
20080181260 Vonog et al. Jul 2008 A1
20080192750 Ko Aug 2008 A1
20080244220 Lin et al. Oct 2008 A1
20080263329 Archer et al. Oct 2008 A1
20080288949 Bohra et al. Nov 2008 A1
20080298380 Rittmeyer et al. Dec 2008 A1
20080307082 Cai Dec 2008 A1
20090037377 Archer et al. Feb 2009 A1
20090063816 Arimilli et al. Mar 2009 A1
20090063817 Arimilli et al. Mar 2009 A1
20090063891 Arimilli et al. Mar 2009 A1
20090182814 Tapolcai et al. Jul 2009 A1
20090247241 Gollnick et al. Oct 2009 A1
20090292905 Faraj Nov 2009 A1
20100017420 Archer et al. Jan 2010 A1
20100049836 Kramer Feb 2010 A1
20100074098 Zeng et al. Mar 2010 A1
20100095086 Eichenberger et al. Apr 2010 A1
20100185719 Howard Jul 2010 A1
20100241828 Yu et al. Sep 2010 A1
20110060891 Jia Mar 2011 A1
20110066649 Berlyant et al. Mar 2011 A1
20110119673 Bloch et al. May 2011 A1
20110173413 Chen et al. Jul 2011 A1
20110219208 Asaad Sep 2011 A1
20110238956 Arimilli et al. Sep 2011 A1
20110258245 Blocksome et al. Oct 2011 A1
20110276789 Chambers et al. Nov 2011 A1
20120063436 Thubert et al. Mar 2012 A1
20120117331 Krause et al. May 2012 A1
20120131309 Johnson May 2012 A1
20120216021 Archer et al. Aug 2012 A1
20120254110 Takemoto Oct 2012 A1
20130117548 Grover et al. May 2013 A1
20130159410 Lee et al. Jun 2013 A1
20130318525 Palanisamy et al. Nov 2013 A1
20130336292 Kore et al. Dec 2013 A1
20140033217 Vajda et al. Jan 2014 A1
20140047341 Breternitz et al. Feb 2014 A1
20140095779 Forsyth et al. Apr 2014 A1
20140122831 Uliel et al. May 2014 A1
20140189308 Hughes et al. Jul 2014 A1
20140211804 Makikeni et al. Jul 2014 A1
20140280420 Khan Sep 2014 A1
20140281370 Khan Sep 2014 A1
20140362692 Wu et al. Dec 2014 A1
20140365548 Mortensen Dec 2014 A1
20150106578 Warfield et al. Apr 2015 A1
20150143076 Khan May 2015 A1
20150143077 Khan May 2015 A1
20150143078 Khan et al. May 2015 A1
20150143079 Khan May 2015 A1
20150143085 Khan May 2015 A1
20150143086 Khan May 2015 A1
20150154058 Miwa et al. Jun 2015 A1
20150180785 Annamraju Jun 2015 A1
20150188987 Reed et al. Jul 2015 A1
20150193271 Archer et al. Jul 2015 A1
20150212972 Boettcher et al. Jul 2015 A1
20150269116 Raikin et al. Sep 2015 A1
20150379022 Puig et al. Dec 2015 A1
20160055225 Xu et al. Feb 2016 A1
20160105494 Reed et al. Apr 2016 A1
20160112531 Milton et al. Apr 2016 A1
20160117277 Raindel et al. Apr 2016 A1
20160179537 Kunzman et al. Jun 2016 A1
20160219009 French Jul 2016 A1
20160248656 Anand et al. Aug 2016 A1
20160299872 Vaidyanathan et al. Oct 2016 A1
20160342568 Burchard et al. Nov 2016 A1
20160364350 Sanghi et al. Dec 2016 A1
20170063613 Bloch et al. Mar 2017 A1
20170093715 McGhee et al. Mar 2017 A1
20170116154 Palmer et al. Apr 2017 A1
20170187496 Shalev et al. Jun 2017 A1
20170187589 Pope et al. Jun 2017 A1
20170187629 Shalev et al. Jun 2017 A1
20170187846 Shalev et al. Jun 2017 A1
20170199844 Burchard et al. Jul 2017 A1
20180004530 Vorbach Jan 2018 A1
20180046901 Xie et al. Feb 2018 A1
20180047099 Bonig et al. Feb 2018 A1
20180089278 Bhattacharjee et al. Mar 2018 A1
20180091442 Chen et al. Mar 2018 A1
20180097721 Matsui Apr 2018 A1
20180173673 Daglis et al. Jun 2018 A1
20180262551 Demeyer et al. Sep 2018 A1
20180285316 Thorson et al. Oct 2018 A1
20180287928 Levi et al. Oct 2018 A1
20180302324 Kasuya Oct 2018 A1
20180321912 Li Nov 2018 A1
20180321938 Boswell et al. Nov 2018 A1
20180367465 Levi Dec 2018 A1
20180375781 Chen et al. Dec 2018 A1
20190018805 Benisty Jan 2019 A1
20190026250 Das Sarma et al. Jan 2019 A1
20190065208 Liu et al. Feb 2019 A1
20190068501 Schneider et al. Feb 2019 A1
20190102179 Fleming et al. Apr 2019 A1
20190102338 Tang et al. Apr 2019 A1
20190102640 Balasubramanian Apr 2019 A1
20190114533 Ng et al. Apr 2019 A1
20190121388 Knowles et al. Apr 2019 A1
20190138638 Pal et al. May 2019 A1
20190147092 Pal et al. May 2019 A1
20190235866 Das Sarma et al. Aug 2019 A1
20190303168 Fleming, Jr. et al. Oct 2019 A1
20190303263 Fleming, Jr. et al. Oct 2019 A1
20190324431 Celia et al. Oct 2019 A1
20190339688 Cella et al. Nov 2019 A1
20190347099 Eapen et al. Nov 2019 A1
20190369994 Parandeh Afshar et al. Dec 2019 A1
20190377580 Vorbach Dec 2019 A1
20190379714 Levi et al. Dec 2019 A1
20200005859 Chen et al. Jan 2020 A1
20200034145 Bainville et al. Jan 2020 A1
20200057748 Danilak Feb 2020 A1
20200103894 Celia et al. Apr 2020 A1
20200106828 Elias et al. Apr 2020 A1
20200137013 Jin et al. Apr 2020 A1
Non-Patent Literature Citations (41)
Entry
Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” Proceedings of the Forth Conferene on Partitioned Global Address Space Programming Model, pp. 1-4, Oct. 2010.
Priest et al., “You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives”, IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 221-230, year 2019.
Wikipedia, “Nagle's algorithm”, pp. 1-4, Dec. 12, 2019.
Mellanox Technologies, “InfiniScale IV: 36-port 40GB/s Infiniband Switch Device”, pp. 1-2, year 2009.
Mellanox Technologies Inc., “Scaling 10Gb/s Clustering at Wire-Speed”, pp. 1-8, year 2006.
IEEE 802.1D Standard “IEEE Standard for Local and Metropolitan Area Networks—Media Access Control (MAC) Bridges”, IEEE Computer Society, pp. 1-281, Jun. 9, 2004.
IEEE 802.1AX Standard “IEEE Standard for Local and Metropolitan Area Networks—Link Aggregation”, IEEE Computer Society, pp. 1-163, Nov. 3, 2008.
Turner et al., “Multirate Clos Networks”, IEEE Communications Magazine, pp. 1-11, Oct. 2003.
Thayer School of Engineering, “An Slightly Edited Local Copy of Elements of Lectures 4 and 5”, Dartmouth College, pp. 1-5, Jan. 15, 1998 http://people.seas.harvard.edu/˜jones/cscie129/nu_lectures/lecture11/switching/clos_network/clos_network.html.
“MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, version 3.1, pp. 1-868, Jun. 4, 2015.
Coti et al., “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), pp. 1-12, Aug. 2009.
Petrini et al., “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (Hotl'01), pp. 1-6, Aug. 2001.
Sancho et al., “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 1-10, Sep. 17-20, 2007.
Infiniband Trade Association, “InfiniBand™ Architecture Specification”, release 1.2.1, pp. 1-1727, Jan. 2008.
InfiniBand Architecture Specification, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007.
Deming, “Infiniband Architectural Overview”, Storage Developer Conference, pp. 1-70, year 2013.
Fugger et al., “Reconciling fault-tolerant distributed computing and systems-on-chip”, Distributed Computing, vol. 24, Issue 6, pp. 323-355, Jan. 2012.
Wikipedia, “System on a chip”, pp. 1-4, Jul. 6, 2018.
Villavieja et al., “On-chip Distributed Shared Memory”, Computer Architecture Department, pp. 1-10, Feb. 3, 2011.
U.S. Appl. No. 16/357,356 office action dated May 14, 2020.
European Application # 20156490.3 search report dated Jun. 25, 2020.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pp. 298-309, Aug. 1, 1994.
Chiang et al., “Toward supporting data parallel programming on clusters of symmetric multiprocessors”, Proceedings International Conference on Parallel and Distributed Systems, pp. 607-614, Dec. 14, 1998.
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI AII-to-AII”, Proceedings of the 23rd European MPI Users' Group Meeting, pp. 167-179, Sep. 2016.
Pjesivac-Grbovic et al., “Performance Analysis of MPI Collective Operations”, 19th IEEE International Parallel and Distributed Processing Symposium, pp. 1-19, 2015.
Danalis et al., “PTG: an abstraction for unhindered parallelism”, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, pp. 1-10, Nov. 17, 2014.
Cosnard et al., “Symbolic Scheduling of Parameterized Task Graphs on Parallel Machines,” Combinatorial Optimization book series (COOP, vol. 7), pp. 217-243, year 2000.
Jeannot et al., “Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors using paramerized Task Graphs”, World Scientific, pp. 1-8, Jul. 23, 2001.
Stone, “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations,” Journal of the Association for Computing Machinery, vol. 10, No. 1, pp. 27-38, Jan. 1973.
Kogge et al., “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C-22, No. 8, pp. 786-793, Aug. 1973.
Hoefler et al., “Message Progression in Parallel Computing—To Thread or not to Thread?”, 2008 IEEE International Conference on Cluster Computing, pp. 1-10, Tsukuba, Japan, Sep. 29-Oct. 1, 2008.
U.S. Appl. No. 16/430,457 Office Action dated Jul. 9, 2021.
Yang et al., “SwitchAgg: A Further Step Toward In-Network Computing,” 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking, pp. 36-45, Dec. 2019.
EP Application # 20216972 Search Report dated Jun. 11, 2021.
U.S. Appl. No. 16/782,118 Office Action dated Jun. 3, 2021.
U.S. Appl. No. 16/789,458 Office Action dated Jun. 10, 2021.
“Message Passing Interface (MPI): History and Evolution,” Virtual Workshop, Cornell University Center for Advanced Computing, NY, USA, pp. 1-2, year 2021, as downloaded from https://cvw.cac.cornell.edu/mpi/history.
Pacheco, “A User's Guide to MPI,” Department of Mathematics, University of San Francisco, CA, USA, pp. 1-51, Mar. 30, 1998.
Wikipedia, “Message Passing Interface,” pp. 1-16, last edited Nov. 7, 2021, as downloaded from https://en.wikipedia.org/wiki/Message_Passing_Interface.
U.S. Appl. No. 16/782,118 Office Action dated Nov. 8, 2021.
Related Publications (1)
Number Date Country
20210234753 A1 Jul 2021 US