The present description relates in general to high-performance system interconnects, and more particularly to, for example, without limitation, load balanced fine-grained adaptive routing in high-performance system interconnect.
High-performance computing systems include thousands of compute nodes, storage, memory and I/O components, coupled through a high-speed interconnection network. The interconnection network faces increased demands for low latency and high throughput from traditional scientific applications and emerging deep learning workloads. Conventional interconnection networks use various congestion control techniques for achieving low latency and efficient data transmission. Some systems use adaptive routing at endpoints, rather than at the switches in an interconnection network. Although this approach can be easier to implement and can suffice for smaller interconnects, the endpoint cannot react swiftly to congestion. This is because of latencies for the endpoint to become aware of the congestion so that the endpoint can modify its traffic in response.
High-performance system interconnects that support non-minimal paths need a routing algorithm that can keep a flow in-order. A flow will remain in-order if all packets of the flow follow the same path of switches and cables between the flow's source and the flow's destination. The routing algorithm should also enable non-minimal paths and minimal paths, to fully utilize the bandwidth of the fabric. The non-minimal paths are present in topologies with all-to-all connections, including HyperX, Dragonfly and Megafly. To minimize congestion and to maximize available bandwidth, all options should be utilized, including non-minimal paths, and the load should be dispersed over the interconnect.
The description provided in the background section should not be assumed to be prior art merely because it is mentioned in or associated with the background section. The background section may include information that describes one or more aspects of the subject technology.
In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
The detailed description set forth below is intended as a description of various implementations and is not intended to represent the only implementations in which the subject technology may be practiced. As those skilled in the art would realize, the described implementations may be modified in various different ways, all without departing from the scope of the present disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive.
Load-Balanced Fine-Grained Adaptive Routing in High-Performance System Interconnect
There is a need for methods and systems that address at least some of the deficiencies identified above in the Background section. Some implementations use fine-grained adaptive routing (FGAR) at a switch. The routing technique selects an output port from among candidates, for each packet arriving at the switch. Each packet is steered by up-to-date information known at the switch. This technique provides a nimble way to avoid congestion in an interconnection network, since the switch can immediately divert traffic around the congestion. This technique not only helps new traffic avoid latency from being stuck in the congested region, but also helps the congestion event clear by letting it drain without adding new load to overloaded network resources. While sending each packet on its best path at a given time works well for a given packet, a stream of packets toward a similar destination will form a burst on a port until the port selection logic is updated with the effect of the burst. This may lead to instability and sites of potential short-term congestion. To address such problems, some implementations load balance a stream of packets over all candidate output ports with a distribution pattern according to available bandwidth of the output ports. This makes link utilization more uniform and avoids bursting to any of the ports.
A link is a single physical cable to an input port that has a single flow control domain—either the port can receive a packet, or its buffer is nearly full and the port cannot accept the packet. Ethernet PAUSE frames are an example of such flow control, but that is limiting. A link can instead carry a number of flow control domains within it, with independent buffering and signaling for each for each flow control domain. In this case, while a port might lack buffer space on a virtual lane VL0, and not accept another packet into it, another virtual lane VL1 might have space and accept a new packet. These domains are ‘within’ the physical link because they share the physical cable and its bandwidth. In high-performance computing context, these are called virtual lanes; in Ethernet, they are called ‘priorities’ in connection with the priority flow control (PFC) standard which has obsoleted PAUSE frames in many cases. As the Ethernet name implies, virtual lanes (VLs) are treated as if they may have different priority.
Typically, there are multiple output ports on a switch (sometimes called a switch Application Specific Integrated circuit (ASIC)) that can serve as the next hop for a given packet. A link between two switch ASICs may have multiple parallel ports. For example, for a HyperX topology, a link is driven by K ports, where K is an integer. The value of K can depend on many factors including switch radix (port count) and size of the fabric. An example range is 1-9 for HyperX. There may be multiple switch ASICs that can serve as the next hop. The product of these cases are all viable options for a packet. In some implementations, a stream of packets to a given destination are load balanced across all of them. In this way, the techniques described herein are different from sending packets over only a ‘best’ path. That other alternative can cause instability in a network. For example, conventional techniques can drive a burst on one port then, after a feedback latency, switch all packets to another port.
In some implementations, there is a static table 504 per dimension. This is shown in
In some implementations, the mapping in 504 is determined by a fabric manager when the interconnect is initialized. The fabric manager determines the topology, number of dimensions, the scale of each dimension and the coordinates of each switch within that space. The fabric manager also determines how to carve the LID into subfields (imposes the hierarchical structure onto the LID). FM can then allocate rows of the DPT 506 per coordinate in every dimension, skipping over the coordinate of the switch in question. This allocation comprises a path to each other coordinate in each dimension of the fabric. Subsequently, the fabric manager can simply map each dimension's LID subfield value, which specifies the coordinates of the endpoint (HFI), to the proper row in the DPT 506. This is the mapping written into the tables 504 per dimension.
The DPT 506 can be a complex structure, not because it's large but because a substantial volume of computation drives the values written into it. This is why it is named “dynamic”.
As shown to the right of the table 506, a comparator 604 compares each dimension's result (indicated by the three arrows emerging from the table 506) against the threshold 602, and the dimensions that have adequate capacity are passed to a secondary (or outer) load balance operation 606. In this context, the secondary load balance operation 606 can include 1 input per dimension (e.g., a maximum of 4 or 5 for exa-scale). This is much smaller scale than in the DPTs, where Minimal DPT has a scale up to K=9, and Non-Minimal DPT scales up to K=20 or more. The outer load balance operation 606 pseudo-randomly selects a dimension which meets the threshold test. In the alternative discussed above, where the stack height is carried through the DPT, the outer load balance could be weighted by stack height, similar to the DPT.
The comparator 604 and the load balancer 606 perform outer load balancing 608, and spread traffic over all dimensions, while logic shown as inner load balance 610 per dimension spreads the traffic over all paths within each dimension. The load balancer 606 outputs a port number 612 for transmitting packets. In some implementations, the output 612 also includes an indication if there is enough bandwidth. This indication is used for selecting between minimal routes and non-minimal routes that use intermediate switches, as described further below in reference to
In some implementations, each cell also includes a bit indicating if the buffer size for that cell is greater than a threshold 710. The threshold 710 may be different from the threshold 602 so as to impose a bias and because the number of alternative paths is different. Bias is a way to tune use of minimal vs non-minimal paths. For example, a high threshold in the minimal DPT would push a lot of traffic to the non-minimal cases where the majority of the bisection bandwidth resides. The non-minimal threshold should be low because traffic should avoid escape paths. All things being equal, the non-minimal DPT will have a much taller stack heights than the minimal DPT because there are so many more options. For example, the maximum stack height with 100% capacity could be 900% in the minimal DPT and 2100% in the non-minimal case. Therefore, a threshold equation like ‘10% of best case’ would have a value of 90% in the minimal DPT and 210% for the non-minimal one. Stack height of each cell entry is compared against a second threshold 710, using a comparator 712, to select one or more dimensions. A load balancer 714 selects a port corresponding to a dimension from the one or more dimensions to route the packet. The non-minimal routing is more complex because many more options are presented and must be weighed.
Example Switch Architecture for Load-Balanced Fine-Grained Adaptive Routing
Referring back to
The switch 200 also includes a port sequence generation circuit (e.g., the port sequence generation circuit 208) configured to generate a port sequence that defines a pseudo-randomly interleaved sequence of a plurality of path options via the plurality of egress ports 204, based on the network capacity. Examples of port sequence generation are described above in reference to
The switch 200 also includes a routing circuit (e.g., the routing circuit 210) for routing one or more packets, received from the one or more ingress ports, towards a destination, based on the port sequence. In some implementations, the routing circuit 210 is configured to route a plurality of packets, received from one or more ingress ports 202, to the plurality of next switches, based on the dynamic port table.
In some implementations, the port sequence generation circuit 208 is configured to update the dynamic port table, based on the plurality of port sequences, after the routing circuit 210 routes a packet of the plurality of packets. Examples of updates of dynamic port table are described above in reference to
In some implementations, the interconnection network 104 includes a plurality of dimensions (e.g., as described above in reference to
In some implementations, the plurality of path options includes non-minimal routes via a corresponding intermediate switch, in addition to minimal routes without any intermediate switches. Examples of non-minimal routing are described above in reference to
Example Computing Device for Load-Balanced Fine-Grained Adaptive Routing
In some implementations, the memory 902 stores one or more programs (e.g., sets of instructions), and/or data structures, collectively referred to as “modules” herein. In some implementations, the memory 902, or the non-transitory computer readable storage medium of the memory 902, stores the following programs, modules, and data structures, or a subset or superset thereof:
Example operations of the network capacity module 906, the port sequence generation circuit 908, the dynamic port table 910, and the routing module 912, are described below in reference to
The I/O subsystem 918 communicatively couples the system 900 to one or more devices, such as other switches 102-2, . . . , 102-M, via the interconnection network 104. In some implementations, some of the operations described herein are performed by the system 900 without any initiation by any of the switches 102-2, . . . , 102-M. For example, the system 900 automatically computes network capacity or sets up port sequences for routing packets. The communication bus 920 optionally includes circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
The method also includes generating (1006) a port sequence (e.g., using the port sequence module 908) that defines a pseudo-randomly interleaved sequence of a plurality of path options via the plurality of egress ports, based on the network capacity. In some implementations, generating the port sequence includes using each path option in a fraction of time slots of the port sequence such that probability of a corresponding egress port appearing in the port sequence is proportional to the network capacity through the corresponding egress port. In some implementations, the method further includes generating a plurality of port sequences. Each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and each port sequence corresponds to a respective next switch of a plurality of next switches. In some implementations, the method further includes generating a plurality of port sequences. Each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and each port sequence corresponds to a respective virtual lane of a plurality of virtual lanes. In some implementations, the method further includes generating (e.g., using the port sequence module 908) a plurality of port sequences. Each port sequence pseudo-randomly interleaves the plurality of path options, via the plurality of egress ports, according to the network capacity. Each port sequence corresponds to (i) a respective virtual lane of a plurality of virtual lanes and (ii) a respective next switch of a plurality of next switches. The method also includes generating (e.g., using the dynamic port table module 910) a dynamic port table of egress port identifiers. Each row of the dynamic port table corresponds to a respective next switch of a plurality of next switches. Each column of the dynamic port table corresponds to a respective virtual lane of a plurality of virtual lanes, and each egress port identifier corresponds to a respective port sequence of the plurality of port sequences.
The method also includes receiving (1008) one or more packets via one or more ingress ports of the switch, and routing (1010) the one or more packets (e.g., using the routing module 908) towards a destination, based on the port sequence. In some implementations, the plurality of path options includes non-minimal routes via a corresponding intermediate switch, in addition to minimal routes without any intermediate switches. In some implementations, the method further includes prioritizing path options that include minimal routes over path options that include non-minimal routes, when routing the one or more packets.
In some implementations, the method further includes, in accordance with a determination that path options that include minimal routes do not meet a threshold network capacity, selecting other path options that include non-minimal routes, when routing the one or more packets.
In some implementations, the method also includes routing (e.g., using the routing module 908) a plurality of packets, received from one or more ingress ports, to the plurality of next switches, based on the dynamic port table. In some implementations, the method further includes updating the dynamic port table, based on the plurality of port sequences, after routing a packet of the plurality of packets. In some implementations, the interconnection network includes a plurality of dimensions, the network capacity includes information regarding capacity of the interconnection network to transmit packets towards the destination via the switch and using the plurality of dimensions, each port sequence further corresponds to a respective dimension of the plurality of dimensions, the dynamic port table includes a plurality of sub-tables of egress port identifiers, each sub-table corresponding to a respective dimension, and the method further includes routing the plurality of packets further comprises selecting a dimension from the plurality of dimensions, based on comparing network capacities for the interconnection network to transmit packets towards the destination using each dimension.
In some implementations, the method further includes: in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination via a first dimension of the plurality of dimensions, does not meet a predetermined threshold, forgoing selecting the first dimension for routing the plurality of packets.
In some implementation, the method further includes: in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination, via a first dimension or via a second dimension of the plurality of dimensions, meets a predetermined threshold, spreading the plurality of packets over the first dimension and the second dimension.
In some implementations, the method further includes, prior to routing the plurality of packets, for each packet: (i) extracting subfields in a header of the packet, and (ii) indexing a static lookup table for each dimension using the subfields to select a row in a respective sub-table for the dimension.
In some implementations, host interfaces may include network interface cards (NICs) or host fabric interfaces (HFIs). In some implementations, the interconnection network is called a computing fabric.
Filter with Engineered Damping for Load-Balanced Fine-Grained Adaptive Routing
Fine-Grained Adaptive Routing (FGAR) selects the best output port among candidates for each packet arriving at a switch. FGAR can be implemented using raw traffic information, but this is vulnerable to overreaction if a measurement changes abruptly. The utility of FGAR can be enhanced significantly by adding digital filtering of the measurements to stabilize the reactions. High-Precision Congestion Control (HPCC) is a datacenter Ethernet congestion control algorithm that uses an Exponentially Weighted Moving Average (EWMA) filter, but that filter is severely over-damped (i.e., no separate damping is used).
Some implementations use filtering for expanding the resolution of measurements by combining information in a time series and enable an engineered damping factor. Some implementations use damping for tuning the reaction to abrupt changes to stabilize the network. Some implementations use hop-by-hop telemetry as opposed to end-to-end telemetry. Some implementations perform filtering and damping at the switch (as opposed to an NIC). Modern fabrication techniques (e.g., a 7 nm process) enable complex or compute intensive filter pipelines.
In some implementations, the port capacity includes available buffer capacity for ingress ports of respective receiver switches coupled to the plurality of egress ports.
In some implementations, the port capacity is zero through any egress port that has a fault (e.g., a link is down).
In some implementations, the bandwidth capacity includes idle buffer in the next switch. For example, the idle buffer can include total available buffer capacity for all of the virtual lanes for all of the ingress ports of a particular receiver switch.
In some implementations, the bandwidth capacity includes configured buffer minus current buffer in the next switch. For example, the bandwidth capacity includes configured buffer minus current buffer for each virtual lane of a respective port of a respective switch.
In some implementations, the bandwidth capacity is calculated based on one or more telemetry packets received from another switch that is coupled to the switch in an interconnection network (e.g., the interconnection network 104).
In some implementations, the function includes Exponential Weighted Moving Average.
In some implementations, the function includes a plurality of low-pass filters. In some implementations, each low-pass filter is configured to combine the port capacity for a respective egress port with the bandwidth capacity, to obtain a respective bandwidth capacity for transmitting packets to the destination via the respective egress port.
In some implementations, the switch is connected to the next switch using a plurality of virtual lanes. The bandwidth capacity includes a respective buffer capacity for each virtual lane. The network capacity circuit 1104 is configured to compute, for each virtual lane, a respective virtual lane capacity, using a respective one or more low-pass filters, based on the port capacity and the respective buffer capacity. And the routing circuit 1106 is configured to route the one or more packets to the destination by selecting a virtual lane from the plurality of virtual lanes based on the respective virtual lane capacity.
In some implementations, the bandwidth capacity includes idle buffers in a path to the destination that includes an intermediate switch.
In some implementations, the bandwidth capacity includes (i) a first buffer capacity corresponding to idle buffers in a first path to the destination via a first intermediate switch, and (ii) a second buffer capacity corresponding to idle buffers in a second path to the destination via a second intermediate switch. The network capacity circuit 1104 is configured to: compute a first network capacity for transmitting packets to the destination, via a first port, using a low-pass filter, based on the port capacity and the first buffer capacity; and compute a second network capacity for transmitting packets to the destination, via the plurality of egress ports, using a second low-pass filter, based on the port capacity and the first buffer capacity. And the routing circuit 1106 is configured to route the one or more packets by selecting between the first path and the second path, based on the first network capacity and the second network capacity.
In some implementations, host interfaces may include network interface cards (NICs) or host fabric interfaces (HFIs). In some implementations, the interconnection network is called a computing fabric.
Examples of Filters
Suppose a cable fails in a system, an FGAR response is to steer packets to alternate routes. This scenario can lead to rapid changes in buffer capacities (e.g., heights of the thermometers described above). As a result, the whole fabric could distort quickly and oscillate. What is desired though is a smooth transition and bounce back, which is generally described as damping. This is accomplished using filtering. Filtering may include averaging and can combine many low precision measurements to result in a higher precision measurement. Filtering can also use telemetry when available. Telemetry may include buffer telemetry and fault telemetry.
In the example shown in
In
In some implementations, filters do not use multiply or divide operations or circuits, but instead use shifts. Because the filter constants are known at configuration time, the filters can be configured to use efficient arithmetic operations.
Example Computing Device for Bandwidth Capacity Filters
In some implementations, the memory 1502 stores one or more programs (e.g., sets of instructions), and/or data structures, collectively referred to as “modules” herein. In some implementations, the memory 1502, or the non-transitory computer readable storage medium of the memory 1502, stores the following programs, modules, and data structures, or a subset or superset thereof:
Example operations of the port and bandwidth capacity module 1506, the network capacity module 1508, the low-pass filter module 1510, and the routing module 1512, are described below in reference to
The I/O subsystem 1518 communicatively couples the system 1500 to one or more devices, such as other switches 102-2, . . . , 102-M, via the interconnection network 104. In some implementations, some of the operations described herein are performed by the system 1500 without any initiation by any of the switches 102-2, . . . , 102-M. For example, the system 1500 automatically computes network capacity and/or sets up port sequences for routing packets. The communication bus 1520 optionally includes circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
The method includes obtaining (1604) port capacity (e.g., using the port and bandwidth capacity module 1506) for a plurality of egress ports configured to couple the switch to a next switch. In some implementations, the port capacity includes available buffer capacity for ingress ports of respective receiver switches coupled to the plurality of egress ports. In some implementations, the port capacity is zero through any egress port that has a fault.
The method also includes obtaining (1604) bandwidth capacity (e.g., using the port and bandwidth capacity module 1506) for transmitting packets to a destination. In some implementations, the bandwidth capacity includes idle buffer in the next switch. In some implementations, the bandwidth capacity includes configured buffer minus current buffer in the next switch. In some implementations, the bandwidth capacity is calculated based on one or more telemetry packets received from another switch of the interconnection network.
The method also includes computing (1606) network capacity (e.g., using the network capacity module 1508), for transmitting packets to the destination, via the plurality of egress ports, based on a function of the port capacity and the bandwidth capacity. In some implementations, the function includes Exponential Weighted Moving Average. In some implementations, the function includes a plurality of low-pass filters. In some implementations, each low-pass filter combines the port capacity for a respective egress port with the bandwidth capacity, to obtain a respective bandwidth capacity for transmitting packets to the destination via the respective egress port.
The method also includes receiving one or more packets via one or more ingress ports of the switch. The method also includes routing (e.g., using the routing module 1512) the one or more packets to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity.
In some implementations, the switch is connected to the next switch using a plurality of virtual lanes, the bandwidth capacity includes a respective buffer capacity for each virtual lane, computing the network capacity includes computing, for each virtual lane, a respective virtual lane capacity, using a respective one or more low-pass filters, based on the port capacity and the respective buffer capacity, and routing the one or more packets to the destination includes selecting a virtual lane from the plurality of virtual lanes based on the respective virtual lane capacity.
In some implementations, the bandwidth capacity includes idle buffers in a path to the destination that includes an intermediate switch.
In some implementations, the bandwidth capacity includes (i) a first buffer capacity corresponding to idle buffers in a first path to the destination via a first intermediate switch, and (ii) a second buffer capacity corresponding to idle buffers in a second path to the destination via a second intermediate switch. In such instances, computing the network capacity includes: computing a first network capacity for transmitting packets to the destination, via a first port, using a low-pass filter, based on the port capacity and the first buffer capacity; and computing a second network capacity for transmitting packets to the destination, via the plurality of egress ports, using a second low-pass filter, based on the port capacity and the first buffer capacity. In such instances, routing the one or more packets includes selecting between the first path and the second path, based on the first network capacity and the second network capacity.
Telemetry-Based Load-Balanced Fine-Grained Adaptive Routing
Fine-grained adaptive routing (FGAR) picks the best output port among candidates for each packet arriving at a switch. As described above, information local to the switch, such as credit counts and link fault information, provide information sufficient for useful FGAR. But the utility of FGAR can be enhanced significantly by adding telemetry so data from other switches can be used to guide port selection. Telemetry reduces the hazard of sending FGAR traffic down a ‘blind alley’ where the traffic will face a blockage after the present switch hop. Conventional interconnects use a form of telemetry called Explicit Congestion Notification (ECN). This technique supplies only a 1-bit measurement and suffers from latency and congestion effects as the forward ECN (FECN) signal must propagate to the forward endpoint then be reflected as a backward ECN (BECN) across the fabric. HPCC improved upon ECN with many-bit measurements but exacerbated the scaling inefficiencies of per-connection, even per-packet signaling. Both of these techniques rely upon the endpoints to make use of the telemetry, adding latency and costs for large scale deployment.
Some implementations send messages among the switch ASICs themselves, allowing all traffic to make use of each measurement. This technique addresses the scaling problem and reduces the latency of distributing the telemetry to the site where the telemetry is used. Some implementations use multi-bit measurements for higher-resolution information than ECN can provide. Some implementations perform the techniques described herein at a high rate of messaging selected to manage the latency of the control loop protecting the switch buffers from blocking, while minimizing telemetry bandwidth consumption.
The buffer capacity circuit 1702 is configured to obtain local buffer capacity for a plurality of buffers configured to buffer transmitted across the interconnect via the switch. In some implementations, the local buffer capacity includes credit counts for the plurality of egress ports configured to couple the switch to a next switch.
The telemetry circuit 1704 is configured to receive a plurality of telemetry flow control units from a plurality of next switches coupled to the switch. Each telemetry flow control unit corresponds to buffer capacity at a respective next switch. In some implementations, the telemetry circuit 1704 is also configured to receive, from the plurality of next switches, link fault information for a plurality of links configured to couple the plurality of switches to one or more switches of the interconnection network. For example, link fault information may be advertised (or encoded in a telemetry flow control unit) as a capacity of 0 for a broken link. In some implementations, the telemetry circuit 1704 is configured to generate a plurality of new telemetry flow control units based on the buffer capacity and the plurality of telemetry flow control units (e.g., by summarizing or sub-setting the old telemetry flow control units). Telemetry flow control units are similar to link-control information, are sent one-hop, so no need of destination information. In some implementations, the telemetry circuit 1704 is configured to transmit the plurality of new telemetry flow control units to a plurality of preceding switches coupled to the switch. The plurality of preceding switches is configured to route packets based on the plurality of new telemetry flow control units (i.e., remote switches are configured to use telemetry for routing decisions). In some implementations, the telemetry circuit 1704 is configured to obtain a telemetry tree that includes the switch as a root and the plurality of next switches coupled to the switch, as nodes of the tree, according to a topology of an interconnection network (e.g., the interconnection network 104). In some implementations, the telemetry circuit 1704 is configured to generate (e.g., summarize or subset) the plurality of new telemetry flow control units further based on the telemetry tree. In some implementations, the telemetry tree includes a first set of next switches of the plurality of next switches, in a first level of the telemetry tree. The telemetry tree includes a second set of next switches of the plurality of next switches, in a second level of the telemetry tree. Each switch in the first set of next switches is directly connected to the switch, in the topology of the interconnection network. Each switch in the second set of next switches is indirectly connected to the switch, via the first set of next switches, in the topology of the interconnection network. In some implementations, the telemetry circuit 1704 is configured to generate the plurality of new telemetry flow control units by generating a telemetry block of flow control units that includes (i) per virtual lane buffer capacity information for each of the first set of next switches and (ii) consolidated buffer capacities for all virtual lanes for the second set of next switches. In some implementations, bit-widths for measurements of buffer capacities in the plurality of new telemetry flow control units is defined based on a telemetry update period (e.g., frequency of telemetry updates, such as 1 microsecond telemetry period with 4-bit measurements). In some implementations, the telemetry circuit 1704 is configured to determine the telemetry update period based on buffer capacities of switches in the interconnect. In some implementations, the telemetry circuit 1704 is configured to define size of each of the plurality of new telemetry flow control units based on number of switches the switch is directly connected to. In some implementations, the telemetry circuit 1704 is configured to define size of each of the plurality of new telemetry flow control units based on predetermined congestion control bandwidth of the interconnection network. Congestion control bandwidth is the product of the size of a single set of telemetry and its transmission rate. So a fabric manager can trade off this bandwidth against the sizes of the measurement fields and the period of telemetry transmission. The count of telemetry fields per block is dependent on fabric scale and topology. So, for example, the fabric manager could maintain a constant telemetry bandwidth across different deployments by slowing the transmission rate for large ones and/or shrinking their measurements. Telemetry bandwidth for small deployments is naturally less than the bandwidth for larger deployments.
The network capacity circuit 1706 is configured to compute network capacity, for transmitting packets to a destination, via a plurality of egress ports, based on the plurality of telemetry flow control units and the local buffer capacity. For example, for minimal routing (e.g., using the first-hop or minimal route DPT described above), the telemetry flow control units provide two measurements used for a specific capacity (sometimes called Capacity) through a link on a given VL. These measurements are (i) the difference between a configured buffer depth and the actual buffer depth for the VL (sometimes called diff_cfg_actual_VLX) at the next-hop switch, and (ii) the idle buffer space (sometimes called idle buffer) in the same next-hop switch. In some implementations, configured buffer depth per VL is computed by the fabric manager and written into each switch. A weighted sum of these values represent capacity to accept new traffic on the VL over a first timescale. Weights for each VL (sometimes called idle_weight_VLX) may be written by the fabric manager, or may be determined dynamically at the switch (e.g., based on monitoring traffic on a VL, and/or traffic via the switch). Credits available on the egress port in question, a measurement local to the switch ASIC, indicate capacity to move a new packet to that switch (e.g., capacity to move a packet over a second timescale that is shorter than the first timescale). Sufficient credits are needed to send the current packet to that port, including a margin of buffer space (sometimes called credit threshold, provided by the fabric manager). Faults on the port, also local to the ASIC, are impediments to reaching that switch. An example computation is shown below as a pseudo-code:
In some implementations, the network capacity circuit 1706 is configured to compute the network capacity further based on the link fault information. In some implementations, the link fault information is received as part of the plurality of telemetry flow control units. For example, switch and cable faults are inferred or directly signaled, cutting filter bandwidth to 0 for such links, and/or the ports affected are not used for routing. In some implementations, values includes lane degrade. In some implementations, the plurality of telemetry flow control units includes cyclic redundancy check (CRC) information, and the network capacity circuit 1706 is configured to discard one or more telemetry flow control units, from the plurality of telemetry flow control units, according to the CRC information, while computing the network capacity
The switch 1700 is configured to receive one or more packets via one or more ingress ports (e.g., the ingress ports 202) of the switch.
The routing circuit 1708 is configured to route the one or more packets to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity. Examples of these operations are described above in reference to
In some implementations, host interfaces may include network interface cards (NICs) or host fabric interfaces (HFIs). In some implementations, the interconnection network is called a computing fabric.
Example Telemetry Tree for Routing
To illustrate telemetry encoding, suppose there are 8 virtual lanes plus a special virtual lane connecting a switch to another switch in a dimension, and 48 ports for each virtual lane, a telemetry block may include 480 measurements (48 ports times 8 virtual lanes plus 1 special virtual lane plus information for an idle buffer). Some implementations store this information such that it can be indexed based on the coordinate for the other switch and the virtual lane connecting the switch to the other switch. Note that parallel ports within a link or virtual lane (i.e., K>1) are not shown in
In some instances, the system has visibility only to per-virtual lane buffers on an aligned switch ASIC. In some instances, the system has access to information for second hop of non-minimal routes. In some implementations, buffer space is shared between virtual lanes, so total buffer space per virtual lane is easy to measure. Some implementations calculate available buffer per VL as configured buffer per VL minus measured buffer per VL, similar to Ethernet's committed bandwidth per TC. Some implementations receive data, per dimension, from directly connected switches, the data including sets of available buffer space per VL and idle buffer. Suppose here are 9 VLs, the data includes 10 values (one value per VL and another value for the idle buffer). This corresponds to one-hop data for minimal or non-minimal routing. The data is indexed by the aligned switch's coordinate. For non-minimal routing, some implementations receive, per dimension, second hop idle buffer to the other switches in the dimension. This data is indexed by aligned and intermediate switch coordinates. In some implementations, switch and cable faults are inferred or directly signaled. In some implementations, telemetry values include lane degrade. In some implementations, faults cut filter bandwidth to 0 for links (e.g., the ports that are affected are not used for routing). In some implementations, because dynamic route tables have flexible mapping to dimension, read side interface from telemetry RAM to the filters is also indexed by dimension. Some implementations retrieve data from a RAM table that holds telemetry values received, based on which dimension the telemetry belongs to.
In some implementations, the switch ASIC performs measurements of its buffer capacity and generates a telemetry value to send through the main data fabric (in-band) for consumption of other switches. Some implementations terminate the telemetry in switch ASICs for their use in routing decisions. Some implementations propagate telemetry through switches through a process of summarizing or sub-setting the telemetry and local information at a given switch. This process helps scale the telemetry system effectively by consuming lower link bandwidth to carry the telemetry, and also helps reduce cost of the routing hardware. The efficiency of this system permits a high rate of telemetry measurements, providing low latency information for routing decisions. This is key to making routing responsive to bursts of congestion in high-bandwidth interconnects (e.g., interconnects with over 400 Gbps).
Example Telemetry Format
A standard packet has unnecessary overhead for telemetry information that is passed only across a single link. For example, there's no need for a destination address for data that is not directly passed through any switch. Telemetry may use an unreliable transport. In other words, telemetry can be lossy. Telemetry can also be small enough for storing cyclic redundancy checks (CRC). A telemetry block is relatively small when compared to a standard data packet, even when the telemetry block includes information for non-minimal routing. Telemetry data is more like link-control information, such as auto-negotiation messages. For these reasons, some implementations use specific flow control unit (flit) definitions for telemetry. For telemetry, quantization size is large (e.g., one flit may be 62 to 64 bits of payload plus some overhead). In some implementations, telemetry block format and/or size is determined based on congestion control bandwidth. For example, congestion control bandwidth can be estimated to make sure that the bandwidth does not step up beyond a predetermined threshold (e.g., a threshold based on the interconnect topology, number of endpoints, and/or application). As an example, for a telemetry implementation with a 1 microsecond telemetry period with 4-bit measurements, the telemetry results in 1.8% of link bandwidth plus headers. Some implementations use a timescale of updates or frequency of telemetry depending on switch buffer size. Some implementations optimize the telemetry block format for a number of flits (e.g., 4 flits per telemetry block). Some implementations use new control flit types. Some implementations include a signal for a fault. Some implementations use a unified format for telemetry block for enabling different forms of telemetry, including fault information.
Some implementations use telemetry to drive load balancing filters. For example, some implementations use the telemetry obtained using techniques described herein and combine that information with local information available to a switch, to make routing decisions. In some implementations, non-minimal routing assumes the second hop is available for a packet sent in that direction and uses telemetry to provide that information.
Example Computing Device for Telemetry for Load-Balanced Fine-Grained Adaptive Routing
In some implementations, the memory 2102 stores one or more programs (e.g., sets of instructions), and/or data structures, collectively referred to as “modules” herein. In some implementations, the memory 2102, or the non-transitory computer readable storage medium of the memory 2102, stores the following programs, modules, and data structures, or a subset or superset thereof:
Example operations of the buffer capacity module 2106, the telemetry module 2108, the telemetry module 2108, the network capacity module 2110, and the routing module 2112, are described below in reference to
The I/O subsystem 2118 communicatively couples the system 2100 to one or more devices, such as other switches 102-2, . . . , 102-M, via the interconnection network 104. In some implementations, some of the operations described herein are performed by the system 2100 without any initiation by any of the switches 102-2, . . . , 102-M. For example, the system 2100 automatically computes network capacity and/or sets up port sequences for routing packets. The communication bus 2120 optionally includes circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
The method includes obtaining (2204) local buffer capacity (e.g., using the buffer capacity module 2106) for a plurality of buffers configured to buffer packets transmitted across the interconnection network via the switch. In some implementations, the buffer capacity includes credit counts for the plurality of egress ports configured to couple the switch to a next switch.
The method also includes receiving (2206) a plurality of telemetry flow control units (e.g., using the telemetry module 2108) from a plurality of next switches coupled to the switch, wherein each telemetry flow control unit corresponds to buffer capacity at a respective next switch. In some implementations, the method includes receiving, from the plurality of next switches, link fault information for a plurality of links configured to couple the plurality of switches to one or more switches of the interconnection network. For example, link fault information may be advertised (or encoded in a telemetry flow control unit) as a capacity of 0 for a broken link. In some implementations, the link fault information is received as part of the plurality of telemetry flow control units (e.g., switch and cable faults are inferred or directly signaled, cutting filter bandwidth to 0 for such links (the ports affected are not used for routing). In some implementations, minimal or non-minimal values includes lane degrade. In some implementations, the method includes generating a plurality of new telemetry flow control units based on the buffer capacity and the plurality of telemetry flow control units (e.g., by summarizing or sub-setting the old telemetry flow control units), and transmitting the plurality of new telemetry flow control units to a plurality of preceding switches coupled to the switch. The plurality of preceding switches is configured to route packets based on the plurality of new telemetry flow control units (i.e., remote switches are configured to use telemetry for routing decisions). In some implementations, the method further includes obtaining a telemetry tree that includes the switch as a root and the plurality of next switches coupled to the switch, as nodes of the tree, according to a topology of the interconnection network; and generating (e.g., summarizing or sub-setting) the plurality of new telemetry flow control units further based on the telemetry tree. In some implementations, the telemetry tree includes a first set of next switches of the plurality of next switches, in a first level of the telemetry tree. In some implementations, the telemetry tree includes a second set of next switches of the plurality of next switches, in a second level of the telemetry tree, each switch in the first set of next switches is directly connected to the switch, in the topology of the interconnection network, each switch in the second set of next switches is indirectly connected to the switch, via the first set of next switches, in the topology of the interconnection network, and generating the plurality of new telemetry flow control units includes generating a telemetry block of flow control units that includes (i) per virtual lane buffer capacity information for each of the first set of next switches and (ii) consolidated buffer capacities for all virtual lanes for the second set of next switches. In some implementations, the size of measurements is tuned to telemetry period. For example, the method further includes defining bit-widths for measurements of buffer capacities in the plurality of new telemetry flow control units based on a telemetry update period (as described above in reference to
The method also includes computing (2208) network capacity (e.g., using the network capacity module 2110), for transmitting packets to a destination, via a plurality of egress ports, based on the plurality of telemetry flow control units and the local buffer capacity. computing the network capacity further based on the link fault information. Examples of computing network capacity are described above in reference to the network capacity circuit 1706, according to some implementations. The operations of the network capacity circuit 1706 may be implemented as a software module (e.g., in the network capacity module 2110).
The method also includes receiving one or more packets via one or more ingress ports of the switch, and routing (2210) the one or more packets (e.g., using the routing module 110) to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity.
Various examples of aspects of the disclosure are described below as clauses for convenience. These are provided as examples, and do not limit the subject technology.
In accordance with some implementations, an example clause includes a method of routing packets in a switch in an interconnection network, the method comprising: at the switch: obtaining network capacity for transmitting packets via a plurality of egress ports of the switch; generating a port sequence that defines a pseudo-randomly interleaved sequence of a plurality of path options via the plurality of egress ports, based on the network capacity; and receiving one or more packets via one or more ingress ports of the switch; and routing the one or more packets towards a destination, based on the port sequence.
The method of any of the clauses, wherein generating the port sequence comprises: using each path option in a fraction of time slots of the port sequence such that probability of a corresponding egress port appearing in the port sequence is proportional to the network capacity through the corresponding egress port.
The method of any of the clauses, wherein the network capacity corresponds to capacity of the interconnection network to transmit packets to a plurality of destinations via the switch.
The method of any of the clauses, further comprising: generating a plurality of port sequences, wherein each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to a respective next switch of a plurality of next switches.
The method of any of the clauses, further comprising: generating a plurality of port sequences, wherein each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to a respective virtual lane of a plurality of virtual lanes.
The method of any of the clauses, further comprising: generating a plurality of port sequences, wherein each port sequence pseudo-randomly interleaves the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to (i) a respective virtual lane of a plurality of virtual lanes and (ii) a respective next switch of a plurality of next switches; generating a dynamic port table of egress port identifiers, wherein each row of the dynamic port table corresponds to a respective next switch of a plurality of next switches, wherein each column of the dynamic port table corresponds to a respective virtual lane of a plurality of virtual lanes, and wherein each egress port identifier corresponds to a respective port sequence of the plurality of port sequences; and routing a plurality of packets, received from one or more ingress ports, to the plurality of next switches, based on the dynamic port table.
The method of any of the clauses, further comprising: updating the dynamic port table, based on the plurality of port sequences, after routing a packet of the plurality of packets.
The method of any of the clauses, wherein: the interconnection network includes a plurality of dimensions; the network capacity includes information regarding capacity of the interconnection network to transmit packets towards the destination via the switch and using the plurality of dimensions; each port sequence further corresponds to a respective dimension of the plurality of dimensions; the dynamic port table includes a plurality of sub-tables of egress port identifiers, each sub-table corresponding to a respective dimension; and routing the plurality of packets further comprises selecting a dimension from the plurality of dimensions, based on comparing network capacities for the interconnection network to transmit packets towards the destination using each dimension.
The method of any of the clauses, further comprising: in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination via a first dimension of the plurality of dimensions, does not meet a predetermined threshold, forgoing selecting the first dimension for routing the plurality of packets.
The method of any of the clauses, further comprising: in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination, via a first dimension or via a second dimension of the plurality of dimensions, meets a predetermined threshold, spreading the plurality of packets over the first dimension and the second dimension.
The method of any of the clauses, further comprising: prior to routing the plurality of packets, for each packet: (i) extracting subfields in a header of the packet, and (ii) indexing a static lookup table for each dimension using the subfields to select a row in a respective sub-table for the dimension.
The method of any of the clauses, wherein the plurality of path options includes non-minimal routes via a corresponding intermediate switch, in addition to minimal routes without any intermediate switches.
The method of any of the clauses, further comprising: prioritizing path options that include minimal routes over path options that include non-minimal routes, when routing the one or more packets.
The method of any of the clauses, further comprising: in accordance with a determination that path options that include minimal routes do not meet a threshold network capacity, selecting other path options that include non-minimal routes, when routing the one or more packets.
The method of any of the clauses, wherein the network capacity includes buffer capacity at the plurality of egress ports.
The method of any of the clauses, wherein the network capacity includes bandwidth of the plurality of egress ports.
In another aspect, in accordance with some implementations, an example clause includes a switch for routing packets in an interconnection network, the switch comprising: a plurality of egress ports to transmit packets; one or more ingress ports to receive packets; a network capacity circuit configured to obtain network capacity for transmitting packets via the plurality of egress ports; a port sequence generation circuit configured to generate a port sequence that defines a pseudo-randomly interleaved sequence of a plurality of path options via the plurality of egress ports, based on the network capacity; and a routing circuit configured to route one or more packets, received from the one or more ingress ports, towards a destination, based on the port sequence.
The switch of any of the clauses, wherein the port sequence generation circuit is configured to: use each path option in a fraction of time slots of the port sequence such that probability of a corresponding egress port appearing in the port sequence is proportional to the network capacity through the corresponding egress port.
The switch of any of the clauses, wherein the network capacity corresponds to capacity of the interconnection network to transmit packets to a plurality of destinations via the switch.
The switch of any of the clauses, wherein the port sequence generation circuit is configured to: generate a plurality of port sequences, wherein each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to a respective next switch of a plurality of next switches.
The switch of any of the clauses, wherein the port sequence generation circuit is configured to: generate a plurality of port sequences, wherein each port sequence defines a pseudo-randomly interleaved sequence of the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to a respective virtual lane of a plurality of virtual lanes.
The switch of any of the clauses, wherein: the port sequence generation circuit is configured to: generate a plurality of port sequences, wherein each port sequence pseudo-randomly interleaves the plurality of path options, via the plurality of egress ports, according to the network capacity, and wherein each port sequence corresponds to (i) a respective virtual lane of a plurality of virtual lanes and (ii) a respective next switch of a plurality of next switches; and generate a dynamic port table of egress port identifiers, wherein each row of the dynamic port table corresponds to a respective next switch of a plurality of next switches, wherein each column of the dynamic port table corresponds to a respective virtual lane of a plurality of virtual lanes, and wherein each egress port identifier corresponds to a respective port sequence of the plurality of port sequences; and the routing circuit is configured to: route a plurality of packets, received from one or more ingress ports, to the plurality of next switches, based on the dynamic port table.
The switch of any of the clauses, wherein the port sequence generation circuit is configured to update the dynamic port table, based on the plurality of port sequences, after the routing circuit routes a packet of the plurality of packets.
The switch of any of the clauses, wherein: the interconnection network includes a plurality of dimensions; the network capacity includes information regarding capacity of the interconnection network to transmit packets towards the destination via the switch and using the plurality of dimensions; each port sequence further corresponds to a respective dimension of the plurality of dimensions; the dynamic port table includes a plurality of sub-tables of egress port identifiers, each sub-table corresponding to a respective dimension; and the routing circuit is configured to route the plurality of packets by selecting a dimension from the plurality of dimensions, based on comparing network capacities for the interconnection network to transmit packets towards the destination using each dimension.
The switch of any of the clauses, wherein the routing circuit is configured to: in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination via a first dimension of the plurality of dimensions, does not meet a predetermined threshold, forgo selecting the first dimension for routing the plurality of packets.
The switch of any of the clauses, wherein the routing circuit is configured to:
in accordance with a determination that network capacity for the interconnection network to transmit packets towards the destination, via a first dimension or via second dimension of the plurality of dimensions, meets a predetermined threshold, spread the plurality of packets over the first dimension and the second dimension.
The switch of any of the clauses, wherein the routing circuit is configured to: prior to routing the plurality of packets, for each packet: (i) extract subfields in a header of the packet, and (ii) index a static lookup table for each dimension using the subfields to select a row in a respective sub-table for the dimension.
The switch of any of the clauses, wherein the plurality of path options includes non-minimal routes via a corresponding intermediate switch, in addition to minimal routes without any intermediate switches.
The switch of any of the clauses, wherein the routing circuit is configured to: prioritize path options that include minimal routes over path options that include non-minimal routes, when routing the one or more packets.
The switch of any of the clauses, wherein the routing circuit is configured to: in accordance with a determination that path options that include minimal routes do not meet a threshold network capacity, select other path options that include non-minimal routes, when routing the one or more packets.
The switch of any of the clauses, wherein the network capacity includes buffer capacity at the plurality of egress ports.
The switch of any of the clauses, wherein the network capacity includes bandwidth of the plurality of egress ports.
In another aspect, in accordance with some implementations, an example clause includes a method for routing packets in a switch in an interconnection network, the method comprising: at the switch: obtaining port capacity for a plurality of egress ports configured to couple the switch to a next switch; obtaining bandwidth capacity for transmitting packets to a destination; computing network capacity, for transmitting packets to the destination, via the plurality of egress ports, based on a function of the port capacity and the bandwidth capacity; receiving one or more packets via one or more ingress ports of the switch; and routing the one or more packets to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity.
The method of any of the clauses, wherein the port capacity comprises available buffer capacity for ingress ports of respective receiver switches coupled to the plurality of egress ports.
The method of any of the clauses, wherein the port capacity is zero through any egress port that has a fault.
The method of any of the clauses, wherein the bandwidth capacity comprises idle buffer in the next switch.
The method of any of the clauses, wherein the bandwidth capacity comprises configured buffer minus current buffer in the next switch.
The method of any of the clauses, wherein the bandwidth capacity is calculated based on one or more telemetry packets received from another switch of the interconnection network.
The method of any of the clauses, wherein the function comprises Exponential Weighted Moving Average.
The method of any of the clauses, wherein the function comprises a plurality of low-pass filters.
The method of any of the clauses, wherein each low-pass filter combines the port capacity for a respective egress port with the bandwidth capacity, to obtain a respective bandwidth capacity for transmitting packets to the destination via the respective egress port.
The method of any of the clauses, wherein: the switch is connected to the next switch using a plurality of virtual lanes; the bandwidth capacity includes a respective buffer capacity for each virtual lane; computing the network capacity includes computing, for each virtual lane, a respective virtual lane capacity, using a respective one or more low-pass filters, based on the port capacity and the respective buffer capacity; and routing the one or more packets to the destination includes selecting a virtual lane from the plurality of virtual lanes based on the respective virtual lane capacity.
The method of any of the clauses, the bandwidth capacity includes idle buffers in a path to the destination that includes an intermediate switch.
The method of any of the clauses, wherein: the bandwidth capacity includes (i) a first buffer capacity corresponding to idle buffers in a first path to the destination via a first intermediate switch, and (ii) a second buffer capacity corresponding to idle buffers in a second path to the destination via a second intermediate switch; computing the network capacity includes: computing a first network capacity for transmitting packets to the destination, via a first port, using a low-pass filter, based on the port capacity and the first buffer capacity; and computing a second network capacity for transmitting packets to the destination, via the plurality of egress ports, using a second low-pass filter, based on the port capacity and the first buffer capacity; and routing the one or more packets includes selecting between the first path and the second path, based on the first network capacity and the second network capacity.
In another aspect, in accordance with some implementations, an example clause includes a switch for routing packets in an interconnection network, the switch comprising: a plurality of egress ports to transmit packets; one or more ingress ports to receive packets; a port and bandwidth capacity circuit configured to obtain (i) port capacity for a plurality of egress ports of the switch, and (ii) bandwidth capacity for transmitting packets to a destination; a network capacity circuit configured to compute network capacity, for transmitting packets to the destination, via the plurality of egress ports, based on a function of the port capacity and the bandwidth capacity; and a routing circuit configured to route one or more packets received via one or more ingress ports of the switch, to the destination, via the plurality of egress ports, based on the network capacity.
The switch of any of the clauses, wherein the port capacity comprises available buffer capacity for ingress ports of respective receiver switches coupled to the plurality of egress ports.
The switch of any of the clauses, wherein the port capacity is zero through any egress port that has a fault.
The switch of any of the clauses, wherein the bandwidth capacity comprises idle buffer in the next switch.
The switch of any of the clauses, wherein the bandwidth capacity comprises configured buffer minus current buffer in the next switch.
The switch of any of the clauses, wherein the bandwidth capacity is calculated based on one or more telemetry packets received from another switch of the interconnection network.
The switch of any of the clauses, wherein the function comprises Exponential Weighted Moving Average.
The switch of any of the clauses, wherein the function comprises a plurality of low-pass filters.
The switch of any of the clauses, wherein each low-pass filter is configured to combine the port capacity for a respective egress port with the bandwidth capacity, to obtain a respective bandwidth capacity for transmitting packets to the destination via the respective egress port.
The switch of any of the clauses, wherein: the switch is connected to the next switch using a plurality of virtual lanes; the bandwidth capacity includes a respective buffer capacity for each virtual lane; the network capacity circuit is configured to compute, for each virtual lane, a respective virtual lane capacity, using a respective one or more low-pass filters, based on the port capacity and the respective buffer capacity; and the routing circuit is configured to route the one or more packets to the destination by selecting a virtual lane from the plurality of virtual lanes based on the respective virtual lane capacity.
The switch of any of the clauses, wherein the bandwidth capacity includes idle buffers in a path to the destination that includes an intermediate switch.
The switch of any of the clauses, wherein: the bandwidth capacity includes (i) a first buffer capacity corresponding to idle buffers in a first path to the destination via a first intermediate switch, and (ii) a second buffer capacity corresponding to idle buffers in a second path to the destination via a second intermediate switch; the network capacity circuit is configured to: compute a first network capacity for transmitting packets to the destination, via a first port, using a low-pass filter, based on the port capacity and the first buffer capacity; and compute a second network capacity for transmitting packets to the destination, via the plurality of egress ports, using a second low-pass filter, based on the port capacity and the first buffer capacity; and the routing circuit is configured to route the one or more packets by selecting between the first path and the second path, based on the first network capacity and the second network capacity.
In another aspect, in accordance with some implementations, an example clause includes a method of routing packets in a switch in an interconnection network, the method comprising: at the switch: obtaining local buffer capacity for a plurality of buffers configured to buffer packets transmitted across the interconnection network via the switch; receiving a plurality of telemetry flow control units from a plurality of next switches coupled to the switch, wherein each telemetry flow control unit corresponds to buffer capacity at a respective next switch; computing network capacity, for transmitting packets to a destination, via a plurality of egress ports, based on the plurality of telemetry flow control units and the local buffer capacity; receiving one or more packets via one or more ingress ports of the switch; and routing the one or more packets to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity;
The method of any of the clauses, wherein the buffer capacity includes credit counts for the plurality of egress ports configured to couple the switch to a next switch;
The method of any of the clauses, further comprising: receiving, from the plurality of next switches, link fault information for a plurality of links configured to couple the plurality of switches to one or more switches of the interconnection network; and computing the network capacity further based on the link fault information.
The method of any of the clauses, wherein the link fault information is received as part of the plurality of telemetry flow control units.
The method of any of the clauses, further comprising: generating a plurality of new telemetry flow control units based on the buffer capacity and the plurality of telemetry flow control units; and transmitting the plurality of new telemetry flow control units to a plurality of preceding switches coupled to the switch, wherein the plurality of preceding switches is configured to route packets based on the plurality of new telemetry flow control units.
The method of any of the clauses, further comprising: obtaining a telemetry tree that includes the switch as a root and the plurality of next switches coupled to the switch, as nodes of the tree, according to a topology of the interconnection network; and generating the plurality of new telemetry flow control units further based on the telemetry tree.
The method of any of the clauses, wherein: the telemetry tree includes a first set of next switches of the plurality of next switches, in a first level of the telemetry tree; the telemetry tree includes a second set of next switches of the plurality of next switches, in a second level of the telemetry tree; each switch in the first set of next switches is directly connected to the switch, in the topology of the interconnection network; each switch in the second set of next switches is indirectly connected to the switch, via the first set of next switches, in the topology of the interconnection network; and generating the plurality of new telemetry flow control units comprises generating a telemetry block of flow control units that includes (i) per virtual lane buffer capacity information for each of the first set of next switches and (ii) consolidated buffer capacities for all virtual lanes for the second set of next switches.
The method of any of the clauses, further comprising: defining bit-widths for measurements of buffer capacities in the plurality of new telemetry flow control units based on a telemetry update period.
The method of any of the clauses, further comprising: determining the telemetry update period based on buffer capacities of switches in the interconnection network.
The method of any of the clauses, further comprising: defining size of each of the plurality of new telemetry flow control units based on number of switches the switch is directly connected to.
The method of any of the clauses, further comprising: defining size of each of the plurality of new telemetry flow control units based on a predetermined congestion control bandwidth of the interconnection network.
The method of any of the clauses, wherein the plurality of telemetry flow control units includes cyclic redundancy check (CRC) information; the method further comprising: discarding one or more telemetry flow control units, from the plurality of telemetry flow control units, according to the CRC information, while computing the network capacity.
In another aspect, in accordance with some implementations, an example clause includes a switch for routing packets in an interconnection network, the switch comprising: a plurality of egress ports to transmit packets; one or more ingress ports to receive packets; a buffer capacity circuit configured to obtain local buffer capacity for a plurality of buffers configured to buffer packets transmitted across the interconnect via the switch; a telemetry circuit configured to receive a plurality of telemetry flow control units from a plurality of next switches coupled to the switch, wherein each telemetry flow control unit corresponds to buffer capacity at a respective next switch; a network capacity circuit configured to compute network capacity for transmitting packets to a destination, via the plurality of egress ports, based on the plurality of telemetry flow control units and the local buffer capacity; and a routing circuit configured to receive one or more packets via the one or more ingress ports, and route the one or more packets to the destination, via the plurality of egress ports, with bandwidth proportional to the network capacity.
The switch of any of the clauses, wherein the local buffer capacity includes credit counts for the plurality of egress ports configured to couple the switch to a next switch.
The switch of any of the clauses, wherein the telemetry circuit is further configured to receive, from the plurality of next switches, link fault information for a plurality of links configured to couple the plurality of switches to one or more switches of the interconnection network.
The switch of any of the clauses, wherein the telemetry circuit is further configured to generate a plurality of new telemetry flow control units based on the buffer capacity and the plurality of telemetry flow control units.
The switch of any of the clauses, wherein the telemetry circuit is configured to transmit the plurality of new telemetry flow control units to a plurality of preceding switches coupled to the switch, and the plurality of preceding switches is configured to route packets based on the plurality of new telemetry flow control units.
The switch of any of the clauses, wherein the telemetry circuit is configured to obtain a telemetry tree that includes the switch as a root and the plurality of next switches coupled to the switch, as nodes of the tree, according to a topology of the interconnection network.
The switch of any of the clauses, wherein the telemetry circuit is configured to generate the plurality of new telemetry flow control units further based on the telemetry tree.
The switch of any of the clauses, wherein: the telemetry tree includes a first set of next switches of the plurality of next switches, in a first level of the telemetry tree; the telemetry tree includes a second set of next switches of the plurality of next switches, in a second level of the telemetry tree; each switch in the first set of next switches is directly connected to the switch, in the topology of the interconnection network; and each switch in the second set of next switches is indirectly connected to the switch, via the first set of next switches, in the topology of the interconnection network.
The switch of any of the clauses, wherein: the telemetry circuit is further configured to generate the plurality of new telemetry flow control units by generating a telemetry block of flow control units that includes (i) per virtual lane buffer capacity information for each of the first set of next switches and (ii) consolidated buffer capacities for all virtual lanes for the second set of next switches.
The switch of any of the clauses, wherein: bit-widths for measurements of buffer capacities in the plurality of new telemetry flow control units is defined based on a telemetry update period.
The switch of any of the clauses, wherein: the telemetry circuit is configured to determine the telemetry update period based on buffer capacities of switches in the interconnect.
The switch of any of the clauses, wherein: the telemetry circuit is further configured to define size of each of the plurality of new telemetry flow control units based on number of switches the switch is directly connected to.
The switch of any of the clauses, wherein: the telemetry circuit is further configured to define size of each of the plurality of new telemetry flow control units based on predetermined congestion control bandwidth of the interconnection network.
The switch of any of the clauses, wherein: the network capacity circuit is further configured to compute the network capacity further based on the link fault information.
The switch of any of the clauses, wherein: the link fault information is received as part of the plurality of telemetry flow control units.
The switch of any of the clauses, wherein: minimal or non-minimal values includes lane degrade.
The switch of any of the clauses, wherein: the plurality of telemetry flow control units includes cyclic redundancy check (CRC) information, and the network capacity circuit is further configured to discard one or more telemetry flow control units, from the plurality of telemetry flow control units, according to the CRC information, while computing the network capacity.
In some implementations, a computer system has one or more processors, memory, and a display. The one or more programs include instructions for performing any of the methods described herein.
In some implementations, a non-transitory computer readable storage medium stores one or more programs configured for execution by a computer system having one or more processors, memory, and a display. The one or more programs include instructions for performing any of the methods described herein.
In one aspect, a method may be an operation, an instruction, or a function and vice versa. In one aspect, a clause or a claim may be amended to include some or all of the words (e.g., instructions, operations, functions, or components) recited in other one or more clauses, one or more words, one or more sentences, one or more phrases, one or more paragraphs, and/or one or more claims. In one aspect, a clause may depend from any other clauses, sentences or phrases.
To illustrate the interchangeability of hardware and software, items such as the various illustrative blocks, modules, components, methods, operations, instructions, and algorithms have been described generally in terms of their functionality. Whether such functionality is implemented as hardware, software or a combination of hardware and software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application.
A reference to an element in the singular is not intended to mean one and only one unless specifically so stated, but rather one or more. For example, “a” module may refer to one or more modules. An element proceeded by “a,” “an,” “the,” or “said” does not, without further constraints, preclude the existence of additional same elements.
Headings and subheadings, if any, are used for convenience only and do not limit the invention. The word exemplary is used to mean serving as an example or illustration. To the extent that the term include, have, or the like is used, such term is intended to be inclusive in a manner similar to the term comprise as comprise is interpreted when employed as a transitional word in a claim. Relational terms such as first and second and the like may be used to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions.
Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
A phrase “at least one of” preceding a series of items, with the terms “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list. The phrase “at least one of” does not require selection of at least one item; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items. By way of example, each of the phrases “at least one of A, B, and C” or “at least one of A, B, or C” refers to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
It is understood that the specific order or hierarchy of steps, operations, or processes disclosed is an illustration of exemplary approaches. Unless explicitly stated otherwise, it is understood that the specific order or hierarchy of steps, operations, or processes may be performed in different order. Some of the steps, operations, or processes may be performed simultaneously. The accompanying method claims, if any, present elements of the various steps, operations or processes in a sample order, and are not meant to be limited to the specific order or hierarchy presented. These may be performed in serial, linearly, in parallel or in different order. It should be understood that the described instructions, operations, and systems can generally be integrated together in a single software/hardware product or packaged into multiple software/hardware products.
The disclosure is provided to enable any person skilled in the art to practice the various aspects described herein. In some instances, well-known structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology. The disclosure provides various examples of the subject technology, and the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles described herein may be applied to other aspects.
All structural and functional equivalents to the elements of the various aspects described throughout the disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for”.
The title, background, brief description of the drawings, abstract, and drawings are hereby incorporated into the disclosure and are provided as illustrative examples of the disclosure, not as restrictive descriptions. It is submitted with the understanding that they will not be used to limit the scope or meaning of the claims. In addition, in the detailed description, it can be seen that the description provides illustrative examples and the various features are grouped together in various implementations for the purpose of streamlining the disclosure. The method of disclosure is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, as the claims reflect, inventive subject matter lies in less than all features of a single disclosed configuration or operation. The claims are hereby incorporated into the detailed description, with each claim standing on its own as a separately claimed subject matter.
The claims are not intended to be limited to the aspects described herein but are to be accorded the full scope consistent with the language claims and to encompass all legal equivalents. Notwithstanding, none of the claims are intended to embrace subject matter that fails to satisfy the requirements of the applicable patent law, nor should they be interpreted in such a way.
This application is a continuation of U.S. application Ser. No. 17/359,358, filed Jun. 25, 2021, now U.S. Pat. No. 11,637,778, which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6570872 | Beshai | May 2003 | B1 |
6658016 | Dai | Dec 2003 | B1 |
6721316 | Epps | Apr 2004 | B1 |
6778546 | Epps | Aug 2004 | B1 |
6904015 | Chen | Jun 2005 | B1 |
6977930 | Epps | Dec 2005 | B1 |
7149187 | Jacobson | Dec 2006 | B1 |
8797877 | Perla | Aug 2014 | B1 |
9628387 | Crisan | Apr 2017 | B2 |
10887217 | McDonald | Jan 2021 | B2 |
20040218617 | Sagfors | Nov 2004 | A1 |
20050099945 | Abel | May 2005 | A1 |
20080279207 | Jones | Nov 2008 | A1 |
20090080328 | Hu et al. | Mar 2009 | A1 |
20100098104 | Marshall | Apr 2010 | A1 |
20100291946 | Yamakawa | Nov 2010 | A1 |
20120263185 | Bejerano | Oct 2012 | A1 |
20140016648 | Hidaka | Jan 2014 | A1 |
20140105218 | Anand | Apr 2014 | A1 |
20160277232 | Bogdanski et al. | Sep 2016 | A1 |
20170324665 | Johnsen et al. | Nov 2017 | A1 |
20180019947 | Shpiner | Jan 2018 | A1 |
20180063007 | Johnson | Mar 2018 | A1 |
20180278549 | Mula | Sep 2018 | A1 |
20190036804 | Mihelic | Jan 2019 | A1 |
20190052567 | Muntz | Feb 2019 | A1 |
20190379610 | Srinivasan et al. | Dec 2019 | A1 |
20190386924 | Srinivasan | Dec 2019 | A1 |
20200007432 | McDonald | Jan 2020 | A1 |
20200136973 | Rahman | Apr 2020 | A1 |
20210203621 | Ylisirniö | Jul 2021 | A1 |
20220014478 | Lee | Jan 2022 | A1 |
20220182309 | Bataineh | Jun 2022 | A1 |
20220239575 | Arneja | Jul 2022 | A1 |
Number | Date | Country |
---|---|---|
WO-2021016410 | Jan 2021 | WO |
Entry |
---|
European Search Report of Application No. EP22178576, dated Nov. 17, 2022, 11 pages. |
European Search Report of Application No. EP22178554, dated Nov. 18, 2022, 11 pages. |
European Search Report of Application No. EP22178587, dated Nov. 22, 2022, 12 pages. |
Non-Final Office Action for U.S. Appl. No. 17/359,367, dated Sep. 1, 2022, 11 pages. |
Non-Final Office Action for U.S. Appl. No. 17/359,371 dated Oct. 27, 2022, 22 pages. |
Non-Final Office Action for U.S. Appl. No. 17/359,358, dated Aug. 26, 2022, 20 pages. |
Ahn, et al., “HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks,” Nov. 2009, retrieved from https://my.eng.utah.edu/˜cs6810/HyperX-SC09.pdf. |
Besta, et al., “High-Performance Routing with Multipathing and Path Diversity in Ethernet and HPC Networks,” Oct. 2020, retrieved from https://arxiv.org/pdf/2007.03776.pdf. |
Domke, et al., “HyperX Topology: First At-Scale Implementation and Comparison to the Fat-Tree,” Nov. 2019, retrieved from http://domke.gitlab.io/paper/domke-hyperx-2019.pdf. |
Kumar, et al., “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter,” SIGCOMM '20, Aug. 2020, retrieved from https://dl.acm.org/doi/pdf/10.1145/3387514.3406591. |
Li, et al., “HPCC: High Precision Congestion Control,” SIGCOMM '19, Aug. 2019, retrieved from https://liyuliang001.github.io/publications/hpcc.pdf. |
Miao et al., “HPCC++: Enhanced High Precision Congestion Control,” Jul. 29, 2020, retrieved from https://tools.ietf.org/id/draft-pan-tsvwg-hpccplus-01.html. |
Mittal, et al., “TIMELY: RTT-based Congestion Control for the Datacenter,” SIGCOMM '15, Aug. 2015, retrieved from https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p537.pdf. |
Yan, et al., “An Enhanced Congestion Control Mechanism in InfiniBand Networks for High Performance Computing Systems,” 20th International Conference on Advanced Information Networking and Applications—vol. 1 (AINA'06), Apr. 2006, 6 pages. |
Final Office Action for U.S. Appl. No. 17/359,371, dated May 11, 2023, 27 pages. |
Number | Date | Country | |
---|---|---|---|
20230130276 A1 | Apr 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17359358 | Jun 2021 | US |
Child | 18087765 | US |