Controlling network congestion is important to network performance, which in turn is important to satisfying demand and meeting customer satisfaction. Congestion control (CC) provides a mechanism that, in general, controls the entry of data, e.g., data packets, into a network. The importance of CC is increasing as applications increasingly demand low-latency operations on a datacenter scale. Examples of such applications include memory and storage disaggregation, machine learning (ML), and large scale incast. Congestion control is becoming more challenging, in part because link bandwidths are growing faster than buffers at switches, and high throughput packet servers benefit from simple CC algorithms offloaded to network interface cards (NICs) to save CPU for applications. Effective datacenter CC should provide one or more of high throughput, low latency, fairness, and relatively fast convergence across varied workloads.
A challenge faced by congestion control protocols is gaining more granular visibility into the fine-timescale hop-level congestion state of a network. Datacenter CC algorithms typically rely on either end-to-end signals (e.g., delay) or quantized in-network feedback (e.g., explicit congestion notification (ECN)). These signals include information or signals that are aggregated end-to-end across all hops on a flow’s path. These CC algorithms often observe under-utilization, slow ramp-up, and/or unfairness.
An aspect of the disclosed technology is a computing system that implements a congestion control protocol that exploits and extends in-network telemetry (INT) to address, for example, blind spots typically found in end-to-end algorithms; determines CC for an actual bottleneck hop; realizes low queuing delay; and/or realizes convergence to network-wide max-min fair bandwidth allocation.
For example, an aspect of the disclosed technology is a method for network congestion control, comprising: detecting maximum hop delays at each hop along a path between a network source node and a network destination node; determining, at a host machine associated with a hop along the path between the network source node and the network destination node, a maximum hop delay value from among the maximum hop delays detected at each hop, the maximum hop delay value being associated with a bottleneck hop along the path between the network source node and the network destination node; and effecting congestion control, at the host machine, based on the maximum hop delay value associated with the bottleneck hop.
In accordance with this aspect of the disclosed technology, the method may comprise comprising inserting, by respective in-network telemetry (INT) devices located at each hop along the path between the network source node and the network destination node, the maximum hop delay for each hop in a packet header of a respective message sent from each hop.
In accordance with the foregoing aspects of the disclosed technology, inserting comprises inserting by an INT-enabled switch or an INT-enabled network interface card (NIC). Further, the host machine may be located at a source hop associated with the network source node.
In accordance with the foregoing aspects of the disclosed technology, the bottleneck hop comprises a congested hop that limits data packet flows that transmit more than their max-min fair-share rate. Further, effecting congestion control may comprise decreasing a transmission rate of only those data packet flows that transmit more than their max-min fair-share rate. In addition, effecting congestion control may comprise comparing the maximum hop delay value detected at each hop to a rate-adjusted target hop delay associated with each respective hop. Further, the method may comprise updating a congestion window update function at the host machine based on the comparison. Further still, updating may comprise decreasing the congestion window only if the data packet flow got the max-min fair-share rate on congested hops along the path between the network source node and the network destination node.
In another example, an aspect of the disclosed invention may comprise a system. The system comprises a source node; a destination node; one or more hops along a data path between the source node and the destination node; and a host machine coupled to the source node, the host machine comprising one or more memories storing instructions that cause one or more processing devices to: detect maximum hop delays at each hop along the data path between the source node and the destination node; determine a maximum hop delay value from among the maximum hop delays detected at each hop, the maximum hop delay value being associated with a bottleneck hop along the data path; and effect congestion control based on the maximum hop delay value associated with the bottleneck hop.
In accordance with this aspect of the disclosure, the instructions may cause the one or more processing devices to insert, by respective in-network telemetry (INT) devices located at each hop along the data path, the maximum hop delay for each hop in a packet header of a respective message sent from each hop. Further, the INT devices may comprise one of an INT-enabled switch or an INT-enabled network interface card (NIC). In addition, the host machine is located at the source node.
In accordance with the foregoing aspects of the disclosure, the bottleneck hop may comprise a congested hop that limits data packet flows that transmit more than their max-min fair-share rate. Further, to effect congestion control may comprise decreasing a transmission rate of only those data packet flows that transmit more than their max-min fair-share rate. Further still, to effect congestion control may comprise comparing the maximum hop delay value detected at each hop to a rate-adjusted target hop delay associated with each respective hop.
In accordance with the foregoing aspects of the disclosure, the instructions may cause the one or more processing devices to update a congestion window at the host machine based on the comparison. Further, to update may comprise decreasing the congestion window only if the data packet flow got the max-min fair-share rate on congested hops along the path between the network source node and the network destination node.
The disclosed technology includes a technique for implementing congestion control in packet-based networks. The technique may be embodied in a process or method, as well as a system. The technique leverages INT technology to enable packet-based systems to segment congestion control such that the congestion control mechanism may react only to congestion at the bottleneck hop but not every congested hop. In effect, the disclosed technology may address blind spots in end-to-end algorithms.
In addition, under steady state conditions, the technique realizes network-wide maximum-minimum fair share bandwidth allocation (e.g., a state where no flow can increase its rate by decreasing the rate of faster flows). Furthermore, the technique decouples the bandwidth fairness requirement from the Additive Increase/Multiplicative Decrease (AIMD) algorithm, making it possible for the technique to converge fast and smooth out bandwidth oscillations. For instance, the technique can achieve fairness on a single hop with selection of the appropriate increase/decrease functions. Further, the technique allows for incremental deployment in Brownfield-type environments that mix INT and non-INT devices.
The technique includes making use of INT technology by including parameter or metadata fields in INT packets that collect and report the maximum or largest maximum hop delay or latency along the data path between the source and the destination nodes. This information is returned to the source node where it is used to control congestion. In this regard, congestion may be controlled in a manner such that flows that have yet to reach their fair share of bandwidth do not get penalized as a result of the detected congestion.
The technique may be applied within a data center using switches (or routers) within the data center. It may also be applied across data centers. In general, the technique may be applied where an entity has control over both ends of the switches or routers.
At step 120, a host machine determines the maximum hop delay along the data path between the source and destination nodes. In this regard, note that the source and destination nodes may be considered hops in the data path. The maximum hop delay may be determined, for example, using a maximum hop delay value that is provided via a return data path to the source node. For instance, a switch at each hop may insert the maximum hop delay or latency experienced by a packet. This may include locally storing a maximum hop delay value received from an upstream node, comparing that value with a maximum hop delay value for the current hop at which the switch is located, and inserting the larger maximum hop delay value of the two, along with information identifying the hop associated with larger delay. In this way, the maximum hop delay value that is returned to the source will be the largest maximum hop delay value experienced by a data packet along the data path.
At step 130, the host machine effects congestion control based on the maximum hop delay value received at the source. The maximum hop delay’s value is to identify a bottleneck hop such that congestion control is effected based on the congestion state of the bottleneck hop. As is further explained below, the process or method 100 may be implemented without learning the congestion state of every hop of a flow (e.g., packets or data sent between a source and destination). As such, the bottleneck hop may be considered the hop that limits the rate of the flow as per max-min fair allocation parameters. In this regard, a congested hop is not considered the bottleneck hop for all flows passing through the congested hop, but only for flows that send more than their max-min fair-share rate, and thus, CC should ideally increase the rate of only those flows that are above their fair-share. In general, fair-share rate refers to fair-share of bandwidth. In this regard, rate generally refers to bandwidth. For example, if we assume there is one link in the network with 100 Gigabyte (100 G) speed and 5 flows going through that link, fair-share allocation should give each flow 20 Gigabytes. The concept of max-min fair-share generally refers to maximizing the flow that is minimum. As an example, this may result in redistributing bandwidth to flows with lower bandwidth demand (poorer flows) from flows with higher bandwidth demands until the lower bandwidth flow experiences some other bandwidth.
As shown, host 240 generates a packet 252 in accordance with the in-network telemetry (INT) mechanism, which provides a framework for collecting and reporting network state via the data plane. For instance, in an aspect of the technology disclosed herein, INT may be leveraged to include a max_hop_latency metadata or parameter that is collected and reported on at one or more or all hops in a network. The max_hop_latency metadata or parameter may comprise the latency associated with a hop. Generally, hop latency is the time from when a packet reaches the input of a device (e.g., switch or router) to time when it egresses from the device. It may be thought of as the time a packet, e.g., an INT packet, takes to be switched or processed within a hop. Switches, for example, need to process the packet to know where it should go. Usually switches have many ports and based on a tag (e.g., destination IP) and a routing table, they know where the packet should go. In this regard, the bulk of the delay in switches is queueing. For example, suppose that traffic from 2 ports on a switch want to go to 1 port for a short period of time. Those packets therefore have to form a queue and gradually go into the target port. This delay can be up to hundreds of microseconds. If this congestion continues, the queue build up may exceed the buffer capacity in the switch, resulting in dropped packets. The host has to detect such drops by noticing that an acknowledgement message didn’t come from the destination. It is the role of congestion control to avoid such scenarios.
As host 240 resides at the source node 210, the source node transmits the INT packet 256 to Hop 2 (or node 2) 220. Hop 2 then updates the max_hop_latency metadata or parameter to include the maximum latency hop value associated with Hop 2. For instance, the source node may set the max_hop_latency metadata or parameter to 0 initially, such that Hop 2 would then replace the value of 0 with the maximum latency hop value determined at Hop 2. Hop 2 then transmits the max_hop_latency metadata or parameter value in INT packet 258, along with a queue.latency metadata or parameter to the next hop, e.g., Hop 3 not shown, in the network. The queue.latency metadata or parameter provides a measure of the amount of time a hop holds a data packet in queue between ingress and egress. If at the next hop, e.g., Hop 3, the max_hop_latency metadata or parameter value is larger than that at Hop 2, the max_hop_latency metadata or parameter and queue.latency metadata or parameter are replaced with the values for Hop 3. The process then continues such that when INT packet 262 arrives at Hop N 230 (the destination node), it contains the value of the hop with the maximum max_hop_latency metadata or parameter, as well as the queue.latency metadata or parameter value for that hop.
Upon receiving the maximum max_hop_latency metadata or parameter and the queue.latency metadata or parameter, Hop 230 (or destination node) generates an INT packet 270, and transmits the maximum max_hop_latency metadata or parameter and queue.latency metadata or parameter value for the hop with maximum value back the source node 210 (or Hop 1) through intervening Hops N-1 through Hop 2. This is illustrated via INT packets 274 and 276. When the INT packet arrives at source 210, the INT metadata/parameters that were collected are reported to host 240 via a packet, for example, 278.
Turning now to
The CC function 324 (which we refer to as Poseidon) effects congestion control in response to data flow (or flow) at a bottleneck hop. As mentioned above, the bottleneck hop may be considered the hop that limits the rate of the flow as per max-min fair allocation parameters. In this regard, a congested hop is not considered the bottleneck hop for all flows passing through the congested hop, but only for flows that send more than their max-min fair-share rate, and thus, CC should ideally increase the rate of only those flows that are above their fair-share. In other words, the CC function 324 reacts to the bottleneck hop by decreasing the congestion window only if the flow got the fair-share on congested hops over its path. Generally, the CC function 324 adjusts the rate of each flow using a congestion window, and if the window goes below a value of 1, the CC function uses pacing (e.g., when the congestion window is less than 1). The CC function 324 compares a delay signal with a target to increase or decrease the congestion window. More specifically, CC function 324 (1) applies the target to maximum per-hop delay (MPD) to allowflows to react to the most congested hop and (2) adjusts the target based on the throughput of the flow to make sure only the flows that get the highest rate on the hop reduce their congestion window.
The disclosed technology achieves high link utilization, low queuing delay, and network-wide max-min fairness, with a fast convergence and stable per-flow throughput. The disclosed technology, e.g., Poseidon, may be configured so that it only reacts to the bottleneck hop by decreasing the congestion window only if the flow got the fair-share on congested hops over its path. This can be accomplished, as discussed herein, without knowing the fair-share. Poseidon compares a delay signal with the target to increase or decrease the window. The delay signal and target can be defined as follows:
1. It applies the target to the maximum per-hop delay to allow flows to react to the most congested hop.
2. It adjusts the target based on the throughput of the flow to make sure only the flows that get the highest rate on the hop reduce their window.
More specifically, every flow tries to maintain the maximum per-hop delay (MPD) close to a maximum per-hop delay target (MPT), namely, increase the congestion window when MPD ≤ MPT to keep the link busy and decrease the window when MPD > MPT to limit the congestion. MPD adds small and fixed overhead to packets and is one of the important designs to find the bottleneck hop. In the max-min fair state, the hop with maximum latency is the bottleneck hop of the flow for Poseidon; otherwise, the flow has not reached its fair-share along its path. The former case must decrease the congestion window, and the latter must ignore the congestion and increase the window. This may be achieved by adjusting the target.
CC function 324 calculates the maximum per-hop delay target (MPT) for each flow based on its rate: the larger the rate is, the smaller MPT will be. This means that flows with higher rates have lower targets, thus they decrease their window earlier or more aggressively. This became possible using INT, as now all flows competing in the same queue tend to observe the same congestion signal (per-hop delay).
In accordance with the disclosed technology, fairness may be achieved on a single hop. Because the disclosed technology makes use of rate-adjusted target delay and delay-based increase/decrease functionality, in some cases faster flows will decrease their rate while slower flows will increase their rate. This may occur if the queuing delay is higher than the faster flow’s target, but lower than the slower flow’s target. In accordance with the disclosed technology, fairness improves in all possible cases:
1. MPD is low, and all flows increase their rate.
2. MPD is high, and all flows decrease their rate.
3. MPD is high, some faster flows decrease their rate, while other slower flows increase their rate.
Assume a queue with two flows, A and B, where b, the rate of B is larger than a, the rate of A (b > a). As a result, the target of A is larger than the target of B (T(a) > T(b)). We define an update function U(T(rate);delay) as the multiplicative factor (where new_cwnd = cwnd _U()) with a specific flow rate and network delay. In order to converge to the line rate, U() is set ≥ 1 if delay is less than the target and ≤ 1 if delay is more than the target, assuming, on average, if arrival rate < line rate, delay is low, and if arrival rate > line rate, delay increases.
In all three cases, if we want to guarantee that the fairness improves or at least stays the same, the updated rates should stay in the red triangle shown in
The disclosed technology achieves high link utilization and fairness if the functions T() and U() satisfy Eq. (1) and Eq. (2).
In accordance with the disclosed technology, any function that satisfies Eqs. (1) and (2) may be used. In accordance with an aspect of the disclosed technology, the following functions are designed to leverage the distance between the target and max-hop delay to not only decide whether to increase or decrease, but to also adjust the update ratio adaptively to reach a better trade-off between stability and fast convergence.
rate is cwnd∗MTU=RTT. k defines the minimum target for any flow; p tunes the maximum target when the rate is equal to min_rate and decides how far apart the target of two flows with close rates can be. In practice, the target cannot be lower than a limit without decreasing the throughput because synchronized arrivals can cause premature window decrease. The target cannot be very large, too, because a) it can cause packet drops in switches when the target delay exceeds the queue capacity; b) as long as we achieve high utilization, we prefer to back-pressure to hosts so that they leverage other mechanisms such as load-balancing and admission control for isolation. We use min_range and max_range to not waste the target range for differentiating rates that only happen rarely. m defines the “step” when updating the rate. The larger m is, the slower the rate of update will be.
When |T(rate) - delay| → 0, then U (rate;delay) → 1. This means when the delay is far away from the target, flows increase/decrease more drastically for faster convergence, and when the delay approaches the target delay, the steps will be more gentle to achieve stable flow rates.
Another option for T(rate) that satisfies the requirements is
which is an extension of Swift flow-scaling. In the disclosed technology, Eq. (3) was designed so as to provide a meaningful difference between the target of flows over all rates. That is, the targetof a flow with rates a and c·a have a fixed difference T(a) - T(c·a) = In (c)/p, providing uniform resolution across all ranges of rates (plot 1006).
Swift uses
for flow scaling that addresses synchronized packet arrivals because it provides higher resolution for small windows where many flows send packets. Similarly, an option for the update function is to use the ratio of target over delay, similar to Swift.
As mentioned above, if the congestion window falls below 1, pacing is used by CC 324. In this regard, the congestion window (cwnd) is the number of packets that can be outstanding on the wire (sent but not acknowledged by remote). Now suppose that cwnd is 1. This would mean sending one packet, when that packet is acknowledged as received, the next packet is sent. When cwnd < 1, e.g., 0.5, a source cannot send half of a packet each time. What is done is based on the time it took to send the last packet. For example, let’s say it took 100 microseconds to send a packet and receive the acknowledgement. In order to simulate sending half a packet each 100 microseconds, we send a packet every 200 microseconds. This is implemented by holding a packet for 100 microseconds, then sending it, it takes another 100 microseconds to get the acknowledgement, then we hold the next for 100 microseconds and so on. The concept of pacing as used within may be employed as is done using the Swift algorithm.
As indicated in
If, at decision diamond 520, a determination is made that the packet is acknowledged as successfully received, processing proceeds to block 550. At block 550, the target rate is updated in accordance with a rate function (function (rate)) and the update_ratio is updated in accordance with an update function based on delay and rate (function (delay, rate)). Eq. (3) above specifies how to figure out the target based on the rate and Eq. 4 shows how to calculate the update ratio. The new cwnd will be the older cwnd multiplied by the update ratio. From block 550, processing proceeds to decision diamond 536, where a determination is made of whether the update_ratio is less than 1. If the update_ratio is not less than 1, the scaled update ratio and congestion window are updated at block 560, and then processing ends. On the other hand, if the ratio is less than 1, processing proceeds to decision diamond 530, which is discussed above.
As shown in
Each computing device 610A-K may include a standalone computer (e.g., desktop or laptop) or a server. The network 640 may include data buses, etc., internal to a computing device, and/or may include one or more of a local area network, virtual private network, wide area network, or other types of networks described below in relation to network 640. Memory 616A-K stores information accessible by the one or more processors 612A-K, including instructions 632A-K and data 634A-K that may be executed or otherwise used by the processor(s) 612A-K. The memory 616A-K may be of any type capable of storing information accessible by a respective processor, including a computing device-readable medium, or other medium that stores data that may be read with the aid of an electronic device, such as a hard-drive, memory card, ROM, RAM, DVD or other optical disks, as well as other write-capable and read-only memories. Systems and methods may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.
The instructions 632A-K may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor. One or more instructions executed by the processors can represent an operation performed by the processor. For example, the instructions may be stored as computing device code on the computing device-readable medium. In that regard, the terms “instructions,” “routines,” and “programs” may be used interchangeably herein, which are executed by the processor to perform corresponding operations. The instructions may be stored in object code format for direct processing by the processor, or in any other computing device language, including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
The data 634A-K may be retrieved, stored, or modified by processor(s) 612A-K in accordance with the instructions 632A-K. As an example, data 634A-K associated with memory 616A-K may include data used in supporting services for one or more client devices, an application, etc. Such data may include data to support hosting web-based applications, file share services, communication services, gaming, sharing video or audio files, or any other network-based services.
Each processor 612A-K may be any of any combination of general-purpose and/or specialized processors. The processors 612A-K are configured to implement a machine-check architecture or other mechanism for identifying memory errors and reporting the memory errors to a host kernel. An example of a general-purpose processor includes a CPU. Alternatively, the one or more processors may be a dedicated device such as a FPGA or ASIC, including a tensor processing unit (TPU). Although
Computing devices 610A-K may include displays 620A-K, e.g., monitors having a screen, a touch-screen, a projector, a television, or other device that is operable to display information. The displays 620A-K can provide a user interface that allows for controlling the computing device 610A-K and accessing user space applications and/or data associated VMs supported in one or more cloud systems 650A-M, e.g., on a host in a cloud system. Such control may include, for example, using a computing device to cause data to be uploaded through input system 628A-K to cloud systems 650A-M for processing, cause accumulation of data on storage 636A-K, or more generally, manage different aspects of a customer’s computing system. In some examples, computing devices 610A-K may also access an API that allow them to specify workloads or jobs that run on Virtual Machines (VMs) in the cloud as part of IaaS (Infrastructure-as-a-System) or SaaS (Service-as-a-System). While input system 628A-K may be used to upload data, e.g., a USB port, computing devices 610A-K may also include a mouse, keyboard, touchscreen, or microphone that can be used to receive commands and/or data.
The network 640 may include various configurations and protocols including short-range communication protocols such as Bluetooth®, Bluetooth® LE, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, Wi-Fi, HTTP, etc., and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces. Computing devices 610A-K can interface with the network 640 through communication interfaces 624A-K, which may include the hardware, drivers, and software necessary to support a given communications protocol.
Network 640 may also implement network slicing. Network slicing supports customizing the capacity and capabilities of a network for different services, such as connected home, video/audio streaming (buffered or real-time), geolocation and route planning, sensor monitoring, computer vision, vehicular communication, etc. Edge data center processing and local data center processing augments central data center processing to allocate 5G, 6G, and future network resources to enable smartphones, AR/VR/XR units, home entertainment systems, industrial sensors, cars and other vehicles, and other wirelessly-connected devices. Not only can terrestrial network equipment support connected home, video/audio streaming (buffered or real-time), geolocation and route planning, sensor monitoring, computer vision, vehicular communication, etc., non-terrestrial network equipment can enable 5G, 6G, and future wireless communications in additional environments such as marine, rural, and other locations that experience inadequate base station coverage. As support for computer vision, objects counting, intrusion detection, motion detection, traffic monitoring, health monitoring, device or target localization, pedestrian avoidance, AR/VR/XR experiences, enhanced autonomous/terrestrial objects navigation, ultra high-definition environment imaging, etc., 5G, 6G, and future wireless networks enable fine-range sensing and sub-meter precision localization. Leveraging massive bandwidths and wireless resource (time, frequency, space) sharing, these wireless networks enable simultaneous communications and sensing capabilities to support radar applications in smart displays, smartphones, AR/VR/XR units, smart speakers, industrial sensors, cars and other vehicles, and other wirelessly-connected devices.
Cloud computing systems 650A-M may include one or more data centers that may be linked via high speed communications or computing networks. A data center may include dedicated space within a building that houses computing systems and their associated components, e.g., storage systems and communication systems. Typically, a data center will include racks of communication equipment, servers/hosts, and disks. The servers/hosts and disks comprise physical computing resources that are used to provide virtual computing resources such as VMs. To the extent a given cloud computing system includes more than one data center, those data centers may be at different geographic locations within relatively close proximity to each other, chosen to deliver services in a timely and economically efficient manner, as well as to provide redundancy and maintain high availability. Similarly, different cloud computing systems are typically provided at different geographic locations.
As shown in
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. For instance, max_hop_latency is one signal that this disclosure is applicable to. Another signal may be minimum available bandwidth. In this instant, we set a target for the minimum available bandwidth that can be adjusted at runtime and make sure we maintain the minimum available bandwidth measured from the network around that target. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.
The present application claims the benefit of the filing date of U.S. Provisional Pat. Application No. 63/332,421, filed on Apr. 19, 2022, the disclosure of which is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63332421 | Apr 2022 | US |