The present invention relates to communication systems, and in particular, but not exclusively to, congestion control.
When multiple nodes, also referred to as sending nodes, want to send packets to the same destination (or receiving) node over a network via a switch, there may be congestion in the switch possibly leading to dropped packets. One congestion control solution includes the switch adding an indication to packets when the switch buffer becomes too full. Upon receiving the packets in a network interface controller (NIC) of the destination node, the NIC sends a notification to NICs of the sending nodes to reduce sending rate, thereby reducing the congestion.
Some systems measure roundtrip, or delay, in the network from a sender to a receiver node to provide an indication of congestion and adjust the sending rate according to delay. For example, if there are N NICs sending to a single NIC, then each NIC will send 1/N of the line rate to avoid congestion. In other words, each of the N NICs may send one packet and wait a period of time to send N−1 packets before sending the next packet, and so on. In this scenario, the switch buffer is statistically never empty due to the NICs not sending in a synchronized manner unless NIC 1 sends, then NIC 2 etc.
A more recent example of congestion control using the roundtrip time or measured delay is described in a paper entitled “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”, by Kumar, et al. The paper describes a congestion control system that assumes that the switch buffer fullness is the order of square root of N. Therefore, the expected delay of sending a packet from a sending NIC to a receiving NIC via the switch is of the order of the reciprocal of the square root of the sending rate. Therefore, based on the measured delay, the sending rate may be adjusted.
There is provided in accordance with an embodiment of the present disclosure, a data communication device, including a network interface to receive first packets over a network from another network device via a switch, which includes a buffer associated with a variable buffer delay, and packet processing circuitry to compute respective measures of delay over the network to the other network device over time responsively to the received first packets, find a minimum measure of delay over the network to the other network device responsively to at least some of the computed respective measures of delay, estimate a current measure of buffer delay of the buffer responsively to the found minimum measure of delay and a current one of the computed respective measures of delay, set a packet processing parameter responsively to the estimated current measure of buffer delay, and process second packets responsively to the set packet processing parameter.
Further in accordance with an embodiment of the present disclosure, the first packets are indicative of the respective measures of delay over the network to the other network device over time.
Still further in accordance with an embodiment of the present disclosure the first packets include data indicative of the respective measures of delay over the network to the other network device over time,
Additionally in accordance with an embodiment of the present disclosure respective roundtrip times of the first packets are indicative of the respective measures of delay over the network to the other network device over time.
Moreover, in accordance with an embodiment of the present disclosure the current measure of buffer delay is a relative buffer delay between a current buffer delay and a minimum buffer delay of the buffer.
Further in accordance with an embodiment of the present disclosure the packet processing parameter is a transmission parameter, and the packet processing circuitry is configured to transmit the second packets responsively to the transmission parameter.
Still further in accordance with an embodiment of the present disclosure the transmission parameter is a current transmission rate, and the packet processing circuitry is configured to transmit the second packets to the other network device responsively to the current transmission rate.
Additionally in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to adjust a previous transmission rate to the current transmission rate responsively to the estimated current measure of buffer delay.
Moreover, in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to perform congestion control responsively to the transmission parameter.
Further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to find the minimum measure of delay over the network to the other network device as a local minimum measure of delay responsively to a function describing the respective measures of delay over the network to the other network device over time.
Still further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to estimate the current measure of the buffer delay of the buffer as a relative delay responsively to the current one of the computed respective measures of delay less the local minimum measure of delay.
Additionally in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to estimate the current measure of the buffer delay of the buffer as a relative delay responsively to the current one of the computed respective measures of delay less the found minimum measure of delay.
Moreover, in accordance with an embodiment of the present disclosure, the device includes a network interface controller including the network interface and the packet processing circuitry.
Further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to compute the respective measures of delay over the network to the other network device over time responsively to respective roundtrip times via the other network device.
Still further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to compute the respective measures of delay over the network to the other network device over time responsively to one-way delay to the other network device.
There is also provided in accordance with another embodiment of the present disclosure a networking method, including receiving first packets over a network from another network device via a switch, which includes a buffer associated with a variable buffer delay, computing respective measures of delay over the network to the other network device over time responsively to the received first packets, finding a minimum measure of delay over the network to the other network device responsively to at least some of the computed respective measures of delay, estimating a current measure of buffer delay of the buffer responsively to the found minimum measure of delay and a current one of the computed respective measures of delay, setting a packet processing parameter responsively to the estimated current measure of buffer delay, and processing second packets responsively to the set packet processing parameter.
Additionally in accordance with an embodiment of the present disclosure the current measure of buffer delay is a relative buffer delay between a current bullet delay and a. minimum buffer delay of the buffer.
Moreover, in accordance with an embodiment of the present disclosure the packet processing parameter is a transmission parameter, the processing including transmitting the second packets responsively to the transmission parameter.
Further in accordance with an embodiment of the present disclosure the transmission parameter is a current transmission rate, the transmitting including transmitting the second packets to the other network device responsively to the current transmission rate.
Still further in accordance with an embodiment of the present disclosure, the method includes adjusting a previous transmission rate to the current transmission rate responsively to the estimated current measure of buffer delay.
Additionally in accordance with an embodiment of the present disclosure, the method includes performing congestion control responsively to the transmission parameter.
Moreover, in accordance with an embodiment of the present disclosure the finding includes finding the minimum measure of delay over the network to the other network device as a local minimum measure of delay responsively to a function describing the respective measures of delay over the network to the other network device over time.
Further in accordance with an embodiment of the present disclosure the estimating includes estimating the current measure of the buffer delay of the buffer as a relative delay responsively to the current one of the computed respective measures of delay less the local minimum measure of delay.
There is also provided in accordance with still another embodiment of the present disclosure a software product, including a. non-transient computer-readable medium in which program instructions are stored, which instructions, when read by a central processing unit (CPU), cause the CPU to compute respective measures of delay over a network to another network device over time responsively to received first packets, find a minimum measure of delay over the network to the other network device responsively to at least some of the computed respective measures of delay, estimate a current measure of buffer delay of a buffer of a switch in the network responsively to the found minimum measure of delay and a current one of the computed respective measures of delay, set a packet processing parameter responsively to the estimated current measure of buffer delay, and process second packets responsively to the set packet processing parameter.
The present invention will be understood from the following detailed description, taken in conjunction with the drawings in which:
Overview
As previously mentioned, the expected delay of sending a packet from a sending NIC to a receiving NIC via a switch may be in the order of the reciprocal of the square root of the sending rate. Therefore, based on the measured delay, the sending rate may be adjusted.
The above solution computes the sending rate based on the total delay (e.g., round trip delay) between the sending MC and receiving NIC. However, total delay also includes propagation delay in the network related to the position of the sending NIC in the cluster or network, and the topology of the network, for example, due to other switches in the network. Therefore, the total delay is not an accurate indicator of the delay due to the buffer of the switch as the total delay also includes propagation delay mentioned above. Without knowing the propagation delay. sending NICs closer to the receiving NIC will generally measure lower delay than sending NICs further away from the receiving MC. This causes network unfairness. Therefore, if there are two NICs sending packets, a first NIC measuring more delay and the second NIC measuring less, the first NIC will send packets at a lower rate than the second NIC leading to unfairness, even though both sending NICs are sending to the same receiving NIC.
Each sending NIC knows what it is sending and the roundtrip time to the receiving NIC. It is also assumed that each sending MC does not know what is happening in the network with respect to other sending NICs in the network sending to the same receiving NIC. Therefore, the delay over the network is generally easy to measure, whereas the buffer level is difficult to measure.
The above problems may occur with any congestion control scheme which is based on network delay to determine the rate at which a NIC should be sending packets. In generally, any system which sets a packet processing parameter based on measured delay over the network may also suffer from similar problems.
One solution is to estimate propagation delay and subtract it from the roundtrip time (RTT). For example, the number of switches over the path in the network could be counted to estimate propagation delay. However, counting the number of switches per path is a hard task, since we need all the switches in the path to do perform this count, or to have a central entity in the network that knows the number of switches in the path in advance.
Therefore, in some embodiments, at least some of the problems are solved by the sending NIC estimating a current measure of buffer delay based on a current measure of delay (from the sending NIC to the receiving NIC) and a minimum measure of delay (found by taking a minimum of many measures of delay from the sending MC to the receiving MC).
In some embodiments, the sending MC receives packets which are indicative of measures of delay over the network to the receiving MC using roundtrip time delay or one-way delay, for example. The sending NIC computes respective measures of delay over the network to the receiving NIC over time responsively to respective ones of the packets. The sending NIC may then find a minimum measure of delay from the computed measures of delay. The minimum measure of delay may be a local minimum.
The current measure of buffer delay may then be estimated by the sending NIC based on the current measure of delay less the (local) minimum measure of delay. The estimated current measure of buffer delay may then be used to set a packet processing parameter such as a transmission parameter (e.g., transmission rate) with which to process packets and thereby provide congestion control in the network and in the switch.
The estimate of the current measure of buffer delay may be estimated as a relative delay between the current measure of buffer delay and the local minimum buffer delay, as detailed now below.
The local minimum measure of delay=propagation delay from the sending NIC to the receiving NIC+the local minimum buffer delay (equation 1).
The current measure of delay=propagation delay from the sending NIC to the receiving NIC+the current measure of buffer delay (equation 2).
Therefore, the relative delay=
current measure of delay−local minimum measure of delay=
equation 2−equation 1=
current measure of buffer delay−local minimum buffer delay.
The relative delay provides a good estimate of current actual buffer
System Description
Reference is now made to
Therefore, the delay from one of the data communication devices 12-1, 12-2 to the data communication device 12-3 is comprised of propagation delay across the network, plus buffer delay in the buffer 18 of the switch 16.
In the example of
Each data communication device 12 includes a network interface 20 and packet processing circuitry 22. Each data communication device 12 may include a network interface controller 24 comprising the network interface 20 and the packet processing circuitry 22.
Reference is now made to
Data communication device 12-1 may be configured to measure delay or roundtrip time from the data communication device 12-1 to another network device (e.g., the data communication device 12-3) via the switch 16). There are different methods to perform the delay or roundtrip time measurements. One method includes the data communication device 12-1 sending a data packet to the other network device and receiving an acknowledgement (ACK) packet from the other network device. Another method includes the data communication device 12-1 sending a dedicated packet to the other network device, which sends that packet back to the data communication device 12-1. Therefore, network interface 20 is configured to receive packets over the network 14 from another network device (e.g., the data communication device 12-3) via the switch 16 (block 202) related to delay or roundtrip time measurement. The packet processing circuitry 22 is configured to compute respective measures of delay over the network 14 to the other network device over time responsively to the received packets (block 204).
Reference is now made to
Reference is again made to
In some embodiments, respective roundtrip times of the received. packets are indicative of the respective measures of delay over the network 14 to the other network device over time. Therefore, the packet processing circuitry 22 is configured to compute the respective measures of delay over the network 14 to the other network device over time responsively to respective roundtrip times via the other network device (i.e., from the data communication device 12-1 to the other network device and back to the data communication device 12-1).
The packet processing circuitry 22 is configured to find a minimum measure of delay over the network 14 to the other network device responsively to at least some of the computed respective measures of delay (block 206). In some embodiments, the packet processing circuitry 22 is configured to find the minimum measure of delay over the network 14 to the other network device as a local minimum measure of delay (arrow 302 in
In some embodiments, the packet processing circuitry 22 uses a local minimum instead of global minimum. The local minimum is the minimum measure of delay in the most recent cycle (e.g., buffer cycle) of the graph 300 or function. One reason to use the local minimum instead of a global minimum is that the global minimum may never occur for one or more of the data communication devices 12. Additionally, the global minimum may be very high for flows that commence after the congestion started. However, if the local minimum is used then all the data communication devices 12 should be aligned to the same minimum after a short period of time (e.g., within one cycle of the graph 300). Every cycle, the most recent local minimum is new and is used by the packet processing circuitry 22 in the steps described below.
The packet processing circuitry 22 is configured to estimate a current measure of buffer delay of the buffer 18 responsively to the found minimum measure of delay (found in the step of block 206) and a current computed respective measure of delay (block 208).
The estimate of the current measure of buffer delay may be expressed as a relative delay between the current measure of buffer delay and the (local) minimum buffer delay of the buffer 18, as detailed now below.
The local minimum measure of delay=propagation delay from the sending data communication device 12 to the receiving data communication device 12+the local minimum buffer delay (equation 1).
The current measure of delay=propagation delay from the sending data communication device 12 to the receiving data communication device 12+the current measure of buffer delay (equation 2).
Therefore, the relative delay=
current measure of delay−local minimum measure of delay=
equation 2−equation 1=
current measure of buffer delay−local minimum buffer delay.
The relative delay provides a good estimate of current actual buffer delay.
Therefore, in some embodiments, the packet processing circuitry 22 is configured to estimate the current measure of the buffer delay of the buffer 18 as a relative delay responsively to the current computed measure of delay less the found minimum measure of delay, which may equal the local minimum measure of delay.
The packet processing circuitry 22 is configured to set a packet processing parameter responsively to the estimated current measure of buffer delay (block 210). In some embodiments, the packet processing parameter 22 is a transmission parameter. In some embodiments, the transmission parameter is a current transmission rate with which to send packets. For example, the transmission rate may be set as a function of the relative delay.
In some embodiments, the packet processing circuitry 22 may be configured to adjust a previous transmission rate to the current transmission rate responsively to the estimated current measure of buffer delay. For example, if the relative delay increases, the transmission rate may be reduced, and if the relative delay decreases, the transmission rate may be increased.
The packet processing circuitry 22 is configured to process packets responsively to the set packet processing parameter (block 212).
In some embodiments, the packet processing circuitry 22 is configured to perform congestion control responsively to the transmission parameter (block 214). The step of block 214 may include the packet processing circuitry 22 being configured to transmit packets responsively to the transmission parameter (block 216). In some embodiments, the packet processing circuitry 22 is configured to transmit packets to the other network device responsively to the current transmission rate.
In practice, some or all of the functions of the packet processing circuitry 22 may be combined in a single physical component or, alternatively, implemented using multiple physical components. These physical components may comprise hard-wired or programmable devices, or a combination of the two. In some embodiments, at least some of the functions of the packet processing circuitry 22 may be carried out by a programmable processor under the control of suitable software. This software may be downloaded to a device in electronic form, over a network, for example. Alternatively, or additionally, the software may be stored in tangible, non-transitory computer-readable storage media, such as optical, magnetic, or electronic memory.
Various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
The embodiments described above are cited by way of example, and the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled. in the art upon reading the foregoing description and which are not disclosed in the prior art.