The present invention relates generally to packet communication networks, and particularly to methods and systems for control of congestion in such networks.
Network congestion occurs when a link or node in the network is required to carry more data traffic than it is capable of transmitting or forwarding, with the result that its quality of service deteriorates. Typical effects of congestion include queueing delay, packet loss, and blocking of new connections. Modern packet networks use congestion control (including congestion avoidance) techniques to try to mitigate congestion before catastrophic results set in.
A number of congestion avoidance techniques are known in the art. In random early detection (RED, also known as random early discard or random early drop), for example, network nodes, such as switches, monitor their average queue size and drop packets based on statistical probabilities: If a given queue (or set of queues) is almost empty, all incoming packets are accepted. As the queue grows, the probability of dropping an incoming packet grows accordingly, reaching 100% when the buffer is full. Weighted RED (WRED) works in a similar fashion, except that different traffic classes are assigned different thresholds.
Another congestion avoidance technique is Explicit Congestion Notification (ECN), which is an extension to the Internet Protocol (IP) and the Transmission Control Protocol (TCP). ECN was initially defined by Ramakrishnan, et al., in “The Addition of Explicit Congestion Notification (ECN) to IP,” which was published as Request for Comments (RFC) 3168 of the Internet Engineering Task Force (2001) and is incorporated herein by reference. ECN provides end-to-end notification of network congestion without dropping packets, by signaling impending congestion in the IP header of transmitted packets. The receiver of an ECN-marked packet of this sort echoes the congestion indication to the sender, which reduces its transmission rate as though it had detected a dropped packet.
Other congestion avoidance techniques use adaptive routing, in which routing paths for packets are selected based on the network state, such as traffic load or congestion. Adaptive routing techniques are described, for example, in U.S. Pat. Nos. 8,576,715 and 9,014,006, whose disclosures are incorporated herein by reference.
Embodiments of the present invention that are described hereinbelow provide improved methods and systems for congestion control in a network.
There is therefore provided, in accordance with an embodiment of the invention, communication apparatus, including multiple interfaces configured for connection to a packet data network as ingress and egress interfaces of the apparatus. A memory is coupled to the interfaces and configured to contain packets awaiting transmission to the network in multiple queues, which are associated with respective egress interfaces and are assigned respective transmission priorities. Control logic is configured to assign to the queues respective weighting factors, such that the weighting factors of the queues vary inversely with the respective transmission priorities of the queues. The control logic calculates for each egress interface a respective interface congestion level based on respective queue lengths of the queues associated with the egress interface, and calculates effective congestion levels for the queues as a weighted function of the respective queue lengths and the respective interface congestion level, weighted by the respective weighting factors, of the egress interfaces with which the queues are respectively associated. Congestion control is applied to the queues responsively to the effective congestion levels.
In the disclosed embodiments, the weighted function includes a weighted sum, in which the respective interface congestion level is weighted by the respective weighting factors of the queues. In one embodiment, the weighting factors are assigned values between zero, for the queues having a highest priority level, and one, for the queues having a lowest priority level.
In some embodiments, the control logic is configured to send congestion notifications over the network when an effective congestion level for a given queue exceeds a predefined threshold. Additionally or alternatively, the control logic is configured to drop packets from a queue when an effective congestion level of the queue exceeds a predefined threshold. Further additionally or alternatively, the control logic is configured to apply adaptive routing to reroute packets awaiting transmission in a queue when an effective congestion level of the queue exceeds a predefined threshold.
There is also provided, in accordance with an embodiment of the invention, a method for communication, which includes holding packets awaiting transmission from a network element to a network in multiple queues, which are associated with respective egress interfaces of the network element and are assigned respective transmission priorities. Respective weighting factors are assigned to the queues, such that the weighting factors of the queues vary inversely with the respective transmission priorities of the queues. For each egress interface, a respective interface congestion level is calculated based on respective queue lengths of the queues associated with the egress interface. Effective congestion levels are calculated for the queues as a weighted function of the respective queue lengths and the respective interface congestion level, weighted by the respective weighting factors, of the egress interfaces with which the queues are respectively associated. Congestion control is applied to the queues responsively to the effective congestion levels.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
In statistical congestion control techniques that are known in the art, such as ECN and WRED, congestion control measures are applied to a certain fraction of the packets that are to be transmitted from each queue in a network element to the network, depending on the respective length of each queue. In other words, a certain threshold is defined for each queue, and the probability that a given packet will be marked with a congestion notification (such as in ECN) or dropped (such as in WRED) depends on the relation between the current length of that queue and the threshold.
In many cases, however, the egress interfaces of a network element, such as the ports of a switch, are configured to transmit packets in different traffic classes, with different, respective transmission priorities. In such cases, the packets for transmission through a given interface are held in multiple different queues, depending on their priorities, and the actual congestion that a packet faces depends not only on the length of its own queue, but also on the other queues that are competing for transmission bandwidth through the same interface. This effect becomes more marked for packets in low-priority queues, which may encounter long transmission delays even when their own queue is short, as high-priority packets in other queues take up most of the available bandwidth. Under these circumstances, it can be difficult or impossible to find an optimal working point for congestion control measures based on the queue length alone.
Embodiments of the present invention that are described herein address this problem by applying congestion control measures to each queue based on effective congestion levels. The effective congestion for each queue is calculated as (or estimated by) a weighted function of its own, respective queue length together with an interface congestion level of the egress interface with which the queue is associated. The interface congestion level is typically calculated based on the respective queue lengths of all the queues associated with the particular egress interface, for example, as a sum or average of the queue lengths. In calculating the effective congestion level for each queue, the interface congestion level is weighted, relative to the queue length itself, by a weighting factor that varies inversely with the respective transmission priorities of the queues, i.e., the interface congestion level receives a high weight for low-priority queues, and vice versa. Thus, generally speaking, the effective congestion level for low-priority queues will be strongly influenced by the overall interface congestion level, while high-priority queues will be influenced less or not at all.
The term “varies inversely,” as used in the context of the present description and in the claims in reference to the weighting factor, means simply that as the priority increases, the weighting factor decreases, and this term applies to any sort of function that behaves in this way. For example, in some embodiments, the network element computes the effective congestion level for each queue as a weighted sum of the queue length and the interface congestion level, with the interface congestion level weighted in the sum by the respective weighting factor that is assigned to the queue. The weighting factors in this case are typically assigned values between zero, for the queues having the highest priority level, and one, for the queues having the lowest priority level. In embodiments of the present invention, the function that is used in setting the weighting factors can be configured as desired, depending on factors such as scheduling policies and network connectivity, and may even vary among different interfaces of the same network element.
The techniques disclosed herein thus enable a network element to optimize application of congestion control measures under different conditions, while imposing only a minimal additional computational burden relative to queue length-based methods that are known in the art. The effective congestion levels that are calculated in this manner may be used in place of the ordinary queue length in a variety of congestion control techniques, such as sending congestion notifications, dropping packets from a queue when necessary, and deciding when to apply adaptive routing.
Incoming packets received by switch 20, such as packet 26, are transferred to a memory 36, which buffers packets awaiting transmission to the network in multiple queues. Although only the single memory 36 is shown in
Control logic 38 in switch 20 executes forwarding and queuing decisions, and thus assigns outgoing packets to appropriate queues for transmission to network 24. Each such queue is associated with a respective egress interface and is assigned a respective transmission priority. Typically, as noted earlier, multiple queues, with different, respective priority levels, are associated with each egress interface. Control logic 38 monitors the length of each queue and calculates an effective congestion level for the queue as a weighted function of its own queue length and an interface congestion level, which is based on the queue lengths of all the queues that are associated with the same egress interface.
Control logic 38 decides whether and how to apply congestion control measures to outgoing data packets in each queue, by comparing the effective congestion level of the queue to a respective threshold. The congestion measures typically include statistical congestion control, such as ECN or WRED, which is applied to a respective fraction of the packets that are queued for transmission to network 24 when the effective congestion level passes the threshold. (Multiple different thresholds may be applied to a given queue, whereby logic 38 sends congestion notifications when the effective congestion level passes a certain lower threshold and drops packets when the effective congestion level passes another, higher threshold, for example.) Additionally or alternatively, control logic 38 can apply adaptive routing to choose a new routing path for the packets in a given queue when the effective congestion level for the queue passes a certain threshold. The term “adaptive routing” as used herein is not limited to classical Layer 3 network routing, but rather includes, as well, other sorts of adaptive forwarding of packets when multiple alternative paths through the network are available.
Although control logic 38 is shown in
Furthermore, although the present description relates, for the sake of concreteness and clarity, to the specific switch 20 that is shown in
For each incoming packet 26, packet classification and forwarding logic 42 (within control logic 38) parses and processes header 28 in order to classify the packet priority level and identify the egress port through which the packet is to be forwarded. To the extent that switch is configured for adaptive routing, logic 42 may identify multiple candidate egress ports for the packet. Based on the chosen egress port and priority level, logic specifies the egress queue for the packet, which is meanwhile held in memory 36.
A congestion level estimator 44 calculates the effective congestion level for the chosen queue. For this purpose, estimator 44 calculates, a respective interface congestion level I for each port 22, based on the queue lengths of the queues associated with the port, i.e., the queues for which the port is to serve as the egress interface. For example, the interface congestion level may be based on a sum of the queue lengths: l=kΣqueuesL, wherein k is a normalization constant (which may be set equal to one). Estimator 44 then computes the effective congestion level C for each queue as a weighted sum of the respective queue length L and the corresponding interface congestion level I, weighted by a certain weighting factor β: C=L+β*I.
In general, β is configurable for each priority level (traffic class) and varies inversely with the queue priority, as explained above. For example, β may be set for each queue to a value between 0 and 1, with β=0 for the queues with highest priority and β=1 for the queues with lowest priority. In a weighted round-robin scheduler, for instance, the congestion weighting factor β for each queue will typically be inversely proportional to the weight allocated to the queue by the scheduler. (When all queues have the same weight in the scheduler, β can be set to 1 for all queues.) The function that is applied by congestion estimator 44 in setting β may vary from port to port. Furthermore, when logic 38 applies a hierarchical scheduling policy, estimator 44 can calculate the effective congestion levels over multiple scheduling levels of the hierarchy.
Congestion control logic 46 compares the effective congestion level C for each queue to a threshold, or a set of thresholds, in order to choose a congestion treatment to apply to the packets in the queue. For example:
Once the final target queue for a given packet has been determined, control logic 38 passes the packet to a queuing system 48, which arbitrates among the queues and delivers each packet from memory 36 to its assigned egress port 50 for transmission. Queuing system 48 meanwhile monitors the queue lengths and delivers the queue length values L to congestion estimator 44, which updates the interface congestion levels and effective congestion levels accordingly, as described above.
It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.
Number | Name | Date | Kind |
---|---|---|---|
6108713 | Sambamurthy et al. | Aug 2000 | A |
6178448 | Gray et al. | Jan 2001 | B1 |
6594263 | Martinsson et al. | Jul 2003 | B1 |
7321553 | Prasad et al. | Jan 2008 | B2 |
7821939 | Decusatis et al. | Oct 2010 | B2 |
8078743 | Sharp et al. | Dec 2011 | B2 |
8345548 | Gusat et al. | Jan 2013 | B2 |
8473693 | Muppalaneni et al. | Jun 2013 | B1 |
8576715 | Bloch et al. | Nov 2013 | B2 |
8767561 | Gnanasekaran et al. | Jul 2014 | B2 |
8811183 | Anand et al. | Aug 2014 | B1 |
8879396 | Guay et al. | Nov 2014 | B2 |
8989017 | Naouri | Mar 2015 | B2 |
8995265 | Basso et al. | Mar 2015 | B2 |
9014006 | Haramaty et al. | Apr 2015 | B2 |
9325619 | Guay et al. | Apr 2016 | B2 |
9356868 | Tabatabaee et al. | May 2016 | B2 |
9426085 | Anand et al. | Aug 2016 | B1 |
20020191559 | Chen et al. | Dec 2002 | A1 |
20030108010 | Kim et al. | Jun 2003 | A1 |
20030223368 | Allen et al. | Dec 2003 | A1 |
20040008714 | Jones | Jan 2004 | A1 |
20050053077 | Blanc et al. | Mar 2005 | A1 |
20050169172 | Wang et al. | Aug 2005 | A1 |
20050216822 | Kyusojin et al. | Sep 2005 | A1 |
20050226156 | Keating et al. | Oct 2005 | A1 |
20050228900 | Stuart et al. | Oct 2005 | A1 |
20060088036 | De Prezzo | Apr 2006 | A1 |
20060092837 | Kwan et al. | May 2006 | A1 |
20060092845 | Kwan et al. | May 2006 | A1 |
20070097257 | El-Maleh et al. | May 2007 | A1 |
20070104102 | Opsasnick | May 2007 | A1 |
20070104211 | Opsasnick | May 2007 | A1 |
20070201499 | Kapoor et al. | Aug 2007 | A1 |
20070291644 | Roberts et al. | Dec 2007 | A1 |
20080037420 | Tang et al. | Feb 2008 | A1 |
20080175146 | Van Leekwuck | Jul 2008 | A1 |
20080192764 | Arefi et al. | Aug 2008 | A1 |
20090207848 | Kwan et al. | Aug 2009 | A1 |
20100220742 | Brewer et al. | Sep 2010 | A1 |
20130014118 | Jones | Jan 2013 | A1 |
20130039178 | Chen et al. | Feb 2013 | A1 |
20130250757 | Tabatabaee et al. | Sep 2013 | A1 |
20130250762 | Assarpour | Sep 2013 | A1 |
20130275631 | Magro et al. | Oct 2013 | A1 |
20130286834 | Lee | Oct 2013 | A1 |
20130305250 | Durant | Nov 2013 | A1 |
20140133314 | Mathews et al. | May 2014 | A1 |
20140269324 | Tietz | Sep 2014 | A1 |
20150026361 | Matthews et al. | Jan 2015 | A1 |
20150055478 | Tabatabaee | Feb 2015 | A1 |
20150124611 | Attar et al. | May 2015 | A1 |
20150127797 | Attar et al. | May 2015 | A1 |
20150180782 | Rimmer et al. | Jun 2015 | A1 |
20150200866 | Pope et al. | Jul 2015 | A1 |
20150381505 | Sundararaman et al. | Dec 2015 | A1 |
20160135076 | Grinshpun | May 2016 | A1 |
20170118108 | Avci et al. | Apr 2017 | A1 |
20170142020 | Sundararaman et al. | May 2017 | A1 |
20170180261 | Ma et al. | Jun 2017 | A1 |
20170187641 | Lundqvist et al. | Jun 2017 | A1 |
Number | Date | Country |
---|---|---|
1720295 | Nov 2006 | EP |
2466476 | Jun 2012 | EP |
2009107089 | Sep 2009 | WO |
2013136355 | Sep 2013 | WO |
2013180691 | Dec 2013 | WO |
Entry |
---|
Gran et al., “Congestion Management in Lossless Interconnection Networks”, Submitted to the Faculty of Mathematics and Natural Sciences at the University of Oslo in partial fulfillment of the requirements for the degree Philosophiae Doctor, 156 pages, Sep. 2013. |
Pfister et al., “Hot Spot Contention and Combining in Multistage Interconnect Networks”, IEEE Transactions on Computers, vol. C-34, pp. 943-948, Oct. 1985. |
Zhu et al.“Congestion control for large-scale RDMA deployments”, SIGCOMM'15, pp. 523-536, Aug. 17-21, 2015. |
U.S. Appl. No. 14/994,164 office action dated Jul. 5, 2017. |
U.S. Appl. No. 15/075,158 office action dated Aug. 24, 2017. |
CISCO Systems, Inc., “Priority Flow Control: Build Reliable Layer 2 Infrastructure”, 8 pages, 2015. |
CISCO Systems, Inc.,“Advantage Series White Paper Smart Buffering”, 10 pages, 2016. |
Hoeiland-Joergensen et al., “The FlowQueue-CoDel Packet Scheduler and Active Queue Management Algorithm”, Internet Engineering Task Force (IETF) as draft-ietf-aqm-fq-codel-06 , 23 pages, Mar. 18, 2016. |
U.S. Appl. No. 14/718,114 Office Action dated Sep. 16, 2016. |
U.S. Appl. No. 14/672,357 Office Action dated Sep. 28, 2016. |
Hahne et al., “Dynamic Queue Length Thresholds for Multiple Loss Priorities”, IEEE/ACM Transactions on Networking, vol. 10, No. 3, pp. 368-380, Jun. 2002. |
Choudhury et al., “Dynamic Queue Length Thresholds for Shared-Memory Packet Switches”, IEEE/ACM Transactions Networking, vol. 6, Issue 2 , pp. 130-140, Apr. 1998. |
Gafni et al., U.S. Appl. No. 14/672,357, filed Mar. 30, 3015. |
Ramakrishnan et al., “The Addition of Explicit Congestion Notification (ECN) to IP”, Request for Comments 3168, Network Working Group, 63 pages, Sep. 2001. |
IEEE Standard 802.1Q™-2005, “IEEE Standard for Local and metropolitan area networks Virtual Bridged Local Area Networks”, 303 pages, May 19, 2006. |
INFINIBAND TM Architecture, Specification vol. 1, Release 1.2.1, Chapter 12, pp. 657-716, Nov. 2007. |
IEEE Std 802.3, Standard for Information Technology—Telecommunications and information exchange between systems—Local and metropolitan area networks—Specific requirements; Part 3: Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications Corrigendum 1: Timing Considerations for PAUSE Operation, Annex 31B (MAC Control PAUSE operation), pp. 763-772, year 2005. |
IEEE Std 802.1Qbb., IEEE Standard for Local and metropolitan area networks—“Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks—Amendment 17: Priority-based Flow Control”, 40 pages, Sep. 30, 2011. |
Elias et al., U.S. Appl. No. 14/718,114, filed May 21, 2015. |
Gafni et al., U.S. Appl. No. 15/075,158, filed Mar. 20, 2016. |
Elias et al., U.S. Appl. No. 14/994,164, filed Jan. 13, 2016. |
U.S. Appl. No. 15/063,527 office action dated Feb. 8, 2018. |
U.S. Appl. No. 15/161,316 office action dated Feb. 7, 2018. |
European Application # 17172494.1 search report dated Oct. 13, 2017. |
European Application # 17178355 search report dated Nov. 13, 2017. |
U.S. Appl. No. 15/081,969 office action dated Oct. 5, 2017. |
U.S. Appl. No. 15/081,969 office action dated May 17, 2018. |
U.S. Appl. No. 15/432,962 office action dated Apr. 26, 2018. |
U.S. Appl. No. 15/161,316 Office Action dated Jul. 20, 2018. |
Number | Date | Country | |
---|---|---|---|
20170171099 A1 | Jun 2017 | US |