The present invention relates to congestion control in a packet switched network. Congestion is a common problem in packet switched networks, which typically lack admission control mechanisms that would prevent packets from entering an oversubscribed network.
In the context of Internet Protocol (IP) networks, congestion control is handled at the transport layer, e.g. by the Transmission Control Protocol (TCP). TCP congestion control mostly relies on an implicit congestion indication in the form of lost packets. Mechanisms for explicit congestion notification (ECN) have been proposed and standardized, requiring intermediate packet switches (routers) to explicitly mark bits in the IP header of a packet indicating the incidence of congestion on the path of the packet. The notifications are reflected back by the receiver in the headers of acknowledgment or other packets flowing in the opposite direction. ECN has seen limited deployment and in some networks has shown benefits compared to packet loss and implicit congestion detection. However, there are no mechanisms indicating congestion at the receiving and/or sending node. A window flow control mechanism is available in TCP and is limited to controlling buffer use at the receiving end. Moreover, it is often the case that flow control windows are sized without relation to the available resources at the receiver, such as memory, in part to improve performance when multiple connections are established, but the different connections are active at different times.
A Network Interface Controller (NIC)—which may be, for example, network interface circuitry, such as on a PCI card connected to a host computer via a PCI host bus—typically includes receive functionality used to couple the host computer to a packet network through at least one interface, usually called a port. NIC circuitry has been an area of rapid development as advanced packet processing functionality and protocol offload have become requirements for so called “smart NICs”.
It has become common for receive functionality of NIC cards to parse packet headers and perform operations such as checksum checking computations, filter lookups and modifications such as VLAN tag removal and insertion. Additionally, the receive functionality in some NICs implement protocol offload such as TCP/IP, iSCSI, RDMA and other protocols.
Advances in network connectivity speed increases and system traffic generation and processing capacity are not coupled, and occur as the different relevant technologies develop. For example, Ethernet speeds typically increase in factors of 10× every few years. In contrast, system components, such as host bus, CPU speeds or memory subsystem bandwidth have different increase steps, varying from 2× every few years for the first, to 10% per year for CPUs and somewhat less for memory.
A mismatch may exist between the ingress rate of packets from the network at the receive functionality and the rate at which the host computer is accepting and/or can accept packets on the host bus from the NIC. This mismatch may be temporary, caused by load on the host system processors, or in general can be caused by the offered load exceeding the capacity of the host system. Receive functionality of a NIC can be equipped to detect and indicate such conditions, which remain otherwise mostly unbeknownst to network routers and switches.
In an Ethernet packet switched network, a “PAUSE” mechanism is available to request an immediate neighbor of a node to pause transmission when the node detects congestion. However, this action is limited in scope to the link level, and does not affect the sending rate of the source node, if that node is not the immediate neighbor but is instead connected to the destination through one or more switches. Instead, end-to-end congestion avoidance and control mechanisms are preferred, such as the ones implemented by the Transmission Control Protocol (TCP).
In accordance with an aspect of the invention, congestion detection and explicit notification mechanisms are extended to the receive functionality of the NIC, which is equipped to mark congestion notification bits similarly to intermediate routers. Furthermore, smart NICs which implement receive protocol offload, can themselves reflect the indication back to the sender in packets the NICs generate according to the protocol requirements.
In accordance with another aspect of the invention, congestion detection and notification are extended to transmit functionality of the NIC, whereby the NIC can detect temporary and/or long term mismatch in the traffic sending rate of the node and the traffic sinking rate of its network port.
Independently, server virtualization has been an area of active development, leading to significant virtualization related features being implemented in a NIC. Some NICs have thus incorporated virtual bridges for switching traffic between two virtual machines residing on the host system. In accordance with an aspect of the invention, receive and transmit functionality in a virtual bridge may provide congestion detection and notification capabilities even for connections that are facilitated by the NIC between functions executing on the same host.
ECN has been standardized in the context of TCP/IP networks for more than a decade, but has seen limited deployment and use, perhaps due to the heterogeneity of the network and the large number of entities managing it. However, there is potential for ECN gaining acceptance in the context of datacenters, which combine a large network infrastructure, where congestion may be a significant concern, with a single management authority.
Packet switched networks typically lack admission control mechanisms that operate based on congestion in the network.
In accordance with an aspect of the current invention, receive and/or transmit functionality of network interface circuitry (“NIC”) is equipped with ingress and/or egress congestion detection functionality, respectively. In the presence of ingress congestion, the NIC may pass an indication of the congestion (notification) to the receiving protocol stack. In the presence of egress congestion, the NIC may pass an indication of the congestion (notification) to the sending protocol stack. At least one of the receiving protocol stack and the sending protocol stack may be operating on a host coupled to the network by the NIC or operating in the NIC itself (protocol offload).
For example, the notification may be carried in the appropriate fields of packets of a protocol stack when relevant, or as sideband information. For example, in a TCP/IP-compliant network, the notification can be carried in the ECN bits as specified by the relevant standards.
A NIC that implements packet switching functionality can therefore provide an indication of congestion even for traffic that is confined to the NIC itself, such as in virtualized configurations, where the NIC switches packets between virtual machines executing on the host system.
A NIC which implements a network stack (protocol offload NIC) can furthermore act on congestion detection and perform the required congestion avoidance and control actions. In a TCP/IP network, the receiving action may include reflecting the detected congestion appropriately to the sending side. The sending action may include reducing sending speed.
A NIC which implements packet switching functionality and protocol offload can thus be extended to implement both sides of an explicit congestion notification capable network node, such as in virtualized configurations, where the NIC switches packets between virtual machines executing on the host system.
A NIC which, in addition to providing an interface between a network and a host, implements packet switching functionality where the NIC switches packets between network ports may be configured to mark the packets flowing through according to congestion detection, or other criteria.
We now describe a novel application of the packet processing capability of a NIC, equipped with a congestion detection facility. Referring to
The NIC 101 can detect congestion on the ingress portion of the interface 106 such as by monitoring the occupancy of relevant state elements, like header or payload FIFOs, or general usage of NIC resources for packet ingress. For example, a payload or header FIFO occupancy threshold can be programmed, such that when the FIFO occupancy exceeds said threshold, the FIFO is considered congested. In another example, a “freelist” of host based receive packet buffers is maintained, and the number of available buffers monitored. When the number of available buffers drops below a certain threshold, the receiving host is considered to be congested. As another example, the NIC 101 may also detect congestion through monitoring the internal state of busses and connections. Based thereon, the NIC 101 may maintain an ingress congestion state that is indicative of the detected ingress congestion, such as indicative of detected congestion episodes. (It is noted that, by “connection,” this is not meant to imply that a connection state is necessarily present and/or being maintained at both ends of the “connection.”)
According to the maintained congestion state, the NIC 101 may mark packets received during congestion detected episodes. The marking can be passed as sideband (out-of-band) information to the receiving host 102 to be associated with a received packet (e.g., as part of a receive descriptor associated with the packet), or a flow to which the packet belongs, or through setting or clearing appropriate bits in the packet headers. Modifying bits in the packet headers may require the NIC to adjust integrity check values that cover said bits.
The congestion notification markings are then processed by the host 102 according to a congestion protocol 107. Typically, based on the processing the congestion notification markings, the congestion protocol 107 causes one or more congestion indication to be communicated back to a peer 105, using congestion experienced indications. The double arrow 108 indicates this bidirectional passing of congestion information—from the NIC 101 to the host 102, and from the host 102 to the peer 105.
The NIC 101 may also (or instead) detect congestion on the egress interface when processing packets for transmission. Congestion detection may be based on monitoring the state of NIC resources in the transmit direction, similarly to the receive congestion detection. It may include monitoring the network (PAUSE) state of the egress interface. The NIC can, for example, mark the packets in appropriate fields in the headers or communicate the notification back to the host system to be passed to the congestion indication processing 107. The notification can be associated with a packet by including identification information from the packet, such as appropriate network headers, or a flow information if a flow has been associated with the packet or, in general, an identifier associated with the packet. Alternatively, the NIC may mark the packets by modifying relevant bits in the packet headers to indicate congestion experienced. Modifying bits in the headers may require the NIC to recompute or adjust check values that cover the bits.
In accordance with another aspect, as shown in
Like the
In addition, the host 202 may be configured to operate multiple virtual machines or guest functions. The NIC 201 may operate as a “virtual network” 214 between the virtual machines such as guest VM1208a and guest VM2208b. Furthermore, similar to the description above with respect to
A NIC may also implement a function to switch packets between two entities on the host side, or between an ingress port and an egress port. Packets flowing through the switching functionality may be marked by the NIC according to the presence of congestion. It may be beneficial to implement marking as part of general traffic management (e.g., the congestion marking may not be based on congestion detection exclusively, but may also be based on desired traffic management).
As indicated in
Number | Name | Date | Kind |
---|---|---|---|
6016319 | Kshirsagar et al. | Jan 2000 | A |
6035333 | Jeffries et al. | Mar 2000 | A |
6118771 | Tajika et al. | Sep 2000 | A |
6167054 | Simmons et al. | Dec 2000 | A |
6181699 | Crinion et al. | Jan 2001 | B1 |
6477143 | Ginossar | Nov 2002 | B1 |
6724725 | Dreyer et al. | Apr 2004 | B1 |
7573815 | Brzezinski et al. | Aug 2009 | B2 |
7660264 | Eiriksson et al. | Feb 2010 | B1 |
7660306 | Eiriksson et al. | Feb 2010 | B1 |
7675857 | Chesson | Mar 2010 | B1 |
7706255 | Kondrat et al. | Apr 2010 | B1 |
7742412 | Medina | Jun 2010 | B1 |
7760733 | Eiriksson et al. | Jul 2010 | B1 |
7761589 | Jain | Jul 2010 | B1 |
8346919 | Eiriksson et al. | Jan 2013 | B1 |
20010055313 | Yin | Dec 2001 | A1 |
20030099197 | Yokota et al. | May 2003 | A1 |
20030219022 | Dillon et al. | Nov 2003 | A1 |
20040179476 | Kim et al. | Sep 2004 | A1 |
20050182833 | Duffie et al. | Aug 2005 | A1 |
20060088036 | De Prezzo | Apr 2006 | A1 |
20060092840 | Kwan et al. | May 2006 | A1 |
20060114912 | Kwan et al. | Jun 2006 | A1 |
20060251120 | Arimilli et al. | Nov 2006 | A1 |
20070022212 | Fan | Jan 2007 | A1 |
20070071014 | Perera et al. | Mar 2007 | A1 |
20070201499 | Kapoor et al. | Aug 2007 | A1 |
20070268830 | Li et al. | Nov 2007 | A1 |
20080025226 | Mogul et al. | Jan 2008 | A1 |
20080025309 | Swallow | Jan 2008 | A1 |
20080056263 | Jain et al. | Mar 2008 | A1 |
20080062879 | Sivakumar et al. | Mar 2008 | A1 |
20080144503 | Persson et al. | Jun 2008 | A1 |
20080232251 | Hirayama et al. | Sep 2008 | A1 |
20090052326 | Bergamasco et al. | Feb 2009 | A1 |
20090073882 | McAlpine et al. | Mar 2009 | A1 |
20090116493 | Zhu et al. | May 2009 | A1 |
20090219818 | Tsuchiya | Sep 2009 | A1 |
20090310610 | Sandstrom | Dec 2009 | A1 |
20100057929 | Merat et al. | Mar 2010 | A1 |
20100091650 | Brewer et al. | Apr 2010 | A1 |
20100091774 | Ronciak et al. | Apr 2010 | A1 |
20100157803 | Rivers et al. | Jun 2010 | A1 |
20100182907 | Pinter et al. | Jul 2010 | A1 |
20100238804 | Jain | Sep 2010 | A1 |
20110002224 | Tamura | Jan 2011 | A1 |
20110182194 | Jacquet et al. | Jul 2011 | A1 |
20120079065 | Miyamoto | Mar 2012 | A1 |
20120120801 | Ramakrishnan et al. | May 2012 | A1 |
20120155256 | Pope et al. | Jun 2012 | A1 |
20120173748 | Bouazizi | Jul 2012 | A1 |
20120207026 | Sato | Aug 2012 | A1 |
20120250512 | Jagadeeswaran | Oct 2012 | A1 |
20130135999 | Bloch et al. | May 2013 | A1 |
20140064072 | Ludwig | Mar 2014 | A1 |
20140126357 | Kulkarni et al. | May 2014 | A1 |
20140185452 | Kakadia et al. | Jul 2014 | A1 |
20140185616 | Bloch et al. | Jul 2014 | A1 |
20140254357 | Agarwal et al. | Sep 2014 | A1 |
20140269321 | Kamble et al. | Sep 2014 | A1 |
20140304425 | Taneja et al. | Oct 2014 | A1 |
20140310405 | Pope et al. | Oct 2014 | A1 |
20150029863 | Lai | Jan 2015 | A1 |
20150103659 | Iles et al. | Apr 2015 | A1 |
Entry |
---|
Lu et al., “Congestion control in networks with no congestion drops,” in Proc. 44th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, Sep. 2006. |
Barrass, et al., “Proposal for Priority Based Flow Control,” May 2008, http://www.ieee802.org/1/files/public/docs2008/bb-pelissierpfc- proposal-0508.pdf. |
Hugh Barrass, “Definition for new PAUSE function,” May 30, 2007, Revision 1.0, http://www.ieee802.org/1/files/public/docs2007/new-cm-barrass-pause-proposal.pdf. |
Henderson et al., “On improving the fairness of TCP congestion avoidance,” Global Telecommunications Conference, 1998. GLOBECOM 98. The Bridge to Global Integration. IEEE Issue Date: 1998, pp. 539-544 vol. 1, Nov. 8, 1998-Nov. 12, 1998, Sydney, NSW, Australia. |
“Priority Flow Control: Build Reliable Layer 2 Infrastructure,” © 2009 Cisco Systems, Inc., http://cisco.biz/en/US/prod/collateral/switches/ps9441/ps9670/white—paper—c11-542809.pdf. |