The present invention relates generally to packet communication networks, and particularly to buffer management in switches that are deployed in such networks.
Switches used in high-speed packet networks, such as Ethernet and InfiniBand networks, typically contain buffer memories. Packets received by the switch through one of its interfaces are stored temporarily in a buffer memory while awaiting transfer to the appropriate egress interface or possibly, in the case of multicast packets, to multiple egress interfaces. Although buffer memory may be allocated statically to each interface, many modern packet switches use a shared memory, in which buffer space is allocated flexibly to different interfaces and queues depending on traffic load and memory availability, as well as packet ingress priority and packet priorities that are set after processing in the switch.
As one example, U.S. Patent Application Publication 2013/0250762 describes a method for achieving lossless behavior for multiple ports sharing a buffer pool. Packets are “colored” and stored in a shared packet buffer without assigning fixed page allocations per port.
Embodiments of the present invention that are described hereinbelow provide improved methods and apparatus for buffer management in a network element.
There is therefore provided, in accordance with an embodiment of the invention, communication apparatus, which includes multiple interfaces configured to be connected to a packet data network so as to serve as both ingress and egress interfaces in receiving and forwarding of data packets of multiple types, including at least first and second types, from and to the network by the apparatus. A memory is coupled to the interfaces and configured as a buffer to contain packets received through the ingress interfaces while awaiting transmission to the network via the egress interfaces. Packet processing logic is configured to maintain multiple transmit queues, which are associated with respective ones of the egress interfaces, and to place both first and second queue entries, corresponding to first and second data packets of the first and second types, respectively, in a common transmit queue for transmission through a given egress interface, while allocating respective spaces in the buffer to store the first and second data packets against separate, first and second buffer allocations, which are respectively assigned to the first and second types of the data packets.
In one embodiment, the first type of the data packets consists of unicast packets, while the second type of the data packets consists of multicast packets. Additionally or alternatively, the first and second types of the data packets are transmitted using different, respective, first and second transport protocols.
Typically, the packet processing logic is configured, when a given queue entry reaches a head of the common transmit queue, to transmit a corresponding data packet through the given egress interface and to release a corresponding space in a respective one of the first and second buffer allocations.
In some embodiments, the first buffer allocation is shared over multiple transmit queues associated with multiple, different egress interfaces through which the data packets of the first type are transmitted.
Additionally or alternatively, the multiple transmit queues include at least two transmit queues that are both associated with the same, given egress interface and have different, respective levels of quality of service, and the first and second data packets of the different, first and second types have a common level of quality of service.
In a disclosed embodiment, the packet processing logic is configured to apply a congestion avoidance mechanism separately to the first and second types of the data packets responsively to respective fill levels of the first and second buffer allocations.
There is also provided, in accordance with an embodiment of the invention, a method for communication, which includes receiving through ingress interfaces of a network element data packets of multiple types, including at least first and second types. Respective spaces in a buffer in the network element are allocated to store the first and second data packets against separate, first and second buffer allocations, which are respectively assigned to the first and second types of the data packets, while the data packets await transmission to the network via egress interfaces of the network element. In the network element, multiple transmit queues are maintained, which are associated with respective ones of the egress interfaces. Both first and second queue entries, corresponding to first and second data packets of the first and second types, respectively, are placed in a common transmit queue for transmission through a given egress interface. Each of the first and second data packets is transmitted through the given egress interface when the corresponding queue entries reach a head of the common transmit queue.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
In network elements, such as switches, that are known in the art, queuing and buffering are generally tightly coupled together. In other words, when a packet enters a given transmit queue, to be transmitted through a certain egress interface of the switch, the packet occupies a slot in the buffer space that is associated with the queue until it is transmitted. Thus, for example, when packets of different types (such as broadcast, multicast and unicast packets, or packets transmitted using different protocols, for example TCP and UDP packets) share the same transmit queue, they also necessarily share the same buffer allocation. Consequently, when heavy traffic of one type causes congestion on a given transmit queue, the resulting congestion avoidance measures (such as dropping or marking packets or applying back pressure on ingress ports) will also be applied to the other types of packets that share the transmit queue.
Embodiments of the present invention that are described herein loosen—and may decouple completely—the connection between queue assignment and buffer occupancy, and thus afford greater flexibility in allocation and management of communication resources. In the disclosed embodiments, different packet types can be assigned separate, respective allocations of buffer space in a network element even when these different packet types share a common transmit queue. Packet processing logic in the network element places queue entries corresponding to the data packets in the common transmit queue for transmission through the appropriate egress interface, while allocating respective spaces in the shared buffer to store the different types of data packets against their separate, respective buffer allocations. A given packet type in a given queue may receive its own buffer allocation, or a common buffer space may be allocated for packets of the given type across multiple transmit queues, meaning that packets of this type in the different queues share the same, common buffer allocation. When a given queue entry reaches the head of the common transmit queue, the corresponding data packet is transmitted through the egress interface and the space in the respective buffer allocation is released.
Thus, by appropriate allocation of the respective buffer spaces, it is possible to assign different, independent shares of the network resources to different packet types. The buffer allocation for any given packet type may be assigned per transmit queue, or per egress interface, or may be shared over multiple transmit queues associated with multiple, different egress interfaces through which the data packets of the given type are to be transmitted. As a consequence of this decoupling of the buffer allocation and queuing mechanisms, the packet processing logic can apply congestion avoidance mechanisms separately to the different types of the data packets, in response to the fill levels of the respective buffer allocations.
Reference is now made to
Furthermore, although the present embodiment refers, for the sake of concreteness and clarity, to a network switch, the principles of the present invention may likewise be applied, mutatis mutandis, in other sorts of network elements that buffer and forward data packets, including (but not limited to) routers, bridges and tunneling elements, as well as in advanced network interface controllers that connect a host computer to a network.
As shown in
In the pictured embodiment, switch 20 receives multicast packet 26 through an ingress port 22. Packet 26 comprises a header 28 bearing a multicast address and a data payload 30. Header 28 may comprise, for example, a Layer 2 header with a multicast MAC destination address or a Layer 3 header with a multicast IP destination address. Switch 20 receives unicast packet 32, with a unicast header 34 containing a unicast MAC destination address, through another ingress port 22. Ports 22 direct packets 26 and 32 to memory 36, where copies of the packets are stored while awaiting retransmission through the appropriate egress ports 22. Packet processing logic (referred to in this embodiment as decision and queuing logic 38) reads headers 28 and 34 and looks up the destination addresses in order to identify the egress ports 22 through which respective the packets are to be transmitted.
Meanwhile, buffer control logic 40 allocates space in the shared buffer in memory 36 for storage of copies of the packets awaiting transmission. (Buffer control logic 40 is considered to be a part of the packet processing logic for purposes of the present description and the claims, although in practice it may be implemented separately from decision and queuing logic 38.) Buffer control logic 40 assigns separate, respective allocations 42 and 44 in memory 36 for multicast and unicast packet types, and stores packets 26 and 32 against these allocations while awaiting transmission. Although multiple copies of multicast packet 26 may be transmitted through different egress ports 22, as illustrated in
For each packet accepted into a corresponding allocation 42, 44, . . . , in memory 36, decision and queuing logic 38 places a queue entry, referred to hereinbelow as a descriptor, in the appropriate transmit queue 46 (or possibly in multiple transmit queues, in the case of multicast packets). Although for the sake of simplicity,
When a given queue entry reaches the head of transmit queue 46 in which the entry has been placed, decision and queuing logic 38 reads (and replicates as necessary) the corresponding data packet from memory 36, and transmits the packet through the appropriate egress interface. Buffer control logic 40 will then release the corresponding space in buffer allocation 42 or 44.
Upon receiving an incoming packet, regardless of packet type, an ingress port 22A (such as one of ports 22 in
In response to the notification received by decision control logic 52 that a new packet has arrived, a parser 54 parses the packet header and generates one or more descriptors, which it passes to a descriptor processor 56 for further handling and generation of forwarding instructions. Based on the descriptors, for example, processor 56 typically chooses an egress port or ports 22B through which the packet is to be transmitted. The descriptor may also indicate the quality of service (QoS) to be applied to the packet, i.e., the level of priority for transmission, and any applicable instructions for modification of the packet header. For multicast packets, processor 56 typically generates multiple descriptors, one for each egress port 22B that is to transmit a copy of the packet. All of these descriptors may have the same QoS (indicated, for example, by a QoS index value), or they may be assigned to two or more different QoS levels for different egress ports.
Descriptor processor 56 places the descriptors in the appropriate transmit queues (shown as queues 46 in the preceding figures) in a queueing system 60, to await transmission via the designated egress ports 22B. Typically, queuing system 60 contains a dedicated transmit queue for each egress port 22B or multiple transmit queues per egress port, one for each QoS level. Upon queuing a descriptor in queuing system 60, processor 56 notifies buffer control logic 40 that the corresponding packet is consuming buffer space in memory 36, and logic 40 notes the buffer consumption against the appropriate allocation 42, 44, . . . , for the packet type in question. Alternatively, the buffer consumption update to buffer control logic 40 may come from queuing system 60.
When a descriptor reaches the head of its transmit queue, queuing system 60 passes the descriptor to a packet modifier 62 for execution. Packet modifiers 62 are respectively coupled to egress ports 22B and serve as packet transmission units. In response to the descriptor, packet modifier 62 reads a copy of the appropriate packet data from memory 36, and makes whatever changes are called for in the packet header for transmission to network 24 through egress port 22B. In the case of multicast packets, packet modifier 62 may replicate the packet data, while the original data remain in memory 36 until all of the packet copies have been transmitted.
Upon the transmission of the packet (or the last packet copy, in the case of multicast transmission) through the corresponding egress port 22B, packet modifier 62 signals buffer control logic 40, and may also signal decision control logic 52, as indicated in the figure. Alternatively, this packet transmission notification may come from queuing system 60. In response to this notification, buffer control logic 40 releases the buffer space in the corresponding allocation 42, 44, . . . , so that the location in memory 36 can be overwritten, and the allocation is free to accept further packets of the corresponding type. This memory accounting and management process typically takes place for multiple different packets in parallel at any given time.
It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.
Number | Name | Date | Kind |
---|---|---|---|
6108713 | Sambamurthy et al. | Aug 2000 | A |
6178448 | Gray et al. | Jan 2001 | B1 |
6594263 | Martinsson et al. | Jul 2003 | B1 |
7321553 | Prasad et al. | Jan 2008 | B2 |
7346059 | Gamer et al. | Mar 2008 | B1 |
7738454 | Panwar et al. | Jun 2010 | B1 |
7821939 | Decusatis et al. | Oct 2010 | B2 |
8078743 | Sharp et al. | Dec 2011 | B2 |
8345548 | Gusat et al. | Jan 2013 | B2 |
8473693 | Muppalaneni et al. | Jun 2013 | B1 |
8576715 | Bloch et al. | Nov 2013 | B2 |
8630294 | Keen et al. | Jan 2014 | B1 |
8767561 | Gnanasekaran et al. | Jul 2014 | B2 |
8811183 | Anand et al. | Aug 2014 | B1 |
8879396 | Guay et al. | Nov 2014 | B2 |
8989017 | Naouri | Mar 2015 | B2 |
8995265 | Basso et al. | Mar 2015 | B2 |
9014006 | Haramaty et al. | Apr 2015 | B2 |
9325619 | Guay et al. | Apr 2016 | B2 |
9356868 | Tabatabaee et al. | May 2016 | B2 |
9426085 | Anand et al. | Aug 2016 | B1 |
20020055993 | Shah et al. | May 2002 | A1 |
20020191559 | Chen et al. | Dec 2002 | A1 |
20030108010 | Kim et al. | Jun 2003 | A1 |
20030223368 | Allen et al. | Dec 2003 | A1 |
20040008714 | Jones | Jan 2004 | A1 |
20050053077 | Blanc et al. | Mar 2005 | A1 |
20050169172 | Wang | Aug 2005 | A1 |
20050216822 | Kyusojin et al. | Sep 2005 | A1 |
20050226156 | Keating et al. | Oct 2005 | A1 |
20050228900 | Stuart et al. | Oct 2005 | A1 |
20060087989 | Gai et al. | Apr 2006 | A1 |
20060088036 | De Prezzo | Apr 2006 | A1 |
20060092837 | Kwan et al. | May 2006 | A1 |
20060092845 | Kwan et al. | May 2006 | A1 |
20070097257 | El-Maleh et al. | May 2007 | A1 |
20070104102 | Opsasnick | May 2007 | A1 |
20070104211 | Opsasnick | May 2007 | A1 |
20070201499 | Kapoor et al. | Aug 2007 | A1 |
20070291644 | Roberts et al. | Dec 2007 | A1 |
20080037420 | Tang et al. | Feb 2008 | A1 |
20080175146 | Van Leekwuck et al. | Jul 2008 | A1 |
20080192764 | Arefi et al. | Aug 2008 | A1 |
20090207848 | Kwan et al. | Aug 2009 | A1 |
20100220742 | Brewer et al. | Sep 2010 | A1 |
20130014118 | Jones | Jan 2013 | A1 |
20130039178 | Chen et al. | Feb 2013 | A1 |
20130250757 | Tabatabaee et al. | Sep 2013 | A1 |
20130250762 | Assarpour | Sep 2013 | A1 |
20130275631 | Magro et al. | Oct 2013 | A1 |
20130286834 | Lee | Oct 2013 | A1 |
20130305250 | Durant | Nov 2013 | A1 |
20140133314 | Mathews et al. | May 2014 | A1 |
20140269274 | Banavalikar et al. | Sep 2014 | A1 |
20140269324 | Tietz et al. | Sep 2014 | A1 |
20150026361 | Matthews et al. | Jan 2015 | A1 |
20150124611 | Attar et al. | May 2015 | A1 |
20150127797 | Attar | May 2015 | A1 |
20150180782 | Rimmer | Jun 2015 | A1 |
20150200866 | Pope | Jul 2015 | A1 |
20150381505 | Sundararaman et al. | Dec 2015 | A1 |
20160135076 | Grinshpun et al. | May 2016 | A1 |
20170118108 | Avci et al. | Apr 2017 | A1 |
20170142020 | Sundararaman et al. | May 2017 | A1 |
20170180261 | Ma et al. | Jun 2017 | A1 |
20170187641 | Lundqvist et al. | Jun 2017 | A1 |
20170295112 | Cheng et al. | Oct 2017 | A1 |
20180205653 | Wang et al. | Jul 2018 | A1 |
Number | Date | Country |
---|---|---|
1720295 | Nov 2006 | EP |
2466476 | Jun 2012 | EP |
2009107089 | Sep 2009 | WO |
2013136355 | Sep 2013 | WO |
2013180691 | Dec 2013 | WO |
Entry |
---|
U.S. Appl. No. 14/994,164 office action dated Jul. 5, 2017. |
U.S. Appl. No. 15/075,158 office action dated Aug. 24, 2017. |
European Application # 17172494.1 search report dated Oct. 13, 2017. |
European Application # 17178355 search report dated Nov. 13, 2017. |
CISCO Systems, Inc., “Priority Flow Control: Build Reliable Layer 2 Infrastructure”, 8 pages, 2015. |
Gran et al., “Congestion Management in Lossless Interconnection Networks”, Submitted to the Faculty of Mathematics and Natural Sciences at the University of Oslo in partial fulfillment of the requirements for the degree Philosophiae Doctor, 156 pages, Sep. 2013. |
Pfister et al., “Hot Spot Contention and Combining in Multistage Interconnect Networks”, IEEE Transactions on Computers, vol. C-34, pp. 943-948, Oct. 1985. |
Zhu et al., “Congestion control for large-scale RDMA depolyments”, SIGCOMM'15, pp. 523-536, Aug. 17-21, 2015. |
Hahne et al., “Dynamic Queue Length Thresholds for Multiple Loss Priorities”, IEEE/ACM Transactions on Networking, vol. 10, No. 3, pp. 368-380, Jun. 2002. |
Choudhury et al., “Dynamic Queue Length Thresholds for Shared-Memory Packet Switches”, IEEE/ACM Transactions Networking, vol. 6, Issue 2 , pp. 130-140, Apr. 1998. |
Gafni et al., U.S. Appl. No. 14/672,357, filed Mar. 30, 3015. |
Ramakrishnan et al., “The Addition of Explicit Congestion Notification (ECN) to IP”, Request for Comments 3168, Network Working Group, 63 pages, Sep. 2001. |
IEEE Standard 802.1Q™-2005, “IEEE Standard for Local and metropolitan area networks Virtual Bridged Local Area Networks”, 303 pages, May 19, 2006. |
INFINIBAND TM Architecture, Specification vol. 1, Release 1.2.1, Chapter 12, pp. 657-716, Nov. 2007. |
IEEE Std 802.3, Standard for Information Technology—Telecommunications and information exchange between systems—Local and metropolitan area networks—Specific requirements; Part 3: Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications Corrigendum 1: Timing Considerations for PAUSE Operation, Annex 31B (MAC Control PAUSE operation), pp. 763-772, year 2005. |
IEEE Std 802.1Qbb., IEEE Standard for Local and metropolitan area networks—“Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks—Amendment 17: Priority-based Flow Control”, 40 pages, Sep. 30, 2011. |
Elias et al., U.S. Appl. No. 14/718,114, filed May 21, 2015. |
Gafni et al., U.S. Appl. No. 15/075,158, filed Mar. 20, 2016. |
Shpiner et al., U.S. Appl. No. 14/967,403, filed Dec. 14, 2015. |
Elias et al., U.S. Appl. No. 14/994,164, filed Jan. 13, 2016. |
Elias et al., U.S. Appl. No. 15/081,969, filed Mar. 28, 2016. |
Kriss et al., U.S. Appl. No. 15/161,316, filed May 23, 2016. |
Roitshtein et al., U.S. Appl. No. 14/961,923, filed Dec. 8, 2015. |
CISCO Systems, Inc.,“Advantage Series White Paper Smart Buffering”, 10 pages, 2016. |
Hoeiland-Joergensen et al., “The FlowQueue-CoDel Packet Scheduler and Active Queue Management Algorithm”, Internet Engineering Task Force (IETF) as draft-ietf-aqm-fq-codel-06 , 23 pages, Mar. 18, 2016. |
U.S. Appl. No. 14/718,114 Office Action dated Sep. 16, 2016. |
U.S. Appl. No. 14/672,357 Office Action dated Sep. 28, 2016. |
U.S. Appl. No. 14/967,403 office action dated Nov. 9, 2017. |
U.S. Appl. No. 15/081,969 office action dated Oct. 5, 2017. |
U.S. Appl. No. 15/161,316 office action dated Feb. 7, 2018. |
U.S. Appl. No. 15/081,969 office action dated May 17, 2018. |
U.S. Appl. No. 15/432,962 office action dated Apr. 26, 2018. |
U.S. Appl. No. 15/161,316 Office Action dated Jul. 20, 2018. |
U.S. Appl. No. 15/432,962 office action dated Nov. 2, 2018. |
U.S. Appl. No. 15/469,652 office action dated Nov. 2, 2018. |
U.S. Appl. No. 15/161,316 Office Action dated Dec. 11, 2018. |
Number | Date | Country | |
---|---|---|---|
20170264571 A1 | Sep 2017 | US |