Flow control is a technique for preventing packet losses in a switched environment due to network bottlenecks or congestion. Ideally, a high performance switch should be capable of forwarding frames at full wire speed to and from each of its ports simultaneously with almost no latency and no packet loss. However, in practice, situations may arise when packet data is transmitted faster than it can be handled by the switch.
Switches may include buffers at each port (e.g., first-in-first-out (FIFO) registers) to alleviate some inconsistencies between port-to-port transmit and receive rates with link partners. However, if the buffer at the destination port of an incoming packet is full, a conventional switch may be forced to drop the packet.
The IEEE 802.3x standard specifies a flow control mechanism for Ethernet LAN (Local Area Network) switches. The IEEE 802.3x flow control mechanism is implemented within the MAC (Media Access Control) sublayer. A port's input buffer begins, to fill as packets are received. Once the buffer has reached a pre-programmed threshold, the MAC control sublayer signals an internal state machine to transmit a “PAUSE” frame. The PAUSE frame informs the link partner to halt transmission for a specified length of time, e.g., a programmed idle time. The MAC control module continues to transmit PAUSE frames with the programmed idle time as long as the threshold has been exceeded. If the buffer level falls below the threshold prior to the expiration of this time, another PAUSE frame is sent with a zero time specified to re-enable transmission.
A network system may include a number of switches connected to a network processor that handles the bulk of the switching and/or routing in the system. The switches may provide per port flow control status information, e.g., the flow control status of a number of their ports, over a link to the network processor. The network processor may use this information to make traffic management decisions.
A switch may include a flow control monitor to monitor a flow control status of a number of ports in the switch, and a flow control module to generate an indicator identifying the flow control status of the ports. The indicator be incorporated into a frame having a format similar to an IEEE 802.3x PAUSE frame. The indicator may be a bitmap incorporated into, e.g., the reserved field or preamble field in the frame. The flow control monitor may generate the indicator in response to the flow control monitor detecting a flow control event at one of the ports, e.g., a port's output queue reaching a flow control threshold. The switch may send the PAUSE frame to a link partner, e.g., a network processor, via one of its ports.
Each switch 102 may have multiple ports to be connected to the network processor 104. However, the network processor 104 may not have enough ports to directly connect to the ports of all switches 102. To accommodate for this pinout, and possibly throughput, limitation, ports in a switch may be connected to a single port in the network processor 104 through a high-speed serial interface 106, e.g., a 1 Gbps (1.25 Gbaud) SERDES (Serializer/Deserializer) interface. The network ports of the switch may have a lower maximum speed (e.g., 10/100 Mbps), in which case the single port can accommodate the throughput of ten network ports.
The network processor 104 may receive and transmit Ethernet frames to and from the switches 102 through the high-speed serial link 108. In an embodiment, the network processor 104 may perform the bulk of switching and/or routing for the system 100, which itself may function as a switch. The network processor 104 may also perform other higher level network management functions, e.g., traffic management, policy lookup, IP routing, MPLS (Multiprotocol Label Switching), service level negotiation, etc. The switches connected to the network processor 104 may be limited to physical (PHY) and media access control (MAC) layer functionality and be limited functionally to buffering frames in transit to and from the network processor 104.
Queuing for local ports may be handled by input and output buffers (e.g., a first-in-first-out (FIFO) buffer) connected to the switch's input and output ports, respectively. The buffers can store a limited number of packets at a time. When the port receives more packets than it can handle, an overflow situation may occur, and the port may be forced to drop received packets until the buffer empties to some degree. For example, a network port on the switch may have experience congestion on its transmit side when the rate of packets arriving at its buffer (to be transmitted) is higher than the rate at which it can transmit the packets on the port. This may occur when the link partner is requesting the switch port to stop sending frames using the 802.3x PAUSE frame (described below) and/or when the port speed is relatively low compared to the overall rate of frames arriving from other switch ports that are destined for this transmit port.
To reduce packet loss due to overflow at the port buffers, the serial interface 106 may employ a flow control mechanism. The local switch may monitor the fill threshold of its buffer at a port, and before having to drop packets, the switch may notify the network processor through the serial interface 106 that this port is experiencing congestion. In such an architecture, congestion management, namely the decision as to which frames need to be dropped, may be implemented by the network processor, which has concurrent visibility on the congestion status of the downstream transmit ports on the switches.
In an embodiment, the switch or serial interface 106 may include a flow control monitor 206 to monitor the status of the port buffers connected to other network devices in the Ethernet LAN 107. The serial interface may also include a flow control module 210 to implement the flow control mechanism. The flow control module 210 may implement the flow control mechanism for Ethernet LAN switches specified by the IEEE 802.3x standard.
The IEEE 802.3x flow control mechanism is implemented within the MAC (Media Access Control) sublayer. A port's input buffer begins to fill as packets are received. Once the buffer has reached a pre-programmed threshold, the MAC sublayer signals an internal state machine to transmit a “PAUSE” frame. The PAUSE frame informs the link partner to halt transmission for a specified length of time, referred to as “Xoff”. The MAC continues to transmit PAUSE frames with the programmed idle time as long as the threshold has been exceeded. If the buffer level falls below the threshold prior to the expiration of this time, another PAUSE frame is sent with a zero time specified to re-enable transmission, referred to as “Xon”.
In an embodiment, flow control may be activated based on the output queue fill level of the port 105. When a packet is sent from the network processor 104 through the serial link 108 to the switch 102, congestion may occur on the output queues of the destination port on the switch device 102. Once the port's output queue reaches a pre-programmed threshold, the MAC sublayer of the serial interface 106 may signal an internal state machine to transmit a “PAUSE” frame. The PAUSE frame informs the network processor 104 to halt transmission for the congested port for a specified length of time, referred to as “Xoff”. The MAC continues to transmit PAUSE frames with the programmed idle time as long as the threshold has been exceeded. If the output queue buffer level falls below the threshold prior to the expiration of this time, another PAUSE frame is sent with a zero time specified to re-enable transmission, referred to as “Xon”.
In many network systems, the network devices (e.g., switches, routers, endpoints, and network processors) may only have information regarding the status of their own ports, and not that of other network devices in the system, e.g., their link partners. Flow control mechanisms may only provide the status of the particular link connecting two link partners and no more information about the capacity of the link partner. For example, an IEEE 802.3x PAUSE frame typically only communicates point-to-point flow control information, i.e., the status of the individual link on which the PAUSE frame is transmitted. Also, the information provided by the flow control status may be limited to simple binary information, e.g., on/off or “send”/“don't send”.
In an embodiment, the switches may provide the status of multiple ports to another network device. For example, in the system 100, the switches may provide the flow control status of their output ports to the network processor, which may use that information to make decisions regarding traffic management. For example, the network processor may include virtual buffers corresponding to the output ports of the switches 102. The network processor may use the per port status information received from a switch to determine whether to send a frame to a particular port in that switch or queue that frame in the virtual port corresponding to that port and instead send a frame to another, more available port, thereby avoiding head-of-line blocking situations. In addition, decisions to drop packets may be allocated to the network processor, which may have more sophisticated algorithms for handling packet overflow conditions.
In an embodiment, the system 100 may utilize the IEEE 802.3x flow control mechanism to communication per port flow control information from the switches to the network processor. The flow control module 212 in the serial interface 210 may include a MAC module 212 to handle MAC sublayer functions. The flow control module 212 may insert flow control status information for each port into the PAUSE frame, e.g., by generating a bitmap identifying the flow control status of each port. The flow control module may insert the bitmap 402 into PAUSE frame, e.g., in the 42 byte reserved field, as shown in
In an embodiment, the bitmap may be an 8-bit field, with each bit representing the flow control status of a port, as shown in
The flow control module 210 may update its internal registers 214 with the flow control status information received from the flow control monitor 206 (block 606). In an embodiment, the flow control monitor 206 and flow control module 210 may be integrated into a single module. When a flow control event occurs (block 608), the flow control module 210 may generate a PAUSE frame including a bitmap indicating the per-port flow control status of the ports (block 610). The serial interface may then transmit the PAUSE frame to the network processor 104 over the high-speed serial link (block 612).
In an alternative embodiment, the serial interface may transmit the per-port flow control status information on a continuous basis. For example, the flow control module may insert a bitmap including per-port flow control status into the preamble of each frame transmitted by the serial interface to the network processor.
The network processor may extract the bitmap from the designated location in a frame (e.g., preamble and reserve section) and use this information to determine whether to throttle the transmission of frames to particular port(s) (block 614).
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, blocks in the flowchart may be skipped or performed out of order and still produce desirable results. Accordingly, other embodiments are within the scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5493566 | Ljungberg et al. | Feb 1996 | A |
5539747 | Ito et al. | Jul 1996 | A |
5905870 | Mangin et al. | May 1999 | A |
6026075 | Linville et al. | Feb 2000 | A |
6094436 | Runaldue et al. | Jul 2000 | A |
6115356 | Kalkunte et al. | Sep 2000 | A |
6167029 | Ramakrishnan | Dec 2000 | A |
6389112 | Stewart et al. | May 2002 | B1 |
6405258 | Erimli et al. | Jun 2002 | B1 |
6553027 | Lam et al. | Apr 2003 | B1 |
6628613 | Joung et al. | Sep 2003 | B1 |
6636480 | Walia et al. | Oct 2003 | B1 |
6636510 | Lee et al. | Oct 2003 | B1 |
6859435 | Lee et al. | Feb 2005 | B1 |
6965565 | Duclos | Nov 2005 | B1 |
6999415 | Luijten et al. | Feb 2006 | B2 |
7031302 | Malalur | Apr 2006 | B1 |
7158480 | Firoiu et al. | Jan 2007 | B1 |
7180857 | Kawakami et al. | Feb 2007 | B2 |
7190667 | Susnow et al. | Mar 2007 | B2 |
7848341 | Benner et al. | Dec 2010 | B2 |
20020136163 | Kawakami et al. | Sep 2002 | A1 |
20030016628 | Kadambi et al. | Jan 2003 | A1 |
20030227926 | Ramamurthy et al. | Dec 2003 | A1 |
20040085904 | Bordogna et al. | May 2004 | A1 |
20040160919 | Balachandran et al. | Aug 2004 | A1 |
20040213151 | Willhite et al. | Oct 2004 | A1 |
20050108444 | Flauaus et al. | May 2005 | A1 |
20050144314 | Kan et al. | Jun 2005 | A1 |
20070091799 | Wiemann et al. | Apr 2007 | A1 |
20080291832 | Bordogna et al. | Nov 2008 | A1 |
20090141627 | Gonzalez et al. | Jun 2009 | A1 |
Number | Date | Country |
---|---|---|
0962077 | Jun 2000 | EP |