 
                 Patent Application
 Patent Application
                     20070058532
 20070058532
                    Embodiments of the invention relate to the field of networking, in particular, to a system and method for managing congestion over an Open Systems Interconnection (OSI) Layer 2 (L2) network.
Over the last year or so, Ethernet is now being considered as a viable solution for blade server backplanes and datacenter networks (generally referred to as “localized data networks”). Typical datacenter networks multiple network connections; e.g. Storage traffic, inter-processor communication (IPC) traffic and local area network traffic. All of these different traffic types need different infrastructure. For example, storage traffic needs servers and storage discs to have Fiber Channel adaptors and Fiber channel switches to connect them. IPC traffic needs high performance networking infrastructure. LAN traffic is carried over Ethernet infrastructure. It will be greatly beneficial (from cost and management perspective), if all these traffic types are carried over single networking infrastructure: Ethernet.
However, one major hurdle in adopting this solution is that many Ethernet network implementations have rudimentary traffic controls, and thus, high latencies may be experienced for data communications within Ethernet networks. In order to achieve an acceptable level of data throughput and reduce latencies experienced over localized data networks, traffic congestion, such as increased packet queuing or dropped packets, needs to be quickly detected.
Currently, router-based Ethernet networks have adapted a mechanism to detect and handle OSI Layer 3 (L3) traffic congestion. This mechanism is referred to as Explicit Congestion Notification or “ECN”. More specifically, for ECN, traffic congestion is detected by accessing a specific bit or group of bits within an Internet Protocol (IP) header of an incoming IP message received by the router as described below.
 As shown in 
In summary, this TCP/IP flow control typically uses Congestion Window adaptation to estimate available bandwidth (BW) in the data network and adjusts the transmission rate accordingly. In other words, the transmission rate may be decreased to ease TCP/IP traffic. The Congestion Window is changed by using (1) packet drops assumed due to timeout, (2) duplicate acknowledgement (ACK) messages, and (3) ECN as described above. While ECN provides a good mechanism for detecting L3 congestion of data flow, it does not consider L2 congestion since ECN is configured so that only IP applications are congestion aware. Non-IP mechanisms have no visibility into congestion experienced by L2 networks.
As a result, since the typical topology for localized data networks such as blade server and datacenter networks involve an interconnection of servers by L2 switches, ECN would not be able to report and handle traffic congestion.
The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention.
  
  
  
  
  
  
Herein, certain embodiments of the invention relate to a system and method for managing congestion caused by Internet Protocol (IP) messages or non-IP messages over a network. This congestion management mechanism is adapted to detect and handle traffic congestion associated with Open Systems Interconnection (OSI) Layer 2 (L2) networks. According to one embodiment of the invention, a Congestion Indication (CI) parameter is set within L2 frames transmitted over the network. The CI parameter is set by L2 switches/devices that experience congestion, such as congestion due to oversubscription for example. The CI parameter may be implemented as one or more bits within an L2 header (e.g., MAC header) of a message received by the L2 switch.
In the event that, at the destination (networking) device, the OSI Network Layer internetworking protocol is “IP” and, when the CI parameter is set, the IP layer should pass this information to a corresponding OSI Transport Layer such as “Transport Control Process” (TCP) or “User Datagram Protocol” (UDP). For instance, with respect to the TCP configuration, TCP will behave as if it has received an indication that the CE bit has been set and send an acknowledgement (ACK) message with an ECN-Echo bit set to the source (networking) device. The remaining operations will follow ECN specification.
In the event that, at the destination (networking) device, the OSI Network Layer internetworking protocol is “Non-IP” and, when the CI parameter is set, this “Non-IP” layer can define extension to its protocol to carry this congestion information back to the source (networking device) device. This source device then should ensure reduction of its rate of information transmission towards the destination (networking device). This will help in reducing the congestion in the intermediate device(s).
In the following description, certain terminology is used to describe features of the invention. For example, the term “networking device” is any device supporting access to a network via a link, which includes and is not limited or restricted to a computer such as any type of server (e.g., blade server), a network interface card or the like. A “switching device” includes a device adapted to transfer information, such as a L2 switch. A “link” is generally defined as an information-carrying medium that establishes a communication pathway. The link may be a wired interconnect, where the medium is a physical medium (e.g., electrical wire, optical fiber, cable, bus traces, etc.) or a wireless interconnect (e.g., air in combination with wireless signaling technology).
A “message” is broadly defined as information placed in a predetermined format for transmission over a network from a source device. The message may be in a variety of formats such as an Ethernet frame configured in accordance with current or future Ethernet standards such as the IEEE 802.3 standard entitled “Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications” (2002), a packet encapsulated as an IP packet and including an Ethernet frame, or the like. The “source device” is broadly defined as a sender of a message while a “destination device” is the intended recipient of the message. Both source and destination devices may be networking devices.
The term “logic” is generally defined as hardware and/or software that perform one or more operations such as measuring data traffic and setting data within a transmitted frame to denote traffic congestion. When deployed in software, such software may be executable code such as an application, a routine or even one or more instructions. Software may be stored in any type of memory, namely suitable storage medium such as a programmable electronic circuit, any type of semiconductor memory device such as a volatile memory (e.g., random access memory, etc.) or non-volatile memory (e.g., read-only memory, flash memory, etc.), a hard drive disk, or any portable storage such as a floppy diskette, an optical disk (e.g., compact disk or digital versatile disc “DVD”), a digital tape or the like.
As an example, a storage medium may be provided to store software that, if executed by a switching device such as an L2 switch, will cause the switching device to (i) measure traffic at incoming and outgoing ports of the switching device, and (ii) alter information within the L2 header of an incoming message prior to outputting the message in order to indicate traffic congestion where the measured traffic congestion exceeds a threshold limit. The information is used to initiate a mechanism, such as an established ECN notification scheme, for notifying a source of the message as to the traffic congestion experienced by the message. The alteration may involve setting a bit, such as a Canonical Format Identifier (CFI) bit, within an Ethernet message or creating a new header in the Ethernet frame to carry this CI bit or setting a value within a Type of Service (ToS) field of the Ethernet message.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
 Referring to 
 As shown, blade server 2101 transmits a message 250 to blade server 2102. A frame 300 (e.g., Ethernet frame) is encapsulated within message 250 and includes an L2 header 310 and a payload 350 as shown in 
 Upon detecting congestion on a port 230 (e.g., TX port 2), switch 220 may be adapted to set TYPE field 340 of 
Regardless whether the CI parameter is set by the switch altering TYPE field 340 or any unused bit in L2 header 310 (e.g., CFI bit 346 of VLAN field 345), message 250 including the altered Ethernet frame 300 is routed to blade server 2102 through congested port 230. Blade server 2102 is adapted to monitor incoming Ethernet frames to detect the setting of the CI parameter to denote unacceptable traffic congestion.
Upon detecting the CI parameter being set, the OSI Link layer of blade server 2102 notifies its OSI Network layer that the CI parameter is set. For instance, the IP layer would be notified and pass this information to a corresponding OSI Transport Layer such as “Transport Control Process” (TCP) or “User Datagram Protocol” (UDP). For instance, with respect to TCP implementation, TCP would send an acknowledgement (ACK) message 400 back to blade server 2101 with an ECN-Echo bit set 420 within a TCP header 410 of ACK message 400.
 As shown in 
In summary, blade server 2102 notifies that it has received a message experiencing traffic congestion and sends ACK message 400 to blade server 210, with the ECN-ECHO bit 420 being set in TCP header 410. The setting of ECN-ECHO bit 420 informs blade server 2101 that message 250 experienced traffic congestion, and thus, blade server 210, can adjust the TCP transmit rate or path to reduce such data traffic congestion. Optionally, blade server 2101 may return an ACK message to blade server 2102 to acknowledge receive of the ECN by setting the CWR flag 422 in the next TCP flow packet to blade server 2102.
The above-described invention is advantageous because it enhances the current ECN mechanism to be an application in a backplane, datacenter or cluster network configuration. Further, it allows TCP to adjust to congestion within L2 clusters so that Head of Line (HoL) blocking can be avoided, while improving throughput and enabling traffic congestion monitoring of non-IP messages. This further allows “Non-IP” protocols aware of congestion in the intermediate devices enabling them to implement better and newer congestion management protocols/techniques.
 Referring now to 
In general, AQM is a mechanism using one of several alternatives for congestion indication, but in the absence of ECN, AQM is restricted to using packet drops as a mechanism for congestion indication. AQM drops packets based on the average queue length exceeding a threshold, rather than only when the queue actually overflows.
For ECN, AQM can set a Congestion Experienced (CE) codepoint in the IP header instead of dropping the packet. Similarly, AQM may be adapted to identify congestion such as at port 530 of switch 5203.
For this illustrative example, networking device 5102 is transferring an Ethernet message to networking device 5104. The message is routed through port 512 of networking device 5102, ports 521-522 of switch 5202, ports 523-524 of switch 5203, ports 525-526 of switch 5204 and port 514 of networking device 5104. AQM of switch 5203 detects congestion at port 524 and sets the CI parameter. This may be accomplished by setting the CFI bit within the VLAN field of the Ethernet frame according to one embodiment of the invention. Of course, it is possible that a new field can be defined in the L2 header of Ethernet frame to carry this congestion information. The Ethernet frame may be the Ethernet message itself or encapsulated within the Ethernet message.
Networking device 5104 detects congestion and responds by setting the ECN-ECHO bit within the TCP header of an Acknowledgement returned to networking device 5102. Hence, non-IP messages and L2 congestion can be detected in lieu of restricting traffic congestion only for L3 traffic.
Upon AQM detecting unacceptable traffic conditions, the outgoing frames get marked. Random Early Detection (RED) algorithm may be used to select frames to mark. Such marking involves setting the CI parameter and forwarding of the message to the destination device. The procedure for handling through translation of the CI parameter to cause the setting of the ECN-Echo bit of the TCP header in a returned ACK message is describe above.
 Referring now to 
Thereafter, the message is routed to the destination device, which determines that the frame experienced unacceptable traffic congestion (blocks 630 and 640). This is determined through analysis of the CFI bit for example, or the value placed in the Type field of the frame. Information regarding the presence of unacceptable traffic congestion is provided to the source device through an Acknowledgement (ACK) message from the destination device (block 650). Such presence may be identified to the source device by setting the ECN-ECHO bit within the ECN field of the TCP header.
The information is returned to the source device to adjust transmit rates, transmission paths and the like (block 660).
While the invention has been described in terms of several embodiments of the invention, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments of the invention described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting. For instance, the ACK message may be from another Network Layer other than TCP as described above.