The present invention relates to a method for flow control in a switch and a switch controlled thereby. In order to ensure that no or few packets are dropped in a switch because of a congested internal memory, pause frames or stop command messages are sent to upstream senders. When to send pause frames are determined by monitoring the buffer contents of the switch and estimating the total expected contents of the links between the senders and the switch. The pause frames are sent to the most offending senders, i.e. the senders causing the largest queues in the switch.
Various problems with a bearing on the present technology have been recognized in the prior art.
The documents WO 96/08899 and U.S. Pat. No. 5,125,096 disclose systems for estimating the expected number of cells arriving to a buffer or a node.
U.S. Pat. Nos. 5,493,566 and 5,905,870 disclose flow control systems using stop messages to selected gateways and buffers.
The present invention is an improvement over the prior art in that the switch does not contain any input port buffer but data cells are switched immediately to queues associated with the output buffers. The internal memory or buffer is shared between the output ports. The remaining available space of the buffer is monitored as well as the estimated contents of the incoming links connected to the switch. Pause commands are sent to the most offending sender on selected links and un-pause commands are sent to the least offending of the paused senders.
The invention can guarantee zero-drop of data cells as well as fairness between the participating senders.
It is an object of the invention to provide a method for flow control in a switch and a thereby controlled switch in which zero drop can be guaranteed by sending pause commands to the selected most offending sender and sending un-pause commands to the least offending of the paused senders, while non-offending senders are involved as little as possible.
The present invention provides a method for controlling a switch comprising:
a number of input ports, each receiving data cells on a respective link;
a number of output ports sharing a buffer space in which each output port can reserve space for an output queue, wherein incoming data cells are switched to an appropriate output queue;
a flow control means for pausing and un-pausing senders on selected links; the method including the steps of:
monitoring the remaining available buffer space AS of the shared buffer;
estimating the expected total content LE of the links;
calculating a free margin (FM) as the remaining available buffer space minus the expected total content of the links FM=AS−LE;
if the free margin sinks below a threshold AS−LE<A, then a selected link is paused;
if the free margin thereafter raises above a threshold AS−LE>B, then a selected paused link is un-paused.
The present invention also provides a switch operating in accordance with the method.
The invention will be described in greater detail below with reference to the accompanying drawings, in which:
Generally, the present invention addresses the problem of ensuring that no data cells are dropped (zero-drop) in a switch having a limited internal buffer space. When the remaining available buffer space of the internal memory approaches the amount of data that may be expected to arrive on the links stop messages or pause frames should be sent to the upstream senders in order to prevent overflow of the internal memory and loss of data.
The basic idea is to try to calculate an upper bound of the contents of all links based on their round-trip time and their history of pausing and un-pausing. As soon as the contents exceed the free internal buffer memory we start pausing the input ports, one at a time, and we send pause frames only to those lines that actually use up the internal memory, i.e. those that send packets to congested output ports or ports paused by downstream nodes.
An overflow sum OFS is defined as a sum for each input port 4 of the number of bytes that is sent to congested output ports 6. A port is considered congested when a packet arrives to a queue that is longer than a certain threshold (currently, 24 cells=a full-size packet). Since this variable is of limited size, we have to decrease the counter at some time. There are at least three ways to do this and also preserve the order of the counters:
To increase output port utilization we set the overflow sum to zero when a port is un-paused.
A variable OFSmax keeps track of the input port having the largest counter OFS. The purpose of the variable OFSmax is to indicate which of the un-paused input ports is to receive the next pause frame.
Similarly, a variable OFSmin keeps track of the input port having the smallest counter OFS of all paused input ports. The variable OFSmin indicates which of the paused input ports is to receive the next un-pause frame.
The purpose of the pause frame generator is to generate a pause frame at the right time and to send it to the right input port. One problem is to estimate the amount of data in the links. The estimate must, to always be on the safe side, be higher than the actual value and, to be efficient, as close as possible to the actual value. The intention is that the estimate, for each port, should always be higher but never by more than one full-sized packet.
We use a model that consists of a set of links that contain twice as much as the round-trip content plus two full-sized packets (to account for the situation where we have to wait for the pause frame (=one packet) to be sent, and where the pause frame arrived just after the switch started sending a packet (+another packet)). The line is initially full at a maximum value and we try to make an upper bound on the estimate of the contents of this line. The maximum value depends on the length and bit rate of the link.
This is accomplished by having every port be in one of six states as shown in
The reason that we consider the line empty when there is one full-size packet left and that we increase the contents linearly when the actual content increases in steps of the current packet-size is that we want to increase the stability of the algorithm by smoothing the input.
As is shown in
A link estimate is computed for each of the input ports and are added together to form the total amount of data LE (Link Estimate) expected to arrive at the switch. Once we have the estimate of the total content in all ports, we compare it to the free buffer memory. The free buffer memory AS (Available Space) equals the total memory space minus the currently occupied space (and minus any reserved space). This is the space available for storing arriving cells, i.e. available to the output ports.
By subtracting the link content from the available space, we obtain a free margin AS−LE which is left if all the data in transit through the links is stored and no cells are emptied from the memory. In order to guarantee zero-drop, the margin should be equal to (or greater than) zero. The value A=0 is currently preferred. Thus, if AS−LE<A, a pause frame should be sent. In other words, if there is not enough free memory to house the contents of the links we pause the input ports, one by one, until it fits. To spread the burden evenly we always choose to pause the worst offender, the one with the highest overflow sum of the un-paused input ports, i.e. the one that has sent the most bytes to a congested port. This has proven to be perfectly fair in the case where two or more ports each send 100% load to one port.
When the free margin rises above a level B, an un-pause frame should be sent, since the buffer may now fit in more data. The value of B may be equal to A, but a somewhat larger value is preferred so that a smoother operation is obtained. Thus, A≦B. If more than one sender is paused, the un-pause frame is sent to the least offending sender as indicated by the variable OFSmin mentioned above.
The strict zero-drop guarantee of the invention is based on the worst-case scenario where all links are filled with maximum length packets directed to the same port that for some reason (pause or collisions) is not being emptied. Since this case is probably not very common, the algorithm will be unnecessary inefficient. It might be possible to increase the efficiency by letting the administrator relax some of the rules. This may be done by setting the threshold A to a negative value. This is done at the cost of losing the zero-drop guarantee, but in most realistic traffic cases this will not be a problem, since only few packets will be dropped. The algorithm will still guarantee perfect fairness between the input ports, in the sense that all links participating in the over-subscription will experience the same pause factor, and none of the links going straight through will receive any pause frames at all.
A person skilled in the art will appreciate that the flow control of the present invention may be obtained in other ways than using the pause frames described in the exemplifying embodiments. A pause frame is considered equal to any type of pause or stop command known in this field. The embodiments discussed above are only intended to be illustrative of the invention. The physical implementation in hardware and software and other embodiments may be devised by those skilled in the art without departing from the spirit and scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5125096 | Brantley, Jr. et al. | Jun 1992 | A |
5189668 | Takatori et al. | Feb 1993 | A |
5493566 | Ljungberg et al. | Feb 1996 | A |
5524211 | Woods et al. | Jun 1996 | A |
5528591 | Lauer | Jun 1996 | A |
5584033 | Barrett et al. | Dec 1996 | A |
5742606 | Iliadis et al. | Apr 1998 | A |
5774453 | Fukano et al. | Jun 1998 | A |
5905870 | Mangin et al. | May 1999 | A |
5995486 | Iliadis | Nov 1999 | A |
6167054 | Simmons et al. | Dec 2000 | A |
6172963 | Larsson et al. | Jan 2001 | B1 |
6252849 | Rom et al. | Jun 2001 | B1 |
6456590 | Ren et al. | Sep 2002 | B1 |
6628613 | Joung et al. | Sep 2003 | B1 |
20010043566 | Chow | Nov 2001 | A1 |
20010050913 | Chen et al. | Dec 2001 | A1 |
20020118689 | Luijten et al. | Aug 2002 | A1 |
Number | Date | Country |
---|---|---|
0 838 972 | Oct 1997 | EP |
0 853 441 | Nov 1997 | EP |
3283940 | Dec 1991 | JP |
04197972 | Jul 1992 | JP |
07270060 | Oct 1995 | JP |
09212046 | Aug 1997 | JP |
10107849 | Apr 1998 | JP |
9608899 | Mar 1996 | WO |
9900949 | Jan 1999 | WO |