1. Field of the Invention
The invention relates generally to the field of packet-switched networks, and in particular to techniques for managing packet traffic inside a router or switch.
2. Background Art
Today, sending information using packets over communications networks, such as the Internet, is widespread. All forms of data including email, documents, photos, video, audio, software updates, are sent via communications lines connected together by routers or switches. The increased data flow has created the need for higher capacity, such as 10 gigabits-per-second (Gbps), network lines. This in turn means that the routers or switches must route these data packets from source to destination in an efficient manner and not become bottlenecks hindering data flow.
An issue of bandwidth allocation occurs when a plurality of input queues send data to a shared output queue. For example, assume that input port 1110 has input queue 120-1, that input port 2112 has input queue 120-2, and that input port 3114 has input queue 120-3. Further assume that they share via switching fabric 140 a common output port queue 174. The send rates, μi, associated with the data lines 126, 128, 130, i.e., μ1, μ2, and μ3, respectively, cannot exceed the maximum arrival rate R 172 of output queue 174. Thus data line 150 which has as its rate, the sum of μ1 , μ2, and μ3, cannot exceed R 172. The issue in this example is in the allocation of the bandwidth of data line 150 to each of the lines 126, 128, and 130. For example, a poor allocation would give all the available bandwidth to the first two data lines 126 and 128 (input queues 120-1 and 120-2). This would mean the queue 120-3 in input port 3114 would starve. Thus some form of “fair” allocation between the input queues of the available bandwidth of the shared output queue is needed.
Currently, there are many algorithms for routing data traffic, but each has its own sets of limitations and problems. One example is a rate controlled service discipline, which includes a rate controller and a scheduler. The rate controller has a set of regulators corresponding to each of the connections traversing the switch. The regulators shape the input traffic by assigning an eligibility time to each packet. The scheduler orders the transmission of eligible packets from all the connections. There are several problems with this algorithm. First, the system may be idle even when there are packets waiting to be sent. Next, the single scheduler is a single point of failure. Another example is a conventional shared memory approach, that depends upon a central switch to provide high-speed interconnections to all ports. The problem is that each packet must be examined to determine its routing. Thus this approach requires a very high memory bandwidth and a fairly high overhead even for small systems.
Therefore with the increasing demand for routers that can switch high rates of data, for example, about 10 Gbps, at each input port, there is a need for techniques which efficiently and fairly control the data flow in a router.
The invention provides techniques for determining the data transmission or sending rates in a router or switch of two or more input queues in one or more input ports sharing an output port, which may optionally include an output queue. The output port receives desired or request data from each input queue sharing the output port. The output port analyzes this data and sends feedback to each input queue so that, if needed, the input queue can adjust its transmission or sending rate. In one embodiment of the invention, a plurality of input queues transfer data to an output queue. Each of these input queues sends the output queue a request sending rate. The request sending rates are summed together. The actual sending rate for one of these input queues is based on this sum. In another embodiment, each input queue sends its queue length or fullness to a shared output port. A derating factor is determined which modifies, if needed, the input queue's actual sending rate. In an alternative embodiment the queue length may be capped at a predetermined value.
In one embodiment of the invention a method for managing data traffic between a plurality of data providers, e.g., input ports, in a router or switch sharing a common data receiver, e.g., output port is provided. The common data receiver has a predetermined maximum receive rate. First, a desired transmission rate between a data provider and the common data receiver is determined. Next, a sum is calculated of a plurality of desired transmission rates, where the sum includes the desired transmission rate. And lastly, an actual transmission rate between the data provider and the common data receiver is determined, where the predetermined maximum receive rate and the sum is used to determine the actual transmission rate.
Another embodiment of the invention provides a method for managing data traffic between a plurality of input queues in a router sharing a common output port. The common output port has a predetermined maximum receive rate. The method includes determining a desired sending rate between an input queue of the plurality of input queues and the common output port; calculating a sum of a plurality of desired sending rates, where the sum includes the desired sending rate; and determining a proportional fair rate between the input queue and the output queue, where the sum is used to determine the proportional fair rate.
Yet another embodiment of the invention provides a system for controlling data flow in a router. The system includes a first determiner configured to determine a desired sending rate between an input queue and an output port; a calculator configured to determine a derating factor based on a sum of a plurality of desired sending rates, where the sum includes the desired sending rate; and a second determiner configured to determine an actual sending rate between the input queue and the output port, where the derating factor is used to determine the actual sending rate.
An embodiment of the invention provides a method for managing packet traffic between a plurality of input queues in a router sharing a common output queue. A length of an input queue of the plurality of input queues is determined. Next, a sum of a plurality of lengths of the plurality of input queues is calculated, where the plurality of lengths includes the length. A derating factor based on the sum is determined. And an actual transmission rate of the input queue using the derating factor is determined. In an alternate embodiment the length is capped at a predetermined rate.
One embodiment of the invention provides a method for controlling data flows in a router between a plurality of input queues at a selected priority level of a plurality of priority levels. The plurality of input queues share a common output queue at the selected priority level. The method includes determining a length of an input queue of the plurality of input queues at the selected priority level. Next a sum of a plurality of lengths of the plurality of input queues is calculated at the selected priority level. A derating factor for the selected priority level based on the sum is determined. Lastly, an actual transmission rate of the input queue at the selected priority level using the derating factor is determined. The above procedure is repeated for the remaining priority levels. In one embodiment the procedures for the priority levels are performed concurrently.
These and other embodiments, features, aspects and advantages of the invention will become better understood with regard to the following description, appended claims and accompanying drawings
The invention relates to a method and apparatus for managing packet traffic flow in a router or switch. In the following description, numerous specific details are set forth to provide a more thorough description of the specific embodiments of the invention. It is apparent, however, to one skilled in the art, that the invention may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the invention.
In an embodiment of the invention, the variable and/or fixed sized packets are sent to one or more input queues, where each input queue is partitioned into segments of fixed length, for example, a segment size of 75 bytes. In another embodiment a segment size such as 53 bytes for an ATM cell, may be used. In other embodiments other fixed segment sizes may be used and one or more may correspond to a fixed packet size. These fixed sized segments are sent from the input queues to the output queues. The limitation on how much data can be transferred from the input queues to a shared output queue in a given time period is the maximum receive rate of the shared output queue. Thus the sum of the input queues data transfer or send rates cannot exceed the maximum receive rate of the shared output queue. In one embodiment of the invention the send rates of the input queues to the shared output queue are proportionally allocated. In another embodiment the allocation for each input queue is based on its queue length or fullness and on the maximum number of segments the shared output queue can receive in a selected time period.
dj=Rj/Σrn [Eqn. 1]
When the derating factor, dj, is equal to or greater than one (1), then each of the input queues sending data to output queue j 220 can use its requested or desired rates as the actual sending rate, μn. For example, at step 252 input queue i 210 will have μi=ri . Thus equation Eqn. 1 is modified to:
dj=min[Rj/Σrn, 1] [Eqn. 2]
Next the output queue j 220 calculates a proportional rate, pn(t), at time t for each of the sending input queues by derating the requested or desired sending rate:
pn(t)=dj*rn(t) [Eqn. 3]
The output queue j 220 then sends to each input queue its proportional fair rate (step 244), i.e., pn(t). For example output queue j at step 244 sends input queue i 250 proportional fair rate, pi(t). Input queue i 210 receivers it's proportional fair rate at step 250. If it is different from the present actual sending rate, μi(t−1) (step 252), then input queue i 210 modifies its actual sending rate, μi(t), to the received proportional fair rate at step 252, i.e. μi(t)←pi(t). If there is no difference, i.e., pi(t)=μi(t−1), then the sending rate stays the same, i.e., μi(t)←μi(t−1).
dj=min[Rj/Σrn, 1] [Eqn. 4]
The input queue then calculates it new sending rate, pn(t) (step 352 for input queue i 310):
pn(t)=dj*rn(t) [Eqn. 5]
Each input queue (step 354 for input queue i) then modifies, if necessary, its sending rate, i.e., μn(t)←pn(t).
In another embodiment of the invention, the derating factor, dj, is determined by the lengths of a plurality of input queues sending data to a shared output queue. The send or transmission rate of a first input queue of the plurality is then determined from the length of the first input queue and the derating factor. Since the derating factor is the same across the plurality of input queues sending data to the shared output queue, when the length of the first queue increases, its sending rate increases relative to the other input queues, assuming, for the purposes of this example, that the lengths of the other queues in the plurality stay the same. Thus the bandwidth is reallocated to give relatively more to the first input queue.
In one embodiment each queue is divided into discrete fullness levels. The length of the queue, L, is then measured in integer values. For example, L could represent the number of fixed size segments in a queue. In another example, L could be a number representing the fullness levels inclusively from zero to the capped value. In an alternative embodiment, the fullness corresponds to variable sized regions of the queue. In one embodiment, the transmission or send rates associated with fullness levels are proportional to the length or the fullness level. In another embodiment, the transmission or send rates associated with the length or the fullness levels are a function of the length or the fullness level.
For the following embodiments of the invention, the following terms are defined: the subscript “i” refers to the ith input port and the subscript “j” refers to the jth output port. Lij is the length of the data waiting at the input queue at input port “i,” where the input queue sends data to the output queue at output port “j” (also called herein as output queue j). Qimax is the maximum length or size of input queues i. Cij is a capped value for the purposes of determining the send rate, μij, of an input queue. Cij=<Qimax. In one embodiment Cij is a predetermined constant C. Qimax is a predetermined constant Qmax, which is set large enough so that discarded packets are zero or minimized. An example value for C is 255 segments (8 bits). μij is the transmission or sending rate of the input queue at input port i to the output queue at output port j. T is a selected time period. Loutj is the maximum number of packets that output port j can receive in a time period T. In one embodiment Loutj=Lout a predetermined constant for all output queues.
While there are many technical definitions of fairness, one meaning of max-min fairness used in the following embodiments, includes: 1) no input queue receives more than it requests (μi=<ri); and 2) μi=min(μfair, ri); μfair is set so that R=Σμi, where the sum is over all the input queues sharing the output queue and R is the maximum receive rate of the shared output queue. For example, in the three input queues, 120-1, 120-2, and 120-3, sharing the one output queue 174 in
In one embodiment of the invention the router has N input or ingress ports and M output or egress ports. In order to avoid head-of-line (HOL) blocking there are M input queues at an input port, i.e., one input queue for each output port. Thus there are N×M input queues in the input ports. In one implementation M=N and more specifically M=N=64. However, in other embodiments M and/or N may be different numbers. These M input queues are called virtual output queues (VOQs).
Let ρij(t) be the length of a VOQ, i.e., an input queue of input port i sending data to an output port j, capped at C.
ρij(t)=min[Lij, C] [Eqn. 6]
In one embodiment of the invention the calculations are done in discrete time. At each time step each input port samples and transmits the new queue length ρij to the output port and each output port calculates and transmits the transmission rate μij to the input port. An intermediate value called the derating factor dj(t) is calculated based on ρij and the total number of segments, Lout, an output port can receive in time T.
dj(t)=min[Lout/Σρij(t), 1] [Eqn. 7]
From equation 7 when the derating factor is one, then Lout>=Σρij (t). This means that given a time period T in which all the segments in the input queues having lengths ρij(t), are sent to the output queue at output port j, the output queue can accept all segments sent. Thus the input queues may drain all segments present.
When from Eqn. 7 the derating factor is less than one, i.e., Lout<Σρij (t), then there are more segments in the VOQs than can be received by the shared output queue in a time T. Thus the actual transmission rate sent back to each input queue sharing output port j is:
μij(t)=dj(t−1)*(ρij(t−1)/T) [Eqn. 8]
When dj(t)<1, Eqn. 7 can be substituted into Eqn. 8. and manipulated to get:
μij(t)=(ρij(t−1)/Σρij(t−1))*(Lout/T) [Eqn. 9]
where the maximum receive rate at the output queue is Rmax=Lout/T.
Because ρij(t) is capped at C, as seen from Eqn. 9, VOQs with large numbers of packets above C cannot dominate the bandwidth allocation as they could in a pure proportional allocation. In the congested case, when all queues are above their cap, each VOQ gets an equal allocation. Thus by feeding back the capped queue fullness, this technique achieves max-min fairness.
An analysis of an embodiment of the invention showing max-min fairness for the congested case may be done under certain conditions. The analysis starts by examining the dynamic case. First, the change of the fullness of the VOQ is determined. The rate of data entering a VOQ is quantized in time, as the amount of data entering the VOQ during the current segment of time. This rate of data entering a VOQ is called Λij(t). The change in the amount of data residing in a VOQ is then the difference of the rate of data entering and the rate of data leaving the VOQ:
Lij(t+1)−Lij(t)=Λij(t)−μij(t) [Eqn. 11]
Substituting [7] into [8] and then [8] into [11] we get:
Lij(t+1)−Lij(t)=Λij(t)−min[Lout/Σρj(t−1), 1]×ρji(t−1)/T [Eqn. 12]
Then, applying [6] and solving for ρij(t+1) we get:
ρij(t+1)=min[ρij(t)+Λij(t)−min[Lout/Σρj(t−1), 1]×ρji(t−1)/T, C] [Eqn. 13]
In the congested case there will be more data in the VOQ than can be supported on the egress link and [13] reduces to:
ρij(t+1)=min[ρij(t)+Λij(t)−(Lout/Σρj(t−1))×ρij(t−1)/T, C] [Eqn. 14]
In the steady state the incoming rates Λij(t) do not change with time. This then gives constant values of ρij(t) and μij(t) These steady state constant values are denoted as follows:
Λij(t)=>Λcij [Eqn. 15]
ρij(t)=>ρcij [Eqn. 16]
μij(t)=>μcij [Eqn. 17]
Steady state congestion is assumed for all the flows associated with the same egress port. When the sum of incoming rates for all the flows to the same egress port exceed the sum of allocated weights, congestion exists for those flows. For each egress port j, the congestion relation summed over the ingress ports i, then is:
ΣiΛcij>Σiμcij [Eqn. 18]
This scheme allocates bandwidth out of the VOQs based on each queues capped length ρcij. In the congested steady state, this set of bandwidth allocations converge on a set of constant bandwidth allocations. It can be shown, that this set of constant bandwidth allocations is max-min fair across ingress rates.
First, express [14] for the congested steady state:
ρcij=min[ρcij+Λcij−(Lout/Σρcij)×ρcij/T, C] [Eqn. 19]
Substituting [7] and [8] to express [19] in terms of μ:
ρcij=min[ρcij+Λcij−μcij, C] [Eqn. 20]
There are two cases satisfying equation 20: either the first expression or the second expression in the minimum. Either ρcij is capped at C or ρcij is less than C and is as shown below:
ρcij=ρcij+Λcij−μcij [Eqn. 21]
This requires the following to hold:
Λcij=μcij [Eqn. 22]
Simply put, in the congested steady state, by capping the measured fullness of the VOQ this technique also caps the bandwidth allocated to those VOQs with larger incoming data rates. This capped allocated bandwidth is achieved when ρ=C. From [8]:
μcappedj=dj×C/T [Eqn. 23]
Those ingress VOQs with data rates less than μcappedj are allocated their full data rate. Also, no VOQ is allocated more bandwidth than it can consume. This satisfies all 3 conditions of the max-min fairness criteria.
In another embodiment, input ports are associated with different priority levels. In one embodiment, the output queue calculates derating rates and transmission or send rates for each VOQ at each priority level. In this embodiment, the send rate for a VOQ depends on the priority associated with the VOQ. An example is a router where sixty-four input ports transmit data to sixty-four output ports, i.e. M=N=64. In this embodiment, data is categorized by a plurality of priorities, for example, five priority levels. Each input port has a set of input queues for each output port, and each set has an input queue for each priority level. Thus, each input port has 320 queues. Each output port has one output queue. In other embodiments there may be different number and variations of input and output queues and/or ports. For example, there may be two or more output queues per output port or there may be two or more input queues per output port or an output port may have no queue.
Although specific embodiments of the invention have been described, various modifications, alterations, alternative constructions, and equivalents are also encompassed within the scope of the invention. The described invention is not restricted to operation within certain specific data processing environments, but is free to operate within a plurality of data processing environments. Additionally, although the invention has been described using a particular series of transactions and steps, it should be apparent to those skilled in the art that the scope of the invention is not limited to the described series of transactions and steps.
Further, while the invention has been described using a particular combination of hardware and software, it should be recognized that other combinations of hardware and software are also within the scope of the invention. The invention may be implemented only in hardware or only in software or using combinations thereof.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that additions, subtractions, deletions, and other modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.
Number | Name | Date | Kind |
---|---|---|---|
4330824 | Girard | May 1982 | A |
4394725 | Bienvenu | Jul 1983 | A |
4754451 | Eng | Jun 1988 | A |
5404461 | Olnowich | Apr 1995 | A |
5550823 | Irie | Aug 1996 | A |
5606370 | Moon | Feb 1997 | A |
5784003 | Dahlgren | Jul 1998 | A |
5859835 | Varma | Jan 1999 | A |
5898689 | Kumar | Apr 1999 | A |
6138185 | Nelson | Oct 2000 | A |
6442674 | Lee et al. | Aug 2002 | B1 |
6487171 | Honig | Nov 2002 | B1 |
6493347 | Sindhu | Dec 2002 | B2 |
6570876 | Aimoto | May 2003 | B1 |
6574194 | Sun | Jun 2003 | B1 |
6588015 | Eyer et al. | Jul 2003 | B1 |
6611527 | Moriwaki | Aug 2003 | B1 |
6658503 | Agarwala et al. | Dec 2003 | B1 |
6708262 | Manning | Mar 2004 | B2 |
6714555 | Excell et al. | Mar 2004 | B1 |
6795870 | Bass | Sep 2004 | B1 |
6836479 | Sakamoto | Dec 2004 | B1 |
7072295 | Benson et al. | Jul 2006 | B1 |
7426602 | Stewart et al. | Sep 2008 | B2 |
20010037435 | Van Doren | Nov 2001 | A1 |
20030035427 | Alasti et al. | Feb 2003 | A1 |
Number | Date | Country | |
---|---|---|---|
20030058802 A1 | Mar 2003 | US |