The field of invention relates generally to networking equipment and, more specifically but not exclusively relates to techniques for detecting packet flow congestion using an efficient random early detection algorithm that may be implemented in the forwarding path of a network device and/or network processor.
Network devices, such as switches and routers, are designed to forward network traffic, in the form of packets, at high line rates. One of the most important considerations for handling network traffic is packet throughput. To accomplish this, special-purpose processors known as network processors have been developed to efficiently process very large numbers of packets per second. In order to process a packet, the network processor (and/or network equipment employing the network processor) extracts data from the packet header indicating the destination of the packet, class of service, etc., store the payload data in memory, perform packet classification and queuing operations, determine the next hop for the packet, select an appropriate network port via which to forward the packet, etc. These operations are generally referred to as “packet processing” or “packet forwarding” operations.
Many modern network devices support various levels of service for subscribing customers. For example, certain types of packet “flows” are time-sensitive (e.g., video and voice over IP), while other types are data-sensitive (e.g., typical TCP data transmissions). Under such network devices, received packets are classified into flows based on various packet attributes (e.g., source and destination addresses and ports, protocols, and/or packet content), and enqueued into corresponding queues for subsequent transmission to a next hop along the transfer path to the destined end device (e.g., client or server). Depending on the policies applicable to a given queue and/or associated Quality of Service (QoS) level, various traffic policing schemes are employed to account for network congestion.
One aspect of the policing schemes relates to how to handle queue overflow. Typically, fixed-size queues are allocated for new or existing service flows, although variable-size queues may also be employed. As new packets are received, they are classified to a flow and added to an associated queue. Meanwhile, under a substantially parallel operation, packets in the flow queues are dispatched for outbound transmission (dequeued) on an ongoing basis, with the transmission dispatch rate depending on network availability. Further consider that both the packet receive and dispatch rates are dynamic in nature. As a result, the number of packets in a given flow queue fluctuates over time, depending on network traffic conditions.
In further detail, buffer managers or the like are typically employed for managing the length of the flow queues by selectively dropping packets to prevent queue overflow. Under connection-oriented transmissions, dropping packets indicate to the end devices (i.e., the source and destination devices) that the network is congested. In response to detecting such dropped packets, protocols such as TCP typically back off and reduce the rate at which they transmit packets on a corresponding connection. At the same time, packet-oriented traffic is typically bursty, which means that a device may often see periods of transient congestion followed by periods of little or no traffic. Therefore, the dual goals of the buffer manager are to allow temporary bursts and fluctuations in the packet arrival rate, while actively avoiding sustained congestion by providing an early indication to the end devices that such congestion is present.
The simplest scheme for buffer management is called “tail drop,” under which each queue is assigned a maximum threshold. If a packet arrives on a queue that has reached the maximum threshold, the buffer manager drops the packet rather than appending it to the end (i.e., tail) of the queue. Even though this scheme is very easy to implement, it is a reactive measure since it waits until a queue is full prior to dropping any packets. Therefore, the end devices do not get an early indication of network congestion. This, coupled with the bursty nature of the traffic, means that the network device may drop a large chunk of packets when a queue reaches its maximum threshold.
Other more complex detection algorithms have been developed to address queue management. These include the Random Early Detection (RED) algorithm, and Weighted Random Early Detection (WRED) algorithm. Although these algorithms are substantial improvements over the simplistic tail drop scheme, they require significant computation overhead, and may be impractical to implement in the forwarding path while maintaining today's and future high line-rate speeds.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
a illustrates an exemplary set of WRED drop profiles having a common maximum probability;
b illustrates an exemplary set of WRED drop profiles having different maximum probabilities;
Embodiments of methods and apparatus for implementing very efficient random early detection algorithms in forwarding (fast) path of network processors are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In accordance with aspects of the embodiment described herein, enhancements to the RED and WRED algorithms are disclosed that provide substantial improvements in terms of efficiency and process latency, thus enabling these algorithms to be implemented in the forwarding path of a network device. In order to better understand operation of these embodiments, a discussion of the conventional RED, and WRED schemes are first presented. Following this, details of implementations of the enhanced algorithms are discussed.
RED as described in Floyd, S, and Jacobson, V, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, V.1 N.4, August 1993, p. 397-413 (hereinafter [RED93]) is an algorithm that marks packets (e.g., to be dropped) based on a probability that increases with the average length of the queue. (It is noted that under RED93, packets are termed “marked,” wherein the marking may be either employed to return information back to the sender identifying congestion or to mark the packets to be dropped. However, under most implementations, the packets are simply dropped rather than marked.) The algorithm calculates the average queue size using a low-pass filter with an exponential weighted moving average. Since measurement of the average queue size is time-averaged rather than an instantaneous length, the algorithm is able to smooth out temporary bursts, while still responding to sustained congestion.
In further detail, the average queue size avg_len is determined by implementing a low-pass EWMA (Exponential Weighted Moving Average) filter using the following equation:
avg_len=avg_len+weight*(current_len−avg_len) (1)
where,
Once the average queue size is determined, it is compared with two thresholds, a minimum threshold minth, and a maximum threshold maxth. When the average queue size is less than the minimum threshold, no packets are dropped. When the average queues size exceeds the maximum threshold, all arriving packets are dropped. When the average queue size is between the minimum and maximum thresholds, each arriving packet is marked with a probability pa, where pa is a function of the average queue size avg_len. This is schematically illustrated in
As seen from above, the RED algorithm actually employs two separate algorithms. The first algorithm for computing the average queue size determines the degree of burstiness that will be allowed in a given connection (i.e., flow) queue, which is a function of the weight parameter (and thus the filter gain). Thus, the choice of the filter gain weight determines how quickly the average queue size changes with respect to the instantaneous queue size (in view of an even packet arrival rate for the connection). If the weight is too large, then the filter will not be able to absorb transient bursts, while a very small value could mean that the algorithm does not detect incipient congestion early enough. [RED 93] recommends a value between 0.002 and 0.042 for a throughput of 1.5 Mbps.
The second algorithm used for calculating the packet-marking probability determines how frequently the network device (implementing RED) marks packets, given the current level of congestion. Each time that a packet is marked, the probability that a packet is marked from a particular connection is roughly proportional to that connection's share of the bandwidth at the network device. The goal for the network device is to mark packets at fairly evenly-spaced intervals, in order to avoid biases and to avoid global synchronization, and to mark packets sufficiently frequently to control the average queue size.
As show in
When a queue goes idle, [RED93] specifies an equation that attempts to estimate the number of packets that could have arrived during the idle period:
m=(current_timestamp−last_idle_timestamp)/average_service_time
avg_len=avg_len*(1−weight)m (2)
where,
WRED (Weighted RED) is an extension of RED where different packets can have different drop probabilities based on corresponding QoS parameters. For example, under a typical WRED implementation, each packet is assigned a corresponding color; namely Green, Yellow, and Red. Packets that are committed for transmission are assigned to Green. Packets that conform but are yet to be committed are assigned to Yellow. Exceeded packets are assigned to Red. When the queue fills above the exceeded threshold, all packets are dropped.
Drop profiles based on exemplary sets of Green, Yellow, and Red WRED thresholds and weight parameters are illustrated in
A “snapshot” illustrating the current condition of an exemplary queue are shown in
The exemplary parameters shown in
It is also possible to employ different drop behavior for different classes of traffic (i.e., different service classes). This enables one to assign less aggressive drop profiles to higher-priority queues (e.g., queues associated with higher QoS) and more aggressive drop profiles to lower-priority queues (lower Qos queues).
The implementation depicted in
One of the key problems with the original algorithm defined in [RED93] was that it was targeted toward the low-speed T1/E1 links common at the time, and it does not scale very well to higher data rates. In Jacobson et al., “Notes on using RED for Queue Management and Congestion Avoidance,” viewgraphs, talk at NANOG 13, June 1998 (hereinafter [RED99]) Jacobson et al. describe a design that significantly optimizes the implementation of WRED in the forwarding path. A key difference is that unlike [RED93], the design does not compute the average queue size at packet arrival time. Instead, the algorithm samples the size of the queue and approximates the persistent queue size only at periodic intervals. The authors of [RED99] recommend a sampling interval of up to 100 times a second irrespective of the link speed, which allows the implementation to scale to very high data rates. For the packet drop calculation, [RED99] recommends including the following code in the forwarding path.
The [RED99] algorithm calculates estimated_drop_count during the averaging of the queue size.
While the [RED99] algorithm variation is a lot more efficient than the one proposed in [RED93], it still implies a critical section for the code that updates the drop_count variable. That is, this portion of code is a mutually exclusive section that must be performed on all packets. This critical section requires the current drop count to be retrieved (read from memory), an arithmetic comparison operation be performed, an entire estimated_drop_count algorithm be performed to calculate the new drop_count variable, and then storage of the updated drop_count variable. Under one state of the art implementation, the critical section requires 55 processor cycles. This represents a significant portion of the forwarding path latency budget.
To better understand the problem with the increased latency resulting from the critical section, one needs to consider the parallelism employed by some modern network processors and/or network device forwarding path implementations. Under the foregoing scheme, it is still necessary for the drop_count calculation be performed on each packet. This increases the overall packet processing latency, thus reducing packet throughput. Under a parallel pipelined packet processing scheme, some packet-processing may not commence until other packet-processing operations have been completed. Accordingly, upstream latencies cause delays to the entire forwarding path.
Modern network processors, such as Intel® Corporation's (Santa Clara, Calif.) IXP2XXX family of network processor units (NPUs), employ multiple multi-threaded processing elements (e.g., compute engines referred to as microengines (MEs) under Intel's terminology) to facilitate line-rate packet processing operations in the forwarding path (also commonly referred to as the forwarding plane, data plane or fast path). In order to process a packet, the network processor (and/or network equipment employing the network processor) needs to extract data from the packet header indicating the destination of the packet, class of service, etc., store the payload data in memory, perform packet classification and queuing operations, determine the next hop for the packet, select an appropriate network port via which to forward the packet, dequeuing etc.
Some of the operations on packets are well-defined, with minimal interface to other functions or strict order implementation. Examples include update-of-packet-state information, such as the current address of packet data in a DRAM buffer for sequential segments of a packet, updating linked-list pointers while enqueuing/dequeuing for transmit, and policing or marking packets of a connection flow. In these cases, the operations can be performed within the predefined cycle-stage budget. In contrast, difficulties may arise in keeping operations on successive packets in strict order and at the same time achieving cycle budget across many stages. A block of code performing this type of functionality is called a context pipe stage.
In a context pipeline, different functions are performed on different microengines (MEs) as time progresses, and the packet context is passed between the functions or MEs, as shown in
Under a context pipeline, each thread in an ME is assigned a packet, and each thread performs the same function but on different packets. As packets arrive, they are assigned to the ME threads in strict order. For example, there are eight threads typically assigned in an Intel IXP2800® ME context pipe stage. Each of the eight packets assigned to the eight threads must complete its first pipe stage within the arrival rate of all eight packets. Under the nomenclature illustrated in
A more advanced context pipelining technique employs interleaved phased piping. This technique interleaves multiple packets on the same thread, spaced eight packets apart. An example would be ME0.1 completing pipe-stage 0 work on packet 1, while starting pipe-stage 0 work on packet 9. Similarly, ME0.2 would be working on packet 2 and 10. In effect, 16 packets would be processed in a pipe stage at one time. Pipe-stage 0 must still advance every 8-packet arrival rates. The advantage of interleaving is that memory latency is covered by a complete 8-packet arrival rate.
According to aspects of the embodiments now described, enhancements to WRED algorithms and associated queue management mechanisms are implemented using NPUs that employ multiple multi-threaded processing elements. The embodiments facilitate fast-path packet forwarding using the general principles employed by conventional WRED implementations, but greatly reduce the amount of processing operations that need to be performed in the forwarding path related to updating flow queue state and determining an associated drop probability for each packet. This allows implementations of WRED techniques to be employed in the forwarding path while supporting very high line rates, such as OC-192 and higher.
It was recognized by the inventors that RED and WRED schemes could be modified using the following algorithm on an NPU that employs multiple compute engines and/or other processing elements to determine whether or not to drop a packet in the context of parallel packet processing techniques:
It was further recognized that since the microengine architecture of the Intel® IXP2XXX NPUs include a built-in pseudo-random number generator, the number of processing cycles required to perform the foregoing algorithm would be greatly reduced. This modification eliminates the critical section completely, since the packet-forwarding path only reads the estimated_drop_probability value and does not modify it. The variation also saves SRAM bandwidth associated with reading and writing the drop_count in [RED99]. Using the pseudo-random number generator on the microengines, the above calculation only requires four instructions per packet in the microengine fast path. Thus, this scheme is very suitable for parallel processing architectures, as it removes restrictions on parallelization of WRED implementations by completely eliminating the aforementioned critical section.
An exemplary execution environment 600 for implementing embodiments of the enhanced WRED algorithm is illustrated in
As illustrated in
Typically, information that is frequently accessed for packet processing (e.g., flow table entries, queue descriptors, packet metadata, etc.) will be stored in SRAM, while bulk packet data (either entire packets or packet payloads) will be stored in DRAM, with the latter having higher access latencies but costing significantly less. Accordingly, under a typical implementation, the memory space available in the DRAM store is much larger than that provided by the SRAM store.
As shown in the lower left-hand corner of
As describe below, each WRED data structure will provide information for effectuating a corresponding drop profile in a manner analogous to that described above for the various WRED implementations in
In addition to storing the WRED data structures, associated lookup data is likewise stored in SRAM 604. In the embodiment illustrated in
An overview of operations performed during run-time packet forwarding is illustrated in
With reference to execution environment 600 and a block 700 in
By way of example, a typical 5-tuple flow classification is performed in the flowing manner. First, the 5-tuple data for the packet (source and destination IP address, source and destination ports, and protocol—also referred to as the 5-tuple signature) are extracted from the packet header. A set of classification rules are stored in an Access Control List (ACL), which will typically be stored in either SRAM or DRAM or both (more frequent ACL entries may be “cached” in SRAM, for example). Each ACL entry contains a set of values associated with each of the 5 tuple fields, with each value either being a single value, a range, or a wildcard. Based on an associated ACL lookup scheme, one or more ACL entries containing values matching the 5-tuple signature will be identified. Typically, this will be reduced to a highest-priority matching rule set in the case of multiple matches. Meanwhile, each rule set is associated with a corresponding flow or connection (via a Flow Identifier (ID) or connection ID). Thus, the ACL lookup matches the packet to a corresponding flow based on the packet's 5-tuple signature, which also defines the connection parameters for the flow.
Each flow has a corresponding entry in flow table 626. Management and creation of the flow entries is facilitated by flow manager 608 via execution of one or more threads on MEs 622. In turn, each flow has an associated flow queue (buffer) that is stored in DRAM 606. To support queue management operations, queue manager 610 and/or flow manager 608 maintains queue descriptor array 632, which contains multiple FIFO (first-in, first-out) queue descriptors 648. (In some implementations, the queue descriptors are stored in the on-chip SRAM interface 605 for faster access and loaded from and unloaded to queue descriptors stored in external SRAM 604.)
Each flow is associated with one or more (if chained) queue descriptors, with each queue descriptor including a Head pointer (Ptr), a Tail pointer, a Queue count (Qcnt) of the number of entries currently in the FIFO, and a Cell count (Cnt), as well as optional additional fields such as mode and queue status (both not shown for simplicity). Each queue descriptor is associated with a corresponding buffer segment to be transferred, wherein the Head pointer points to the memory location (i.e., address) in DRAM 606 of the first (head) cell in the segment and the Tail pointer points to the memory location of the last (tail) cell in the segment, with the cells in between being stored at sequential memory addresses, as depicted in a flow queue 650. Depending on the implementation, queue descriptors may also be chained via appropriate linked-list techniques or the like, such that a given flow queue may be stored in DRAM 606 as a set of disjoint segments.
Packet streams are received from various network nodes in an asynchronous manner, based on flow policies and other criteria, as well as less predictable network operations. As a result, on a sequential basis packets from different flows may be received in an intermixed manner, as illustrated by a stream of input packets 644 depicted toward the right-hand side of
During on-going packet-processing operations, parallel operations are performed on a periodic basis in a substantially asynchronous manner. These operations include periodically (i.e., repeatedly) recalculating the queue state information for each flow queue in the manner discussed below with reference to
Continuing at a block 710, in association with the ongoing packet-processing operation context, the current estimated_drop_probability value for the flow queue is retrieved, (i.e., read from SRAM 604) by the microengine running the current thread in the pipeline and stored in that ME's local memory 634, as schematically depicted in
random_number<estimated_drop_probability.
The result of the evaluation of the foregoing inequality is depicted by a decision block 714. If the inequality is True, the packet is dropped. Accordingly, this is simply accomplished in a block 716 by releasing the Rx buffer in which the packet is temporarily being store. If the packet is to be forwarded, it is added to the tail of the flow queue for the flow to which it is classified in a block 718 by copying the packet from the Rx buffer into an appropriate storage location in DRAM 606 (as identified by the Tail pointer for the associated queue descriptor), the Tail pointer is incremented by 1, and then the Rx buffer is released in a block 718.
With reference to
An exemplary WRED data structure 900 is shown in
The exemplary WRED data illustrated in
In general, a WRED data structure will be generated for each service class. However, this isn't a strict requirement, as different service classes may share the same WRED data structure. In addition, more than three colors may be implemented in a similar fashion to that illustrated by the Green, Yellow, and Red implementations discussed herein. Furthermore, as discussed above with reference to
Returning to
In view of the foregoing, sets of policy data (wherein each set defines associated policies) are stored in SRAM 604 as policy data 628. At the same time, the various WRED data structures defined in block 800 are stored as WRED data structures 630 in SRAM 604. The policy data and WRED data structures are associated using a pointer included in each policy data entry. These associations are defined during the setup operations of blocks 800 and 802.
Following the setup operations, the run-time operations illustrated in
In a block 806, various information associated with the flow is retrieved from SRAM 604 using a data read operation. This information includes the applicable WRED data structure, the flow queue state, and the current queue length. In the embodiment illustrated in
The flow ID identifies the flow (optionally a connection ID may be employed), and enables an existing flow entry to be readily located in the flow table. The buffer pointer points to the address of the (first) corresponding queue descriptor 648 in queue descriptor array 632. The policy pointer points to the applicable policy data in policy data 628. As discussed above, each policy data entry includes a pointer to a corresponding WRED data structure. (It is noted that the policy data may include other parameters that are employed for purposes outside the scope of the present specification.) Accordingly, when a new flow table entry is created, the applicable WRED data structure is identified via the policy pointer indirection, and a corresponding WRED pointer is stored in the entry.
In general, the flow queue state information may be stored inline with the flow table entry, or the state field may contain a pointer to where the actual state information is stored. In the embodiment illustrated in
In one embodiment, the current queue length may be retrieved from the queue descriptor entry associated with the flow (e.g., the Qcnt value). As discussed above, the queue descriptor entry for the flow may be located via the buffer pointer.
Next, in a block 808, a new queue state is calculated. In a block 810, a new avg_len value is calculated for each color (as applicable) using Equation 1 above. In general, the appropriate weight value may be retrieved from the WRED data structure, or may be located elsewhere. For example, in some implementations, a single or set of weight values may be employed for respective colors across all service classes.
In conjunction with this calculation, a new timestamp value is also determined. In one embodiment, the respective timestamp values are retrieved during an ongoing cycle to determine if the associated flow queue state is to be updated, thus effecting a sampling period. Based on the difference between the current time and the timestamp, the process can determine whether a given flow queue needs to be processed. Under other embodiments, various types of timing schemes may be employed, such as using clock circuits, timers, counters, etc. As an option to storing the timestamp information in the dynamic portion of a WRED data structure, the timestamp information may be stored as part of the state field or another filed in a flow table entry or otherwise located via a pointer in the entry.
In a block 812, a recalculation of the estimated_drop_probability for each color (as applicable) is performed based on the corresponding WRED drop profile data and updated avg_len value using algorithm 2 shown above. The updated queue state data is then stored in a block 814 to complete the processing for a given flow.
In some implementations, the sampling period for the entire set of active flows will be relatively large when compared with the processing latency for a given packet. Since the sampling interval is relatively large, the recalculation of the queue state may be performed using a processing element that isn't in the fast path. For example, the Intel IXP2XXX NPUs include a general purpose “XScale” processor (depicted as GP Proc 652 in
However, for a system with a large number of flows, this approach may require too many computations on the XScale. In addition, the XScale and the microengines need to share the estimated_drop_probability value for a queue via SRAM (since the value is also being read by the microengines). As a result, the slow path operations performed by the Xscale and the fast path operations performed by the microengines are not entirely decoupled.
Since the foregoing scheme only requires four instructions per packet, another implementation possibility is to add the WRED functionality to either scheduler 614 or queue manager 610. Typically, in any application, either the scheduler or the queue manager tracks the instantaneous size of a queue. Since the WRED averaging function requires the instantaneous size, it is appropriate to add this functionality to one of these blocks. The estimated_drop_probability value can be stored in the queue state information used at enqueue time of the packet. The rest of the WRED context can be stored separately in SRAM and accessed only in the sampling path in the manner described above.
In one embodiment, the queue state update is performed by a single thread once every N packets where N is calculated as
For example, for an OC-192 POS interface with 128 queues, assuming the per-queue sampling rate is 100 times a second, the average queue length calculation needs to be invoked once every (24.5/(128*100)=1914) packets. Note that this design only makes sense if N is substantially greater than one. If the number of queues times the sampling frequency starts to approach the packet arrival rate, then the application may as well compute the queue size on every packet.
To implement the periodic sampling, the future_count signal in the microengine can be set. The microengine hardware sends a signal to the calling thread after a configurable number of cycles. In the packet processing fast path, a single br_signal [ ] instruction is sufficient to check if the sampling timer has expired. The pseudo-code shown in
As discussed above, various operations illustrated by functional blocks and modules in the figures herein may be implemented via execution of corresponding instruction threads on one or more processing elements, such as compute engines (e.g., microengines) and general-purpose processors. Thus, embodiments of this invention may be implemented via execution of instructions upon some form of processing core, wherein the instructions are provided via a machine-readable medium. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), and may comprise, for example, a read only memory (ROM); a random access memory (RAM); a magnetic disk storage media; an optical storage media; and a flash memory device, etc. In addition, a machine-readable medium can include propagated signals such as electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.).
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.