This invention relates to managing on-chip queues in switched fabric networks. Advanced Switching Interconnect (ASI) is a technology based on the Peripheral Component Interconnect Express (PCIe) architecture and enables standardization of various backplanes. The Advanced Switching Interconnect Special Interest Group (ASI-SIG) is a collaborative trade organization chartered with providing a switching fabric interconnect standard, specifications of which, including the Advanced Switching Core Architecture Specification, Revision 1.1, November 2004 (available from the ASI-SIG at www.asi-sig.com), it provides to its members.
ASI utilizes a packet-based transaction layer protocol that operates over the PCIe physical and data link layers. The ASI architecture provides a number of features common to multi-host, peer-to-peer communication devices such as blade servers, clusters, storage arrays, telecom routers, and switches. These features include support for flexible topologies, packet routing, congestion management, fabric redundancy, and fail-over mechanisms.
The ASI architecture requires ASI devices to support fine grained quality of service (QoS) using a combination of status based flow control (SBFC), credit based flow control, and injection rate limits. ASI endpoint devices are also required to adhere to stringent guidelines when responding to SBFC flow control messages. In general, each ASI endpoint device has a fixed window in which to suspend or resume the transmission of packets from a given connection queue after a SBFC flow control message is received for that particular connection queue.
The connection queues are typically implemented in external memory. A scheduler of the ASI endpoint device schedules packets from the connection queues for transmission over the ASI fabric using an algorithm, such as weighted round robin (WRR), weighted fair queuing (WFQ), or round robin (RR). The scheduler uses the SBFC status information as one of the inputs to determine eligible queues. The latency to fetch the scheduled packets and inject them into a transmit pipeline of the ASI endpoint device is high due to the delay introduced by processing pipeline stages and latency to access external memory. The large latency can potentially lead to undesirable conditions if the connection queue is flow controlled. As a result, the packets need to be scheduled again to ensure that the selected packets conform to the SBFC status.
Referring to
Each ASI device 102, 104 has an ASI interface that is part of the ASI architecture defined by the Advanced Switching Core Architecture Specification (“ASI Specification”). Each ASI switch element 102 can be implemented to support a localized congestion control mechanism referred to in the ASI Specification as “Status Based Flow Control” or “SBFC”. The SBFC mechanism provides for the optimization of traffic flow across a link between two adjacent ASI devices 102, 104, e.g., an ASI switch element 102 and its adjacent ASI endpoint 104, or between two adjacent ASI switch elements 102. By adjacent, it is meant that the two ASI devices 102, 104 are directly linked without any intervening ASI devices 104, 104.
Generally the SBFC mechanism works as follows: a downstream ASI switch element 102 transmits a SBFC flow control message to an upstream ASI endpoint 104. The SBFC flow control message provides some or all of the following status information: a Traffic Class designation, an Ordered-Only flag state, an egress output port identifier, and a requested scheduling behavior. The upstream ASI endpoint 104 uses the status information to modify its scheduling such that packets targeting a congested buffer in the downstream ASI switch element 102 are given lower priority. In particular, the upstream ASI endpoint 104 either suspends (e.g., the SBFC message is an ASI Xoff message) or resumes (e.g., the SBFC message is an ASI Xon message) transmission of packets from a connection queue, where all of the packets have the requested Ordered-Only flag state, Traffic Class field designation, and egress output port identifier. When the transmission of packets is suspended from a connection queue, that connection queue is said to be “flow controlled”.
In the example scenario described below, the packets to be transmitted from the upstream ASI endpoint 104 to the downstream ASI switch element 102 include ASI Protocol Interface 2 (PI-2) packets. Referring to
Referring to
A primary scheduler 308 of the NPU 302 determines the order in which PDUs are retrieved from the PDU memory 306. The retrieved PDUs are forwarded by the NPU 302 to a PI-2 segmentation and reassembly (SAR) engine 310 of the upstream ASI endpoint.
The ASI devices 102, 104 are typically implemented to limit the maximum ASI packet size to a size that is less than the maximum ASI packet size of 2176 bytes supported by the ASI architecture. In instances in which a PDU retrieved from the PDU memory 206 has a packet size larger than the maximum payload size that may be transferred across the ASI fabric, the PDU is segmented into a number of segments. In some implementations, the segmentation is performed by microengine software in the NPU 302 prior to the individual segments being forwarded to the PI-2 SAR engine 301. In other implementations, the PDUs are forwarded to the PI-2 SAR engine 310 where the segmentation is performed.
For each received PDU (or segment of a PDU), the PI-2 SAR engine 310 forms one or more PI-2 packets by segmenting the PDU into segments whose size is smaller than the maximum supported in the network, and to each segment appending an ASI route header and optionally, computing a PI-2 CRC. A buffer manager 312 stores each PI-2 packet formed by the PI-2 SAR engine 310 into a data buffer memory 314 that is referred to in this description as a “transmit buffer” or “TBUF”. In an ideal scenario, the TBUF 314 is sized large enough to buffer all of the PI-2 packets that are in-flight across the ASI fabric. In such a scenario, the NPU 302 is ideally implemented with a TBUF 314 of a size that is greater than 512 MB for low data rates and greater than 2 MB for high data rates.
Although the ASI architecture does not place any size constraints on the TBUF 314, it is generally preferable to implement a TBUF 314 that is much smaller in size (e.g., 64 K to 256 KB) due to die size and cost constraints. In one implementation, the TBUF 314 is a random access memory that can contain up to 128 KB of data. The TBUF 314 is organized as elements 314a-314n of fixed size (elem_size), typically 32 bytes or 64 bytes per element. A given PI-2 packet of length L would be allocated mod(L/elem_size) elements 314n of the TBUF 314. An element 314n containing a PI-2 packet is designated as being “occupied”, otherwise the element 314n is designated as being “available”.
For each PI-2 packet that is stored in the TBUF 314, the buffer manager 312 also creates a corresponding queue descriptor, selects a target connection queue 316a from a number of connection queues 316a-316n residing on an on-chip memory 318 to which the queue descriptor is to be enqueued, and appends the queue descriptor to the last queue descriptor in the target connection queue 316a. The buffer manager 312 records an enqueue time for each queue descriptor as it is appended to a target connection queue 316a. The selection of the target connection queue 316a is generally based on the Traffic Class designation of the PI-2 packet corresponding to the queue descriptor to be enqueued, and its destination and path through the ASI fabric.
In order to ensure that the TBUF 314 is not over-run, the buffer manager 312 implements a buffer management scheme that dynamically determines the TBUF 314 space allocation policy. In general, the buffer management scheme is governed by the following rules: (1) if a connection queue 316a-316n is not flow controlled, PI-2 packets (corresponding to queue descriptors to be appended to that connection queue 316a-316n) are allocated space in the TBUF 314 to ensure a smooth traffic flow on that connection queue 316a-316n; (2) if a connection queue 316a-316n is flow controlled, PI-2 packets corresponding to queue descriptors to be appended to that connection queue 316a-316n are allocated space in the TBUF 314 until a certain programmable per connection queue threshold is exceeded, at which point the buffer manager 312 selects one of several options to handle the condition; and (3) packet drops and roll-back operations are triggered only when the TBUF occupancy exceeds certain thresholds to ensure that expensive roll-back operations are kept to a minimum.
Referring to
The NPU 302 has a secondary scheduler 320 that schedules PI-2 packets in the TBUF 314 for transmission over the ASI fabric via an ASI transaction layer 322, an ASI data link layer 324, and an ASI physical link layer 326. In some implementations, the ASI device 104 includes a fabric interface chip that connects the NPU 302 to the ASI fabric. In a normal mode of operation, the occupancy of the TBUF 314 (i.e., the number of occupied elements 314a-314n in the TBUF) is low enough so that the rate at which elements 314a-314n are added to the TBUF 314 is at (or lower) than the rate at which elements 314a-314n are made available in the TBUF 314. That is, the secondary scheduler 320 is able to keep up with the rate at which the primary scheduler 308 fills the TBUF elements 314a-314n.
As the secondary scheduler 320 schedules each PI-2 packet for transfer over the ASI fabric, the secondary scheduler 320 sends a commit message to a queue management engine 330 of the NPU 302. Once the queue management engine 330 receives the commit message for all of the PI2 packets into which the segments of a PDU have been encapsulated, the queue management engine 330 removes the PDU data from the PDU memory 306.
Upon detection (404) of a trigger condition, the buffer manager 312 initiates (406) a process (referred to in this description as a “data buffer element recovery process”) to reclaim space in the TBUF 314 in order to alleviate the TBUF 314 occupancy concerns. Examples of such trigger conditions include: (1) the number of available TBUF elements 314a-314n falling below a certain minimum threshold; (2) the number of flow controlled queues 316a-316n exceeding a programmable threshold; and (3) the number of TBUF elements 314a-314n associated with any one flow controlled connection queue 316a-316n exceeding a programmable threshold.
Once the data buffer element recovery process is initiated, the buffer manager 312 selects (408) one or more connection queues 316a-316n for discard, and performs (410) a roll-back operation on each selected connection queue 316a-316n such that the occupied elements 314a-314n of the TBUF 314 that correspond to each selected connection queue 316a-316n are designated as being available. One implementation of the roll-back operation involves sending a rollback message (instead of a commit message) to the queue management engine 330 of the NPU 302. When the queue management engine 330 receives the rollback message for a PDU, it re-enqueues the PDU to the head of the connection queue 316a-316n and does not remove the PDU data from the PDU memory 306. In this manner, the buffer manager 312 is able to reclaim space in the TBUF 314 in which other PI-2 packets can be stored. In general, the data buffer element recovery process is governed by two rules: (1) select one or more connection queues 316a-316n to ensure that the aggregate reclaimed TBUF 314 space is sufficient so that the TBUF 314 occupancy falls below the predetermined threshold conditions; and (2) minimize the total number of roll-back operations to be performed.
Four example techniques may be implemented by the buffer manager 312 to perform the data buffer element recovery process. The specific technique used in a given scenario may depend on the source 304a-304n of the PDUs. That is, the technique applied may be line card specific to best fit the operating conditions of a particular line card configuration.
In one example, the buffer manager 312 examines each connection queue's counter and bit vector that indicates whether the connection queue is flow controlled, and identifies the flow controlled connection queue 316a-316n that has the largest number of occupied elements 314a-314n in the TBUF 314 that are allocated to that connection queue 316a-316n. The buffer manager 312 marks the identified flow controlled connection queue 316a-316n for discard, and initiates a roll-back operation for that connection queue. Occupied elements 314a-314n of the TBUF 314 allocated to that connection queue 316a-316n are designated as being available, and the buffer manager 312 re-evaluates (412) the trigger condition. If the trigger condition is not resolved (i.e., the reclaimed TBUF 314 space is insufficient), the buffer manager 312 identifies the flow controlled connection queue 316a-316n having the next largest number of occupied elements 314a-314n allocated in the TBUF 314, and repeats the process (at 408) until the trigger condition is resolved (i.e., becomes false), at which point the buffer manager returns to monitoring (402) the state of the NPU 302. By selecting flow controlled queues 316a-316n having relatively larger numbers of allocated occupied elements 314a-314n, the buffer manager 312 is able to resolve the trigger condition while minimizing the number of connection queues 316a-316n upon which roll-back operations are performed.
In another example, the buffer manager 312 examines each connections queue's head of connection queue time-stamp and bit vector that indicates whether the connection queue 316a-316n is flow controlled, and identifies the flow controlled connection queue 316a-316n having the earliest head of connection queue time-stamp. The buffer manager 312 marks the identified flow controlled connection queue 316a-316n for discard, and initiates a roll-back operation for that connection queue 316a-316n. Occupied elements 314a-314n of the TBUF 314 allocated to that connection queue 316a-316n are designated as being available, and the buffer manager 312 re-evaluates (412) the trigger condition. If the trigger condition is not resolved, the buffer manager 312 identifies the flow controlled connection queue 316a-316n having the next earliest head of connection queue time-stamp, and repeats the process (at 408) until the trigger condition is resolved. By selecting the oldest flow controlled queue 316a-316n (as reflected by the earliest head of connection queue time-stamp), the buffer manager 312 is able to resolve the trigger condition while re-designating the elements 314a-314n of the TBUF 314 that have the oldest SBFC status.
In a third example, the buffer manager 312 examines each connections queue's head of connection queue time-stamp and bit vector that indicates whether the connection queue 316a-316n is flow controlled, and identifies the flow controlled connection queue 316a-316n having the latest head of connection queue time-stamp. The buffer manager 312 marks the identified flow controlled connection queue 316a-316n for discard, and initiates a roll-back operation for that connection queue 316a-316n. Occupied elements 314a-314n of the TBUF 314 allocated to that connection queue 316a-316n are designated as being available, and the buffer manager 312 re-evaluates the trigger condition. If the trigger condition is not resolved (i.e., the reclaimed TBUF 314 space is insufficient), the buffer manager 312 identifies the flow controlled connection queue 316a-316n having the next latest head of connection queue time-stamp, and repeats the process (at 408) until the trigger condition is resolved. By selecting the newest flow controlled queue 316a-316n (as reflected by the latest head of connection queue time-stamp), the buffer manager 312 operates under the assumption that the newest flow controlled connection queue 316a-316n is unlikely to be subject to an ASI Xon message (signaling the resumption of packet transmission from that connection queue 316a-316n) in the immediate future. Accordingly, performing a roll-back operation on the newest flow controlled connection queue 316a-316n allows the buffer manager 312 to reclaim elements 314a-314n of the TBUF 314, while allowing older flow controlled queues 316a-316n to be maintained as these are more likely to be subject to ASI Xon messages. The techniques of
In a fourth example, the data buffer element recovery process is triggered when the number of flow controlled connection queues 316a-316n exceeds a certain threshold. When this occurs, the buffer manager 312 selects connection queues 316a-316n for discard based on occupancy (i.e., using each connection queue's per connection queue counter), oldest element (i.e., identifying the earliest head of connection queue time-stamped), newest element (i.e., identifying the latest head of connection queue time-stamp), or by applying a round-robin scheme. The buffer manager 312 repeatedly selects connection queues 316a-316n for discard until the number of flow controlled connection queues 316a-316n drops below the triggering threshold.
In the examples described above, the NPU 302 is implemented with on-chip connection queues 316a-316n that have shorter response times as compared to off-chip connection queues. These shorter response times enable the NPU 302 to meet the stringent response-time requirements for suspending or resuming the transmission of packets from a given connection queue 316a-316n after a SBFC flow control message is received for that particular connection queue 316a-316n. The upstream ASI endpoint is further implemented with a buffer manager 312 that dynamically manages the buffer utilization to prevent buffer over-run even if the TBUF 314 size is relatively small given die size and cost constraints.
The techniques of one embodiment of the invention can be performed by one or more programmable processors executing a computer program to perform functions of the embodiment by operating on input data and generating output. The techniques can also be performed by, and apparatus of one embodiment of the invention can be implemented as, special purpose logic circuitry, e.g., one or more FPGAs (field programmable gate arrays) and/or one or more ASICs (application-specific integrated circuits).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a memory (e.g., memory 330). The memory may include a wide variety of memory media including but not limited to volatile memory, non-volatile memory, flash, programmable variables or states, random access memory (RAM), read-only memory (ROM), flash, or other static or dynamic storage media. In one example, machine-readable instructions or content can be provided to the memory from a form of machine-accessible medium. A machine-accessible medium may represent any mechanism that provides (i.e., stores or transmits) information in a form readable by a machine (e.g., an ASIC, special function controller or processor, FPGA or other hardware device). For example, a machine-accessible medium may include: ROM; RAM; magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals); and the like. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
The invention has been described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, the steps of an implementation of the invention can be performed in a different order and still achieve desirable results.