Store-and-forward devices, such as switches and routers, are used in packet networks, such as the Internet, to direct traffic at interconnection points. The store-and-forward devices include line cards to receive (ingress ports) and transmit (egress ports) packets from/to external sources. The line cards are connected to a switching fabric via a backplane. The switching fabric provides configurable connections between the line cards. The packets received at the ingress ports are stored in queues prior to being transmitted to the appropriate egress ports. The queues are organized by egress port and may also be organized by priority.
The store-and-forward devices also include a scheduler to schedule transmission of packets from the ingress ports to the egress ports via the switch fabric. The ingress ports send requests to the scheduler for the queues having packets stored therein. The scheduler considers the source and destination and possibly priority when issuing grants. The scheduler issues grants for queues from multiple ingress ports each cycle. The ingress ports transfer packets from the selected queues to the corresponding ingress ports in parallel across the crossbar switching matrix.
Transmitting packets of variable size through the switch fabric during the same cycle results in wasted bandwidth. For example, when a 50-byte packet and a 1500-byte are transmitted in the same cycle the switch fabric must be maintained in the same configuration for the duration of the 1500-byte packet. Only 1/30th of the bandwidth of the path is used by the 50-byte packet.
Dividing the packets into fixed-size units (typically size of smallest packet) for transmission and then reassembling the packets as necessary after transmission reduces or avoids the wasted bandwidth of the switch fabric. However, the smaller fixed sized units increase the scheduling and the fabric switch reconfiguration rates. For example, a unit size of 64 bytes and a port rate of 10 Gigabits/second results in scheduling and reconfiguration rates of 51.2 nanoseconds.
The features and advantages of the various embodiments will become apparent from the following detailed description in which:
The switch fabric 160 provides re-configurable data paths between the line cards 110 (or fabric interfaces). The switch fabric 160 includes a plurality of fabric ports 170 (addressable interfaces) for connecting to the line cards 110 (port interfaces). Each fabric port 170 is associated with a fabric interface (pair of ingress fabric interface modules and egress fabric interface modules). The switch fabric 160 can range from a simple bus-based fabric to a fabric based on crossbar (or crosspoint) switching devices. The choice of fabric depends on the design parameters and requirements of the store-and-forward device (e.g., port rate, maximum number of ports, performance requirements, reliability/availability requirements, packaging constraints). Crossbar-based fabrics may be used-for high-performance routers and switches because of their ability to provide high switching throughputs.
It should be noted that a fabric port 170 may aggregate traffic from more than one external port (link) associated with a line card. A pair of ingress and egress fabric interface modules is associated with each fabric port 170. When used herein the term fabric port may refer to an ingress fabric interface module and/or an egress fabric interface module. An ingress fabric interface module may be referred to as a source fabric port, a source port, an ingress fabric port, an ingress port, a fabric port, or an input port. Likewise an egress fabric interface module may be referred to as a destination fabric port, a destination port, an egress fabric port, an egress port, a fabric port, or an output port.
The ingress fabric interface module 230 receives packets from the packet processor/traffic manager device (e.g., 140 of
The ingress fabric interface module 230 stores the segments in queues. The queues may be based on flow (e.g., destination, priority). The queues may be referred to as virtual output queues. The ingress fabric interface module 230 sends requests for permission to transmit data from its virtual output queues containing data to the scheduler 220.
Once a request is granted for a particular virtual output queue, the ingress fabric interface module 230 dequeues segments from the queue and aggregates the segments into a frame having a maximum size. The frame will consist of a whole number of segments so if the segments are not all the same size the constructed frames may not be the same size. The frames may be padded to the maximum size so that the frames are all the same size. The maximum size of the frame is a design parameter. A frame may have segments associated with different packets.
The frame is transmitted to the switching matrix 210. The switching matrix 210 routes the frame to the appropriate egress fabric interface modules 260. The time taken to transmit the maximum-size frame is referred to as the “frame period.” This interval is the same as a scheduling interval (discussed in further detail later). The frame period can be chosen independent of the maximum packet size in the system. The frame period may be chosen such that a frame can carry several maximum-size segments. The frame period may be determined by the reconfiguration time of the crossbar data path.
The egress fabric interface modules 260 receive the frames from the switching matrix 210 and splits the frame into the plurality of segments. The egress fabric interface modules 260 recreates a packet by configuring the appropriate segments together. The egress fabric interface modules 260 transmits the packets to the packet processor/traffic manager device for further processing.
Stage III is the crossbar configuration stage. During this stage, the scheduler configures the crossbar planes based on the matches computed during stage II. While the crossbar is being configured, the ingress modules de-queue segments from the appropriate queues in order to form frames. The scheduler may also send grants to the egress modules for error detection during this stage. Stage IV is the data transmission stage. During this stage, the ingress modules transmit the frames across the crossbar. The time for each stage is equivalent to time necessary to transmit the frame (frame period). For example, if the frame size, including its header, is 3000 bytes and the port speed is 10 Gbs the frame period is 2.4 microseconds, (3000 bytes×8 bits/byte)/10 Gbs.
The amount of data in a queue may be described in terms of number of bytes, packets, segments or frames. If the data is transmitted in frames the request fields 430 may quantize the amount of data as the number of data frames it would take to transport the data within the associated queue over the crossbar planes. The length of the request fields 430 (e.g., number of bits) associated with the amount of data defines the granularity to which the amount of data can be described. For example, if the request fields 430 included 4 bits to define amount of data that would provide 16 different intervals by which to for classify the amount of data.
The age of data may be defined as the amount of time that data has been in the queue. This time can be determined as the number of frame periods since the queue has had a request granted. The ingress ports may maintain an age timer for each queue. The age counter for a queue may be incremented each frame period that a request is not issued for the queue. The age counter may be reset when a request is granted for the queue. The length of the request fields 530 (e.g., number of bits) associated with the data age defines the granularity to which the age can be described.
As each request may define external criteria (e.g., aging, fullness) the request pre-processing block 610 may map the requests an internal scheduler priority level (SPL) based on the external criteria. The length of the SPL (e.g., number of bits) defines the granularity of the SPL.
Referring back to
The grant arbiters 640 are associated with specific egress modules. The grant arbiters 640 are coupled to the arbitration request blocks 630 and are capable of receiving requests from any arbitration request block 630. If a grant arbiter 640 receives multiple requests, the grant arbiter 640 will grant one of the requests (e.g., activate the associated bit) based on some type of arbitration (e.g., round robin (RR)).
The accept arbiters 650 are associated with specific ingress modules. The accept arbiters 650 are coupled to the grant arbiters 640 and are capable of receiving grants from any grant arbiter 640. If an accept arbiter 650 receives multiple grants, the accept arbiter 650 will accept one of the grants (e.g., activate the associated bit) based on some type of arbitration (e.g., RR). When an accept arbiter 650 accepts a grant, the arbitration request block 630 associated with that ingress port and the grant arbiter 640 associated with that egress port are disabled for the remainder of the scheduling cycle.
Each iteration of the scheduling process consists of the three phases: requests generated, requests granted, and grants accepted. At the end of an iteration the process continues for ingress and egress ports that were not previously associated with an accepted grant.
After an accept arbiter 650 accepts a grant, the scheduler can generate a grant for transmission to the associated ingress port. A grant also may be sent to the associated egress port. The grants to the ingress port and the egress port may be combined in a single grant frame.
The egress module grant 840 may include an ingress module (input port) number 842 representing the ingress module it should be receiving data from, and a valid bit 844 to indicate that the field is valid. The ingress module grant 850 may include an egress module (output port) number 852 representing the egress module to which data should be sent, a starting priority level 854 representing the priority level of the queue that should be used at least as a starting point for de-queuing data to form the frame, and a valid bit 856 to indicate that the information is a valid grant. The presence of the starting priority field enables the scheduler to force the ingress module to start de-queuing data from a lower priority queue when a higher-priority queue has data. This allows the system to prevent starvation of lower-priority data.
The flows may be weighted in order to provide bandwidth guarantees (quality of service). The weighting may be defined as a certain amount of data (e.g., bytes, segments, frames) over a certain period (e.g., time, cycles, frame periods). The period may be referred to as a “scheduling round” or simply “round”. When the weighting for a particular flow is satisfied for a particular scheduling round, the flow is disabled for the remainder of the period in order to provide the other flows with the opportunity to meet their weights. The grants issued by the scheduler should be proportional to the programmed weights.
According to one embodiment, the weights associated with the flows may be stored in the scheduler so that the scheduler can determine when a flow has met its weight. The scheduler may track the amount of data sent based on the grants issued. Alternatively, the ingress port may track the amount of data dequeued for the flows associated therewith and provide that data to the scheduler. The scheduler may compare the data transmitted to the weighting to determine when the weighting has been satisfied.
According to one embodiment, the weights for the flows may be stored in the respective ingress ports. The ingress ports may keep a running total of the amount of data transmitted per flow during a period. The ingress port may compare the running total to the weight and determine the weighting is satisfied when the running total equals or exceeds the weight. The ingress port may maintain a satisfied bit for each flow and may activate the bit when the weight is satisfied. The ingress port informs the scheduler when a particular flow has been satisfied. The ingress port may include the satisfaction notification in a request (next request sent). The request frame may include weight satisfied flags (e.g., bit) for each of the flows and the flags associated with satisfied flows may be activated.
The scheduler receives the satisfied information from the ingress port and deactivates the associated flow from consideration for the remainder of the current scheduling round in the arbitration of requests. The scheduler may maintain a satisfied bit for each flow and may activate the bits when informed that the flow is satisfied by the ingress port. When the satisfied bit is active the flow is deactivated. The flow may be deactivated by preventing the associated arbitration block from sending a request to the associated grant arbiter within the scheduler.
The scheduler maintains data related to the duration of the scheduling round with which the weights are associated. The scheduler tracks the duration of the current scheduling round and when the duration is up, instructs the ingress ports to restart the running counts. The scheduler may also reset the count for particular flows during the scheduling round. For example, if there are no other requests from the ingress port, for the egress port, or for the priority (or SPL) associated with the satisfied flow. The flow may also be reset during the period if there are requests from the ingress port, for the egress port and/or the priority (or SPL), but a grant has not been accepted for more than a programmable number of consecutive frame times implying that the ingress port is giving priority to other flows. The scheduler may send the reset instructions in grants.
The scheduler may maintain a reset bit for each flow and the bit may be set when the running totals for the flow should be reset. The grant frames may include reset flags (e.g., bits) for each of the flows associated with an ingress port and the flags associated with the flows that should be reset may be activated.
The scheduler may reset a set reset bit and a corresponding set satisfied bit the next frame period after the grant frame with the reset flag activated is forwarded to the ingress port. Due to the pipelined nature of the switching device the scheduler may receive request frames with satisfied flags set for particular flows after the scheduler has sent a grant frame with a reset flag set for the particular flow. Since the scheduler will be working on the most recent data, if the scheduler receives a request frame with a satisfied flag set for a particular flow in the same frame period as the scheduler is resetting the reset bit and the satisfied bit maintained in the scheduler for the particular flow the satisfied flag in the request will be ignored.
When the ingress ports receive the reset information they may reset the running totals for the associated flows. The ingress port may maintain a reset bit for each flow and may activate the bit when the reset information is received from the scheduler. When the reset bit is activated for a flow the running count may be cleared in the next frame period and after the running count is cleared the reset bit may be deactivated in the next frame time.
The reset bit map may be sent by the scheduler to the ingress port every frame period. The ingress port may update its reset bit map based thereon. However, since the reset bits may be deactivated in the scheduler before the ingress port has reset its running counts for the associated flows, the reset bit map received from the scheduler may be logically ORed with the current reset bit map to ensure the resets are not deactivated before the counts have been
The length of the scheduling round is stored in the scheduler (905). The scheduler will also maintain a running count of the frame periods to track the progress of the scheduling round, a reset bit for each flow to indicate when the flow should be reset, and a satisfied bit for each flow to indicate when the weight for the flow is satisfied and should be excluded from scheduling. Initially, the running count for the frame periods will be 0 and the satisfied and reset bits will be deactivated.
The flow chart of
The scheduler receives the requests and updates the satisfied bits maintained therein based on the satisfied flags in the request frame (920). The scheduler deactivates any flow having a satisfied bit set for the remainder of the current scheduling round, and arbitrates amongst the remaining requests received from each of the ingress ports (925). The scheduler updates the running frame period total and determines if any or all of the flows should be reset (930). The reset determination includes determining if the running total of the frame periods equals the duration of the scheduling round stored therein. The determination also includes determining if no requests are being received for other ingress ports, egress ports, or priorities associated with a satisfied flow or if requests are being received but not granted. The reset bits for the appropriate flows are set. The scheduler generates a grant frame every frame period for each of the ingress ports that includes grants and reset flags for the associated flows (935). The reset flags are set if the reset bit in the scheduler is set indicating the flow should be reset.
After the grant frame is sent, the scheduler updates the counters and flags (940). If no reset flags were set in the grant frame that was sent the previous frame period than no updates are required. If the reset flag was set for all the flows indicating that the round ended, the count is reset as are the reset and satisfied flags for all of the flows. If the reset bit was only set for a subset of the flows, the reset and satisfied bits are reset for the subset of flows.
The ingress port receives the grant and dequeues data from the associated queues and transmits the data to the appropriate egress port via the switch fabric (945). As the data is being dequeued the ingress port updates the counts and flags for the associated flows (950). The running total is increased by the amount of data that is dequeued. The reset bits for the flows are updated based on the grant frame received. As previously mentioned the reset bit map in the ingress port may be logically ORed with the reset bitmap received in the grant frame. If the reset bit is set in the ingress port for a flow the satisfied bit and the running count for the flow are reset.
Resetting the count may not mean setting the count to zero. If the running count was greater than the weight the overage may be counted against the weight in the next round. A difference between the running count and the weight is determined. If the difference is greater than or equal to 0 that means that weight was not exceeded and the running count is simply set to 0. If the count is greater that 0 there was an overage and the running count is set to the overage. If the overage is greater than the weight indicating that more than twice the weight was dequeued last round the count may be set to the weight. After the counts and flags are updated a determination is made as to whether the weights are satisfied and the appropriate satisfied bits are set (910).
The elements of the flowchart may be mapped to the different stages of the store-and-forward pipeline schedule. For example, the request 915 may be the request stage (stage I). The reset 917, update 920, arbitrate 925, determine 930, and generate 935 may be the schedule stage (stage II). The reset 940 and dequeue 945 may be the cross bar configuration stage (stage III). The update 950 and determine 910 may be the data transmission stage (stage IV).
It should be noted that the steps identified in the flowchart may be rearranged, combined and or separated without departing from the scope. Moreover, the pipeline stage within which the specific steps are accomplished may be modified without departing from the scope.
It should also be noted that the disclosure focused on frame based store-and-forward devices but is in no way intended to be limited thereby.
Although the disclosure has been illustrated by reference to specific embodiments, it will be apparent that the disclosure is not limited thereto as various changes and modifications may be made thereto without departing from the scope. Reference to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described therein is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Different implementations may feature different combinations of hardware, firmware, and/or software. For example, some implementations feature computer program products disposed on computer readable mediums. The programs include instructions to cause processors to perform the techniques described above.
The various embodiments are intended to be protected broadly within the spirit and scope of the appended claims.