Store-and-forward devices, such as switches and routers, are used in packet networks, such as the Internet, for directing traffic at interconnection points. The store-and-forward devices include a plurality of line cards for receiving and transmitting data from/to external sources. The line cards are connected to one another via a backplane and a switching fabric. The backplane provides data paths between each line card and the switching fabric and the switching fabric provides configurable data paths between line cards. The line cards receiving data from external sources (ingress ports) receive data (packets) of various sizes. The data received are stored in queues prior to being transmitted to the appropriate line cards for transmission to external sources (egress ports). The packets include a header that identifies the destination of the packet. The packet is stored in the queue associated with that destination. The packet may also identify a priority for the data and the ingress port may also include queues for the various priorities.
The ingress ports send requests for transmitting data to a scheduler within the switching fabric. The scheduler generates grants for the queues that should transmit packets therefrom. The packets are switched through a crossbar switching matrix in batches. A batch consists of at most one packet selected from each input port. Thus, no more than one of the packets is destined for each output port. The packets in a batch are transferred in parallel across the crossbar switching matrix. While the packets from a scheduled batch are being transferred through the crossbar, the scheduler can select the packets to form the next batch, so that the transmission of the new batch of packets can start as soon as transmission of the current batch ends. At the end of each batch of packets, the fabric scheduler re-configures the crossbar switching matrix so as to connect each input port to the output port where its next packet is destined to.
Because the packets are transferred in batches, the switching paths in the crossbar switching matrix is kept unchanged for the duration of the longest packet being transferred across the crossbar in that batch. When the packets are of variable size (as is the case for packets generated by most network protocols in the industry), this results in wasted bandwidth. For example, when a 50-byte packet and a 1500-byte packet are part of the same batch, the crossbar is be maintained in the same configuration for the duration of the 1500-byte packet, and only 1/30th of the bandwidth of the path is used by the 50-byte packet.
One solution for avoiding the inefficiency caused by variable-size packets is to divide the packets into fixed-size units before switching through the crossbar switching fabric, and combine the fragments into the original packet at the output of the fabric. The packet fragments switched through the crossbar are called “segments” or “cells”. The fabric scheduler selects at most one cell from each input port to form a batch, such that the destination port numbers associated with the cells in the same batch are distinct. The cells in the same batch are then transmitted in parallel. Because the cells are of the same size, no bandwidth is wasted in the crossbar. The cells switched through the fabric have a fixed size. This fixed size is typically chosen to correspond to the size of the smallest packet switched by the fabric, plus the size of any internal headers added by the router or switch before passing the packet through the fabric.
The fabric scheduler computes a new schedule for each batch of cells during the transmission time of a cell. In a high-speed switch, this time interval can be extremely short. For example, with a cell size of 64 bytes and a port rate of 10 Gigabits/second, the fabric scheduler schedules a new batch of cells every 51.2 nanoseconds. The crossbar switching matrix is also configured at intervals of 51.2 nanoseconds. As the port speed is increased, both the fabric scheduler and the crossbar reconfiguration are correspondingly made faster. This is especially a problem when an optical switching device is used as the crossbar switching matrix. While supporting very high data rates, many of the optical switching devices have long reconfiguration times. This makes them unsuitable for use in a cell-based fabric.
Another difficulty with the cell-based fabric is that it is difficult to separate the crossbar switching matrix (the data path) and the fabric scheduler, because the delays in communication between them can become a bottleneck. During every scheduling cycle, the header information (in particular, the destination port number) from cells stored in the input buffers of the crossbar matrix is passed to the fabric scheduler, and the crossbar configuration setting is communicated back from the scheduler to the crossbar matrix. If the scheduler is physically separated from the crossbar matrix (on separate chips or circuit boards), the delays in communication between the two may make it difficult to achieve the scheduling rate needed in a high-speed router or switch.
The features and advantages of the various embodiments will become apparent from the following detailed description in which:
Store-and-forward devices, such as switches and routers, are used in packet networks, such as the Internet, for directing traffic at interconnection points. Store-and-forward devices include a plurality of interface modules, a switch fabric for selectively connecting different interface modules, and a backplane for connecting the interface modules and the switching fabric. The interface modules include receivers (ingress ports) to receive data from and transmitters (egress ports) to transmit data to multiple sources (e.g., computers, other store and forward devices) over multiple communication links (e.g., twisted wire pair, fiber optic, wireless). Each of the sources may be capable of transmitting/receiving data at different speeds, different quality of service, etc. over the different communication links. The interface modules can transmit/receive data using any number of protocols including Asynchronous Transfer Mode (ATM), Internet Protocol (IP), and (Time Division Multiplexing) TDM. The data may be variable length or fixed length packets, such as cells, or frames.
The data received from external sources is stored in a plurality of queues. The queues may be stored in any type of storage device and preferably are a hardware storage device such as semiconductor memory, on-chip memory, off-chip memory, field-programmable gate arrays (FPGAs), random access memory (RAM), or a set of registers. The interface modules may be line cards or chips contained on line cards. A single line card may include a single interface module (receiver or transmitter) or multiple interface modules (receivers, transmitters, or a combination). The interface modules may be Ethernet (e.g., Gigabit, 10 Base T), ATM, Fibre channel, Synchronous Optical Network (SONET), Synchronous Digital Hierarchy (SDH) or various other types. A line card having multiple interface modules may have the same type of interface modules (e.g., ATM) or may contain some combination of different interface module types. The backplane may be electrical or optical.
The switch fabric 160 provides re-configurable data paths between the line cards 110 (or fabric interfaces). The switch fabric 160 includes a plurality of fabric ports 170 (addressable interfaces) for connecting to the line cards 110 (port interfaces). Each fabric port 170 is associated with a fabric interface (pair of ingress interfaces and egress interfaces). The switch fabric 160 can range from a simple bus-based fabric to a fabric based on crossbar (or crosspoint) switching devices. The choice of fabric depends on the design parameters and requirements of the store-and-forward device (e.g., port rate, maximum number of ports, performance requirements, reliability/availability requirements, packaging constraints). Crossbar-based fabrics are the preferred choice for high-performance routers and switches because of their ability to provide high switching throughputs.
A backplane consists of a plurality of channels (input 240 and output 250) that provide connectivity between the fabric ports 205 and the crossbar matrix 210 so as to provide switching connectivity between line cards. With advances in serial communication technologies, the channels 240, 250 are preferably high-speed serial links. High-speed serial data can be carried over either electrical backplanes or optical backplanes. If an optical backplane is used, the transmitting line card converts electrical signals to optical signals and send the optical signals over fiber, and the destination line card receives the optical signals from the fiber and reconverts them to electrical signals.
The crossbar matrix 210 is logically organized as an array of N×N switching points, thus enabling any of the packets arriving at any of N input ports to be switched to any of N output ports, where N represents the number of fabric ports. These switching points are configured by the fabric scheduler 220 at packet boundaries. Typically, the packets are switched through the crossbar switching matrix 210 in batches, where a batch consists of at most one packet selected from each input port, in such a way that no more than one of the packets is destined for each out port.
Each of the packets, arriving at one of the input buffers 230, has a header containing the destination port number where it needs to be switched. The fabric scheduler 220 periodically reads the destination port information from the headers of the packets stored in the input buffers 230 and schedules a new batch of packets to be transferred through the crossbar switching matrix 210. The packets in a batch (a maximum of N packets) are transferred in parallel across the crossbar switching matrix 210. While the packets from a scheduled batch are being transferred through the crossbar 210, the scheduler 220 can select the packets to form the next batch, so that the transmission of the new batch of packets can start as soon as transmission of the current batch ends. At the end of each batch of packets, the fabric scheduler 220 re-configures the crossbar switching matrix 210 so as to connect each input port to the output port where its next packet is destined to.
Because the packets in the exemplary switching fabric 200 are transferred in batches, the switching paths in the crossbar switching matrix 210 are kept unchanged for the duration of the longest packet being transferred across the crossbar 210 in that batch. When the packets are of variable size (as is the case for packets generated by most network protocols in the industry), this results in wasted bandwidth. For example, when a 50-byte packet and a 1500-byte packet are part of the same batch, the crossbar 210 is maintained in the same configuration for the duration of the 1500-byte packet, and only 1/30th of the bandwidth of the path is used by the 50-byte packet.
The fixed size of the cells may be chosen to correspond to the size of the smallest packet switched by the switch fabric 300, plus the size of any internal headers added. For example, if the smallest packet is of size 64 bytes, and the size of the internal headers is 16 bytes, a cell size of 64+16=80 bytes can be chosen. A packet larger than 64 bytes, arriving in the switch fabric 300, will be segmented into multiple cells of maximum size 64 bytes by the segmentation unit 370 before switching through the crossbar matrix 310. For example, if a 180 byte packet is received it will be broken into 2 cells each the maximum 64 bytes and one cell having 52 bytes. The last cell is padded to 64 bytes so that the cells are the same size. Each of these cells is appended with a header (e.g., 16 bytes). After the cells (data and header) are switched through the crossbar matrix 310 they are combined into the original packet by the reassembly unit 380.
The fabric scheduler 320 works in the same way as the fabric scheduler 220 from
The ingress fabric interface module 430 receives packets from the packet processor/traffic manager device on a line card. The packet processor/traffic manager is responsible for processing the packets arriving from the external links, determining the fabric port number associated with the incoming packet (from a header lookup), and attaching this information to the packet for use by the switching fabric 400. The ingress fabric interface module 430 receives the packets, stores the packets in associated queues, and sends the packets to the switching matrix 410 for transfer to a different line card. The egress fabric interface modules 460 are responsible for receiving packets arriving from the switching matrix 410 (typically from a different line card), and passing them on for any egress processing needed in a line card and subsequently for transmission out on the external links. It should be noted that a fabric port may aggregate traffic from more than one external port (link) associated with a line card. A pair of ingress and egress fabric interface modules 430, 460 is associated with each fabric port. When used herein the term fabric port may refer to an ingress fabric interface module and/or an egress fabric interface module. An ingress fabric interface module may be referred to as a source fabric port, a source port, an ingress fabric port, aan ingress port, a fabric port, or an input port. Likewise an egress fabric interface module may be referred to as a destination fabric port, a destination port, an egress fabric port, an egress port, a fabric port, or an output port.
The ingress fabric interface modules 430 store the packets arriving from the packet processor/traffic manager in a set of queues. The packets destined to the egress fabric interface modules 460 are maintained in a separate queue (isolated from each other). In addition, the packets destined to a specific egress fabric interface module 460 can further be distributed into multiple queues based on their class of service or relative priority level. These queues may be referred to as virtual output queues. The packets may be broken down into segments and the segments stored in the queues. The segments can be variable size but are limited to a maximum size.
The segment header identifies the queue in which the segment is to be placed upon its arrival in the egress fabric interface module. The number of queues is dependent on number of priority levels (or class of services) associated with the packet. Furthermore, the number of queues may also be dependent on number of ingress fabric interface modules that can send data to the egress fabric interface module. For example, if the egress fabric interface module receives data from 8 line card ports (ingress fabric interface modules) and each line card port supports 4 levels of priority for packets to that egress fabric interface module, then the segments arriving at the egress fabric interface module may be placed in one of 32 queues (8 ingress fabric interface modules×4 priorities per ingress module). Therefore, a minimum of 5 bits are needed in the segment header to identify one of the 32 queues. The segment header also includes an “End of Packet” (EOP) bit to indicate the position of the segment within the packet where it came from. The EOP bit is set to 1 for the last segment of a packet, and 0 for the other segments. This enables the egress modules to detect the end of a packet.
The segments stored in its queues are aggregated into frames by the ingress fabric interface module 430 before transmission to the crossbar matrix 410.
The frame period can be chosen independent of the maximum packet size in the system. Typically, the frame period is chosen such that a frame can carry several maximum-size segments. The frame period is often determined by the reconfiguration time of the crossbar data path. For example, the switching time of certain optical devices are currently of the order of microseconds. If such devices are used for the data path, the frame period is on the order of microseconds. Electronic switching technologies, on the other hand, are significantly faster, allowing frame periods in the range of tens to hundreds of nanoseconds.
Another factor that needs to be taken into account while choosing the frame period is the overhead in synchronizing the egress fabric interface modules with the data streams at the start of a frame. Data streams are broken at the end of a frame and the new arriving frame may be from a different ingress fabric interface module (resulting in a change in frequency and/or phase of the clock associated with the data stream). Accordingly, the egress fabric interface modules re-establish synchronization at the boundary of every frame. This requires a preamble at the beginning of each frame that does not carry any data, but only serves to establish synchronization.
The ingress fabric interface module constructs a frame by de-queuing one or more segments from its queues when instructed to do so by a grant from the fabric scheduler (discussed in further detail later). A grant may be received by an ingress fabric interface module during each frame period. The grant identifies the subset of queues from which data need to be de-queued based on the destination fabric port (egress fabric port module). This de-queuing of segments proceeds until the frame is full. Because the segments cannot be broken up further, and a frame consists of a whole number of segments, each frame constructed may not have the same size, but will always be within the maximum size specified. Alternatively, the frames that do not equal the maximum frame size can be padded to the maximum size so that the frames are the same size.
By way of example, assume that the maximum frame size is 1000 bytes and that ingress port 1 just received a grant to transmit data to egress port 2 (queue 2). Assume that queue 2 has the following segments stored therein: segment 1—256 bytes, segment 2—256 bytes, segment 3—200 bytes, segment 4—256 bytes, segment 5—64 bytes, and segment 6—128 bytes. The frame would be constructed to include as many full segments as possible. In this case the first 4 segments would be selected and utilize 968 bytes. As the fifth segment cannot fit within the frame without exceeding the maximum frame size, the segment is not included. The frame is transmitted as a 968 byte frame. Alternatively, the frame can be padded to the maximum 1000 byte frame size.
If there are multiple queues (based on priority, class of service) associated with a specific destination, the ingress module chooses one or more queues from this subset based on a scheduling discipline. The scheduling discipline may be based on priority (e.g., highest priority first). That is, the queues may be serviced in order of priorities, starting from the highest priority queue and proceeding to the next priority level when the current priority level queue is empty. This de-queuing of segments proceeds through queues (priorities) until the frame is full.
By way of example, assume the same maximum frame size of 1000 bytes and that ingress port 1 just received a grant to transmit data to egress port 2 (queues 4-6 corresponding to priorities 1-3). Assume that queue 4 includes segment 1—256 bytes and segment 2—256 bytes; queue 5 includes segment 3—200 bytes; and queue 6 includes segment 4—256 bytes, segment 5—64 bytes, and segment 6—128 bytes. The frame would include segments 1 and 2 from queue 4, segment 3 from queue 5, and segment 4 from queue 6. These 4 segments selected from three different queues (priorities) generate a 968 byte frame. The frame may be transmitted as a 968 byte frame or alternatively may be padded to the maximum 1000 byte frame size.
While constructing the frame, the segments from multiple packets may be interleaved within a frame. Because the segment header provides identifying information for re-assembling them into the original packets, such interleaving does not violate data integrity. The only constraint to be satisfied is that two segments from the same packet should not be sent out of order. By way of example, assume that packet 1 includes segments 1-5 and packet 2 includes segments 6-8 and that both packets (segments associated with) can fit within the maximum size frame. The order of the packets in the frame may be 1, 2, 3, 6, 4, 7, 8, and 5. That is the packets are interleaved within one another but the order of the segments associated with a packet are in order.
When there is only a single crossbar switching plane present, the frame is transmitted in bit-serial fashion through the crossbar planes. When multiple crossbar planes are used, the contents of the frame are striped over the available crossbar planes. Striping may be performed at the bit, byte, or word level. Additional channels may be used for protection (error detection and correction).
Referring back to
For example, assume that the ingress fabric interface module associated with fabric port 1 has five packets in its queue, two of which are destined to fabric port 3, and one each to fabric ports 5, 6 and 7, respectively. Then, the information transmitted from the ingress fabric interface module 1 in that cycle will carry at least one bit corresponding to each of the fabric ports 3, 5, 6 and 7 to identify the non-empty queues. The information can optionally include many other attributes, such as the amount of data in each queue and the “age” (time interval since a packet was last transmitted) of each queue. In addition, if there are multiple queues associated with each destination port, based on priority or class, then the information may include the amount of data queued at each priority level for each destination port.
A basic fabric scheduler implementation may need only the basic information (ID of non-empty queues) to be passed from the ingress fabric interface modules. More powerful scheduler implementations, supporting additional features, require more information to be passed from the ingress fabric interface modules and higher bandwidth links (or stripping of the requests over multiple links) connecting them to the scheduler.
Based on the information received from the ingress fabric interface modules 430, the fabric scheduler 420 computes a schedule for the crossbar planes 410. The schedule is computed by performing a matching of the requests received from the ingress fabric interface modules 430 and resolving any conflicts therebetween. For example, assume ingress fabric interface module 1 has packets queued for destinations 5 and 7, while ingress fabric interface module 2 has packets queued for destinations 5 and 9. During the matching phase, the scheduler 420 could match both of the ingress modules to destination 5. However, the scheduler 420 would realize the conflict and modify the schedule accordingly. The scheduler 420 may schedule ingress module 1 to send packets to destination 7 and ingress module 2 to send to destination 5, enabling both transmissions to occur in parallel during the same frame cycle. In practice, the fabric scheduler 420 may use criteria such as, the amount of data queued at various priorities, the need to maintain bandwidth guarantees, and the waiting times of packets, in the computation of the schedule.
The scheduler 420 then sets the crossbar matrix (planes) 410 to correspond to this setting. For example, if the fabric scheduler 420 generates a schedule in which the ingress fabric interface module 1 is to transmit a packet in its queue to destination port 4 in the current cycle, the scheduler 420 configures the crossbar matrix 410 to connect ingress port 1 to egress port 4 during the current frame cycle. If there are multiple crossbar planes used to stripe the data, then the planes are set in parallel to the same configuration.
After the fabric schedule 420 computes its schedule, the scheduler 420 communicates back to each ingress fabric interface module 430 the schedule information (grants) computed. The information sent to a particular ingress modules includes, at a minimum, the destination fabric port number to which it was matched. Upon receiving this information, the ingress fabric interface modules 430 de-queue data (segments) from the associated queue(s) and transmit the data (frames) to the crossbar data planes (previously discussed). This is done in parallel by the interface modules 430. Because the fabric scheduler 420 sets the crossbar planes 410 to correspond to the schedule information (grants) communicated to the ingress fabric interface modules 430, the data transmitted by the ingress modules 430 will reach the intended destination egress interface modules 460.
While communicating the schedule information (grants) to the ingress fabric interface modules 430, the fabric scheduler 420 may optionally send information about the computed schedule to the egress fabric interface modules 460. Specifically, the scheduler 420 may send to each egress module 460 the port number associated with the ingress module 430 that will be transmitting data to it in that cycle. Although this information can be provided within the data stream itself (as part of header), sending it directly from the fabric scheduler 420 enables the egress modules 460 to detect errors by comparing the source of the arriving data (obtained from the headers) with the scheduler-supplied port number. A mismatch indicates an error or failure in the switch fabric system. The arriving data can be discarded in such an event, thus avoiding delivery of data to an unintended port.
The crossbar switching planes 720 and the fabric scheduler 730 reside on one or more switch cards. The backplane 740 (serial channels) form the data path over which packets are transported through the crossbar switching planes 720. When the bandwidth of a single serial channel (link) is inadequate to support the data rate of each fabric port, data is striped over multiple channels. Such striping can be at different granularities (e.g., bit, byte, word). If the data is stripped over several channels, there will be a corresponding number of crossbar planes. The crossbar planes may be separate crossbar matrixes or may be a single crossbar matrix containing multiple planes. Additionally, more links and switching planes may be used to provide speedup, redundancy, error detection and/or error recovery.
This enables the use of a fast crossbar switching plane 800 (e.g., optical switching device), as it needs to incorporate only the data path, and does not need to provide buffering and scheduling functions. Additionally, the serial channels 815, 835 between the fabric ports and the crossbar switching plane 800 may be optical (such as optical fiber links). Thus, according to one embodiment the path from the ingress fabric interface module to the egress fabric interface module is completely optical. Optoelectronic modules within the ingress and egress fabric modules perform conversion of the electrical data into optical and back.
The scheduler generates a schedule and transmits grants to the associated ingress fabric interface modules. The grant identifies, by fabric port number, the subset of queues to de-queue from. The grant is received from the scheduler by the local scheduler 970. The local scheduler 970 determines the associated queues. The framer 930 de-queues segments from these queues using an appropriate service discipline and constructs a frame therefrom. The data (frame) is then sent to the crossbar. If the data needs to be striped across multiple channels the striper 940 does so.
According to one embodiment, the scheduler sends the grants to all egress modules (multicast the schedule) regardless of whether the grant pertains to that egress module. The schedule interface 1080 determines if the grant is applicable based on the destination port number specified in the header of the frame (matched against the address of the egress module). Alternatively, the grants are sent to only the applicable egress modules. The schedule interface 1080 also extracts the source fabric port number within the header of the frame and provides the source fabric number to the error checker 1020. The error checker checks the source fabric number contained in the grant (received from the scheduler) with the source fabric port number from the frame to check for errors. A mismatch indicates an error. In the case of an error, the frame is discarded and the appropriate error handling procedure is invoked.
The deframer 1030 receives the error-free frames and extracts the segments from within them. These segments are queued in the reassembly buffer 1040. The queue number associated with each segment is inferred from information in its header such as the priority level and address of the line card port where it is to be delivered. The queue and reassembly state manager 1050 maintains the status of each packet (e.g., complete, partial). The status is updated when a segment is queued or de-queued. In addition, the queue and reassembly state manager 1050 monitors the EOP bit in each segment in order to determine when complete packets are available. The local egress scheduler 1060 is responsible for making decisions to de-queue packets from the reassembly buffer 1040. A queue is eligible for de-queuing if it has at least one full packet. The scheduler 1060 selects the queue for de-queuing based on a service discipline such as round robin or strict priority. The de-queued segments are then reassembled into the original packet by the reassembly unit 1070 and forwarded to the line card.
Configuring a switch fabric includes communicating scheduler requests from the ingress modules to the fabric scheduler, the scheduler's computation of a schedule (crossbar setting), communicating the results in the form of grants to the ingress and egress interface modules, and configuring the crossbar planes to correspond to the computed schedule. In a large switch fabric with several fabric ports, the ingress and egress fabric interface modules may be distributed over several line cards and the crossbar data paths may consist of several switching planes located over multiple cards. Configuring a large switch fabric (large number of inputs and outputs) may take several clock cycles. Thus, the overheads associated with communicating requests, schedule computation, communicating grants, and crossbar configuration can be significant. No data can be transmitted until these operations are completed so a large amount of the switch bandwidth can be potentially lost.
Each stage occurs during a frame period (the basic time unit for system operation). Therefore, each pipeline stage is one frame period. As illustrated, during a first frame period, t0, a request is sent from the ingress modules to the scheduler. During a second frame period, t1, the scheduler generates a schedule based on the request from the first frame period. In addition, new requests are sent to the scheduler from the ingress modules. That is, two tasks are being performed during the second frame period. During a third frame period, t2, the crossbar is being configured in response to the schedule generated in the second frame period, the scheduler is generating a schedule for the requests from the second frame period and additional requests are being sent. That is, three tasks are being performed during this frame period. During a fourth frame period, t3, the data is being transmitted across the crossbar configuration from the third frame period, the crossbar is being configured in response to the schedule generated in the third frame period, the scheduler is generating a schedule for the requests from the third frame period and additional requests are being sent. That is, four tasks are being performed during the same frame period.
As noted with respect to
Although the various embodiments have been illustrated by reference to specific embodiments, it will be apparent that various changes and modifications may be made. Reference to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Different implementations may feature different combinations of hardware, firmware, and/or software. For example, some implementations feature computer program products disposed on computer readable mediums. The programs include instructions for causing processors to perform techniques described above.
The various embodiments are intended to be protected broadly within the spirit and scope of the appended claims.