Routers are key components in computer networks, such as the Internet. As network applications have demanded increased router performance, router architecture has evolved from using a shared backplane to a switched backplane. At present, many routers are based on a single stage crossbar architecture. However, single stage crossbar routers typically require complex schedulers and have a complexity of O(N2) that grows exponentially as a function of network size N. Accordingly, single stage crossbar routers practically support only a limited number of interconnections and may not be scalable for next generation networks, which may incorporate the Internet, telecommunication services (such as Skype), and TV services (such as IP TV).
One alternative to the single stage crossbar architecture are delta class networks, such as multistage interconnected networks (“MINs”) including Banyan, Omega, Baseline, reverse Baseline, and indirect binary n-Cube networks. Delta class networks may be attractive for use in high speed switching fabrics due to their simple self-routing and their complexity of O(N×log2N). However, delta class networks can suffer from internal blocking, resulting in poor throughput under some traffic patterns. Blocking occurs because delta class networks are not permutation networks; that is, a delta class network can not implement all input permutations with a single copy (log2N stages) of such a network.
Another alternative to the single stage crossbar architecture are output-queued multistage interconnected networks with b×2b switching elements. The number of cells that can be concurrently switched from the inlets to each output queue equals the number of stages in the interconnection network. However, more stages are required to achieve higher throughput, which can increase both latency and hardware cost.
There is a need for scalable high performance routers and switches to provide a larger number of ports, higher throughput, and good reliability.
In an embodiment, an interleaved multistage switching fabric includes Y multistage switching fabric panels, where Y is an integer greater than one. Each panel has primary inputs for receiving cells to be routed, local outputs for outputting routed cells, primary outputs for outputting non-routed cells, and reentry points for introducing non-routed cells into the panel. The switching fabric additionally includes at least one demultiplexer subsystem communicatively coupled to primary inputs of each panel, for interfacing the switching fabric with input lines. The switching fabric further includes at least one multiplexer subsystem communicatively coupled to local outputs of each panel, for interfacing the switching fabric with destination queues. The switching fabric additionally includes Y recirculation connections, where each recirculation connection communicatively couples primary outputs of one panel to reentry points of another panel.
In an embodiment, an interleaved multistage switching fabric includes Y multistage switching fabric panels, where Y is an integer greater than one. Each panel has an I-Cubeout architecture and a plurality of stages of switching elements. Adjacent stages are interconnected according to an indirect n-cube connecting pattern. Each panel has primary outputs for outputting non-routed cells and reentry points for introducing non-routed cells into the panel. The switching fabric additionally includes Y recirculation connections, where each recirculation connection communicatively couples primary outputs of one panel to reentry points of another panel.
In an embodiment, a method for recirculating cells that fail to route in a switching fabric includes receiving a first cell that failed to route from a primary output of a first multistage switching fabric panel. The first cell is introduced into a reentry point of a second multistage switching fabric panel.
Specific instances of an item may be referred to by use of a numeral in parentheses (e.g., switching element 102(1)) while numerals without parentheses refer to any such item (e.g., switching elements 102).
Panel 100 includes at least n stages of switching elements 102, where n=log2N. Each stage has N/2 switching elements 102. Panel 100 has at least one full copy of the n stages. Additional full or partial copies of the n stages can be added to improve performance. For example, panel 100 is illustrated in
Each switching element 102 has dimensions of b×2b, meaning that the switching element has b inputs 104 and 2b outputs (including b local outputs 108 and b remote outputs 112). Only some switching elements 102, inputs 104, and outputs 108, 112 are labeled in
Let L be the index of line and be expressed in binary notation as follows (where n=log2N):
L=2n-1l1+2n-2l2+ . . . +2ln-1ln (1)
Accordingly, the indices of the lines incident on either side of a switching element 102 differ only in lk. Specifically, the two indices of any switching element 102 in the same stage differ by a constant; those in stage 1 differ by N/2, those in stage 2 differ by N/4 and so on.
An ICO network, such as panel 100, self-routes slightly differently than a normal delta class network. In particular, a routing tag is generated for each cell entering panel 100 at primary inputs 106 by using a bit-wise XOR of the cell's local primary input address and destination address. Subsequently, as the cell travels through panel 100, its path through a given switching element 102 is a function of its tag bit corresponding the switching element's stage. If a tag bit corresponding to a stage is 1, the cell needs to take the “cross” state of the switching element of the stage. After the non-zero tag bit is corrected, it is set to 0. If a tag bit corresponding to a stage is 0, the cell just passes straight through the switching element 102 of the stage. When all the tag bits become 0, the cell has reached its destined row and may take the local outlet 108 at the switching element to its appropriate destination queue 110, through which an output line card (“LC”) may be connected.
A cell's routing tag is cyclically rotated leftward by log2b bits after the cell advances to the next stage. Only the leftmost tag bits are examined at each switching element 102. Such characteristic unifies switching element design and eliminates correlating each switching element's design with its stage number.
Each switching element 102 includes a local scheduler that follows the shortest path algorithm. A cell's distance to its appropriate destination queue is defined as the rightmost nonzero bit position q of its tag, where the cell needs to travel at least q stages before reaching its destination queue. If two conflicting cells arrive at a given switching element 102 (e.g., one cell is to pass crosswise through the switching element and the other cell is to pass straight through the switching element), the local scheduler gives priority to the cell with a smaller distance and deflects the cell with a larger distance. Such operation helps minimize the number of cells in the switching fabric and thereby helps improve system performance. If the two conflicting cells have an identical distance, one of the cells is, for example, randomly chosen to receive priority.
One routing example is shown in
In the example of
In multistage switching fabrics, such as panels 100 and 200 of
However, an ICO network's efficiency and performance can be increased by reintroducing cells that failed to route back into the network. Such reintroduction of cells may be referred to as recirculation and gives cells that failed to route at least one more opportunity to reach their destination queues before being dropped.
In an ICO network supporting recirculation, a cell that failed to route is reintroduced to the switching fabric at a reentry point. Various recirculation methods have been proposed, including the static connection recirculation method, the first available point (“FA”) recirculation method, and first 1 bit in routing tag (“FO”) recirculation method. Such recirculation methods specify how a cell is reintroduced into the switching fabric. In the static connection recirculation method, cells are always reintroduced at the same reentry points. In contrast, in the FA and FO recirculation methods, cells may be reintroduced at one of a number of reentry points (e.g., the first available reentry point in the case of the FA scheme). Examples of recirculation methods may be found in Tzeng.
In panel 300, a respective recirculating subsystem 316 is communicatively coupled to each primary output 314 to recirculate cells that reached the primary output back into switching fabric 300. However, only a single recirculating subsystem 316(1) is shown in
In some embodiments of panel 300, each recirculating subsystem 316 recirculates cells back into switching fabric 300 at reentry points 319 in the same logical row as the primary output 314 that the recirculating subsystem is connected to. For example, recirculation subsystem 316(1) recirculates a cell that reaches primary output 314(2) back into switching fabric 300 via reentry point 319(1) through multiplexer 318(1), via reentry point 319(2) through multiplexer 318(2), or via reentry point 319(3) through multiplexer 318(3).
Some embodiments of panel 300 support one of the FA or FO recirculation methods. In such embodiments, each recirculating subsystem 316 includes a tag shifter 322 as well as gates 320. Tag shifter 322 shifts a cell's tag depending on which reentry point 319 the cell is to utilize, and gates 320 control which reentry point 319 the cell is to utilize. Other embodiments of panel 300 support a static connection recirculation method having only a single reentry point 319 for each recirculating subsystem 316. Such embodiments do not include tag shifters 322 or gates 320. It may be preferable to dispose reentry points 319 primarily or completely in the last copy of stages, such as illustrated in
Each panel 401 includes primary inputs 430 for receiving cells to be routed to a destination queue, local outputs 432 for outputting routed cells to destination queues, primary outputs 436 for outputting cells that failed to route, and reentry points 434 for introducing non-routed cells into the panel. A respective recirculation connection 426 communicatively couples each panel's primary outputs 436 to another panel's reentry points 434. Recirculation connections 426, for example, collectively couple each of Y panels 401 in a loop, as shown in
In multistage switching fabric panel 300 of
Panels 401 are, for example, based on an I-Cubeout (“ICO”) network that supports recirculation, where adjacent stages are interconnected according to an indirect n-cube connecting pattern, and a recirculation method such as the static connection recirculation method, the FA recirculation method, or the FO recirculation method is supported. For example, each panel 401 may be an embodiment of panel 300 of
Switching fabric 400 further includes at least demultiplexer subsystem 422, which is communicatively coupled to each panel 401's primary inputs 430, for interfacing switching fabric 400 with input lines 438. In particular, demultiplexer subsystem 422 receives cells from inputs 438 and distributes the cells among the Y panels 401. Demultiplexor subsystem 422, for example, distributes the cells among panels 401 sequentially—that is, a cell is distributed to one panel 401, a subsequent cell is distributed to a next panel 401, and so on. As another example, demultiplexer subsystem 422 may distribute cells to panels based on each panel's load (e.g., to a panel 401 having the smallest load), or based on cell destination address (e.g., to a panel 401 best able to route the cell to its proper destination queue). In some embodiments, demultiplexer subsystem 422 includes N individual demultiplexers, where N is switch fabric 400's network size.
Switching fabric 400 further includes at least one multiplexer subsystem 424, which is communicatively coupled to each panel 401's local outputs 432, for interfacing switching fabric 400 with destination queues, such as destination queues similar to those of
Some embodiments of switching fabric 400 are synchronized with a common clock signal. For example, all cells may move to a next stage or a local outlet under the control of one clock signal. As another example, in some embodiments, all cells from demultiplexer subsystem 422 enter a panel ((t mod Y)+1) at a clock cycle t, for each of Y panels from 1 to Y.
In some embodiments of switching fabric 400, a concentrator is located before each destination queue for terminating the cells from local outputs 432. Each concentrator has a speedup ξ, where speedup is the ratio of the concentrator's speed to the speed of links interfacing with the concentrator. Accordingly, each concentrator is operable to receive up to ξ cells in one clock cycle, typically from the active rightmost to leftmost stages, regardless of what panel 401 a given stage is part of.
It may be advantageous to design switching fabric 400's constituent switching elements to process only fixed length cells in order to simply hardware design and increase switching element processing efficiency. In such case, variable length packets can be handled by segmenting them into the fix-sized cells before inputting them into switching fabric 400 and by subsequently reassembling cells after routing by switching fabric 400. One example of a mechanism to keep track of cells in transmission for resequencing at their destinations is disclosed in Tzeng.
If traffic is assumed to be evenly loaded between panels 401, each panel runs at load (p/Y), in which p is the load from outside. Accordingly, as the number of panels are increased from Y to Y+1, the difference in each panel's load is given by p/(Y2+Y). Thus, the difference in each panel's load decreases quickly as Y increases, and Y=2 may represent a good compromise between throughput and hardware cost. However, increasing Y typically increases fault tolerance of switching fabric 400. Thus, if high fault tolerance is desired, it may be desirable for Y to be at least 3.
Switching fabric 500 includes a respective recirculation connection 526 for each panel 501. In particular, each recirculation connection 526 couples primary outputs 536 of one panel 501 to reentry points 534 of another panel 501. However, it should be noted that only parts of each recirculation connection 526(1), 526(2) are shown in
In this document, the notation SX/PY may be used. X specifies the number of stages per panel, and Y specifies the number of panels. For example, switching fabric 500, which is shown having two panels 501 and four stages per panel, could be described as being a S4/P2 switching fabric.
Method 600 optionally includes additional steps 606 and 608 that are executed if the first cell failed to route in the second panel. In step 606, the first cell is received from a primary output of the second panel, and in step 608, the first cell is introduced into a reentry point of a third panel. An example of step 606 is receiving the first cell from primary outputs 436(2) of panel 401(2), and an example of step 608 is introducing the first cell into panel 401(3) via recirculation connection 426(2) and reentry points 434(3).
Switching fabric 400 may offer advantages over other switching fabrics. First, switching fabric 400 has a relatively low complexity, thereby promoting scalability. For example, the number of switching elements increases linearly with network size in switching fabric 400. In contrast, the number of elements in the single stage crossbar architecture increases exponentially with network size.
Additionally, although switching fabric 400 typically has a relatively low complexity, it nevertheless may exhibit better performance than some prior art switching fabrics. For example, an S6/P2 embodiment of switching fabric 400 is expected to exhibit around a third of the latency of a S12/P1 switching fabric with only a negligible drop rate increase. As another example, an S8/P2 embodiment of switching fabric 400 is expected to outperform a S12/P1 single panel switching fabric at any traffic pattern. It should be noted that the S8/P2 switching fabric requires only a small increase in hardware resources over the S12/P1 switching fabric.
One reason that switching fabric 400 may exhibit better performance than some prior art switching fabrics is that switching fabric 400's multiple panels 401 and recirculation connections 426 help balance cell traffic and ease hot flows after collisions. Additionally, some embodiments of switching fabric 400 also advantageously avoid the non-linear performance degradation that commonly occurs in single multistage switching fabric panels during heavy load conditions (0.5≦p≦1) by distributing the load among panels 401. For example, as previously noted, if traffic is assumed to be evenly loaded between panels 401, each panel runs at load (p/Y), in which p is the load from outside. The total cell drop rate (or throughput) is the sum of that from each panel. For example, when Y=2 and p=1.0, each panel runs at load=0.5, and the total cell drop rate is equal to 2× (cell drop rate at load 0.5 for a single panel). When Y=3 and p=0.9, each panel runs at load=0.3, and the total cell drop rate equals to 3× (cell drop rate at load 0.3 for a single panel). Accordingly, in some embodiments of switching fabric 400, each panel typically operates with a load of less than or equal to 0.5, thereby avoiding non-linear performance degradation associated with heavy loads.
Simulations have shown that some embodiments of switching fabric 400 outperform single multistage switching fabric panels having the same length, or number of stages, under different traffic settings. For example, simulations have shown embodiments of switching fabric 400 achieve speedups (over a single multistage switching fabric panel) of 3.4 and 2.25 under uniform and hot-spot traffic patterns, respectively, at maximum load (p=1). As another example, simulations under different traffic patterns have shown that some embodiments of switching fabric 400 are far less prone to suffer a mean latency increase due to load congestion than single panel switching fabrics. In contrast, the mean latency deteriorates considerably as the load increases in single multistage switching fabric panels. Additional simulation results are discussed below.
Furthermore, switching fabric 400 may show significant tolerance against internal hardware failures. In some embodiments of switching fabric 400, a faulty switching element is treated in a similar fashion to a hot congestion area—that is, traffic is deflected away from the faulty switching element. The parallel architecture of switching fabric 400 may also advantageously provide inherent redundancy and associated fault tolerance. For example, under a worst case scenario where one panel 401 in a two panel embodiment of switching fabric 400 is non-functional, the remaining functional panel 401 will permit the switching fabric to continue operate with some performance degradation. In contrast, in the case of a failure of the panel of a S12/P1 system, the entire system may malfunction. Simulations under different fault models have revealed that some embodiments of switching fabric 400 are highly fault tolerant against internal hardware failures compared to single multistage switching fabric panels.
Moreover, some embodiments of switching fabric 400 can be utilized as a Redundant Array of Independent Fabrics (“RAIF”) analogous to a Redundant Array of Independent (or Inexpensive) Disks (“RAID”). Such RAIF systems, for example, are scalable and exhibit high performance and reliability. In particular, some RAIF systems will continue to operate even after one panel 401 fails. For example, a S6/P1 system can be upgraded to form an embodiment of switching fabric 400 by adding additional multistage switching fabric panels. The system could function as a RAIF 0 system (similar to RAID level 0), where panels 401 operate in parallel. As another example, an embodiment of switching fabric 400 could function as a RAIF 1 system (similar to RAID level 1), where one panel of the RAIF 1 system operates in a standby mode when there is no fault and replaces a malfunctioning panel when there is a fault. By having a panel operate in a standby mode when there is no fault, power consumption during normal operation may be reduced. As yet another example, RAIF 0 and 1 approaches could be combined to form a RAIF 2 system with Y panels (Y>2), where Y−1 panels operate in parallel, and one panel operates in a standby mode for fault tolerance.
Some embodiments of switching fabric 400 as well as single multistage switching fabric panels were simulated. System metrics that were observed during the simulations include cell drop rate and mean latency. Cell drop rate is equal to one minus throughput, and mean latency is the average time a cell takes to cross a switching fabric. The simulations were generally conducted under the following conditions:
The simulations of
From
The S6/P2 system had a slight increase in drop rate of 0.013% while the S12/P1 system (curve 808) had essentially no increase in drop rate. However, as shown in
Traffic through switching fabrics is usually nonuniform. In particular, there are generally some hot spots on the network, such as due to connections with file servers, connections to popular web sites, and/or an uplink to a backbone network. In the simulations of
In the simulations of
As shown in
It should be noted that in the double panel systems (Y=2) of
Although the S6/P2 system and S12/P1 system (curve 1008) each have the same number of switching elements, the S6/P2 system had a drop rate 10.3% while the S12/P1 system has a slightly lower drop rate of 9.8% at p=1.0. However, the S12/P1 system had a latency of about 22.4 while the S6/P2 system had a latency of about 7.8 at load p=1.0, as shown in
Possible points of failure in multistage switching fabrics include link failures and switching element failures. Switching element hardware is typically more complex than link connections and therefore more prone to fail than link connections. An internal link fault could also be modeled as a switching element fault because a faulty link renders the following switching element inaccessible. Accordingly, a switching element fault model was used to evaluate fault tolerance in the following simulations.
In the simulated single fault model, a single switching element is faulty and will not accept any cells. Thus, cells that need to pass through the faulty switching element are deflected in prior stages. If the faulty switching element is located in a last copy of stages, the recirculated cells need to jump this faulty point if the FA recirculation method is used. In the following simulations, the faulty switching element is assumed to be located in row 31 of various stages.
A fault's effect will depend on the faulty switching element's location due to hot-spot traffic saturating parts of switching fabrics along the traffic's path. Accordingly, the following fault tolerance simulations assume uniform traffic.
1. Single Multistage Switching Fabric Panel v. Interleaved Switching Fabrics with Y=2:
Fault location was shown to affect degradation of the S6/P1 system, particularly with respect to drop rate. As shown in
A fault in the S6/P2 system had a negligible impact on performance regardless of the stage in which the fault occurred. Accordingly, some embodiments of switching fabric 400 not only perform better than single multistage switching fabric panels, but also tolerate a single hardware failure. For example, although a fault in a first stage of the S6/P1 system is effectively fatal, cells have a chance to divert around a failed first stage in the S6/P2 system.
The multiple fault model is considerably more complicated than the single fault model because of the numerous combinations of number of faults and fault locations. However, a fault in the same stage of plurality of copies of stages will generate a switching bottleneck and cause performance to deteriorate significantly. The simulations of
As it can be observed in
In the S6/P2 systems, stages 2 and 6 (total of 4 affected stages) correct the second position of the tag bits. Accordingly, it is reasonable to expect that one or more redundant stages in S6/P2[(S2+S6)/P1] and S6/P2[(S2+S6)/P1+S2/P2] systems will compensate for the faults with slight degradations in latency and drop rate. Stage 3 of the S6/P2 systems (total of 2 affected stages) corrects the third position of the tag bits. If both instances of stage 3 fail, inferior performance is expected. However, it should be noted that the S6/P2[S3/(P1+P2)] and S6/P2[(S2+S6)/(P1+P2)] systems exhibited a negligible increase in drop rate, that is 0.086% as opposed to 0.013% of the corresponding fault-free systems.
The traffic in each panel of the S6/P2 systems is roughly half that of the single multistage switch fabric panels. Additionally, the S6/P2 systems may have double the FA recirculation points of the single multistage switching fabric panels. Such characteristics of S6/P2 systems broaden the switching path and mitigate collisions. Though correcting deflected cells in S6/P2[S3/(P1+P2)] systems (curve 1704) increases latency to 6.6 cycles from 4.7 cycles of a corresponding fault-free system (curve 1706) as shown in
Furthermore, simulations of extreme cases were performed where the number of faults were doubled to 8 faults in specific stages (row 28 to 35).
Changes may be made in the above methods and systems without departing from the scope hereof. It should thus be noted that the matter contained in the above description and shown in the accompanying drawings should be interpreted as illustrative and not in a limiting sense. The following claims are intended to cover generic and specific features described herein, as well as all statements of the scope of the present method and system, which, as a matter of language, might be said to fall therebetween.
This application claims benefit of priority to U.S. Provisional Patent Application Ser. No. 60/990,144, filed 26 Nov. 2007, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60990144 | Nov 2007 | US |