SCHEDULING METHOD FOR INPUTBUFFER SWITCH ARCHITECTURE

Information

  • Patent Application
  • 20030081548
  • Publication Number
    20030081548
  • Date Filed
    April 09, 1998
    26 years ago
  • Date Published
    May 01, 2003
    21 years ago
Abstract
A method of scheduling in a switch for transferring data by information units is provided where the scheduling decisions are performed from the destination node point of view, considering the demand of all the source nodes to reach this destination node. This algorithm also allows an improvement in performance, from a traffic point of view, of a rotator switch, since the algorithm is much more fair than the known source based scheduling algorithm in sharing the bandwidth amongst the contenting source nodes for a given destination node. Embodiments of the invention, are extended to support class of service, including minimum bandwidth guarantee. Further embodiments are provided that support age-group to further increase the performance of a rotator switch fabric with respect to traffic. In still further embodiments the algorithm is extended in a load-shared architecture to make it fault tolerant
Description


FIELD OF THE INVENTION

[0001] The present invention relates to scheduling algorithms, and their implementations, for routing data in information units through an input-buffer switch architecture having an internally non-blocking switch fabric. The present invention is particularly concerned with scheduling algorithms for rotator switch architectures, yet can be used as well for demand-driven space switch architectures.



RELATED APPLICATIONS

[0002] The present invention is related to copending application entitled “ROTATOR SWITCH DATA PATH STRUCTURES” filed on the same day with the same inventors and assignee as the present invention, and the entire specification thereof is incorporated by reference herein.



BACKGROUND OF THE INVENTION

[0003] The present invention concerns the scheduling of ATM cells, or more generally, the scheduling of any fixed-size Information Unit (IU), to be routed through a switch fabric of an input-buffer switch (in particular, an ATM input-buffer switch).


[0004] An input-buffer switch is composed of a set of N Ingress nodes, a switch fabric, and a set of N Egress nodes. In the following, the Ingress nodes and Egress nodes are named source nodes and destination nodes, respectively. The basic characteristic of this architecture is that lUs are queued in the source nodes before being routed via the switch fabric to the destination nodes.


[0005] The present application considers a switch fabric architecture being internally non-blocking; that is, a switch fabric architecture supporting all the possible one-to-one connection mappings between the source nodes and the destination nodes. Each one-to-one connection mapping supports a connection between each source node and a distinct destination node, or equally between each destination node and a distinct source node. There are N! possible one-to-one connection mapping for the case of a switch fabric with N source nodes and N destination nodes.


[0006] The capacity of all connections of each one-to-one connection mapping are the same. That capacity is either the same as the source node (or equally the same as the destination node), or slightly higher than the capacity of the destination node. We suppose, however, that the capacity of the connection is less than N times the capacity of the destination nodes, otherwise no input-buffer would be needed at the source nodes, and the architecture will be logically equivalent to an output-buffer switch architecture.


[0007] Since the aggregate capacity at which IUs can arrive at the source nodes for the same destination nodes can be much higher than the supported connection capacity of the switch fabric, input buffers are required at the source nodes in order to queue lUs when there is output contention at a destination node.


[0008] An algorithm is thus needed to decide the sequence of one-to-one connection mapping status of the switch fabric, or equally, to inform each source node about the destination node it is currently connected with and thus for which it can send lUs through the switch fabric. That algorithm is named scheduling algorithm, since it schedules the flow of IUs from the source nodes to the destination nodes.


[0009] A particular implementation of the switch fabric is a demand-driven space switch architecture. For each one-to-one connection mapping, a demand-driven space switch supports at the same time all the connection of the mapping.


[0010] Another particular implementation of the switch fabric is a rotator space switch architecture in which all connection of a one-to-one mapping are established one after the other, following a rotation principle. The rotator architecture is logically composed of many small demand-driven space switches, named tandem nodes, each permitting at a given time a one-to-one connection mapping between a set of source nodes and a set of destination nodes. A tandem node is connected with all the source nodes following a rotation scheme and, similarly, with all the destination nodes following a rotation scheme as well. Each tandem contains a fixed number of IU buffers in order to “transport” the IUs from the source nodes to the destination nodes. The rotator switch architecture was patented Dec. 1, 1992, in U.S. Pat. No. 5,168,492, by M. E. Beshai and E. A. Munter and an improvement of the data paths thereof has been applied for in copending patent application filed on the same day as the present application by the same inventors and having the same assignee.


[0011] A scheduling method, namely the source-based scheduling (SBS), was included in the patent for the original rotator architecture by Beshai et al. In that method, the scheduling decisions are performed logically by each source node, without considering the queue status of the other source nodes. For each tandem node, each source node, one after the other, selects the destination node to which it will send an IU using that tandem node, and it thus seizes on that tandem node the IU buffer associated with the selected destination node. Hence, the destination node must be selected from those not yet already selected during the current rotation of the tandem node.


[0012] However, there is a problem of fairness related with that method. In the original proposal of the rotator architecture, the tandem IU buffers are emptied one after the other. That is, the tandem node frees its IU buffers in a fixed order, corresponding to the order it is connected with the destination nodes. The tandem node is connected with the source nodes following as well a fixed order. Hence, when an source node is considering a tandem node for transferring an IU to a given destination node, the probability of finding a free IU buffer associated with that destination node is not the same as for the other destination node; the more recently the IU buffer has been emptied, the more likely the source node will see a free IU buffer associated with the destination node. This means that under output contention for a destination node, the source node furthest from to this destination node has the freedom to use as much of the bandwidth available to reach this destination node, while the source node closest to this destination node sees only the bandwidth not used by the preceding source nodes. Under severe output contention, the closest source node may never see available bandwidth to reach this destination node, while the furthest source node can reach the destination node as if there was no contention at all. This is unfair.



SUMMARY OF THE INVENTION

[0013] According to an aspect of the present invention there is provided a method of scheduling wherein the scheduling decisions are performed from the destination node point of view, considering the demand of all the source nodes to reach this destination node. This algorithm allows an improvement in performance, from a traffic point of view, of the rotator switch, since the algorithm is much more fair then the original SBS algorithm to share the bandwidth amongst the contenting source nodes for a given destination node.


[0014] According to another aspect of the present invention there is provided in a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of determining availability of a destination node, determining demand for connection from each source node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of and demand for the destination node.


[0015] According to another aspect of the present invention there is provided in a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of determining availability of a destination node, determining a class of traffic being scheduled, determining demand for connection from each source node to the destination node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of and demand for the destination node and the class of traffic.


[0016] According to another aspect of the present invention there is provided in a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of determining availability of a destination node, determining age of traffic being scheduled, determining demand for connection from each source node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of and demand for the destination node and age of traffic.


[0017] According to another aspect of the present invention there is provided in a rotator switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining availability of a tandem associated with a destination node, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of the tandem node and demand for the destination node.


[0018] According to another aspect of the present invention there is provided in a switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining availability of a tandem node associated with a destination node, determining a class of traffic being scheduled, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of the tandem node, demand for the destination node and the class of traffic.


[0019] According to another aspect of the present invention there is provided in a switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining an age group of traffic being scheduled, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node, determining availability of a tandem node associated with a destination node, and selecting a source node in dependence upon availability of the tandem node, demand for the destination node and the age group.


[0020] According to another aspect of the present invention there is provided in a rotator rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining availability of a tandem associated with a destination node, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node; and selecting an available source node in dependence upon the availability of the tandem node and demand for the destination node.


[0021] According to another aspect of the present invention there is provided in a rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining availability of a tandem node associated with a destination node, determining a class of traffic being scheduled, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node, and selecting an available source node in dependence upon the availability of the tandem node, demand for the destination node and the class of traffic.


[0022] According to another aspect of the present invention there is provided in a rotator rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node associated with a desired destination node, said method comprising the steps of determining an age group of traffic being scheduled, determining demand for connection from each source node via the tandem node to the destination node, determining availability of each source node, determining availability of a tandem node associated with a destination node, and selecting a source node in dependence upon availability of the tandem node, demand for the destination node and the age group.


[0023] In embodiments of the invention, the algorithm is extended to support class of service, including minimum bandwidth guarantee. Further embodiments are provided that support age-group to further increase the performance of a rotator switch fabric with respect to traffic. In still further embodiments the algorithm is extended in a load-share architecture to make it fault tolerant. Further embodiments extend the algorithm for supporting the improvements of the rotator data-path architecture proposed in the co-pending application referenced herein above. A further embodiment applies the algorithm for a pure demand driven space switch architecture. A further embodiment extends the algorithm to provide fault tolerance in the switch fabric.







BRIEF DESCRIPTION OF THE DRAWINGS

[0024] The present invention wil be further understood from the following detailed description, with reference to the drawings in which:


[0025]
FIG. 1 illustrates a known rotator switch for transferring data in information units;


[0026]
FIG. 2 illustrates the data flow inside the known rotator switch of FIG. 1;


[0027]
FIG. 3 illustrates a circular representation of the known rotator switch of FIG. 1;


[0028]
FIG. 4 illustrates the functional structure of the destination-based scheduling algorithm for an input-buffer switch.


[0029]
FIG. 5 illustrates a distributed implementation of the destination-based scheduling algorithm in accordance with a second embodiment of the present invention for the known rotator switch of FIG. 1;


[0030]
FIG. 6 illustrates a centralised implementation of the destination-based scheduling algorithm in accordance with a third embodiment of the present invention for the known rotator switch of FIG. 1;


[0031]
FIG. 7 illustrates a partitioning for the centralised implementation of the destination-based scheduling algorithm of FIG. 6 in accordance with a fourth embodiment of the present invention for the known rotator switch of FIG. 1;


[0032]
FIG. 8 illustrates a load-sharing implementation of the destination-based scheduling algorithm in accordance with a fifth embodiment of the present invention for the known rotator switch of FIG. 1;


[0033]
FIG. 9 illustrates an extension of the known rotator switch of FIG. 1 using compound-tandem nodes;


[0034]
FIG. 10 illustrates an extension of the known rotator switch of FIG. 1 using parallel rotator slices;


[0035]
FIG. 11 illustrates an extension of the known rotator switch of FIG. 1 using both compound-tandem nodes and parallel rotator slices;


[0036]
FIG. 12 illustrates an extension of the known rotator switch of FIG. 1 using double-bank tandem nodes.







[0037] Abbreviations


[0038] DBS: Destination-Based Scheduler (or Scheduling)


[0039] GM: Grant Manager


[0040] IU: Information Unit (fixed size, e.g., 64 Byte)


[0041] RM: Request Manager


[0042] TDB: Tandem-Destination Buffer (IU size)


DETAILED DESCRIPTION

[0043] The principles of the Destination-Based Scheduling (DBS) algorithm are first described in the context of the known rotator switch architecture. Then, the algorithm is extended for various architectures, up to a pure demand-driven space switch architecture.


[0044] A. DBS Algorithm for the Known Rotator Switch Architecture


[0045] A.1 Basic DBS Algorithm Principles


[0046] Referring to FIG. 1 there is illustrated a 4-node configuration of the known rotator switch for transferring data in Information Units (IUs). The rotator switch includes four (input) source nodes 10-16, a first commutator 18, four (intermediate) tandem nodes 20-26, a second commutator 28, and four (output) destination nodes 30-36. Each commutator 18 and 28 is a specific 4-by-4 space-switch in which the connection matrix status is restricted to follow a predefined pattern that mimics a rotation scheme.


[0047] In operation, the Ingress data enters the switch via the source nodes using a fixed size Information Unit (IU) format. An IU is similar to an ATM cell, but it contains two mandatory fields in the header: the destination node address of the IU, and the class of service related with the IU (class are discussed later). The IUs are queued per destination address (and class) in the source nodes, waiting for places on the tandem nodes to be routed to the target destination nodes. Queuing by destination in the source node avoids the problem known as head-of-line blocking. The deterministic sequence of space-switch connections guarantees the correct ordering of IUs arriving at the destination nodes. Finally, the destination nodes forward as Egress data the IUs received from the source nodes via the tandem nodes.


[0048] Referring to FIG. 2 there is illustrated the sequence of four phases composing the rotation scheme of the known rotator switch illustrated in FIG. 1; these phases are referred as phase 0, 40; phase 1, 42; phase 2, 44; and phase 3, 46. At each phase of the rotation, a tandem node is connected with exactly one source node and with exactly one destination node, all tandem nodes being connected with different source nodes, and with different destination nodes. Similarly, a source node is connected with exactly one tandem node, all source nodes being connected with different tandem nodes, and a destination node is connected with exactly one tandem node, all destination nodes being connected with different tandem nodes.


[0049] Referring to FIG. 3 there is illustrated a circular representation of the known rotator data flow corresponding to the phase 0 connectivity presented in FIG. 2. The three other phases, phase 1, phase 2, and phase 3, are obtained by turning clockwise the internal disk containing the tandem nodes in the middle of the figure, the rotating effect being physically obtained by reconfiguring deterministically the space-switch 48, that implements both space switches 18, 28 of FIG. 1. During a rotation, each tandem node is connected with all source nodes, one source node after the other, and with all destination nodes, one destination node after the other. The sequence of connections is the same at each rotation.


[0050] During a phase, a tandem node can accept one IU from the connected source node, and can transfer one IU to the connected destination node. In general, K IUs could be transferred during a phase, as discussed below.


[0051] Each tandem node can buffer one IU for each destination node. The IU for one destination node is stored in a buffer, named Tandem-Destination-Buffer (TDB), associated with this destination node. There are four TDBs per tandem, one associated with each destination node. When a tandem node is connected with a destination node, the IU on the tandem node, in the TDB associated with this destination node, is transferred to this destination node; then, the TDB is freed.


[0052] It is useful to define the sequence of rotation of the tandem nodes using the destination nodes and the source nodes as reference point.


[0053] With respect to a given destination node, a tandem node terminates a rotation when it is connected with this destination node. That is, a tandem node starts a new rotation with respect to a destination node the phase after emptying the TDB associated with this destination node.


[0054] With respect to a given source node, a tandem node terminates a rotation when it is connected with this source node. That is, a tandem node starts a new rotation with respect to a source node the phase after receiving an IU from this source node.


[0055] The scheduling algorithm is the process of deciding the destination node associated with the IU provided by each source node to the connected tandem node at each phase of the rotation. This process is equivalent to assigning a source node associated with the IU provided by each tandem node to the connected destination node at each phase of the rotation. The algorithm must satisfy two constraints related with the IU data flow through the rotator:


[0056] 1) During each rotation of a tandem node with respect to a given destination node, this tandem node can accept at most one IU for this destination node, regardless of the source node providing the IU.


[0057] 2) During each rotation of a tandem node with respect to a given source node, a source node can provide only one IU to the tandem node, regardless of the destination node associated with the provided IU.


[0058] Referring to FIG. 4 there is illustrated the functional partitioning of the scheduling algorithm for a rotator switch in accordance with an embodiment of the present invention. The algorithm is composed of three specific modules:


[0059] 1) Request Manager 50: the purpose of the request manager 50 is to inform the scheduler about the queue-fill status of the source nodes, the queue-fill status being the number of IUs queued by each source node for each destination node. We assume in the following that a communication path exists from the source nodes to the scheduler. Using this path, each source node can forward, as requests, to the request manager, the information about the IU arrivals at this source node.


[0060] 2) Core Scheduler 52: the core scheduler 52 is the module implementing the process of deciding which source nodes have provided the IUs arriving at each destination node from its connected tandem node at each phase of the rotator. The scheduling decisions are based on the queue-fill status of the source nodes provided by the request manager, and they must satisfy the two above scheduling constraints. The scheduling decisions are then forwarded to the grant manager.


[0061] 3) Grant Manager 54: the purpose of the grant manager 54 is to inform the source nodes about the scheduling decisions. We assume in the following that a communication path exists from the scheduler back to the source nodes. For each rotator phase, 40, 42, 44, 46 each source node must receive a grant from the grant manager, that specifies to the source node for which destination node this source node must provide an IU to the connected tandem node.


[0062] The core of the scheduling algorithm must be optimised from a traffic performance point of view, the best achievable traffic performance of the rotator switch architecture being the one achievable by an output buffer switch architecture. For this optimal switch architecture, the order in which IUs arrive at a destination node corresponds to the order in which IUs have entered the switch, regardless of which source nodes the IUs effectively arrived from.


[0063] The destination-based scheduling (DBS) algorithm presented herein is devised to optimise the IU data flow through the rotator switch such that the traffic performance approximates that achieved by an output buffer switch architecture. The basic principle of the algorithm is that the destination node selects the source node that will use the destination-buffer associated with this destination node, on each tandem node and for each rotation.


[0064] For each rotation of a tandem node with respect to a given destination node, the DBS algorithm reserves (or allocates) to a source node the TDB associated with this destination node. This decision must be completed before the tandem node starts the rotation with respect to the destination node, and the reservation will be consumed by the source node during this rotation of the tandem with respect to the destination node.


[0065] Therefore, at each phase of the rotation, one TDB on each tandem node can be reserved for a source node, one for each destination node. The process of reserving a TDB for a source node is called a source-selection.


[0066] From a destination node point of view, a source-selection is completed at each phase, one on each tandem node, one tandem node after the other. Since a source-selection is performed only once per rotation on a given tandem node for a given destination node, the above scheduling constraint 1 is satisfied.


[0067] From a tandem node point of view, a source-selection is completed at each phase, one for each destination node, one destination node after the other. To satisfy the above scheduling constraint 2, it is sufficient to select a source node that is not yet selected to send an IU on this tandem node for the current rotation of this tandem node with respect to this source node. At each phase of the rotation, the tandem node starts a new rotation with respect to a source node, one source node after the other. At this reference point, the source node can be considered as eligible to send an IU to this tandem node, regardless of the destination node; the source node is eligible until its selection by a destination node.


[0068] In summary, the basic DBS algorithm principles are the following: During each rotator phase,


[0069] 1) Each source node becomes eligible to use the tandem node connected with for the next rotation of this tandem node with respect to this source node;


[0070] 2) Each destination node selects an eligible source node on the connected tandem node to send on this tandem node an IU for this destination node during the next rotation of this tandem node with respect to this destination node. The selected source node is no longer eligible to be selected on this tandem node for the remainder of its rotation with respect to this source node.


[0071] A.1.1 Basic Parameters


[0072] N: number of source nodes, number of destination nodes, number of tandem nodes, or number of rotator phases. In the above example related with FIG. 1, N is 4.


[0073] K: number of IUs transferred per phase, from each source node to the connected tandem node, as well as from each tandem node to the connected destination node. In the example related with FIG. 1 discussed previously, K=1 was assumed.


[0074] In general, a source node can transfer K IUs to the connected tandem node at each phase, and a destination node can received K IUs from the connected tandem node at each phase. Thus, there is NK TDBs per tandem node, K TDBs associated with each destination node. When a tandem node is connected with a destination node, the IUs on this tandem node, in the K TDBs associated with this destination node, are transferred to this destination node, in the order they arrived at the tandem node; then, the K TDBs are freed.


[0075] The two generalised constraints to be satisfied by the scheduling algorithm become:


[0076] 1) During each rotation of a tandem node with respect to a given destination node, this tandem node can accept at most K IUs for this destination node, regardless of the source nodes providing the IUs.


[0077] 2) During each rotation of a tandem node with respect to a given source node, this source node can provide at most K IUs to the tandem node, regardless of the destination nodes associated with the IUs.


[0078] A.1.2 Basic Notation


[0079] The source nodes are numbered 0, 1, . . . , N−1.


[0080] The destination nodes are numbered 0, 1, . . . , N−1.


[0081] The tandem nodes are numbered 0, 1, . . . , N−1.


[0082] The rotator phases are numbered 0, 1, . . . , N−1.


[0083] ST(p,z): the source node connected with tandem node z during rotator phase p.


[0084] DT(p,z): the destination node connected with tandem node z during rotator phase p.


[0085] Q(x,y): the Queue-fill status of source node x for destination node y. On one hand, the value Q(x,y) is increased by the information forwarded by the request manager given the number of IU arrivals at source node x for destination node y, since the last update. The request manager provides this information at each period RP (request period) for each source-destination node combination.


[0086] TDS(y,z): the Tandem-Destination-Status (TDS) for destination node y on tandem node z. TDS(y,z) corresponds to the number of IUs the destination node y is already scheduled to receive from tandem node z during the current rotation of z with respect to destination y. This value is updated during scheduling to guarantee that the above scheduling constraint 1 is satisfied.


[0087] TSS(x,z): the Tandem-Source-Status (TSS) for source node x on tandem node z. TSS(x,z) corresponds to the number of IUs the source node x is already scheduled to send on tandem node z during the current rotation of z with respect to the source node x. This value is updated during scheduling to guarantee that the scheduling constraint 2 above is satisfied.


[0088] A. 1.3 Basic DBS Algorithm


[0089] The basic DBS algorithm consists in making K source-selections at each phase for each destination node, the source-selections being performed on the tandem node connected with the destination node. The core scheduler of the DBS algorithm is presented below as a function DBS1 (line 0 to line 15); this function is executed at each phase p of the rotator.
1 0:function DBS_1 (p)  { 1:for each tandem node z { 2:y = DT(p,z); 3:TDS(y,z) = 0; 4:x = ST(p,z); 5:TSS(x,z) = 0; 6:while (TDS(y,z) < K) { 7:s = select_source(z, y); 8:if (s non-existing) then exit while; 9:Q(s,y) = Q(s,y) − 1;10:TSS(s,z) = TSS(s,z) + 1;11:TDS(y,z) = TDS(y,z) + 1;12:record_grant(z, y, s);13:}14:}15:}


[0090] For each rotator-phase, source-selections are computed on each tandem node (line 1 to line 14). For each tandem node z, the source-selections are made for the destination node y connected with this tandem node (line 2); before making the source-selections for the destination node y, the TDBs on the tandem node z associated with this 0 destination node become available; thus, the associated TDS value is reset (line 3). Since the tandem node z starts a new rotation with respect to the source node x (line 4), the reservation status of this source node on this tandem node is reset (line 5).


[0091] Then, up to K source-selections are completed for destination node y on tandem node z (line 6 to line 13); the source-selections are completed one at a time, using the function select_source which return the selected source node s (line 7). There exists different schemes to select the source node, ranging from random selection to pure round-robin selection, as discussed below.


[0092] If no source node s is selected, then the source-selections for the destination node y on the tandem node z are terminated (line 8). Otherwise, the data structures are updated in accordance with the selected source node (line 9 to line 12): the queue-fill status of the selected source node s for the destination node y is decremented (line 9); the reservation status for the selected source node s on the tandem node z is incremented (line 10); the reservation status of the destination node y on the tandem node z in incremented (line 11); finally, the grant information corresponding with the selection of source node s for destination node y on tandem z is forwarded to the grant manager; the recording of the grant by the grant manager is performed by the function record_grant (line 12).


[0093] As discussed above, there are many possible ways to select a source node, for a given destination node on a given tandem node; the only requirement of the function is to guarantee that the above scheduling constraints 1 and 2 are satisfied. The constraint 1 is automatically satisfied given the select_source function is called only when TDS(z)y is smaller than K; to satisfy constraint 2, it is sufficient to select a source node s such that TSS(z)s is smaller than K.


[0094] A round-robin implementation of the select_source function is presented below (line 0 to line 8). One round-robin pointer is used per destination node, LSS(y), which record the Last-Selected Source for destination node y (regardless of the tandem node).
20:function select_source (z, y)  {1:for s = LSS(y)+1,LSS(y)+2,,N−1,0,1,,LSS(y)  {2:if ((Q(s,y) > 0) && (TSS(s,z) < K))  {3:LSS(y) = s;4:return (success(s));5:}6:}7:return (failure) ;8:}


[0095] The round-robin selection is implemented by considering all the source nodes, in the increasing order, starting after the last selected source (line 1 to line 6). A source node s is a candidate for destination node y on tandem node z if TSS(x,z) is smaller than K, and if Q(x,y) is greater than 0; the selected source node is the first candidate considered following the round-robin order. If such a source node s exist (line 2), the value of the round-robin pointer for destination node y is set to s (line 3), and s is successfully returned by the function select_source (line 4). Otherwise, no source node is selected, and the function select_source returns a failure (line 7).


[0096] Many variants of the select_source function are possible, such as:


[0097] 1) Considering the source nodes either in increasing order or in decreasing order, setting randomly the round-robin pointer each time the order is reversed;


[0098] 2) Considering the source nodes following a completely random order.


[0099] Referring again to FIG. 2, the relationships between the scheduling decisions and the IU data flow with respect to tandem node t0 are as follows:


[0100] At phase 0, 40:


[0101] 1) s0 sends an IU to t0, the IU being dequeued from the queue associated with the destination node (including d0 itself) by which s0 was selected during the current rotation of t0 with respect to s0; thus, t0 will start a new rotation with respect to s0.


[0102] 2) t0 sends the IU for d0 , the IU was received from a source node (including s0 itself) that was previously selected by d0 during the current rotation of t0 with respect to d0 ; thus, t0 will start a new rotation with respect to d0.


[0103] 3) The scheduler selects a source node to send an IU on t0 for d0 during the next rotation of t0 with respect to d0.


[0104] At phase 1, 42:


[0105] 1) s1 sends an IU to t0; thus, t0 will start a new rotation with respect to s1.


[0106] 2) t0 sends the IU for d1; thus, t0 will start a new rotation with respect to d1.


[0107] 3) The scheduler selects a source node to send an IU on t0 for d1 during the next rotation of t0 with respect to d1.


[0108] At phase 2, 44:


[0109] 1) s2 sends an IU to t0; thus, t0 will start a new rotation with respect to s2.


[0110] 2) t0 sends the IU for d2; thus, t0 will start a new rotation with respect to d2.


[0111] 3) The scheduler selects a source node to send an IU on t0 for d2 during the next rotation of t0 with respect to d2.


[0112] At phase 3, 46:


[0113] 1) s3 sends an IU to t0; thus, t0 will start a new rotation with respect to s3.


[0114] 2) t0 sends the IU for d3; thus, t0 will start a new rotation with respect to d3.


[0115] 3) The scheduler selects a source node to send an IU on t0 for d3 during the next rotation of t0 with respect to d3.


[0116] The sequencing of source-selections on the other tandems nodes are similar.


[0117] When considering the traffic performance (IU delay variation) achievable with a rotator switch architecture, the DBS algorithm significantly improves this performance with respect to the known source-based scheduling algorithm. When there is severe output contention for a destination node y, the DBS algorithm provides a fair distribution amongst the contenting source nodes of the bandwidth available to reach this destination node y; this is because the scheduling decisions are performed from a destination-node point of view.


[0118] By contrast, under such a severe output contention, the known source-based scheduling algorithm is unfair, since a source node can reserve all the bandwidth available to reach the destination node y, leaving little or no bandwidth at all for the other contenting source nodes.


[0119] A.2 Extension of DBS Algorithm to Consider Traffic Priority


[0120] Assume C classes of traffic are supported by the core fabric of the rotator switch architecture. The classes are numbered 1, 2, . . . , C, in decreasing order of priority, the class 1 being the highest priority class, and the class C being the lowest priority class.


[0121] It is possible to support more than C classes of traffic in the source and destination nodes; however, in that case, this superset of classes must be map onto the C classes provided by the core switch to be routed from the source nodes to the destination nodes.


[0122] The basic principle for extension of the DBS algorithm to support C classes of traffic is to consider each class of traffic, one after the other, following the decreasing order of priority.


[0123] To support strict priority between two adjacent classes, the source-selection for the high class priority traffic for all destination nodes on a given tandem node must be completed before considering any low class priority traffic. For a given future rotation of a tandem node, the highest class traffic is first scheduled for each destination node, one destination node after the other; then, for the unassigned (residue) bandwidth on the tandem node (either from source nodes to the tandem node, or from the tandem node to destination nodes), the second class traffic is scheduled for each destination node; this process is repeated until the last class traffic is scheduled.


[0124] Since the source-selections on a tandem node are completed for one destination node after the other, following the order in which the tandem node is connected with the destination nodes, many classes of service can be scheduled by making source-selections for C rotations of the tandem node at a time; for a given destination node, the source-selections for the highest class traffic can be completed for a rotation on the tandem node with respect to the destination node that will start in C rotations, while the source-selections for lowest class traffic can be completed for the next rotation of the tandem node with respect to the destination node. In this way, when scheduling for a given class of service for a given rotation of a given tandem node, the source-selections for higher class traffic has been already completed for all destination nodes for this rotation of this tandem node.


[0125] To support the scheduling for many classes of service, the data structures of the basic DBS algorithm (DBS1) are extended in the class dimension:


[0126] Q(x,c,y): the Queue-fill status of source node x of class c for destination node y.


[0127] TDS(y,c,z): the Tandem-Destination-Status (TDS) for destination node y on tandem node z for traffic of class c or higher.


[0128] TDS(y,c,z) corresponds to the number of IUs of class c or higher the destination node y is already scheduled to receive from tandem node z during a future rotation of z with respect to destination y. This value is updated during scheduling to guarantee that the above scheduling constraint 1 is satisfied.


[0129] TSS(x,c,z): the Tandem-Source-Status (TSS) for source node x on tandem node z for traffic of class c or higher. TSS(x,c,z) corresponds to the number of IUs of class c or higher the source node x is already scheduled to send on tandem node z during a future rotation of z with respect to the source node x. This value is updated during scheduling to guarantee that the scheduling constraint 2 above is satisfied.


[0130] The extension of the DBS1 algorithm to consider C classes of service consists in making K source-selections at each phase for each destination node and for each class, the source-selections being performed on the tandem node connected with the destination node, but for C different rotations of the tandem node, one class per rotation. The core scheduler of the algorithm is presented below as a function DBS2 (line 0 to line 17); this function is executed at each phase p of the rotator.
3 0:function DBS_2 (p)  { 1:for each tandem node z  { 2:y = DT(p,z); 3:update_TDS(z, y); 4:x = ST(p,z); 5:update_TSS(z, x); 6:for each class c  { 7:while (TDS(y,c,z) < K)  { 8:s = select_source(z, y, c); 9:if (s non-existing) then exit while;10:Q(s,c,y) = Q(s,c,y) − 1;11:TSS(s,c,z) = TSS(s,c,z) + 1;12:TDS(y,c,z) = TDS(y,c,z) + 1;13:record_grant(z, y, s, c);14:}15:}16:}17:}


[0131] For each rotator-phase, source-selections are computed on each tandem node (line 1 to line 16). For each tandem node z, the source-selections are for the destination node y connected with this tandem node (line 2). The availability of TDBs on the tandem node z associated with this destination node y are updated accordingly for each class of service (line 3); the update is computed by the function update_TDS presented below:
40:function update_TDS (z, y)  {1:for class c = 2, 3, , C  {2:TDS(y,c,z) = TDS(y,c,z) − 1;3:}4:TDS(y,1,z) = 0;5:}


[0132] That is, the TDS value of the tandem node z for the destination node y associated with a class of service takes the residual TDS value associated with the next higher class of service, excepted for the highest class of service for which the TDS value is reset.


[0133] Similarly, since the tandem node z starts a new rotation with respect to the source node x (line 4), the reservation status of this source node on this tandem node is updated accordingly for each class of service (line 5); the update is computed by the function update_TSS presented below:
50:function update_TSS (z, x)  {1:for class c = 2, 3, , C {2:TSS(x,c,z) = TSS(x,c,z) − 1;3:}4:TSS(x,1,z) = 0;5:}


[0134] The source-selections are computed for each class of service, each for a different rotation of the tandem node (line 6 to line 15). For each class of service c, up to K source-selections are completed for destination node y on tandem node z (line 7 to line 14); the source-selections are completed one at a time, using the function select_source which return the selected source node s (line 8). The extension of this function to consider class of service is discussed below.


[0135] If no source node s is selected, then the source-selections for the destination node y on the tandem node z are terminated for this class of service c (line 9). Otherwise, the data structures are updated in accordance with the selected source node (line 10 to line 13): the queue-fill status of the selected source node s for the destination node y and class of service c is decremented (line 10); the reservation status for the selected source node s on the tandem node z for the class of service c is incremented (line 11); the reservation status of the destination node y on the tandem node z for the class of service c is incremented (line 12); finally, the grant information corresponding with the selection of source node s for destination node y on tandem z for the class of service c is forwarded to the grant manager; the recording of the grant by the grant manager is performed by the function record_grant (line 12), which is extended to consider class of service, i.e., to relate the grant with the effective rotation of the tandem.


[0136] A round-robin implementation of the select_source function which considered the class of service c is presented below (line 0 to line 8). One round-robin pointer is used per destination node and class of service, LSS(y,c), which record the Last-Selected Source for destination node y and class of service c (regardless of the tandem node).
60:function select_source (z, y, c)  {1:for s = LSS(y,c)+1,,N−1,0,1,,LSS(y,c)  {2:if ((Q(s,c,y) > 0) && (TSS(s,c,z) < K))  {3:LSS(y,c) = s;4:return (success(s));5:}6:}7:return (failure);8:}


[0137] The round-robin selection is implemented by considering all the source nodes, in the increasing order, starting after the last selected source (line 1 to line 6). A source node s is a candidate for destination node y on tandem node z for class of service c if TSS(x,c,z) is smaller than K, and if Q(x,c,y) is greater than 0; the selected source node is the first candidate considered following the round-robin order. If such a source node s exist (line 2), the value of the round-robin pointer for destination node y and class of service c is set to s (line 3), and s is successfully returned by the function select_source (line 4). Otherwise, no source node is selected, and the function select_source returns a failure (line 7).


[0138] As for the classless DBS algorithm (DBS1), many variants of the select_source function are possible.


[0139] Note that with C=1, the DBS2 function degenerates into the DBS1 function.


[0140] Although class priority is an important feature to be supported by a switch architecture, a strict priority between classes may not always be acceptable. For instance, it is not possible to guarantee a minimum bandwidth for a low class service. This is because high class traffic can always prevent allocation of bandwidth for traffic of a lower class.


[0141] However, to guarantee minimum bandwidth to any class of traffic, the same algorithm as proposed for strict class priority can be used. In that scheme, the highest priority class can be dedicated to any class of traffic for which a minimum allocation of bandwidth must be guaranteed. That is, all the classes of traffic share the highest class such that each class can make “high-class” requests at the rate corresponding with its minimum bandwidth guarantee. The minimum bandwidth allocation can be guaranteed because the high priority class request are satisfied strictly before the request of a lower priority, assuming that the aggregate of minimum bandwidth guarantee is not overbook (this assumption is required for any scheduling algorithms to honor the minimum bandwidth guarantee).


[0142] It is the responsibility of the source nodes to map the IU of any class to the first class of service in order to guarantee minimum bandwidth. There are many ways to implement this scheme, a simple way being to associate one counter with each logical input queue (per destination and class), where this counter represents the credit available for the corresponding traffic flow. The counter is incremented at a rate corresponding to the minimum bandwidth to guarantee, up to a given limit of credit Each time an IU is received, the high-class request is performed if the corresponding credit counter is not zero, and the counter is decremented; otherwise, a normal request is performed. The source node must record the number of high-class request it has made for a destination node for each class of service, such that when high-class grant are received for this destination node, the source node can provide an IU corresponding to a class having pending high-class requests.


[0143] A.3 Request Ageing


[0144] During a source-selection of class c for a destination node y, since the queue-fill status Q(x,c,y) as seen by the scheduler does not include time information, two source nodes x1 and x2 are considered of equivalent priority when both Q(x1,c,y) and Q(x2,c,y) are greater than 0, regardless of the queue-fill history of the source nodes. The source-selection is only based on the current round-robin pointer LSS(y,c), as well as whether or not the source nodes are eligible for the current tandem node z, i.e., the values of TSS(x1,c,z) and TSS(x2,c,z).


[0145] During a severe output contention for a destination node y, the values of Q(y) may become large for the set of source nodes contenting for the destination node y. Thus, it can be advantageous from a traffic performance point of view to consider the history of the queue-fill values when performing the source-selection, since the source node having a queue-fill value corresponding with the oldest IU having entering the switch should be considered first.


[0146] It is not practical for the scheduler to associate an exact historical information with the queue-fill values. However, it is possible to approximate the history of queue-fill using age-groups. Assume J age-groups are supported, numbered 1, 2, . . . , J, in the decreasing order of age, the age-group 1 being used for the requests associated with the oldest IUs, and the age-group J being used for the requests associated with the youngest IUs.


[0147] To considered the queue-fill history during the scheduling, the queue-fill data structure of the DBS algorithm (DBS2) is extended in the age-group dimension:


[0148] Q(x,j,c,y): the Queue-fill status of source node x from the age-group j of class c for destination node y.


[0149] The extension of the DBS2 algorithm to consider the queue-fill history consists only in providing a select_source function which considers the age-group dimension and returns the age-group component associated with the selected source node. The core scheduler of the algorithm is presented below as a function DBS3 (line 0 to line 17); this function is executed at each phase p of the rotator.
7 0:function DBS_3 (p)  { 1:for each tandem node z  { 2:y = DT(p,z); 3:update_TDS(z, y); 4:x = ST(p,z); 5:update_TSS(z, x); 6:for each class c  { 7:while (TDS(y,c,z) < K)  { 8:(s, j) = select_source(z, y, c); 9:if (s non-existing) then exit while;10:Q(s,j,c,y) = Q(s,j,c,y) − 1;11:TSS(s,c,z) = TSS(s,c,z) + 1;12:TDS(y,c,z) = TDS(y,c,z) + 1;13:record_grant(z, y, s, c);14:}15:}16:}17:}


[0150] A round-robin implementation of the select_source function which considered the age-groups is presented below (line 0 to line 10).
8 0:function select_source (z, y, c)  { 1:for j = 1 to J  { 2:for s = LSS(y,c)+1,,N−1,0,1,,LSS(y,c)  { 3:if ((Q(s,j,c,y) > 0) && (TSS(s,c,z) < K))  { 4:LSS(y,c) = s; 5:return (success(s, j)); 6:} 7:} 8:} 9:return (failure);10:}


[0151] As for the age-groupless DBS algorithm (DBS2), many variants of the select_source function are possible.


[0152] Note that with J=1, the DBS3 function is degenerated in the DBS2 function.


[0153] The quality of approximating the queue-fill history using the age-group dimension is dependant on the number J of age-groups, and the relation between these age-groups and the queue-fill history. The best approximation would be achieved using an infinite number of age-groups, which is not practical.


[0154] Given a finite number J of age-groups, here are two possible ageing schemes of the age-groups:


[0155] 1) The ageing of each age group is performed at a specified rate, named ageing rate, given as a parameter. Many combinations of these parameters are possible to form many ageing configurations. For instance, a non-linear ageing scheme can be implemented by ageing each age-group at a rate two time slower than the ageing rate of the younger age-group.


[0156] 2) The ageing of each age-group is performed when the older age-group is empty.


[0157] A.4 Physical Implementation


[0158] The physical implementation of the DBS3 algorithm depends on the phase duration, which is dependent on the bandwidth supported by each source node, or equally by each destination node, and, as well, on the IU size.


[0159] For instance, with 2.5 Gb/s source-destination nodes and 64 Byte IU, the phase duration is approximately K.205 ns. For a given destination node y and a class of service c, since K source-selections must be computed at each phase (line 7 to line 14 of the DBS3 function), each one must be computed in 205 ns. Thus, the DBS 3 function must compute NC source-selections per 205 ns for a rotator switch architecture configuration with N source-destination nodes and C classes of service; for a 640 Gbps switch configuration (i.e., with N=256), and 4 classes of service (C=4), the computation rate corresponds to one source-selection per 0.2 ns, which is approximately a 5 GHz rate.


[0160] To achieve an high processing rate, the DBS3 algorithm can be distributed such that many source-selections can be computed in parallel. A natural distribution of the algorithm is per destination-node.


[0161] Referring to FIG. 5 it is illustrated as a circular representation a distributed implementation of the DBS3 algorithm. The destination-based scheduler associated with a destination node y (DBSy entity 60, 62, 64, 66) is physically collocated with the destination node y. Besides being used as usual for the IU data flow, a tandem node z is used to carry the requests, via the RMz entity 70, 72, 74, 76, from the source nodes to the DBS entities, and to carry the grants, via the GMz entity 80, 82, 84, 86, from the DBS entities back to the source nodes. Furthermore, the tandem node is used as well to carry its associated TSS value, via the TSSz entity 90, 92, 94, 96, from destination node DBS entity to destination node DBS entity.


[0162] During each rotator-phase, the GMz entity sends the grants to the connected source node x, indicating to the source node which IUs it must send to the tandem node z, while the RMz entity sends the previously received requests for the connected destination node y to the DBSy entity. At the same time, the source node x can send its request to the RMz entity, such that the request will be forwarded to the appropriate destination node DBS entity. Furthermore, the TTSz entity sends the current TSS value associated with the tandem node z to the DBSy entity. Based on the TSS value, the DBSy entity can compute the source-sections 0 on the tandem node z for the destination node y (line 6 to line 15 of the DBS3 function). Concurrently, the normal IU data flow can proceed from the source node x to the tandem node z, and from the tandem node z to destination node y. To complete the phase, the DBSy entity sends the grants (source-selections) to the GMz entity, as well as the resulting TSS value to the TSSz entity.


[0163] Using the above distributed implementation, each DBS entity needs to compute KC source-selections per phase, i.e., K source-selections for each class of services; thus, each DBSy entity needs to implement the Q(x,j,c,y) data structure restricted to destination node y. Furthermore, since the K source-selections for each class of service are computed for different rotations of the tandem node z, the functionality of the DBS entity can be distributed per class, named DBSy,c entity, where each DBSy,c entity needs to compute K source-selections per phase, and thus needs to only implement the Q(x,j,c,y) data structure restricted to destination node y and class of service c.


[0164] The above distributed implementation is advantageous because the existing IU data path is used to implement the communication path from the source nodes to the scheduler and from the scheduler back to the source nodes. Furthermore, the request-manager function and grant-manager function are both distributed amongst the tandem nodes.


[0165] However, the above distributed implementation is problematic because of the relatively long latency required to transfer the TSS values between DBS entities. More the latency of the TSS transferred is long, less is time remaining to the DBS entity for making the source-selections on the connected tandem node for the associated destination node. Worst, the size of the TSS values may be significant, in particular for large switch configurations, and the bandwidth required to transfer these values steal the one which would be available for transferring user data IUs.


[0166] To overcome the above problem related with the transfer of TSS values, all the DBS entities can be centralised at the same physical location.


[0167] Referring to FIG. 6 it is illustrated as a circular representation a centralised implementation of the DBS3 function. The destination-based scheduler associated with a destination node y (DBSy entity) is collocated with all the others DBS entities. Besides being used as usual for the IU data flow, a tandem node z is used to carry the requests, via the RMz entity, from the source nodes to the DBS entities, and to carry the grants, via the GMz entity, from the DBS entities back to the source nodes


[0168] In the centralised implementation, the data IU space switch bandwidth is only used to transfer the requests from the source node to the RM entities, and the grants from the GM entities back to the source nodes. Another space switch is dedicated to transfer the requests from the RMz entities to the DBS entities, and to transfer the grants from the DBS entities to the GMz entities. Furthermore, the TSS values are directly transferred from DBS entity to DBS entity. Schematically, the source-destination nodes ring as well as the DBS entity ring are fixed, while the tandem node ring is rotating between these two.


[0169] As for the distributed implementation, the centralised implementation is advantageous because the existing IU data path is used to implement the communication path from the source nodes to the request manager and from the grant manager to the source nodes. Contrary to the distributed implementation, however, the latency to transfer the TSS values can be minimised, because the DBS entities are collocated.


[0170] Referring to FIG. 7 it is illustrated in more details a centralised implementation of the DBS3 algorithm. For this implementation, one physical device is used to implement the functionality associated with exactly one DBSy,c entity (line 7 to line 14 of the DBS3 function). Thus, 12 physical devices are needed 110, 112, 114, 116, 120, 122, 124, 126, 130, 132, 134, 136, since 4 destination nodes and 3 classes of service are assumed in the example. The implementation is composed of 3 identical rows of 4 DBS devices, each row being responsible for the source-selections of one class of service. The requests from the source nodes, all classes of service, forwarded by the grant manager for a destination node, enter via the corresponding DBS device of class 1, and are then forwarded to the corresponding DBS devices of class 2 and class 3.


[0171] At each phase, each DBS device computes the source-selections for its associated destination node on a given tandem node. For a given class of service (row), each DBS entity computes source-selections for its associated destination node, each on a different tandem node; then, the resulting TSS value associated with the tandem node is transferred to the DBS device associated with the next destination node the tandem node will be connected with at the next phase of the rotation; an efficient electronic link can used to carry the TSS values from a DBS device to the following one. For a given destination node (column), each DBS device computes source-selections for its associated destination node on the same tandem node, each for a different target rotation of this tandem node that corresponds with the class of service the DBS device is responsible for.


[0172] Before making the source-selections on a given tandem node, a DBS device transfers the TSS residue associated with the tandem node to the corresponding next class DBS device, as required in the algorithm for updating the TSS values (line 5 of the DBS3 function). After making the source-selections on a given tandem node, the selected source nodes are forwarded to the corresponding next class DBS device; the grant forwarding implement implicitly the transfer of the TDS residue associated with the tandem node, as required in the algorithm for updating the TDS value (line 3 of the DBS3 function); furthermore, the grant forwarding implement part of the record_grant function which consists in forwarding the grants to the grant manager.


[0173] The above centralised implementation can be further optimised, since many DBS entities of the same class of service can be implemented on the same physical ASIC device. This permits to reduce the number of device, as well as minimising furthermore the latency related to the transfer of the TSS values. The number of DBS entities that can share the same physical ASIC device is mainly limited by the memory requirement for implementing the Q(x,j,c,y) data structure. The size of this data structure is dependant on the size of each queue-fill counter as well as the number of source nodes and age-groups, and the limitation is technology dependant.


[0174] B. Load-Share DBS Algorithm


[0175] A weakness of the centralised architecture implementation of the DBS algorithm described in Section A is related to its fault tolerance. If one DBS device fails, no more source-selections are possible for the associated destination nodes, regardless of the tandem nodes. This weakness may be even worst since the faulty DBS device can make the whole scheduler faulty, since the TSS flow is broken.


[0176] Redundant interconnection between DBS devices can be provided to minimise the impact of a faulty DBS device. Depending on the number of redundant links provided, this solution can allow the scheduler to continue making the source-selections for all the destination nodes, excluding those associated with one or more faulty DBS devices.


[0177] A better solution is to duplicate all the DBS devices, i.e., the whole scheduler, where one scheduler is considered as the active one, while the other is considered as the stand-by one. In that protection scheme, each scheduler must receive the same requests from the source nodes, and must compute the same grants for these source nodes. This solution requires both schedulers to behave exactly in the same way, which can be very difficult to guarantee. For instance, a request can be lost for only one scheduler, making both schedulers to behave differently for a certain period of time, even if no scheduler is faulty; the synchronisation of the schedulers is mandatory but very difficult to achieve.


[0178] An even better solution using scheduler duplication is to make each scheduler responsible to compute the source-selections on only half of the tandem nodes. That is, the traffic load of the switch can be shared between two disjoint physical partitions of the switch fabric, each having its own scheduler. Thus, each scheduler can perform the source-selections at a rate two times slower compared to the rate required when a single scheduler is used. In the case one scheduler become faulty, either half of the switch capacity is lost, or the other scheduler can become responsible to schedule all the traffic load on all the tandem nodes, providing that it was implemented to compute the source-selections at the full rate.


[0179] The performance of the rotator switch using the load-share DBS algorithm is dependant upon the efficiency of the load sharing between the two schedulers. It is the responsibility of the source node to evenly distribute its requests between both schedulers. This can be achieved in many ways; for instance, a simple random distribution scheme can be used in which for each incoming IU the source node selects randomly, following an uniform distribution, to which scheduler it will send the request corresponding to the arrival of this IU. When the requests are evenly distributed, the performance of the rotator switch using the load share DBS scheduler and the single DBS scheduler are similar.


[0180] The degree of load-sharing can be increased beyond two schedulers, up to the number of tandem nodes N. That is, a scheduler can be associated with each tandem node; in that case, the requests from a source node must be evenly distributed amongst all the tandem nodes.


[0181] To distribute the load amongst all the tandem node, the queue-fill data structure of the DBS algorithm (DBS3) is extended in the tandem node dimension:


[0182] Q(x,j,c,y,z): the share of the Queue-fill status of source node x on tandem node z from the age-group j of class c for destination node y.


[0183] The extension of the DBS3 algorithm to consider the load-share amongst the tandem nodes consists only in providing a select_source function which considers the share of the queue-fill status associated with the tandem node. The core scheduler of the algorithm is presented below as a function DBS4 (line 0 to line 17); this function is executed at each phase p of the rotator.
9 0:function DBS_4 (p)  { 1:for each tandem node z  { 2:y = DT(p,z); 3:update_TDS(z, y); 4:x = ST(p,z); 5:update_TSS(z, x); 6:for each class c  { 7:while (TDS(y,c,z) < K)  { 8:(s, j) = select_source(z, y, c); 9:if (s non-existing) then exit while;10:Q(s,j,c,y,z) = Q(s,j,c,y,z) − 1;11:TSS(s,c,z) = TSS(s,c,z) + 1;12:TDS(y,c,z) = TDS(y,c,z) + 1;13:record_grant(z, y, s, c);14:}15:}16:}17:}


[0184] A round-robin implementation of the select_source function which considered the load-sharing is presented below (line 0 to line 10). One round-robin pointer is used per destination node, tandem node, and class of service, LSS(y,z,c), which records the Last Selected Source for destination node y on tandem node z for class of service c.
10 0:function select_source (z, y, c)  { 1:for j = 1 to J  { 2:for s = LSS(y,z,c)+1,,N−1,0,1,,LSS(y,z,c)  { 3:if ((Q(s,j,c,y,z) > 0) && (TSS(s,c,z) < K))  { 4:LSS(y,z,c) = s; 5:return (success(s, j)); 6:} 7:} 8:} 9:return (failure);10:}


[0185] As for DBS3 algorithm, many variants of the select_source function are possible.


[0186] Notice that the DBS4 algorithm can be adapted for any degree of load-sharing between 1 and N, where source-selections on a given tandem node are made by exactly one scheduler which received a load-share corresponding with the ratio of tandem nodes it is responsible for. For the case of a load-sharing degree of 1, the DBS4 algorithm is degenerated in the DBS3 algorithm.


[0187] The main advantage in using a load-sharing degree of N (i.e., associated one scheduler per tandem node) is the high fault-tolerance implementation of the architecture that can be achieved.


[0188] Referring to FIG. 8 it is illustrated as a circular representation an N-degree load-sharing implementation of the DBS4 function. The destination-based scheduler associated with a destination node y (DBSy entity), for a given tandem node z, is collocated with the tandem node z and with all the others DBS entities associated with the tandem node z. That is, each tandem node is collocated with its own scheduler 100, 102, 104, 106. Besides being used as usual for the IU data flow, a tandem node z is used to carry the requests, via the RMz entity, from the source nodes to its local DBS entities, and to carry the grants, via the GMz entity, from its local DBS entities back to the source nodes.


[0189] Notice that the rate to compute the source-selections for a tandem node by its associated scheduler (load-sharing degree of N) is N times slower than the rate required in the case of a single scheduler for all the tandem nodes (load-sharing degree of 1). Each scheduler can be implemented as illustrated in FIG. 7, but the implementation can be less complex (in terms of number of ASIC devices) since the required processing rate of the scheduler is N times slower.


[0190] An high degree of fault tolerance can be achieved because:


[0191] 1) If a scheduler becomes faulty, its associated tandem node can be considered as faulty, resulting in a bandwidth penalty of 1/N.


[0192] 2) If a tandem node becomes faulty, its associated scheduler can be considered as faulty, resulting in a bandwidth penalty of 1/N.


[0193] This bandwidth capacity can be easily compensated by having a rotator switch fabric that provides some bandwidth expansion with respect to the user traffic.


[0194] C. DBS Algorithm Extension for Rotator Architecture with Compound-Tandem Nodes


[0195] Referring to FIG. 9 there is illustrated a 4-node configuration of the rotator switch extension using compound-tandem nodes of degree 2. In operation, each tandem node is connected at the same time with two source nodes as well as with two destination nodes, reducing by a factor of 2 the rotation latency with respect to the known rotator switch. Detailed descriptions of this rotator switch are given in the above referenced copending patent application.


[0196] In general, using compound-tandem nodes of degree u, tandem node is connected with u source nodes at a time and with u destination nodes at a time.


[0197] At each scheduling phase (each call of the function DBS3), a tandem node z terminates a scheduling rotation with respect to u destination nodes, and with respect to u source nodes. It is thus possible to perform source-selections for these u destination nodes on the tandem node z. From an implementation point of view, referring to FIG. 7, the TSS value associated with the tandem node z need to be considered by two DBS devices at each phase.


[0198] The N degree load-sharing DBS4 algorithm is extended in a similar way. In that case, because of the compound-tandem nodes, there are less tandem node and thus less scheduler, but there is always one scheduler associated with each tandem node. At each phase, each scheduler must complete the source-selections for u destination nodes on its associated tandem node.


[0199] D. DBS Algorithm Extension for Rotator Architecture with Parallel Rotator Slices


[0200] Referring to FIG. 10 there is illustrated a 4-node configuration of the rotator switch extension using parallel rotator slices of degree 2. In operation, each source node is connected at the same time with two tandem nodes, and similarly for each destination node, increasing by a factor of 2 the number of physical path between each combination of source-destination nodes with respect to the known rotator switch. Detailed descriptions of this rotator switch are given in the above referenced copending patent application.


[0201] In general, using parallel rotator slices of degree v, a source node is connected with v tandem nodes at a time and a destination node is connected with v tandem node at a time, as well. That is, v independent rotator switch fabrics are used.


[0202] At each scheduling phase (each call of the function DBS3), v tandem nodes terminate a scheduling rotation with respect to the same destination node y, and with respect to the same source node x. It is thus possible to perform source-selections for this destination node y on these v tandem nodes. From an implementation point of view, referring to FIG. 7, a DBS device needs to consider the TSS values associated with 2 tandem nodes at each phase.


[0203] The N degree load-sharing DBS4 algorithm is extended in a similar way. In that case, because of the parallel rotator slices, there are more tandem nodes and thus more schedulers, but there is always one scheduler associated with each tandem node. At each phase, each scheduler must complete the source-selections for a destination node on its associated tandem node.


[0204] E. DBS Algorithm Extension for Rotator Architecture with Compound-Tandem Nodes and Parallel Rotator Slices


[0205] Normally, the compound-tandem node extension and parallel rotator slice extension should be used together. The parallel rotator slices increase the number of physical paths from each source node to each destination node, which results in an architecture inherently fault-tolerant with respect to the data flow. However, the latency of the rotator switch (rotation delay of one tandem node) is increased by a factor v corresponding to the number of parallel rotator slices. On the other hand, the advantage of the compound-tandem nodes architecture is to reduce this latency of the rotator switch by a factor of u, where u is the number of source or destination nodes connected at the same time with a tandem node.


[0206] Referring to FIG. 11 there is illustrated a 4-node configuration of the rotator switch extension combining compound-tandem nodes of degree 2 and parallel rotator slices of degree 2. In operation, each tandem node is connected at the same time with two source nodes as well as with two destination nodes, while each source node is connected at the same time with two tandem nodes, and similarly for each destination node. Detailed descriptions of this rotator switch are given in the above referenced copending patent application. In general, combining compound-tandem nodes of degree u and parallel rotator slices of degree v, a tandem node is connected at the same time with u source nodes as well as with u destination nodes, while a source node is connected with v tandem nodes at a time and a destination node is connected with v tandem node at a time, as well. The DBS algorithm for the known rotator architecture can be easily extended for this architecture.


[0207] At each scheduling phase (each call of the function DBS3), v tandem nodes terminate a scheduling rotation with respect to the same set of u destination nodes, and with respect to the same set of u source nodes. It is thus possible to perform source-selections for these u destination nodes on these v tandem nodes. From an implementation point of view, referring to FIG. 7, a DBS device needs to consider the TSS values associated with 2 tandem nodes at each phase, while the TSS value associated with a tandem node need to be considered by two DBS devices at each phase.


[0208] The N degree load-sharing DBS4 algorithm is extended in a similar way. Since there is always one scheduler associated with each tandem node, at each phase, each scheduler must complete the source-selections for u destination nodes on its associated tandem node.


[0209] F. DBS Algorithm Extension for Rotator Architecture with Double-Bank Tandem Nodes


[0210] As discussed previously, when considering the traffic performance achievable with a rotator switch architecture, the DBS algorithm improve significantly this performance with respect to the known source-based scheduling algorithm. It is because the DBS algorithm fairly distributes amongst the source nodes the bandwidth available to reach a destination node.


[0211] Although the improvement is very significant, the proposed DBS algorithm is inherently biased for a source-node point of view. Because a tandem node is starting a new rotation at a different phase with respect to each destination node, there exists a fixed dependency between the time a source node become eligible to be selected on a given tandem node, and the time a destination node perform a source-selection on this tandem node. For a given source node, this time dependency is different for each destination node; thus, a source node x is more likely to be eligible for a source-selection by a destination node closer with x than by a destination node further with x.


[0212] For instance, when completing source-selections for destination node 1, on any tandem node, source node 1 has not yet been considered as a candidate for any destination node for this rotation of the tandem node with respect to source node 1. On the other hand, when completing source-selections for destination node 0, source node 1 has been considered as a candidate for all destination nodes excepted destination node 0 for this rotation of the tandem node with respect to source node 1. In the case source node 1 has IU traffic for destination node 0 and destination node 1, source node 1 is less likely to be eligible for a source-selection by destination node 0 than by destination node 1, since destination node 1 makes always its source-selection before destination node 0, for the point of view of source node 1.


[0213] The double-bank tandem node architecture is proposed as an extension of the known rotator architecture to eliminate the above problem. In the double-bank architecture each tandem node has two banks of TDBs, one for receiving IUs from the source nodes, and one for sending IUs to destination nodes. The banks are swap once per rotation. To guarantee a correct IU ordering at the destination node, the banks must be swap at a fixed position of the rotation for all tandem nodes; we suppose in the following that the swapping occurs when the tandem nodes is connected with source node 0.


[0214] Referring to FIG. 12 there is illustrated a 4-node configuration of the rotator switch extension using double-bank tandem nodes. In operation, each tandem node stored the IU received from the connected source node in one bank, while the IU sent to the connected destination node is read from the other bank. The tandem node swap its banks when it is connected with the source node 0. Detailed descriptions of this rotator switch are given in the above referenced copending patent application.


[0215] In the following, we do not consider the compound-tandem node and parallel rotator slice architectural extension, although the double-bank tandem node architecture as well as the proposed scheduler can be extended for both the compound-tandem nodes and parallel rotator slices; the extensions for the DBS scheduling algorithm are similar to those proposed in the case of the compound-tandem node and parallel rotator slice architectural extension of the known rotator switch.


[0216] In the double-bank tandem node architecture, when a tandem node is connected with destination node 0, it terminates a rotation with respect to all the destination nodes. At each phase of the rotator, there is one tandem node starting a new rotation with respect to all the destination nodes. The objective of the scheduling algorithm is to select a source node for each destination node on a tandem node before this tandem node start a rotation. The destination node order for making the source-selections in no more constrained by the IU data flows.


[0217] For each tandem node z and target rotation of this tandem node, the scheduler must select K source nodes for each destination node to use the K TDBs associated with this destination node during this target rotation of z. For each rotation, the tandem node starts with an empty bank of TDBs for incoming IUs, and the destination node order for the source-selections is no more constrained by the rotator IU flow. However, a source node can be selected at most K times for each rotation of the tandem node, regardless of the destination nodes it is selected for.


[0218] The core scheduler of the algorithm is presented below as a function DBS5 (line 0 to line 17); this function is executed at each phase p of the rotator.
11 0:function DBS_5 (p)  { 1:z = DT−1 (p,0); 2:for each destination node y  { 3:TDS(y,z) = 0; 4:} 5:for each source node x  { 6:TSS(x,z) = 0; 7:} 8:for class c = 1, 2, , C  { 9:for each destination node y  {10:while (TDS(y,z) < K)  {11:(s, j) = select_source(z, y, c);12:if (s non-existing) then exit while;13:Q(s,j,c,y) = Q(s,j,c,y) − 1;14:TSS(s,z) = TSS(s,z) + 1;15:TDS(y,z) = TDS(y,Z) + 1;16:record_grant(z, y, s, c);17:}18:}19:}20:}


[0219] Contrary to DBS3 algorithm, only one tandem node is scheduled per phase, and it is the tandem node z connected with destination node 0 (line 1); the inverse function DT−1 (p,y) of the function DT(p,z) gives the tandem node connected with the destination 0 at phase p; in fact, DT and its inverse are the same function, following our node numbering, since when tandem node z is connected with the destination node y, the tandem node y is connected with the destination node z.


[0220] On this tandem node z, the TDS values are updated for each destination node y (line 2 to line 4), and the TSS values are updated for each source node x (line 5 to line 7). Since all destination nodes are scheduled during the same phase on one tandem node, it is no more required to schedule different classes of service for different rotations of the tandem node.


[0221] Then, source-selections are performed on the tandem node z for each class of service, from the highest priority class to the lowest priority class, since the source-selections are for the same target rotation of the tandem node z (line 8 to line 19).


[0222] Given a class of service, each destination node are considered for source-selections, one after the other (line 9 to line 18). The destination nodes can be considered in any order; for instance, a random order can be used, and in that case, for a source node point of view, the probability to be selected by a destination node is evenly distributed amongst all the destination nodes.


[0223] For a given destination node y, up to K source-selections are completed (line 10 to line 17); this part of the algorithms is as in the DBS3 function, excepted that the class dimension needs no more to be associated with the TSS and TDS data structures.


[0224] A round-robin implementation of the select_source function which does not considered the class dimension of TSS is presented below (line 0 to line 10).
12 0:function select_source (z, y, c)  { 1:for j = 1 to J  { 2:for s = LSS(y,c)+1,,N−1,0,1,,LSS(y,c)  { 3:if ((Q(s,j,c,y) > 0) && (TSS(s,z) < K))  { 4:LSS(y,c) = s; 5:return (success(s, j)); 6:} 7:} 8:} 9:return (failure);10:}


[0225] When the DBS5 function is used as the core scheduler for the double-bank tandem node rotator switch architecture, the destination node bias as seen by a source nodes disappears, since a source node has the same probability to be selected by any destination nodes, providing a random ordering of destination nodes is used for the source-selections.


[0226] This algorithm can be easily extended for the N-degree load-sharing scheduler architecture. As before, the source-selections for a given tandem node must consider only the queue-fill share associated with this tandem node.


[0227] For a practical point of view, however, the DBS5 function is much more complex to implement than the DBS3 function. Each source-selection is a time consuming task, and they must be performed one after the other on a given tandem node for all the destination nodes and classes. It is because there is data dependency between source-selections on the same tandem node, since a source node can be selected up to K times on this tandem node, regardless of the destination node.


[0228] Furthermore, it is difficult to compute source-selections at the same time for the same destination node on two or more tandem nodes. It is because for each source-selection the queue-fill status associated with the destination node must be considered and updated, regardless of the tandem node for which the source-selection is computed.


[0229] In the case of the DBS3 function, this problem of data dependency is not significant, since at each phase the source-selections are performed on different tandem nodes, and for different destination nodes. Furthermore, in order to perform source-selection for different class of services at the same time, the source-selections for each class is performed for different rotation of the tandem nodes.


[0230] In the case of the DBS4 function, there is one scheduler associated with each tandem node. Thus, there is no problem of data dependency related with the concurrent source-selections for the same destination node on two or more tandem nodes, since the source-selections on each tandem node are based on local queue-fill status associated with the tandem node. However, the source-selections for different destination nodes on the same tandem nodes must be completed one after the other.


[0231] The above DBS5 function can be modified to meet the constraint where at each phase source-selections are computed on all tandem nodes, all for a different destination node. We assume in the following, as for the DBS3 algorithm implementation, that only K source-selections can be performed per phase on a given tandem node.


[0232] For a given class of service, since there is N destination nodes for which up to K source-selections must be completed on a tandem node for a target rotation of this tandem node, these source-selections must be started N phases ahead of the target rotation (i.e., one rotation ahead of the target rotation). Furthermore, source-selections for different class of service can be performed for different target rotations of the tandem nodes, as in the DBS3 algorithm.


[0233] The basic principle of the extension of the DBS5 function is that the source-selections for a given target rotation is started at the same time for all the tandem nodes, although the tandem nodes will effectively start this rotation each at a different rotator-phase. The scheduling process becomes a sequence of scheduling rotation, where during each scheduling rotation each destination node makes source-selections on each tandem node, one tandem node per phase, for the same target rotation of the tandem nodes for a given class of service, each class of service being scheduled for different target rotation. For each scheduling rotation, an ordering of destination nodes can be assigned to each tandem node, such that at each scheduling phase all destination nodes are making K source-selections each on a different tandem node. We assume phase 0 is used as the starting scheduling phase.


[0234] The core scheduler of the algorithm satisfying the above constraint for the rotator switch architecture with double-bank tandem nodes is presented below as a function DBS6 (line 0 to line 23); this function is executed at each phase p of the rotator.
13 0:function DBS_6 (p)  { 1:for each tandem node z  { 2:if (p == 0)  { 3:for each destination node y  { 4:update_TDS(z, y); 5:} 6:for each source node x  { 7:update_TSS(z, x); 8:} 9:set_destination_node_order(z);10:}11:y = next_destination_node(z, p);12:for each class c  {13:while (TDS(y,c,z) < K)  {14:(s, j) = select_source(z, y, c);15:if (s non-existing) then exit while;16:Q(s,j,c,y) = Q(s,j,c,y) − 1;17:TSS(s,c,z) = TSS(s,c,z) + 1;18:TDS(y,c,z) = TDS(y,c,z) + 1;19:record_grant(z, y, s, c);20:}21:}22:}23:}


[0235] For each rotator-phase, source-selections are computed on each tandem node (line 1 to line 22). As discussed previously, scheduling for the next target rotation is started at the rotator-phase 0 (line 2). Thus, for each tandem node z, the TDS values are updated for each destination node y (line 3 to line 5), the TSS values are updated for each source node x (line 6 to line 8), and a ordering of the destination nodes for making the source-selections on the tandem node z, one destination node per phase, is generated (line 9); this ordering is generated with the set_destination_node_order function, which is discussed below. The requirement of this function, as described previously, is that the generated destination node ordering is such that at each scheduling phase all destination nodes are making K source-selections each on a different tandem node and, furthermore, during each scheduling rotation, each destination node is making K source-selections on each tandem node, one tandem node per scheduling phase.


[0236] Then, the destination node y for which source-selections can be performed on the tandem node z is computed (line 11); the destination node y is given by the function next_destination_node which returns the destination node y to schedule on tandem node z during the rotator phase p, as previously generated during the last rotator phase 0 by the function set_destination_node_order.


[0237] The source-selections for destination node y on tandem node z, for each class of service, each for a different target rotation of the tandem node z (line 12 to line 21) are computed in exactly the same way as in the DBS3 algorithm (line 6 to line 15).


[0238] The fairness of the scheduling algorithm, for a source node point of view with respect to the destination nodes, is directly and only dependent on the perturbation of the destination node ordering as provided by the function set_destination_node_order. Theoretically, the ordering generated can be totally random, and the achievable performance is the same as the one achievable with the above DBS 5 algorithm. Although the DBS6 algorithm is less efficient for a latency point of view, since scheduling for a target rotation of a given tandem node is performed much more in advance than in the DBS5 algorithm, this latency can be keep small enough in a physical implementation of the rotator switch such that it becomes non significant for a traffic performance point of view.


[0239] The DBS6 algorithm can be optimised for the N-degree load-sharing architecture, because all tandem nodes are independently scheduled. Thus, the destination node ordering for making the source-selections on a given tandem node is not constrained by ordering used for the other tandem nodes. This permits to relax the constraint of starting at the same time the source-selections on all the tandem nodes for a target rotation of these tandem nodes, although each tandem node will effectively start the target rotation at a different phase. Instead, at each phase, the scheduling for a target rotation can be started only for the tandem node starting effectively a rotation, i.e., the tandem node connected with the destination node 0.


[0240] The core scheduler of the N-degree load-sharing DBS algorithm for the rotator switch architecture with double-bank tandem nodes is presented below as a function DBS7 (line 0 to line 23); this function is executed at each phase p of the rotator.
14 0:function DBS_7 (p)  { 1:for each tandem node z  { 2:if (z == DT−1 (p,0))  { 3:for each destination node y { 4:update_TDS(z, y); 5:} 6:for each source node x  { 7:update_TSS(z, x); 8:} 9:set_destination_node_order(z);10:}11:y = next_destination_node(z, p);12:for each class c  {13:while (TDS(y,c,z) < K)  {14:(s, j) = select_source(z, y, c);15:if (s non-existing) then exit while;16:Q(s,j,c,y,z) = Q(s,j,c,y,z) − 1;17:TSS(s,c,z) = TSS(s,c,z) + 1;18:TDS(y,c,z) = TDS(y,c,z) + 1;19:record_grant(z, y, s, c);20:}21:}22:}23:}


[0241] For each rotator-phase, source-selections are computed on each tandem node (line 1 to line 22). As discussed previously, scheduling for the next target rotation is started for the tandem node connected with the destination node 0 (line 2). Thus, only for this tandem node z, the TDS values are updated for each destination node y (line 3 to line 5), the TSS values are updated for each source node x (line 6 to line 8), and a ordering of the destination nodes for making the source-selections on the tandem node z, one destination node per phase, is generated (line 9); this ordering is generated with the set_destination_node_order function, which is discussed below. The requirement of this function, as described previously, is that the generated destination node ordering is such that during the scheduling rotation, each destination node is making K source-selections on the tandem node z, one destination node per scheduling phase.


[0242] Then, the destination node y for which source-selections can be performed on the tandem node z is computed (line 11); the destination node y is given by the function next_destination_node which returns the destination node y to schedule on tandem node z during the rotator phase p, as previously generated by the function set_destination_node_order for the tandem node z.


[0243] The source-selections for destination node y on tandem node z, for each class of service, each for a different target rotation of the tandem node z (line 12 to line 21) are computed in exactly the same way as in the DBS 4 algorithm (line 6 to line 15).


[0244] Practically, only a subset of all the possible destination node ordering may be generated by the function set_destination_node_order either for the DBS6 algorithm or for the DBS7 algorithm. In a practical implementation of the scheduling algorithm, the DBS entities (each one associated with a destination node and a class of service) is distributed amongst many physical devices; that is, each physical device is responsible in making source-selections for a fixed subset of destination nodes for a given class of service (and for a given tandem node in the case of the DBS7 algorithm). In that case, the connectivity between these devices for transferring the TSS values constrained the possible perturbation that can be applied on the destination node ordering.


[0245] In the following, we discuss some practical implementations of the DBS6 and DBS7 functions with respect to the set_destination_node_order function.


[0246] F.1 One-Way DBS


[0247] In the one-way DBS scheme, an implementation as illustrated in FIG. 7 is proposed where, in general, each DBS device is responsible for making the source-selections for M destination nodes, 0<M<N+1 (for a given class of service). Without loss of generality, suppose that N is a multiple of M; thus, the N DBS entities are distributed between N/M DBS devices. Because of the strict connectivity between the DBS devices, after the source-selections on a given tandem node, each DBS device can transfer the residue of the TSS value only to the DBS device located physically at its right.


[0248] At the beginning of each scheduling rotation, the set_destination_node function can generate a random order of destination nodes for each group of M destination nodes associated with a DBS device. This random generation produce a global ordering of the destination nodes such that the destination node following a destination node y of a DBS device D is either the next one of its group of M destination nodes, if y is not the last one of its group, or, otherwise, it is the first one of the group of M destination nodes associated with the DBS device located physically at the right of D.


[0249] In the case of the DBS6 function, a tandem node is associated (randomly) with each destination node at the beginning of the scheduling rotation; thus, M tandem nodes are associated with each DBS device. At each phase each DBS device computes K source-selections for each of the destination nodes it is responsible for, on the tandem node currently associated with the destination node; then, each TSS residue is transferred to the destination node at the right of the current destination node, following the previously generated random ordering. Thus, at each phase, there is always one TSS residue being transferred from a DBS device to its right neighbour. As required in the DBS6 algorithm, each destination node can make K source-selections on each tandem node during each scheduling rotation, and each tandem node received K source-selections from each destination node during each scheduling rotation.


[0250] More M is large, more the destination node ordering perturbation can approximate a completely random perturbation, and a perfect approximation can be achieved with M greater than or equal to N/2 (i.e., with one or two DBS devices per class of service). For a smaller value of M, because there is more than 2 DBS devices (per class of service) and because the transfer of the TSS values follow a strict order of the DBS devices, the destination nodes associated with a DBS device makes always their source-selections after the destination nodes associated with the DBS device at its left, on N-M tandem nodes. This ordering scheme results in a bias, which is more significant when M is relatively small compare to N.


[0251] For instance, suppose that M=1 and N=256, a source-node x has a large number of IUs queue for destination node 0 and destination node 1, and has IUs only for these two destination nodes, and no other source nodes has IUs queued for these two destination nodes. Suppose furthermore that the DBS device responsible of destination node 1 is located physically at the right of the DBS device responsible for the destination node 0. In that case, the DBS device for destination node 0 is always selecting the source node x on all the tandem nodes before the DBS device for destination node 1, excepted for one tandem node per rotation. That is, the bandwidth for the point of view of source node x will not be fairly distributed between all the destination nodes.


[0252] In the case of the DBS7 algorithm, there is only one tandem node scheduled by each DBS device at each scheduling rotation, and a destination node can be randomly selected as the starting one to make its source-selections on the tandem node. Since the tandem node must be considered in an order of the destination nodes similar as in the case of the DBS6 function described above, the same bias exists. However, since each DBS entity is less complex in that case, much more can share the same DBS device, making M larger, and the bias problem become much less significant. Furthermore, the logical mapping of the DBS entities on the DBS devices can be different in each scheduler, making the bias even less significant.


[0253] F.2 Two-Way DBS


[0254] A simple extension of the one-way DBS scheme is to provide a duplex communication path between neighbour DBS devices, and to inverse the direction flow of the TSS values at each scheduling rotation. Thus, it will be no more the case that one destination node can make its source-selections before another destination node on almost all the tandem nodes (when M is relatively small with respect to N). Instead, for each pair of destination node, half of the time one destination node has priority over the other destination node, and half of the time it is the inverse.


[0255] More M is large, more the destination node ordering perturbation can approximate a completely random perturbation, and a perfect approximation can be achieved with M greater than or equal to N/3 (i.e., with one, two or three DBS devices per class of service). For a 3s smaller value of M, because there is more than 3 DBS devices (per class of service) and because the transfer of the TSS values follow a strict order of the DBS devices, the destination nodes associated with a DBS device makes always their source-selections after the destination nodes associated with the DBS device at its left, on N-M tandem nodes, for half of the rotation, and after the destination nodes associated with the DBS device at its right, as well on N-M tandem nodes, for the other half of the rotation. This ordering scheme results in a bias, which is more significant when M is relatively small compare to N.


[0256] For instance, suppose M=1 and N=256, and that a source-node x has a large number of IUs queue for destination node 0, destination node 1 and destination node 2, and has IUs only for these three destination nodes, and no other source nodes has IUs queued for these three destination nodes. Suppose furthermore that the DBS device responsible of destination node 0 is just at the left (following the TSS value flow in the right direction) of the DBS device responsible for the destination node 1, which is just at the left of the DBS device responsible for the destination node 2. In that case, half of the time the DBS device for destination node 0 is always selecting the source node x on all the tandem nodes before the DBS device for destination node 1 and destination node 2, excepted for two tandem nodes per rotation. The other half of the time, the DBS device of the destination node 2 is always selecting the source node x on all tandem nodes before the DBS devices for destination node 1 and destination node 0, excepted for two tandem nodes per rotation. That is, the bandwidth for the point of view of source node x will not be totally fairly distributed, even if it is fairly distributed between destination nodes 0 and 2.


[0257] Notice that the probability of bias is much less likely in the case of the two-way DBS scheme than in the case of the one-way DBS scheme.


[0258] In the case of the DBS7 algorithm, a similar bias exists. However, since each DBS entity is less complex in that case, much more DBS entities can share the same DBS device, making M larger, making the bias much less likely. Furthermore, the logical mapping of the DBS entities on the DBS devices can be different in each scheduler, making the bias even less significant.


[0259] F.3H-Way DBS


[0260] The extension from the one-way DBS scheme to the two-way DBS scheme can be further extended for an H-way DBS scheme which can be practically implemented when H=(N/M−1) is sufficiently small. In the case where the DBS devices can be fully mesh connected, it is possible to make the TSS value flowing though the DBS devices following a different order at each scheduling rotation. Combining with the local random ordering of the destination node in each DBS device, this scheme permits to obtain a totally random scheme for ordering the destination nodes.


[0261] One way to implement the cross-connect is to use a demand-driven space switch between all the DBS devices, in which only a subset of the configuration are needed, the configuration being generated randomly for each scheduling rotation.


[0262] F.4 M-Pass DBS


[0263] Another scheme to perturbate the destination node ordering is to combine an H-way scheme (for H=1, 2, . . . ), with an M-pass scheme. In a M-pass DBS scheme, a tandem node is scheduled by all the destination nodes using multiple passes of the tandem node through the DBS devices implementing the DBS entities.


[0264] For each scheduling rotation, each destination node select randomly, for each tandem node, during which pass of this tandem node the destination node will make its source-selections on.


[0265] More M is large, better the approximation of the random perturbation is obtained. The number of passes M is dependent on the effective latency to transfer the TSS values between DBS devices.


[0266] G. DBS Algorithm Extension for Demand-Driven Space Switch Architecture


[0267] In the following, we argue that, besides the fixed transport delay, the functionality of the rotator switch architecture with double-bank tandem nodes is identical with the functionality of an input-buffer demand-driven space-switch architecture; thus, the same schedulers and implementations proposed for this rotator switch architecture can be used for the demand space-switch architecture.


[0268] In the demand-driven space-switch architecture, the IUs are queued in the source nodes, as in the rotator switch architecture. The switch fabric is a demand-driven space-switch that can be configured dynamically in any one-to-one mapping between the source nodes and the destination nodes. The IUs data flow for this architecture is composed of an infinite sequence of bursts; for each burst the demand-driven space switch is reconfigured, and each source node can send IUs to the connected destination node. Usually, the duration of each burst is the same, which corresponds to the time of sending a given number of IUs, say L. The configuration at each burst is demand-driven to increase the throughput of the switch fabric. We assume first that L=1, which permits to achieve the best performance with a demand-driven space-switch architecture, yet which is not really practical because of the delay penalty involve to reconfigure the space switch.


[0269] As in the case of the rotator switch, the switch fabric can be composed of many parallel demand-driven space-switches, where each one can be configured independently.


[0270] A configuration in the case of the demand-driven space switch corresponds with a tandem node rotation in the case of the rotator switch with double-bank tandem nodes, when K=1. In general, the tandem node can implement K different configurations of a demand-driven space switch during each rotation. Without loss of generality, we suppose K=1 in the following.


[0271] Hence, during each rotation, a tandem node can implement any one-to-one connection mapping between the source nodes and the destination nodes. The only difference with the demand-driven space-switch resides in the fact that the one-to-one connection mapping implemented by the tandem node is spread in time, during two rotations: during the first rotation, at each phase, the connected source node sends to the tandem node an IU for the destination node the source node is mapped with, while during the second rotation, at each phase, the tandem node sends to the connected destination node the IU that was previously sent by the source node the destination node is mapped with. In fact, at each rotation, the tandem node implements the half of two (different) one-to-one connection mappings.


[0272] Many tandem nodes are used each tandem node implements a space-switch of capacity 1/N, yet all tandem nodes can implement different one-to-one connection mappings. In fact, the N tandem nodes implements a N-stage pipeline architecture of a demand-driven space switch.


[0273] Since a tandem node rotation implements a one-to-one mapping of a demand-driven space switch, the set of source-selections on a tandem node, one for each destination node, as computed by the DBS algorithm for a target rotation of this tandem node, can be used directly to configure the demand-driven space-switch for a given burst.


[0274] To use directly the proposed DBS algorithms, as well as the corresponding implementations, for the demand-driven space-switch architecture, it is sufficient to map each tandem node rotation to a burst of the demand-driven space switch.


[0275] Assuming there is N tandem nodes, then the source-selections for the rotation R of the tandem node t can be used directly for the configuration of the demand-driven space switch for the burst NR+t.


[0276] Thus, the proposed DBS6 algorithm, together with the proposed implementations for destination node ordering perturbation, can be used directly as the scheduler for a demand-driven space switch architecture.


[0277] Furthermore, the DBS7 algorithm can be used as well to distribute the load amongst many schedulers for error protection purpose, in particular, in the case of an architecture with parallel demand-driven space-switches.


[0278] The proposed DBS algorithms can be used directly for the case of a demand-space switch architecture with burst length L of 1 IU. For the case L>1, it possible to use as well directly the proposed DBS algorithms, assuming that the source nodes are making requests to the scheduler for group of L IUs for each combination of destination node and class of service.


[0279] In one scheme, a source node makes a request to the scheduler for transferring one IU to a specified destination node as soon as possible, even if it has not yet a group of L IUs ready to be sent for this destination node. Then, the source node will refrain making another request to the scheduler for the same destination node until it receives L more IUs for this destination node, unless the source node has been granted to send IUs to this destination node without having L IUs to sent. This scheme minimises the latency an IU can experience through the switch, yet the switch throughput is not optimised since less than L IUs may be transferred per burst.


[0280] In another scheme, a source node makes a request to the scheduler for transferring one IU to a specified destination node only for each group of L IUs received for this destination node. This scheme optimises the switch throughput, yet the latency an IU can experience through the switch is not optimised, since an IU must wait for L-1 other companions before the scheduler is informed of its presence at the source node. A time-out counter can be used to guarantee a maximum waiting period to optimise as well the latency.


[0281] Thus, the proposed DBS6 and DBS7 algorithm, together with the proposed implementations for destination node ordering perturbation, can be used directly as the scheduler for a demand-driven space switch architecture.


[0282] H. Fault Tolerant Switch Architecture


[0283] We have already described an S-degree load-sharing variant of the destination-based scheduling algorithm as a mean to increase the fault tolerance of the architecture with respect to the scheduler fault as well as with respect to the part of the switch fabric (tandem nodes, space switch) the scheduler is responsible for (S=1, . . . , N). That is, if either the scheduler or its associated fabric part become faulty, both of them can be disable, resulting in a lost in capacity of C/S, where C is the fault-free capacity of the switch.


[0284] Furthermore, because the scheduler is the entity deciding by which physical path each IU is travelling through the switch fabric from its source node to its target destination node, it is possible to make the architecture even more fault tolerant by informing the scheduler about each faulty physical path discovered in the switch fabric.


[0285] For instance, in the case of the rotator switch, a bit-vector can be associated with each tandem node having a bit-value associated with each source node, such that the bit is set only if the physical connection from the associated source node to the tandem node is known to be fault free. A similar bit-vector can be associated with each tandem node for the destination nodes. Using these masking tables, the scheduling algorithm can be easily extended to avoid IU travelling through a faulty path.


[0286] On the one hand, when the TDS value of a given tandem node is updated (e.g., line 4 of the DBS7 function), the entry corresponding to a destination node y for which the connection is faulty with the tandem node z is not reset to 0, but it is set to K instead (only in the case of the TDS value for the class 1 source-selections). This guarantees that a destination node will never selects a source node for sending an IU to a tandem node for which a known faulty connection between the tandem node and the destination node.


[0287] On the other hand, when the TSS value of a given tandem node is updated (e.g., line 7 of the DBS7 function), the entry corresponding to a source node x for which the connection is faulty with tandem node z is not reset to 0, but it is set to K instead (only in the case of the TSS value for the class 1 source-selections). This guarantees that a source node will never be grant for sending an IU to a tandem node over a known faulty connection.


[0288] The same masking tables can be used as well for a demand-driven space-switch architecture.


[0289] The relation between the physical links and the logical links, either in the rotator switch architecture or in the demand-driven space switch architecture, is implementation dependant.


[0290] Furthermore, the DBS scheduler can be used to detect the faulty logical links. Assuming that the switch fabric provide some bandwidth expansion with respect to the user traffic, the DBS scheduler can scheduled a deterministic background traffic, using only a part of the switch fabric bandwidth expansion. The purpose of the background traffic is to traverse all the possible logical path from the source nodes to the destination nodes. Since the traffic is deterministic, any IU which does not arrive at a destination node can be flag as missing to the scheduler (e.g., via the request communication path), permitting the scheduler to mark as faulty the logical link corresponding with the missing IU.


[0291] The above scheme permits to obtain a fault-tolerant switch architecture, where faulty logical links are automatically and efficiently detected,permitting the scheduler to avoid scheduling transfer of user data IUs over the faulty links. Furthermore, since the background deterministic traffic can always be scheduled, regardless of the links status, the same scheme permits to detect automatically and efficiently the fix of a faulty logical link.


Claims
  • 1. In a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of: determining availability of a destination node; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of and demand for the destination node.
  • 2. A method as claimed in claim 1 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 3. A method as claimed in claim 1 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 4. A method as claimed in claim 1 wherein the step of determining demand for connection considers a portion of the demand associated with the shared link and the time interval during which the connection will use the shared link.
  • 5. A method as claimed in claim 1 wherein the step of determining availability of a destination node considers a known faulty shared link as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty shared link as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each shared link by deterministically scheduling transfer of a background information unit via the shared link.
  • 6. In a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of: determining availability of a destination node; determining a class of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of and demand for the destination node and the class of traffic.
  • 7. A method as claimed in claim 6 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 8. A method as claimed in claim 6 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 9. A method as claimed in claim 6 wherein the step of determining demand for connection considers a portion of the demand associated with the shared link and the time interval during which the connection will use the shared link.
  • 10. A method as claimed in claim 6 wherein the step of determining availability of a destination node considers a known faulty shared link as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty shared link as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each shared link by deterministically scheduling transfer of a background information unit via the shared link.
  • 11. In a switch for transferring information units and having a plurality of source nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node via a shared link to a desired destination node, said method comprising the steps of: determining availability of a destination node; determining age of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of and demand for the destination node and age of traffic.
  • 12. A method as claimed in claim 11 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 13. A method as claimed in claim 11 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 14. A method as claimed in claim 11 wherein the step of determining demand for connection considers a portion of the demand associated with the shared link and the time interval during which the connection will use the shared link.
  • 15. A method as claimed in claim 11 wherein the step of determining availability of a destination node considers a known faulty shared link as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty shared link as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each shared link by deterministically scheduling transfer of a background information unit via the shared link.
  • 16. In a rotator switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of the tandem node for the destination node and demand for the destination node.
  • 17. A method as claimed in claim 16 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 18. A method as claimed in claim 16 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 19. A method as claimed in claim 16 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 20. A method as claimed in claim 16 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.
  • 21. In a switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining a class of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of the tandem node for the destination node, and demand for the destination node and the class of traffic.
  • 22. A method as claimed in claim 21 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 23. A method as claimed in claim 21 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 24. A method as claimed in claim 21 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 25. A method as claimed in claim 21 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.
  • 26. In a switch for transferring information units and having a plurality of source node, double-bank tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining an age group of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting a source node in dependence upon availability of the tandem node for the destination node, and demand for the destination node and the age group.
  • 27. A method as claimed in claim 26 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 28. A method as claimed in claim 26 wherein the step of determining availability of a destination node considers the destination nodes in random order.
  • 29. A method as claimed in claim 26 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 30. A method as claimed in claim 26 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.
  • 31. In a rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of the tandem node for the destination node and demand for the destination node.
  • 32. A method as claimed in claim 31 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 33. A method as claimed in claim 31 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 34. A method as claimed in claim 31 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.
  • 35. In a rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining a class of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting an available source node in dependence upon the availability of the tandem node for the destination node, and demand for the destination node and the class of traffic.
  • 36. A method as claimed in claim 35 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 37. A method as claimed in claim 35 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 38. A method as claimed in claim 35 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.
  • 39. In a rotator switch for transferring information units and having a plurality of source node, tandem nodes and destination nodes and selectable connectivity therebetween, a method of scheduling transfer of an information unit from a source node to a tandem node for a desired destination node, said method comprising the steps of: determining availability of a tandem node for a destination node; determining an age group of traffic being scheduled; determining demand for connection from each source node to the destination node; determining availability of each source node; and selecting a source node in dependence upon availability of the tandem node for the destination node, and demand for the destination node and the age group.
  • 40. A method as claimed in claim 39 wherein the step of selecting includes scanning the source nodes in round-robin fashion until one requesting the desired destination node is found.
  • 41. A method as claimed in claim 39 wherein the step of determining demand for connection considers a portion of the demand associated with the tandem node.
  • 42. A method as claimed in claim 39 wherein the step of determining availability of a tandem node considers a known faulty link from the tandem node to the destination node as not available for supporting the connection with the destination node, wherein the step of determining availability of a source node considers a known faulty link from the source node to the tandem node as not available for supporting the connection with the source node, the method further comprising the step of periodically probing the status of each link with the tandem node by deterministically scheduling transfer of a background information unit via the link.