The present invention relates to optoelectronic switches, and in particular to the topology according to which the constituent switching elements are arranged within that switch.
Large-scale packet switches can be built in a scalable fashion from smaller switching elements by connecting the switching elements according to the interconnection pattern of a given network topology. Examples of such network topologies are Folded Clos networks (also called k-ary n-trees), Torus (also called k-ary n-cubes) and “RPFabric topologies” such as those topologies disclosed in PCT/GB2016/051127. These network topologies are hierarchical in nature, meaning that a given implementation of the topology having L tiers or dimensions can be extended (scaled) by adding another tier or dimension to the topology, in such a fashion that the larger (L+1) dimension topology includes a number of identical L-dimensional sub-topologies that are interconnected by the (L+1)th dimension in a recursive fashion.
For an important class of topologies, the maximum scale (i.e. the maximum number of endpoints or nodes) supported by the topology is determined by the radix (i.e. the number of ports) of the constituent switching elements. This is true for e.g. Folded Clos and RPFabric topologies, but not for Torus topologies. Consequently, for topologies in this class, the factor by which the maximum scale of a given topology increases when adding a dimension is also determined by the radix of the switching elements.
It may be preferable and moreover economically advantageous to implement a given network topology using switching elements that are identical. This enables important economies of scale, as the individual switching element is generally implemented as an ASIC which is costly to design, manufacture and test. Requiring different ASICs to build a switch fabric would multiply the associated monetary and temporal costs.
As mentioned above, an example of a known network topology is a Folded Clos topology. In an L-dimensional Folded Clos topology made up of switching elements having a radix R, the maximum number of endpoints is given by:
Similarly, an RPFabric topology having L dimensions has a maximum number of endpoints given by:
Correspondingly, adding a dimension increases the maximum scales of Folded Clos and RPFabric topologies by a factor of R/2 and R respectively. Depending on the target scale for a specific instantiation, this granularity may be too coarse, in the sense that the maximum scale for L dimensions may be just too small for the target size, whereas the maximum scale for (L+1) dimensions may be much too large. For example, consider a situation where N=8,000 endpoints are required. First, consider the case of an RPFabric topology with R=24 and L=2. The maximum scale in this case is 8×242=4,608 endpoints, which is clearly too small. Adding an extra dimension leads to a network with a maximum scale of 6×243=82,944, which is greater than an order of magnitude too large, and therefore highly wasteful.
This problem can be addressed by not fully populating some of the dimensions. For instance, in the example set out above, it would be possible to populate just two of the 24 possible two-dimensional sub-topologies of the three-dimensional network. This leads to topologies where the sizes or cardinalities of the dimensions may vary, thus providing finer granularity in scaling the size of the network. This principle can be applied to a single dimension, or to multiple dimensions. However, it is preferable to apply this approach to scale down an (L+1)-dimension network to sizes that are larger than what the same topology can support with L dimensions. Otherwise it would be more economical to use an L-dimensional network instead. This approach can still incur some inefficiency: because the switching elements all have the same radix R, the switching elements belonging to a scaled-down dimensions have unconnected ports.
Accordingly, at its most general, embodiments of the present invention provide an optoelectronic switch architecture which provides incremental network scalability while minimizing the number of unused ports on the constituent switching elements. Broadly speaking, embodiments of the present invention achieve this by utilizing the concept of link bundling (also known as “link aggregation”, “parallel linking”, or “link trunking”). Link bundling is a technique wherein two or more physical ports on a given switching element are treated equivalently in terms of packet forwarding, which allows more generalized topologies, leading to greater efficiency at a finer granularity of switch configurations. The signals transferred from the input to the output device may be either optical or electronic signals, since it is not this feature which is at the heart of embodiments of the invention, rather it is the arrangement of switching elements within the optoelectronic switch which achieves the advantageous technical effects. This is described in greater detail in the remainder of the application.
Throughout this application, the terms “leaf switch” and “leaf” are used interchangeably, as are “spine switch” and “spine”.
Accordingly, a first aspect of the present invention provides an optoelectronic switch for transferring a signal from an input device to an output device, the optoelectronic switch including:
By providing more than one connection between a given spine switch and leaf switch (“link bundling”), the two connections or links are treated equivalently in terms of e.g. packet forwarding. As discussed in the Background section above, when a reduced-size or underpopulated dimension is used in a switch architecture employing constant-radix switching elements, there are unused ports leading to inefficiency. However, the same connectivity, i.e. in terms of bisection bandwidth and path diversity can be achieved by reducing the number of spine switches and employing link bundling. In certain embodiments of the present invention, the number of leaf switches in a given sub-array is therefore greater than the number of spine switches connected to that sub-array.
Switching elements which may be used in embodiments of the present invention are described in U.S. patent application Ser. No. 15/072,314.
In arrangements of spine switches and leaf switches according to embodiments of the present invention, it is possible to send data from any leaf switch (i.e. a source leaf switch) to any other leaf switch in the array (i.e. a final destination leaf switch) using a maximum of L hops (here, “hop” is the transfer of a signal from one leaf switch in a sub-array to another leaf-switch, which is in the same array, the transfer taking place via a spine switch connected to the array). This is possible because the leaf switches are able to act as intermediate switching elements, which can forward a signal coming into one of its F fabric ports, to another of its own fabric ports. This internal forwarding may be performed by an integrated switch inside the leaf switch, e.g. an electronic crossbar switch or an electronic shared-memory switch. Thus during a data transfer operation, data can perform a hop from one leaf switch to another leaf switch (via a spine switch), and then an internal electronic hop within the leaf switch to another fabric port, and then a second hop, along a different dimension (i.e. in a different sub-array of which the (intermediate) leaf switch is also a member). This process may be repeated up to L times, until the data reaches the final destination leaf switch, wherein it is then transferred to an output device, via a client port on that leaf switch.
Optoelectronic switches according to embodiments of the present invention include a plurality of sub-arrays, and more specifically, the number of sub-arrays associated with each dimension (i.e. the j-th dimension) is given by the product of the sizes of all the dimensions bar the dimension in question, or:
Accordingly, the total number of sub-arrays in the whole optoelectronic switch is given by the sum of the number of sub-arrays for each dimension, over all L dimensions:
In some embodiments, the layout or structure, i.e. the interconnectivity between the spine switches and the leaf switches is identical or substantially identical for each sub-array associated with a given dimension. There may only be two different sizes Ri of dimension, and in some embodiments, all but one of the dimensions may have a size Rlarge and the remaining dimension has a size Rsmall, which is smaller than Rlarge. For example e.g. in a 3-dimensional optoelectronic switch: R1=R2=Rlarge=24, and R3=Rsmall=2. Such embodiments are easier to manufacture, since only one dimension is reduced in size. Having just one dimension reduced in size still provides improved granularity.
In such embodiments, the layout or structure of all of the sub-arrays associated with the dimensions having equal size may be identical or substantially identical. By having identical layouts, the control process, i.e. to determine the path which a given signal takes when traversing a given sub-array, is simplified as the same process can be applied to a plurality of sub-arrays, and a bespoke control process is not required for switching in different dimensions. Details of the methods by which the switching may be controlled may be found later in the application.
In some embodiments, the aggregate client port bandwidth per leaf switch is equal to the fabric port bandwidth available per dimension. Thus, if all of the ports on the switching element have the same bandwidth, this means that one fabric port per dimension should be provided for each client port. In other embodiments, the switching elements may be oversubscribed, i.e., there may be fewer than one fabric port per dimension for each client port, or the switching elements may be overprovisioned, i.e., there may be more than one fabric port per dimension for each client port. In some embodiments, the value of R is a number which is evenly divisible by 2, 3, 4, 5 or 6. In a subset of these embodiments, the value of R is divisible by more than one of 2, 3, 4, 5, and 6. For example, R may be equal to 12, 24, 30, 36, or 60.
In some embodiments, the number of unused ports is minimized, where “unused” refers to fabric ports on the spine switches which are not connected to any fabric ports on any other spine switches or leaf switches (though spine switches may in any event not be connected to other spine switches). Accordingly, in some embodiments, for a given sub-array associated with the reduced dimension, all of the fabric ports included on the plurality of Si spine switches connected to the sub-array are connected to a fabric port on a leaf switch in that sub-array.
Depending on the number of client ports, the radix of the switches, the number of spine switches connected to the sub-array associated with the reduced dimension, and the number of leaf switches in the sub-array, it may not always be possible to arrange for each of the fabric ports included on the plurality of Si spine switches to be connected to a respective fabric port on a leaf switch in that sub-array.
If Fli is the number of fabric ports per leaf switch in dimension i, and Fsi is the number of fabric ports per spine switch in dimension i, then in dimension i the total number of spine fabric ports is equal to the total number of leaf fabric ports (and unused fabric ports can be avoided) when the following constraint is met:
SiFsi=FliRi.
If Fli=C (i.e., the leaf switches are neither oversubscribed nor overprovisioned) and Fsi=R, the constraint is met (and it is possible to avoid unused fabric ports) if and only if SiR=CRi, in which Si, R, C and Ri are integer values, having the same meanings as described above. In this way, bisection bandwidth can be maintained (i.e. SiR≥CRi) while all of the fabric ports are connected to a fabric port on a leaf switch, to provide connectivity therebetween. In some embodiments, in a given sub-array associated with the reduced dimension, at least one spine switch of the plurality of Si spine switches connected to the sub-array has a plurality of connections to each of the Ri leaf switches. In other words, one of the spine switches may have two or three connections to each of the leaf switches. In some embodiments, all of the spine switches connected to the sub-array may have a plurality, e.g. two or three, connections to each of the leaf switches in the sub-array. The greater the extent to which the reduced dimension is reduced in size relative to the other dimensions, the greater the number of connections which the spine switches may have to each of the leaf switches.
If C>>Fli (i.e., the leaf switches are oversubscribed) or if C<Fli (i.e., the leaf switches are overprovisioned), then the constraint to be met to make it possible to avoid having unused ports may instead be SiR=FliRi. If each spine switch is connected to more than one sub-array (see, e.g.,
In some embodiments, at least one spine switch, or alternatively each spine switch, connected to a given sub-array (associated with the reduced dimension) may have the same number of connections to each leaf switch in the array. This is possible when the number of client ports per leaf switch on the sub-array is divisible by the number of spine switches connected to the sub-array, with integer result. Such embodiments have a high degree of topological regularity, and therefore associated advantages in terms of routing and load balancing. In other embodiments, the number of connections may not be uniform across all of the leaf switches. This is the case when the number of client ports per leaf switch on the sub-array is not divisible by the number of spine switches connected to the sub-array. In these cases, each spine switch connected to a given sub-array associated with the reduced dimension may have:
It should be noted that it is not necessary that there are a plurality of connections between each spine switch and each leaf switch. In other words, the second number may be exactly one.
Here, “disjoint” means that, for a given spine switch, the first subset and the second subset of leaf switches have no members in common. However, the constituents of the first and second subset of leaf switches for one spine switch may be different from the constituents of the first and second subset of leaf switches for another spine switch, as long as there are the same numbers of leaf switches in each. For example, for one spine switch, there may be three connections to each of a subset of (i.e. containing) two leaf switches, and one connection to each of a subset of (i.e. containing) four leaf switches. These groups of connections may be referred to as “bundles” or “link bundles”, and may contain one connection. In some embodiments, the first number is greater than the second number by one. By having the first number and the second number as close as possible, the degree of topological regularity is maximized for those embodiments in which it is not possible to have equal numbers of connections to each leaf switch.
Embodiments among those described above that fulfil the criterion wherein SiR=CRi may have no unused ports. However, this is not possible with all arrangements. In some embodiments SiR>CRi, and according there are U=SiR−CRi unused ports. Note that if SiR is less than CRi, bisection bandwidth may not be preserved. Even though there are some unused fabric ports in the sub-arrays which are associated with the reduced dimension, the number of unused ports is still reduced relative to configurations in which there is a maximum of one connection between each leaf switch and spine switch. The one connection may be a bidirectional connection, which may be in the form of a single cable or wire containing two bundled optical fibres, in other words a bidirectional connection providing physical media allowing full-duplex communication.
It may be possible to maintain both efficiency and topological regularity, even with unused ports in a given sub-array associated with the reduced dimension. In particular, when (as above) the number of client ports is exactly divisible by the number of spine switches connected to the sub-array, the connections and the unused ports can be spread evenly across the spine switches. In other words, each of the spine switches connected to a given sub-array associated with the reduced dimension may have the same number of unused ports, given by U/Si.
In other cases, e.g., where SiR>CRi and C/Si is not an integer, it is still possible to maximize both the efficiency and topological regularity by adopting a configuration wherein for a given sub-array associated with the reduced dimension:
Here when the first and second subset of spine switches are “disjoint”, this means that no spine switch is a member of both. Similarly, when “the second subset of leaf switches is disjoint from the first subset, with respect to each spine switch in the first subset of spine switches”, this means that for a given spine switch in the first subset of spine switches, the first and second subset of leaf switches have no members in common. However, it is possible for a leaf switch which is in the first subset for a first spine switch in the first subset of spine switches to be in the following:
The same definition of “disjoint” applies for the third and fourth subset of leaf switches. This is explained in detail with reference to the drawings later on in the application. For the same reasons as above, the first number may be greater than the second number by one, and/or the third number may be greater than the fourth number by one.
The principle of combining spine switches with unused ports can not only be applied with parallel spines connected to the same sub-array. Accordingly, at its most general, in embodiments of a second aspect of the present invention, instead of combining parallel spine switches connected to the same sub-array, a spine switch may connect to leaf switches in more than one sub-array. More specifically, embodiments of a second aspect of the present invention provide an optoelectronic switch for transferring a signal from an input device to an output device, the optoelectronic switch including:
In some embodiments, a single spine switch may connect to all leaf switches in a plurality of sub-arrays, each sub-array associated with the same dimension. In addition to reducing the total number of spine switches, such embodiments of the second aspect of the present invention introduce additional connectivity, since the consolidated spines permit movement along two dimensions in a single hop. This can therefore also shorten the average path length (where the path length is the smallest number of hops that may be used to send a signal from a source leaf switch to its final destination leaf switch). Each “hop” is a transfer of data directly between two switches. For example, if a packet of data is sent from a first leaf switch to a first spine switch, and from there to a second leaf switch, the packet has executed two hops. Given a spine switch having a radix R and a given dimension for which Ri<R (i.e. a reduced dimension), with
one spine associated with the ith dimension can be used to connect up to x sub-arrays along a second dimension j≠i. As used herein, “along a second dimension”, does not mean that the sub-arrays are associated with a different dimension, but that a second dimension is traversed in order to connect to the sub-arrays in e.g. an adjacent sub-array. This is shown visually later in the application.
The number of unused ports per spine is given by U=R−xRi. It is also possible to partially combine sub-arrays. For example, in some embodiments, a spine switch connected to a first sub-array may be connected to a leaf switch (or plurality of leaves) in a second sub-array (in addition to all of the leaf switches in the first sub-array) associated with the same dimension as the first. The spine switch may be connected to all leaf switches having the same co-ordinate in the dimension in question. It is noted that embodiments of the first and second aspects of the present invention may be combined, and each sub-array may be connected to a plurality of spine switches, wherein each spine switch connected to a given sub-array associated with the reduced dimension may have a connection to each leaf switch in the sub-array and a plurality of connections to at least one leaf switch in the sub-array. Accordingly, any of the optional features presented above with reference to embodiments of the first aspect of the present invention may also apply to embodiments of the second aspect of the invention, to the extent that they are compatible.
The following optional features are compatible with embodiments of both the first and second aspects of the present invention.
The leaf switches may contain a packet processor configured to perform packet fragmentation, wherein packets of data having the same next destination switch module (i.e. those packets which are intended for the same leaf switch after the next hop, whether that leaf switch module be the final destination or just the next intermediate switch module in the journey of that packet of data) are arranged into frames having a predetermined size, and wherein packets of data may be split up into a plurality of packet fragments, which are then arranged in a corresponding plurality of frames. Optionally, one frame may contain data from more than one packet of data. Each packet fragment may have its own packet fragment header which includes information at least identifying the packet to which that packet fragment originally belonged, so that the packet may be reconstructed when all of its constituent fragments reach their final destination module.
For example, consider the case where the packet processor is configured so that the frame payload size is 1000 B, and three packets of 400 B, 800 B and 800 B are input into the switch module. If each of these were to be sent in separate frames, of one packet each, this would represent an efficiency of (400+800+800)/3000=67%. However, by using packet fragmentation, a first frame may include the 400 B packet, and 200 B of the first 800 B packet, and then a second frame may include the second 800 B packet and the remaining 200 B of the first 800 B packet. This leads to an efficiency of 100%. The frames that are constructed by this process represent packets of data in their own right, and so further fragmentation may occur at intermediate switch modules, when the packet undergoes more than one hop (e.g., more than one optical hop) in order to reach the destination switch module.
In order to maximize efficiency, subsequent processing of a frame (e.g. forwarding said frame to be converted into a first plurality of optical signals) may not occur until the filling proportion of a frame reaches a set or predetermined threshold, e.g. more than 80%, more than 90%, or when the frame is filled to 100%. The packets may alternatively be sent for subsequent processing after a set or predetermined amount of time has elapsed. In this way, if packets of data for a given switch module cease to arrive at the packet processor, a frame which is still below the threshold filling proportion may still be sent for subsequent processing rather than lying stagnant on the packet processor. The set or predetermined amount of time may be between 50 and 1000 ns, or between 50 and 200 ns. In some embodiments, the time interval is around approximately 100 ns. Accordingly, the packet processor may include or be associated with a transmission side memory in which to temporarily store incomplete frames during their construction. The set or predetermined amount of time may be varied depending upon traffic demand; typically, the higher the rate of traffic flow, the shorter will be the set or predetermined amount of time and lower rates of traffic flow may lead to an increase in the set or predetermined amount of time. The leaf switches may correspondingly include another packet processor, which may be the same as the first packet processor, or may be a different packet processor, which is arranged to recombine the packet fragments upon receiving them, to recreate the original packet of data for subsequent processing and transmission.
Leaf switches may be configured to operate in burst mode, in which the leaf switches send data (e.g. in the form of packets, packet fragments or frames as described above) in a series of successive bursts, each burst containing only data having the same next destination leaf switch. Each successive burst may include a frame of data having a different next destination leaf switch. Pairs of sequential bursts may be separated by a predetermined time interval between 50 and 1000 ns, or between 50 and 200 ns, e.g. 100 ns. All of the leaf switches sending signals within a given sub-array may be able to “fire” a burst synchronously.
To control the switching of data by the spine switches connected to a given sub-array, that sub-array may include an arbiter, considered to control the operation of the spine switches connected to that sub-array, based on destination information contained in the data to be transferred. This control allows the provision of a route which can ensure that all data reaches its next destination leaf switch in a non-blocking fashion to minimize bottlenecking. The arbiter may be connected to a packet processor in each of the leaf switches, either directly or via a controller, or the like. When, for example, a packet of data is received by a leaf switch, a request is sent by the packet processor to the arbiter. The request may optionally identify the next destination leaf switch of a given packet of data. The arbiter is configured to establish a scheme which ensures that, to the greatest extent possible, each packet is able to perform its next hop. The arbiter may accordingly be configured to perform a bipartite graph matching algorithm in order to calculate pairings between the inputs and outputs of the spine switches, such that each input is paired with at most one output, and vice versa. In such embodiments, there may be an arbiter associated with each spine switch, which is configured to control the routing of signals from the inputs to the outputs of the spine switch. Each spine with its respective arbiter may be able to operate independently of the other spine switches connected to the sub-array. Naturally, in some cases, where e.g. several leaf switches send large amounts of data all of which is intended for the same output of a given spine switch, the request cannot be met. Accordingly, the arbiter may be configured to store information relating to requests that cannot be met, in a request queue. Then, until these requests are met, the associated data is buffered on the corresponding leaf switch, e.g. in the packet processor or in a separate memory. In this way, requests that cannot be met are delayed rather than dropped, e.g. when a local bottleneck occurs at one or more of the spine switches. In other words, the arbiter maintains the state of a buffer memory or a virtual output queue (VOQ) on the leaf switches or spine switches, this state can be in the form of counters (counting e.g. the number of packets or bytes per VOQ), or in the form of FIFOs (first-in, first-out) that store packet descriptors. However, the actual packets themselves remain stored on the leaf switch(es) rather than at the arbiter.
When it is necessary for a packet to perform more than one hop in order to reach its final destination leaf switch, the route may be deduced entirely from a comparison between the coordinates of the source leaf switch and the final destination leaf switch. For example, in a process known as dimension ordered routing, the first hop may match the first coordinate of the source and final destination leaf switches, the second hop may match the second coordinate of the source and final destination leaf switches and so on, until all of the coordinates match, i.e. until the packet has been transferred to the final destination leaf switch. For example, in a four-dimensional network, if the source leaf switch were to have coordinates (a, b, c, d) and the final destination leaf switch were to have coordinates (w, x, y, z), then the dimension-ordered route might be: (a, b, c, d)→(w, b, c, d)→(w, x, c, d)→(w, x, y, d)→(w, x, y, z). At any point along the route, the packet processor may compare the coordinates of the source leaf switch against the coordinates of the final destination leaf switch, and determine which coordinates do not yet match. Then it will decide to route along the non-matching directions, e.g. with the lowest index, or the highest index.
An example demonstrating this resulting inefficiency is shown in
The RPFabric, which is employed in embodiments of the present invention includes spines and leaves in which the leaves are connected only to clients and spines, and the spines are connected only to leaves. Each leaf switch provides C client ports and F fabric ports, where C+F÷R. Each spine switch connected to sub-arrays associated with the ith dimension provides R fabric ports, where Ri of those ports are used to connect to leaf switches within a given sub-array, and where Ri≤R. The numbers of unused ports per switching element is given by the following expressions:
Uleaf=R−F−C
Uspine=R−Ri
The size of a dimension i is denoted by Ri, meaning that Ri leaf switches are arranged along the ith axis of the grid. The total number of leaf switches equals the product of all Ri. For each Ri, Ri≤R holds, meaning that each spine switch can be connected to all leaf switches in a given sub-array.
In embodiments of the first aspect of the invention, there are multiple spine switches along each dimension, i.e. there are multiple spine switches connected to each sub-array. The number of spine switches for the ith dimension is denoted by Si. If the leaf switches are neither overprovisioned nor oversubscribed, then:
In some embodiments, a larger value of C may be used, so that the leaf switches are oversubscribed. This may result in an optoelectronic switch that provides a larger number of client connections, possibly resulting in a reduction in performance at the client ports. In other embodiments, a larger value of F may be used, resulting in leaf switches that are overprovisioned.
Then consider a sub-array including Ri<R leaf switches. There are then U=C(R−Ri) unused ports in total on the set of Si spine switches connected to that sub-array. Assuming that all sub-arrays associated with that dimension each contain the same number of leaf switches, the number of unused ports is the same for all sub-arrays associated with that dimension. Then, if C(R−Ri)≥R, then at least one spine switch may be removed without affecting the available bandwidth, as long as the existing connections are distributed over the remaining spines. This is where the concept of link bundling, and therefore the technical effect of embodiments of the present invention comes into play, and it will become apparent that four distinct cases arise, all falling within the scope of embodiments of the first aspect of the present invention.
Denoting the actual number of spines for dimension i (i.e., the number of spines connected to each sub-array associated with dimension i, in an embodiment in which the number of spines connected to each such sub-array is the same) by Si, the total number of ports available on the spines connected to a dimension-isub-array is given by SiR. The total number of ports sufficient for full bisection bandwidth is given by CRi. In order to have zero unused ports, SiR=CRi. In some embodiments, however (e.g., along the reduced dimension, which includes fewer than the full number of leaf switches), it may instead be the case that SiR>CRi, e.g., it may be the case that
where the ceiling operator is used to select the smallest value of Si that ensures that every leaf fabric port in dimension i can be used, i.e., connected to a spine fabric port. Thus, there may still be some unused ports, given by U=SiR−CRi, across the spine switches connected to the same sub-array. Even if there are unused ports, embodiments of the present invention still provide an arrangement in which this number is minimized and the spine switches are utilized in as efficient a manner as possible.
Four cases may be identified, one corresponding to each of
The case to which a given configuration of switching elements belongs can be determined by the following two criteria:
an integer?
Case 1: In this case, the answer to each of the above two questions is yes. The bundling factor bis an integer, which means that exactly b ports from each leaf switch are connected to each spine connected to the sub-array in question. This case is illustrated in
Here there are R1=4 leaf switches in the sub-array, connected using S1=2 spine switches. The bundling factor
and therefore, it can be seen that 2 links from each spine switch are connected to each of the leaf switches. Accordingly, all 8 of the fabric ports on each spine switch are used, maximizing efficiency, especially as compared to the case shown in
Case 2: In this case, SiR=CRi but
is not, an integer. Thus, all of the fabric ports on all of the spine switches are used, but unlike in the previous example, the links are not distributed evenly amongst the spine switches. More specifically, there are a1 bundles having
line in them, and a2 bundles having
links in them. In these cases, the following is true:
a1b1+a2b2=R
a1+a2=Ri
It can then be shown that: a1=R−b2Ri and a2=biRi−R.
This is illustrated in
which is non-integer. Using the expressions defined above, it can be seen that a1=2, b1=2, a2=4, b2=1. In
Therefore, the 8 fabric ports on each spine switch are distributed amongst the 6 leaf switches with 1 connection to 4 of the leaf switches and 2 connections to the remaining two leaf switches. The same is true for all of the spine switches, and accordingly each leaf switch has 1 connection to each of 2 of the spine switches, and 2 connections to the third. These connections are distributed evenly so that all 8 of the fabric ports are utilized for each spine switch.
The table below sets out which leaf switches are in the first and second subsets, as described earlier in the application (accordingly, the first number is the first bundling factor, b1=2, and the second number, is the second bundling factor b2=1; a1 and a2 represent the number of leaf switches in the first and second subset respectively):
Case 3: In this case, SiR>CRi, but
an integer value. Thus, there still remain some unused fabric ports on each of the spine switches, but these are evenly distributed among all of the spine switches. Or equivalently, b fabric ports from each leaf switch are connected to each spine connected to the given sub-array. It therefore follows that the number of unused ports per spine is also uniform in this case:
This example is shown in
links from each spine switch are connected to each of the leaf switches. This leaves 2 unused ports on each of the spine switches. It can be seen that this arrangement provides the optimum connectivity between the spine switches and leaf switches, and minimizes the number of unused ports.
Case 4: The final case is the most irregular, in which SiR>CRi and
is non-integer. In this case, there are some bundles with
links in them, and other bundles with
links in them. Moreover, there are two disjoint sets of spines: the first set includes u1 spines with
unused ports and the second set includes u2 spines with
unused ports, wherein u1=U−v2Si and u2=v1Si−U, such that u1+u2=Si.
Each spine in the first set has a1=R−v1−b2Ri bundles of b1 links, and a2=Ri−a1 bundles of b2 links.
Correspondingly, each spine in the second set has a3=R−v2−b2Ri bundles of b1 links and a4=Ri−a3 bundles of b2 links.
This example is shown in
The following tables set out which switches are present in which subsets, to use the terminology used earlier in the application. Accordingly, the first number and the third number are equal to the bundling factor b1=2, and the second number and the fourth number are equal to the bundling factor b2=1; u1 and u2 give the number of spines in each of the subsets of spine switches; a1 and a2 give the number of leaf switches in the first and second subset of leaf switches respectively, and a3 and a4 give the number of leaf switches in the third and fourth subsets respectively.
which is an integer. Accordingly, there are U=24−16 unused ports across the two spine switches, i.e. 4 on each, and two connections to each spine switch SS1-2 from each leaf switch LS1-4. In other words, each of the spine switches has 4 bundles of 2 links per leaf.
In
still. Each leaf switch LS1-4 therefore has two links connected to each of the spine switches SS1-2, or in other words, each spine switch has 5 bundles of 2 links per leaf. Again, this falls into case 3 as described above.
is not integer valued, these unused ports are not evenly distributed across the spine switches. There is therefore an irregular connection pattern, as compared to the previous examples. The connections are as follows:
As above, the following tables summarize which leaf switches fall within which subset, as defined earlier in the application:
The table below shows examples of which cases various configurations of switching elements fall into, in an optoelectronic switch having 2 dimensions (L=2), a switching element radix R=12, and for C=4 client ports. In particular the values of all of the other parameters described above are shown, when a given dimension is reduced to Ri=2, 3, . . . , 12, for Si=1, 2, 3, 4 spine switches.
and so each of the spine switches SS1-3 is able to connect to all of the leaf switches in the two sub-arrays. Although in
Number | Date | Country | Kind |
---|---|---|---|
1611197.3 | Jun 2016 | GB | national |
1611433.2 | Jun 2016 | GB | national |
The present application (a) claims the benefit of U.S. Provisional Application 62/309,425, filed Mar. 16, 2016, and (b) claims the benefit of U.S. Provisional Application 62/354,600, filed Jun. 24, 2016, and (c) is a continuation-in-part of U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and (d) is a continuation in part of PCT Application PCT/EP2016/076755, filed Nov. 4, 2016, which claims priority to (i) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and to (ii) U.S. Provisional Application 62/309,425, filed Mar. 16, 2016, and to (iii) PCT Application PCT/GB2016/051127, filed Apr. 22, 2016, which claims priority to (1) U.S. Provisional Application 62/152,696, filed Apr. 24, 2015, and to (2) U.S. Provisional Application 62/234,454, filed Sep. 29, 2015, and to (3) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and to (iv) U.S. Provisional Application 62/354,600, filed Jun. 24, 2016, and to (v) foreign application No. 1611433.2, filed in Great Britain on Jun. 30, 2016, which claims priority to (1) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and to (2) U.S. Provisional Application 62/309,425, filed Mar. 16, 2016, and to (3) U.S. Provisional Application 62/354,600, filed Jun. 24, 2016, and to (4) PCT Application PCT/GB2016/051127, filed Apr. 22, 2016, which claims priority to (A) U.S. Provisional Application 62/152,696, filed Apr. 24, 2015, and to (B) U.S. Provisional Application 62/234,454, filed Sep. 29, 2015, and to (C) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and to (vi) foreign application No. 1611197.3, filed in Great Britain on Jun. 28, 2016, which claims priority to U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and (e) is a continuation in part of PCT Application PCT/GB2016/051127, filed Apr. 22, 2016, which claims priority to (i) U.S. Provisional Application 62/152,696, filed Apr. 24, 2015, and to (ii) U.S. Provisional Application 62/234,454, filed Sep. 29, 2015, and to (iii) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and (f) claims priority to foreign application No. 1611433.2, filed in Great Britain on Jun. 30, 2016, which claims priority to (i) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and to (ii) U.S. Provisional Application 62/309,425, filed Mar. 16, 2016, and to (iii) U.S. Provisional Application 62/354,600, filed Jun. 24, 2016, and to (iv) PCT Application PCT/GB2016/051127, filed Apr. 22, 2016, which claims priority to (1) U.S. Provisional Application 62/152,696, filed Apr. 24, 2015, and to (2) U.S. Provisional Application 62/234,454, filed Sep. 29, 2015, and to (3) U.S. patent application Ser. No. 15/072,314, filed Mar. 16, 2016, which claims the benefit of U.S. Provisional Application 62/251,572, filed Nov. 5, 2015, and (g) claims the benefit of U.S. Provisional Application 62/364,233, filed Jul. 19, 2016, the entire content of each of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6285809 | Nir et al. | Sep 2001 | B1 |
6510260 | Chen et al. | Jan 2003 | B2 |
6512612 | Fatehi et al. | Jan 2003 | B1 |
6529301 | Wang | Mar 2003 | B1 |
7088919 | Graves | Aug 2006 | B2 |
7177544 | Wada et al. | Feb 2007 | B1 |
7218853 | Handelman | May 2007 | B2 |
7257283 | Liu et al. | Aug 2007 | B1 |
7260329 | Fall et al. | Aug 2007 | B1 |
7389046 | Tanaka et al. | Jun 2008 | B1 |
7426210 | Miles et al. | Sep 2008 | B1 |
7526603 | Abdollahi-Alibeik et al. | Apr 2009 | B1 |
7577355 | Sato et al. | Aug 2009 | B2 |
7590359 | Kim et al. | Sep 2009 | B2 |
7724759 | Bozso et al. | May 2010 | B2 |
7764882 | Beacken | Jul 2010 | B2 |
7773606 | Dobjelevski et al. | Aug 2010 | B2 |
7773608 | Miles et al. | Aug 2010 | B2 |
7872990 | Guo et al. | Jan 2011 | B2 |
7899327 | Wada et al. | Mar 2011 | B2 |
8065433 | Guo et al. | Nov 2011 | B2 |
8073327 | Mayer et al. | Dec 2011 | B2 |
8098593 | Guo et al. | Jan 2012 | B2 |
8472805 | Lam et al. | Jun 2013 | B2 |
8473659 | Koka et al. | Jun 2013 | B2 |
8493976 | Lin | Jul 2013 | B2 |
8774625 | Binkert et al. | Jul 2014 | B2 |
8792787 | Zhao et al. | Jul 2014 | B1 |
8867915 | Vandat et al. | Oct 2014 | B1 |
8891914 | Ticknor et al. | Nov 2014 | B2 |
8902751 | Zhou et al. | Dec 2014 | B1 |
8942559 | Binkert et al. | Jan 2015 | B2 |
9008510 | Zhao et al. | Apr 2015 | B1 |
9124383 | Frankel et al. | Sep 2015 | B1 |
9167321 | Chen | Oct 2015 | B2 |
9184845 | Vandat et al. | Nov 2015 | B1 |
20020057861 | Ge et al. | May 2002 | A1 |
20020114036 | Ghani | Aug 2002 | A1 |
20020186432 | Roorda et al. | Dec 2002 | A1 |
20030138189 | Rockwell et al. | Jul 2003 | A1 |
20050105906 | Barbosa et al. | May 2005 | A1 |
20070009262 | Perkins et al. | Jan 2007 | A1 |
20090226183 | Kang | Sep 2009 | A1 |
20100254703 | Kirkpatrick et al. | Oct 2010 | A1 |
20120033968 | Testa et al. | Feb 2012 | A1 |
20120201540 | Uekama et al. | Aug 2012 | A1 |
20120243869 | Sato | Sep 2012 | A1 |
20120250574 | Marr | Oct 2012 | A1 |
20130156425 | Kirkpatrick | Jun 2013 | A1 |
20150249501 | Nagarajan | Sep 2015 | A1 |
20150309265 | Mehrvar et al. | Oct 2015 | A1 |
20170041691 | Rickman et al. | Feb 2017 | A1 |
20170111294 | Laor | Apr 2017 | A1 |
20170295130 | Mahajan | Oct 2017 | A1 |
Number | Date | Country |
---|---|---|
2 386 352 | Nov 2003 | CA |
2530833 | Apr 2016 | GB |
WO 2007124514 | Nov 2007 | WO |
WO 2013063543 | May 2013 | WO |
WO 2014180292 | Nov 2014 | WO |
WO 2015060820 | Apr 2015 | WO |
Entry |
---|
Andreyev, Alexey, “Introducing data center fabric, the next-generation Facebook data center network”, Facebook, Facebook Code, Engineering Blog, Nov. 14, 2014, 13 pages. |
Farrington, Nathan et al., “Data Center Switch Architecture in the Age of Merchant Silicon”, High Performance Interconnects, Aug. 25, 2009, pp. 93-102. |
Farrington, Nathan et al., “Facebook's Data Center Network Architecture”, 2013 Optical Interconnects Conference, May 1, 2013, pp. 49-50. |
International Search Report and Written Opinion of the International Searching Authority, dated Jun. 30, 2017, Corresponding to PCT/EP2017/056129, 19 pages. |
Padmanabhan, Krishnan et al., “Dilated Networks for Photonic Switching”, IEEE Transactions on Communications, Dec. 1987, pp. 1357-1365, vol. COM-35, No. 12. |
“2000 Networkers”, Campus Switch Architecture Session 2806, Cisco Systems, Inc., 2000, 65 pages. |
“66AK2E0x Multicore DSP+ARM KeyStone II System-on-Chip (SoC)”, System Interconnect, Texas Instruments Incorporated, 2012-2015, pp. 1-282. |
Ahn, Jung Ho et al., “HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks”, ACM, SC'09, Nov. 14-20, 2009, 11 pages, Portland, Oregon, USA. |
Arimilli, Baba et al., “The PERCS High-Performance Interconnect”, 18th IEEE Symposium on High Performance Interconnects, 2010, pp. 75-82. |
Bhuyan, Laxmi N. et al., “Generalized Hypercube and Hyperbus Structures for a Computer Network”, IEEE Transactions on Computers, Apr. 1984, pp. 323-333, vol. C-33, No. 4. |
Dumais, Patrick et al., “Scaling up Silicon Photonic Switch Fabrics”, IEEE, 2015, pp. 175-176. |
Guo, Chuanxiong et al., “BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers”, ACM, SIGCOMM'09, Aug. 17-21, 2009, Barcelona, Spain. |
“Hot Interconnects 22”, IEEE Symposium on High Performance Interconnects, Aug. 26-28, 2014, 1 page, Google Headquarters, Mountain View, California. |
Invitation to Pay Additional Fees and Partial Search Report dated Feb. 23, 2017 in related International Application No. PCT/EP2016/076755, 8 pages. |
Iyer, Sundar et al., “Techniques for Fast Shared Memory Switches”, Stanford HPNG Technical Report TR01-HPNG-081501, 2001, 12 pages. |
Kachris, Christoforos et al., “A Survey on Optical Interconnects for Data Centers”, IEEE Communications Surveys & Tutorials, Oct. 1, 2012, pp. 1021-1036, vol. 14, No. 4. |
Mandyam, Lakshmi et al., “Switch Fabric Implementation Using Shared Memory”, Freescale Semiconductor, Inc., 2004, pp. 1-8. |
Miao, Wang et al., “Novel flat datacenter network architecture based on scalable and flow-controlled optical switch system”, Optics Express, Feb. 10, 2014, 8 pages, vol. 22, No. 3. |
Miao, Wang et al., “Petabit/s Data Center Network Architecture with Sub-microseconds Latency Based on Fast Optical Switches”, Ecoc-ID: 0630, 2015, 3 pages. |
Parsons, N.J. et al., “Multidimensional Photonic Switches”, Photonic Switching II, Proceedings of the International Topical Meeting, Apr. 12-14, 1990, pp. 364-369, Kobe, Japan. |
Shekel, Eyal et al., “Optical packet switching”, Proc. Of SPIE, 2005, pp. 49-62, vol. 5625. |
Suzuki, Keijiro et al., “Ultra-compact 8 x 8 strictly-non-blocking Si-wire PILOSS switch”, Optics Express, Feb. 24, 2014, 8 pages, vol. 22, No. 4. |
Tabatabaee, Vahid, “Switch Fabric Architectures”, University of Maryland, ENTS689L: Packet Processing and Switching, Fall 2006, pp. 1-18. |
U.K. Intellectual Property Office Search Report, dated Nov. 29, 2016, Received Dec. 1, 2016, for Patent Application No. GB1611197.3, 3 pages. |
Wu, Haitao et al., “MDCube: A High Performance Network Structure for Modular Data Center Interconnection”, ACM, CoNEXT'09, Dec. 1-4, 2009, Rome, Italy. |
Zahavi, Eitan, “Fat-tree routing and node ordering providing contention free traffic for MPI global collectives”, Journal of Parallel and Distributed Computing, 2012, pp. 1-10. |
Zahavi, Eitan et al., “Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing”, HOTI 2014, pp. 1-21. |
Number | Date | Country | |
---|---|---|---|
20170195758 A1 | Jul 2017 | US |
Number | Date | Country | |
---|---|---|---|
62309425 | Mar 2016 | US | |
62354600 | Jun 2016 | US | |
62251572 | Nov 2015 | US | |
62152696 | Apr 2015 | US | |
62234454 | Sep 2015 | US | |
62364233 | Jul 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15072314 | Mar 2016 | US |
Child | 15461421 | US | |
Parent | PCT/EP2016/076755 | Nov 2016 | US |
Child | 15072314 | US | |
Parent | 15072314 | Mar 2016 | US |
Child | PCT/EP2016/076755 | US | |
Parent | PCT/GB2016/051127 | Apr 2016 | US |
Child | 15072314 | US | |
Parent | 15072314 | Mar 2016 | US |
Child | PCT/GB2016/051127 | US | |
Parent | 15461421 | US | |
Child | PCT/GB2016/051127 | US | |
Parent | PCT/GB2016/051127 | Apr 2016 | US |
Child | 15461421 | US | |
Parent | 15072314 | Mar 2016 | US |
Child | PCT/GB2016/051127 | US |