FABRIC MODULES FOR HIGH-RADIX NETWORKS

Information

  • Patent Application
  • 20240171886
  • Publication Number
    20240171886
  • Date Filed
    November 17, 2022
    2 years ago
  • Date Published
    May 23, 2024
    7 months ago
Abstract
An apparatus having a plurality of multifiber connector interfaces, where some of these multifiber connector interfaces can connect to network equipment in a network using multifiber cables, has an internal mesh implemented in two tiers. The first is configured to rearrange and the second is configured to recombine individual fiber of the different fiber groups. The light path of each transmitter and receiver is matched in order to provide proper optical connections from transmitting to receiving fibers and complex arbitrary network topologies can be implemented with at least 1/N less point to point interconnections, where N=number of channels per multifiber connector interface.
Description
FIELD OF INVENTION

Disclosed is an apparatus and method to improve the scalability of Data Center networks using mesh network topologies, switches of various radixes, tiers, and oversubscription ratios. The disclosed apparatus and method reduces the number of manual network connections while simplifying the cabling installation, improving the flexibility and reliability of the data center, at reduced cost.


BACKGROUND AND PRIOR ART EVALUATION

The use of optical fiber for transmitting communication signals has been rapidly growing in importance due to its high bandwidth, low attenuation, and other distinct advantages, including radiation immunity, small size, and lightweight. Datacenter architectures using optical fiber are evolving to meet the global traffic demands and the increasing number of users and applications. The rise of cloud data centers, particularly the hyperscale cloud, has significantly changed the enterprise information technology (IT) business structure, network systems, and topologies. Moreover, cloud data center requirements are impacting technology roadmaps and standardization.


The wide adoption of server virtualization and advancements in data processing and storage technologies have produced the growth of East-West traffic within the data center. Traditional three-tier switch architectures comprising Core, Aggregation, and Access (CAA) layers cannot provide the low and equalized latency channels required for East-West traffic. Moreover, since the CAA architecture utilizes spanning tree protocol to disable redundant paths and build a loop-free topology, it underutilizes the network capacity.


The Folded Clos network (FCN) or Spine-and-Leaf architecture is a better-suited topology to overcome the limitation of the three-tier CAA networks. A Clos network is a multilevel circuit switching network introduced by Charles Clos in 1953. Initially, this network was devised to increase the capacity of crossbar switches. It became less relevant due to the development and adoption of Very Large Scale Integration (VLSI) techniques. The use of complex optical interconnect topologies initially for high-performance computing (HPC) and later for cloud data centers makes this architecture relevant again. The Folded-Clos network topology utilizes two types of switch nodes, Spine, and Leaf. Each Spine is connected to each Leaf. The network can scale horizontally to enable communication between a large number of servers, while minimizing the latency and non-uniformity by simply adding more Spine and Leaf switches.


FCN depends on k, the switch radix, i.e., the ratio of Leaf switch server downlink compared to Spine switch uplink, and m, the number of tiers or layers of the network. The selection of (k,m) has a significant impact on the number of switches, the reliability and latency of the network, and the cost of deployment of the data center network. FIG. 1 shows the relationship between the number of servers for different levels of oversubscription, assuming all switches have similar radix and total oversubscription 1:1.



FIGS. 2A and 2B show an example of two FCNs with a similar number of hosts, using different radixes and levels. The higher radix, 32 in this example, connects 32 edge switches to 16 spine switches in a two-layer network, as shown in FIG. 2A. The two-level FCN provides the lowest latency at the cost of requiring a denser network (512 interconnections). By using a three-layer network, the interconnection layout simplifies (256 interconnections). However, more switches are needed, and more latency is introduced in the network. Over the last years, the need for flatter networks to address the growing traffic among machines has favored the radix increase of the switches' application-specific integrated circuits (ASICs). Currently, ASICs can handle 256 radix switches at a speed of 100 Gb/s per port. Those switches support 64×400 GbE, 128×200 GbE, or 256×100 GbE enabling flatter networks with at most three layers.


Based on industry telecommunications infrastructure Standard TIA-942-A, the locations of leaf and spine switches can be separated by tens or hundreds of meters. Typically, Spine switches are located in the main distribution area (MDA), whereas Leaf switches are located in the equipment distribution area (EDA) or horizontal distribution area (HDA).


This architecture has been proven to deliver high-bandwidth and low latency (only two hops to reach the destination), providing low oversubscription connectivity. However, for large numbers of switches, the Spine-Leaf architecture requires a complex mesh with large numbers of fibers and connectors, which increases the cost and complexity of the installation.


Future data centers will require more flexible and adaptable networks than the traditional mesh currently implemented to accommodate highly distributed computing, machine learning (ML) training loads, high levels of virtualization, and data replication.


The deployment of new data centers or scaling of data center networks with several hundred or thousands of servers is not an easy task. A large number of interconnections from Spine to Leaf switches is needed, as shown in FIG. 3. In this example, a fabric 100 can have 572 paths. Each line in the inset 120 can represent a group of eight or 12 fibers that are terminated in multi-fiber MPO connectors. The fibers can be ribbonized, in the traditional flat or rollable ribbons. The inset 110 shows a zoom-in on a small area of the fabric 100.


The interconnecting fabric similar to or larger than 100s can be prone to errors which can be accentuated in many cases by challenging deployment deadlines or the lack of training of installers. Although the Spine-Leaf topology is resilient to misplaced connections, a large number of interconnection errors will produce a noticeable impact due to performance degradation resulting in the loss of some server links. Managing large-scale network configurations usually requires a dedicated crew to check the interconnections, which causes delays and increases the cost of the deployment.


Using transpose boxes, as shown in the prior art, can help to reduce installation errors. However, the prior art cannot be easily adapted to different network topologies, switches radixes, or oversubscription levels.


A new mesh method and apparatus that utilizes modular flexible, and better-organized interconnection mapping that can be quickly and reliably deployed in the data center is disclosed here.


In U.S. Pat. No. 8,621,111, US 2012/0250679 A1, and US 2014/0025843 A1, a method of providing scalability in a data transmission network using a transpose box was disclosed. This box can connect the first tier and second tier of a network. This box facilitates the deployment of the network. However, a dedicated box for a selected network is required. As described in that application, the network topology dictates the type of transpose box to be used. Changes in the topology can require swapping the transpose boxes. Based on the description, a different box will be needed if the number of Spine or Leaf switches changes, the oversubscription, or other parameters of the network change.


Once the topology is selected, the application provides a method for scaling. This requires connecting the port of one box to another with a cable. This adds losses to the network and cannot efficiently accommodate the scaling of the network.


This approach disclosed in US 2014/0025843 A1, can work well for a large data center that has already selected the type of network architecture to be implemented and can prepare and maintain stock of different kinds of transpose boxes for its needs. A more flexible or modular approach is needed for a broader deployment of mesh networks in data centers.


In W2019099771A1, an interconnection box is disclosed. This application shows exemplary wiring to connect individual Spine and Leaf switches using a rack-mountable 1 RU module. The ports of these modules are connected internally using internal multi-fiber cables that have a specific mesh incorporated. However, the module appears to be tuned to a particular topology, such as providing mesh among four spine and leaf switch ports. The application does not describe how the device can be used for topologies with a variable number of leaf or spine switches or with a variable number of ports.


In US20150295655A, an optical interconnection assembly that uses a plurality of leaf-side multiplexers and demultiplexers at each side of the network, one on the Spine side and another set near the Leaf is described. Each mux and demux is configured to work together in the desired topology. However, the application does not demonstrate the flexibility and scalability of this approach.


U.S. Pat. No. 11,269,152 describes a method to circumvent the limitations of optical shuffle boxes, which according to the application, do not easily accommodate for reconfiguration or expansion of switch networks. The application describes apparatuses and methods for patching the network links using multiple distribution frames. At least two chassis are needed to connect switches from one to another layer of a network. Each chassis can accommodate a multiplicity of modules, e.g., cassettes arranged in a vertical configuration. The connection from a first-tier switch to one side of the modules is made using breakout cables. One side of the breakout cables is terminated in MPO (24 fibers) and the other in LC or other duplex connectors. One side of the modules has one or two MPO ports, and the other six duplex LC connectors or newer very-small form factor (VSFF) connectors.


Similarly, the second-tier switch is connected to modules in the other chassis. The patching needed to connect the switches is performed using a plurality of jumper assemblies configured to connect to the plurality of optical modules. The jumpers are specially designed to fix their relative positions since they must maintain the correct (linear) order. U.S. Pat. No. 11,269,152 describes a method for patching, and it can make networks more scalable depending on the network radix. However, the network deployment is still challenging and susceptible to interconnection errors.


SUMMARY

An apparatus having a plurality of multifiber connector interfaces, where some of these multifiber connector interfaces can connect to network equipment in a network using multifiber cables, has an internal mesh implemented in two tiers. The first is configured to rearrange and the second is configured to recombine individual fiber of the different fiber groups. The light path of each transmitter and receiver is matched in order to provide proper optical connections from transmitting to receiving fibers and complex arbitrary network topologies can be implemented with at least 1/N less point to point interconnections, where N=number of channels per multifiber connector interface.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows the number of servers as a function of switch radix and the number of switch layers of the network.



FIG. 2A shows a first example of two FCNs with a similar number of hosts, using different radixes and levels.



FIG. 2B shows a second example of two FCNs with a similar number of hosts using different radixes and levels.



FIG. 3 shows interconnections of an example mesh that contains 576 interconnects (each with 12 or 8 fibers).



FIG. 4A shows a front view of the disclosed module 400.



FIG. 4B shows the rear view of module 400.



FIG. 5 shows a top view of module 400.



FIG. 6 shows the interconnections of module 400.



FIG. 7 illustrates of the interconnection scheme from input to output ports of region 305.



FIG. 8 is a top view of submodule 500 showing interconnection arrangements.



FIG. 9 shows 8 possible configurations that can be implemented in a submodule.



FIG. 10A Illustrates a spine side of a module simple method for implementing networks with 16 Leaf Switches and up to 16 Spine switches, using the modules 400.



FIG. 10B illustrates a leaf side of a module for a simple method for implementing networks with 16 Spine switches.



FIG. 10C helps illustrates the connections for a simple method for implementing networks with 16 Spine switches.



FIG. 10D further illustrates the method for implementing networks with 16 Spine switches.



FIG. 11A shows an example of interconnections between Spine port and Modules 400.



FIG. 11B shows an interconnection table for the example of FIG. 11(a).



FIG. 12A shows an example of interconnections between ports of modules 400 and Spine chassis ports 16 spines with two linecards each.



FIG. 12B shows an example of interconnections between ports of modules 400 and Spine chassis ports eight spines with four spines with four linecards each.



FIG. 12C shows an example of interconnections between ports of modules 400 and Spine chassis ports 4 spines with eight Spines with eight linecards each.



FIG. 12D shows an example of interconnections between ports of modules 400 and Spine chassis ports two spines with sixteen Spines with eight linecards each.



FIG. 13A Illustrates a method for implementing two-tier FCN (combined with FIG. 13B)



FIG. 13B Illustrates a method for implementing two-tier FCN (combined with FIG. 13A).



FIG. 14A Illustrates a method for implementing two-tier FCN (combined with FIG. 14B).



FIG. 14B Illustrates a method for implementing two-tier FCN (combined with FIG. 14A).



FIG. 15A Illustrates a method for implementing two-tier FCN (combined with FIG. 15B).



FIG. 15B Illustrates a method for implementing two tier FCN (combined with FIG. 15A).



FIG. 16A Illustrates a method for implementing two-tier FCN (combined with FIG. 16B).



FIG. 16B Illustrates a method for implementing two-tier FCN (combined with FIG. 16A).



FIG. 17A Illustrates a method for implementing two-tier FCN (combined with FIG. 17B).



FIG. 17B Illustrates a method for implementing two-tier FCN (combined with FIG. 17A).



FIG. 18A shows part 1 (of 4) of Table 1 which details interconnection configurations of submodule 500.



FIG. 18B shows part 2 of Table 1 which details interconnection configurations of submodule 500.



FIG. 18C shows part 3 of Table I which details interconnection configurations of submodule 500.



FIG. 18D shows part 4 of Table 1 which details interconnection configurations of submodule 500.



FIG. 19 shows Table II which is a mesh configuration of a module.



FIG. 20 shows Table III, parameters for two-layer FCN with 16 spine switches. For the number of switches and servers, the required number of modules 400, and its size in rack units are shown.



FIG. 21 shows Table IV, Parameters for three-layer FCNs with oversubscription 1:1 for 1024 Spine switches. For the number of switches and servers, the required number of modules 400, and its size in rack units are shown.





DESCRIPTION OF INVENTION

A modular apparatus and general method to deploy optical networks of a diversity of tiers and radixes are disclosed in this document. The module and method can be used with standalone, stacked, or chassis network switches as long as the modular connections utilize MPO connectors with 16. In particular, switches with Ethernet specified SR or DR transceivers in their ports, such as 400GBASE-SR8, 800GBASE-SR8, or 800GBASE-DR8, can use these modules without any change in connectivity. Other types of transceivers, such as 400GBASE-FR4/LR4, can also be used by combining four transceiver ports with a hardness or breakout cassette.



FIG. 4A shows a front view of the disclosed module 400, which is the key element in facilitating optical network deployment, reshaping, and scaling. In this embodiment, the module has 64 MPO connectors that can be divided into the front and rear sections, as shown in the figures. Alternatively, the 64 ports could be located on one face of the device (not shown here).


For the sake of illustration, we assume that ports 420 to 451, each representing a MPO connector are located on the front side of the module, facing the Leaf switches, as shown in FIG. 4A of the figure. On the other side of the module, ports 440 to 491 (opposite to the 420-451 ports), each representing one MPO connector, face the Spine switches connections. The MPO dimensions allow a module width, W, which can be in the range of 12 inches up to 19 inches, and the height, H, is in the range of 0.4 to 0.64 inches. The MPO connectors can be placed in a vertical way, as shown in the figure to allocate more space. Machine-readable labels 410, and 412, can help deploy or check the network interconnection as described later in this application. Also, lateral rails, 405, on both sides of the module, would enable the modules to be inserted into a chassis structure if required. Alternatively, using brackets 406, the modules can be directly attached to the rack. By using the specified height range for this embodiment, up to four modules can be stacked in less than 1.6 RU depending on density requirements.



FIG. 5 shows a top view of the module, showing additional machine-readable labels 410 and 412. A laser scanner or a camera can read the labels. The read code can link to a database that has the interconnection maps of all modules in the network. The information can be displayed on a portable device, tablet, phone, or augmented reality lens to facilitate the deployment. See RSs 16563 and 25512 for more specific information on this.



FIG. 6 shows the whole interconnection scheme of the modules according to the present invention. To simplify the module structure, the mesh is divided into two regions, 305 and 310. Region 305 re-orders groups of fibers, e.g., 380, which can be standard or rollable ribbons or just cable units of 16 fibers. The interconnection method of region 305, from ports 420 to 451 to ports 320 to 351, shown in FIG. 7, is needed to increase the mesh diversity and provide an optimum combination of individual fibers, in region 310. FIG. 6 shows how individual fibers, such as 370, are paired with groups of fibers, e.g., 380.


Region 310 contains four submodules, 500, which produce the interconnection mesh of the eight groups of fibers. FIG. 8 shows an interconnection diagram for one submodule 500, indicating how each of the fibers from ports groups 510 to 545, which connect to region 305, are distributed to ports 550 to 585.


For an MPO transmitting eight parallel channels, the mesh of submodule 500 has internally 64 duplex fiber interconnections. The interconnections that enable full-duplex bandwidth communication among ports can be implemented in several millions of, eight of them are shown in FIG. 9. Any of these configurations can be used to implement the submodules 500 as long they are employed consistently in the modules 400. The interconnection configurations for eight configurations are listed in Table I (FIGS. 18A-D).


The groups of fibers from regions 305 to 310 are connected at interfaces 307. The interfaces 307 can be implemented using a mechanical splice, multi-fiber connectors, or mass fusion splice. The latter is preferable to achieve lower losses. Similarly, the groups of fibers from region 310, 32 groups of fibers from four 500 modules, connect to ports 460 to 490 using fusion or mechanical splice or multifiber connectors represented by 302.


The resultant interconnection map of module 400, comprising the meshing of groups of fibers, region 305, and the meshing of individual fibers, region 310, is shown in Table II (FIG. 19). As described in the table, each output port has meshed with exactly eight input ports. For example, the first eight ports, 460 to 467, have meshed with ports 420, 424, 428, 432, 436, 440, 444, and 448. The last eight ports, 484 to 491, have meshed with ports 423, 427, 431, 435, 439, 443, 447, and 451. Hence, according to the present invention, an apparatus mixes the Ethernet physical media dependent (PMD) lanes with other transceiver PMD lanes to facilitate interconnections of Spine and Leaf switches and distribute the traffic flow among multiple redundant paths. The two-step mesh incorporated in each module 400, by combining sections 305 and 310, increases the degree of fiber connections inside each module. This apparatus simplifies the network deployment since a significant part of the network complexity is moved from the structured cabling fabric to one or more modules 400. Using module 400 and following simple rules to connect a group of uplinks or downlinks horizontally or vertically, the installation becomes cleaner, and cable management is highly improved, as shown in the following description of this application.


A stack of several modules 400 can enable networks of diverse configurations and radixes, with various numbers of Spine and Leaf switches. For example, FIGS. 10A-10D shows a stack of four modules 400. FIG. 10A of the figure shows the module side that is connected to the Leaf switches. For simplicity, we label this as the front side. FIG. 10B shows the opposite side of the same module 400, the backside, which is connected to the Spine switches.


The diagrams in FIGS. 10A-10D assume that thirty-two Leaf switches, each with four MPO uplinks, need to be connected to the fabric shown in FIG. 10C. In this illustrative example, the uplinks of the Leaf switches are connected horizontally in groups of four until the last port of each module 400 is used. For example, 710 ports connect to the first fourth uplinks of the first Leaf switch, and 712 connect to the first fourth uplinks of the second Leaf switch. The last four ports of the first module 400 in the stack, 714, connect to the first fourth uplinks of the eighth Leaf switch. Following the previous description and FIG. 10D, we can say that module ports 710, 712, and 714 connect to four uplinks from Leaf switches L1, L2, and L8. The first ports of the second module 400, ports 718, connect to the uplinks of the ninth Leaf Switch (L9) and the last ports of the bottom module 400 in the stack, ports 720, connect to four uplinks from the thirty-second Leaf switch (L32).


The Spines ports are assigned at the backside of the stacked modules 400, as shown in part (b) of FIG. 10. For example, if standalone Spine switches are used, 730, 732, 734, and 736 correspond to ports of the first, second, third, and thirty-second Spine switch, respectively, labeled as S1, S2, S3, and S32 in FIGS. 10A-10D. A more detailed description of the connections from the module to the Spines is shown in FIGS. 11A and 11B. In FIG. 11A of that figure, the connection of ports S1, S2, and S32 to standalone Spine switches is shown. FIG. 11B shows an interconnection table for all the ports.


Alternatively, the Spines can be implemented using chassis switches. Although more expensive than standalone systems, chassis switches can provide several advantages, such as scalability, reliability, and performance. The port connectivity of the Spines using chassis switches can follow various configurations some of them described in FIGS. 12A-12D. For example, using sixteen Spine switches, with two linecards each, all S1 and S2 ports can connect to the first Spine, S3 and S4 to the second Spine, and S31 and S32 ports to the sixteenth Spine, as shown in FIG. 12A. Using eight Spine switches with four linecards each, all S1, S2, S3, and S4 ports can connect to the first Spine, S5, S6, S7, S8 to the second Spine, and S29, S30, S31, S32 to the last Spine as shown in FIG. 12B. Using four Spine switches with eight linecards each, the ports S1 to S8 can connect to the first Spine, S9 to S16 ports can connect to the second Spine, and S25 to S32 to the last Spine, as shown in FIG. 12C. Otherwise, by using two Spines, with sixteen linecards, each S1 to S16 will be connected to the first Spine and the rest to the second Spine, as shown in FIG. 12D.


Using modules 400 and the described method, each Spine switch interconnects with all the thirty-two Leaf switches to thirty-two Spine switches, as shown in FIGS. 10C and 10D. The network deployment using stacks of modules 400 enables significantly broader configuration alternatives while providing the required number of Spine-Leaf connections with various bandwidth and oversubscription requirements. Moreover, using the stack of modules 400 increases the resilience to interconnection mistakes during network deployments. In all the tables shown in FIGS. 12A-12D, the columns' permutation has no impact on the network performance. For example, interchanging 51 to S2, S3, or S32 pots on the backside of the stack of modules 400 preserves interconnectivity maps to the Leaf ports connected to the front side of the module stack. In general, module 400, reduces the complexity of scaling out or scaling up, or even de-scaling the network, as it is shown in the following examples.


The interconnection inside modules 400 can transmit signal at any wavelength from 830 nm-1650 nm. Moreover, the signals assigned to each wavelength can propagate in one direction, e.g., from a transmitter to receiver or in a bidirectional way, e.g., using bidirectional transceivers.


An important metric to characterize the degree of complexity reduction in the modules is the aggregated data rates per module are estimated using, Da=f×Nf×Nc×D, where Nf is the number of fibers used per connector, e.g., Nf=16, Nc, the number of adapters in module 400, e.g., Nc=32, D is the data rate per fiber in one direction, and the f account for the bidirectional communication if bidirectional transceivers are used. For example, using the typical case shown in this document, Nf=16, Nc=32, f=1, Da=512 D. For current transceivers operating at D D=100 Gbps/lambda, Da=51.2 Tbps. Assuming next-generation transceivers operating at D=200 Gbps/lambda, Da=102.4 Tbps. Using VSFF connectors such as SN-MT or MMC, 3 NIMs can fit in 1 RU, enabling mesh data rate densities of 307.2 Tbps per RU of mesh connectivity between switches. One rack (assume 50 Rus) full of modules 400 can potentially provide up to 15 Pbps of mesh connectivity.


Examples of Network Deployment of Networks Using Modules 400

The examples in FIGS. 13 to 15 show the implementation of two-tier and three-tier FCNs of various radixes, oversubscription, and sizes using modules 400. A detailed description of the number of modules needed for each network and an estimation of the rack space required for the modules is shown in Tables III to V.


Starting with two-tier FCNs, FIGS. 13A and 13B shows that a fabric 850, with 32 Spine and 64 Leaf switches, implemented using eight modules 400. The connection scheme of the stack is shown from both sides of modules 400, one labeled front, 855, and the one labeled back, 857. The 855 side connects to 64 Leaf switches with four MPO uplinks with sixteen fibers per connector. The switches are labeled Li, where i is the index of the switch. For this example, i is in the range of 1 to 64. As shown in the figure on the 855 front side, the Leaf switches connect horizontally following the methods shown previously. For example, the fourth L1 uplinks are connected adjacently in the first four ports of the first module 400. All L64 uplinks are connected to the last four ports of the eighth module 400. From the side 857 the backside of the same module stack, 32 Spine switches connect vertically, as shown in the figure. Based on the disclosed dimensions range of module 400, this fabric can be implemented in less than 4 RU.


Table II shows that the two-tier network can be scaled out to a large number of Leaf switches. This table also shows the number of modules 400 and rack space for those modules that are required to implement those networks. FIGS. 14A and 14B. 14 shows an example from that table. A fabric 870 with up to 32 Spine and 512 Leaf switches, implemented using 64 modules 400. The connection scheme of the stack is shown from both sides of modules 400, one labeled front, 875, and the one labeled back, 877. The 875 side connects to 512 Leaf switches with four MPO uplinks. For example, the fourth L1 uplinks are connected adjacently in the first four ports of the first module 400. All L512 uplinks are connected to the last four ports of the 64th module 400. From the side 877 the backside of the same module stack, 32 Spine switches connect vertically, as shown in the figure. As described previously in FIGS. 12A-12D, only two Spine switches are required when using chassis switches with 16 linecards.


As shown in Tables III (FIG. 20), using two-layer networks, the network can be scaled to support thousands of Leaf switches that can interconnect 10s of thousands of servers. A method to scale beyond that number requires using a three-layer FCN. FIG. 1(b) shows a topology of a three-layer network with 16 Spine and 32 Leaf switches. In a three-layer network, the Spines do not need to be connected to all Leaf switches but to a group of them called PODs.


Module 400 can also be used to implement three-layer FCNs, as shown in FIGS. 15A and 15B. In this figure, a three-layer network with 1024 Spines (or 64 chassis with 16 linecards) and 2048 Leaf switches is shown. This network has 64 PODs, and each POD of the network has 32 fabric and Leaf switches. Each POD's mesh can be fully implemented with four stacked modules 400, as it was shown previously (see FIGS. 10A-D). Therefore, 64×4=256 modules are required to implement all the PODs. The connections between Spines and the 64 PODs require another set of 256 modules 400 since each Spine switch needs to connect only to one Leaf switch in the POD.


The interconnection method for the Leaf and Spine switches is shown in FIGS. 16A and 16B for the front side of the modules and 17 for the backside of the module. Following the method described above, the uplinks of the Leaf switches in each POD populate the modules horizontally. For example, in FIGS. 16A and 16B, four MPO uplinks of the first Leaf switch from POD 1, L1p1, occupy the first MPO ports of the first module 400. Four MPO uplinks of the Leaf switch 32 from POD 64, which are L32p64, occupy the last four MPO ports of the last module 400 in the stack.


From the opposite side of the stack, the columns of the module stack connect to the linecard MPO ports of the Spine switches. For example, as shown in FIGS. 17A and 17B, the linecard MPO ports of the Spine switch 1, S1c1, connect to eight MPO ports of the first column of the stack. The MPO ports of the second linecard of the same switch, S1c2, connect to eight MPO ports of the second column of the stack. The MPO ports of the 32nd linecard of the last Spine Switch, S32c16, connect to eight ports of the last column of the stack.


This three-layer fabric, with 1024 Spines (or 64 chassis with 16 linecards) and 2048 Leaf switches, requires 512 modules 400 with equivalent rack space equal to or smaller than 220 RU. The method to scale this network to a larger number of Leaf (and servers), the required number of modules 400, and rack space is shown in Table IV (FIG. 21).


As shown in this section, modules 400 and the disclosed method of interconnection for two and three-tier FCNs simplify the deployment of the optical networks of different sizes and configurations. The risk of interconnection errors during the deployment is highly reduced since the groups of cables representing uplinks/downlinks for the same switches are connected in close proximity, and also due to the high degree of mesh in the networks. For example, in FIGS. 16A and 16B, all Lipj connections, where i is the Leaf uplink index ranging from 1 to 4 and j the POD index ranging from 1 to 64, are interchangeable. During the network deployment, an unplanned change from L1p1 to L1p2, L1p1 to L1p3, L1p1 to L1p4, or in general any combination inside that group, will not have an impact on the network operation. The topology will still connect all the Leaf switches from the PODs to the Spine switches with the same number of paths and identical allocated bandwidth. Similarly, in FIGS. 17A and 17B, all the Spines columns are interchangeable, as can be derived from FIG. 10D. The level of redundancy provided by the stack of modules 400 highly reduces the risk of fabric failures or performance degradation caused by errors in the interconnection.


While this invention has been described as having a preferred design, the present invention can be further modified within the spirit and scope of this disclosure. This application is therefore intended to cover any variations, uses, or adaptations of the invention using its general principles. Further, this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains and which fall within the limits of the appended claims.

Claims
  • 1. An Apparatus having a plurality of multifiber connector interfaces, where some of these multifiber connector interfaces can connect to network equipment in a network using multifiber cables comprising an internal mesh implemented in two tiers, wherein the first is configured to for rearrange and the second is configured to recombine individual fiber of the different fiber groups, further wherein the light path of each transmitter and receiver is matched in order to provide proper optical connections from transmitting to receiving fibers and wherein complex arbitrary network topologies can be implemented with at least 1/N less point to point interconnections, where N=4 for MPOs with 8 fibers.
  • 2. The apparatus of claim 1 wherein the apparatus is further configured to be stacked to provide two-tier or three-tier CLOS network topology of various spine and leaf switch radixes.
  • 3. The apparatuses of claim 1 wherein the apparatus is further configured to enable networks with different levels of oversubscription from 1:1 to 1:12
  • 4. The apparatus of claim 1 wherein the apparatus is further configured to be used to scale optical networks from eight to a hundred thousand switches
  • 5. The apparatus of claim 1 wherein the apparatus is further configured to provide redundant paths, reducing the risk of network failure due to interconnection errors.
  • 6. The apparatus of claim 1 wherein the apparatus is further configured to have a small form factor that enables stacking of three modules in one RU, allowing the stacking of up to 132 modules per rack.
  • 7. The apparatus of claim 1 further comprising external labels can provide interconnection maps of the network to portable devices when the labels are read by said label readers such as laser scanning or cameras.
  • 8. The apparatus of claim 1 wherein the apparatus is further configured to distribute the traffic load of the switches efficiently.