 
                 Patent Grant
 Patent Grant
                     9965429
 9965429
                    The present invention relates to computer network topology and architecture. In particular, the present invention relates to a method and apparatus for managing the wiring and growth of a direct interconnect switch implemented on, for example, a torus or higher radix wiring structure.
The term Data Centers (DC) generally refers to facilities used to house large computer systems (often contained on racks that house the equipment) and their associated components, all connected by an enormous amount of structured cabling. Cloud Data Centers (CDC) is a term used to refer to large, generally off-premise facilities that similarly store an entity's data.
Network switches are computer networking apparatus that link network devices for communication/processing purposes. In other words, a switch is a telecommunication device that is capable of receiving a message from any device connected to it, and transmitting the message to a specific device for which the message was to be relayed. A network switch is also commonly referred to as a multi-port network bridge that processes and routes data. Here, by port, we are referring to an interface (outlet for a cable or plug) between the switch and the computer/server/CPU to which it is attached.
Today, DCs and CDCs generally implement data center networking using a set of layer two switches. Layer two switches process and route data at layer 2, the data link layer, which is the protocol layer that transfers data between nodes (e.g. servers) on the same local area network or adjacent nodes in a wide area network. A key problem to solve, however, is how to build a large capacity computer network that is able to carry a very large aggregate bandwidth (hundreds of TB) containing a very large number of ports (thousands), that requires minimal structure and space (i.e. minimizing the need for a large room to house numerous cabinets with racks of cards), that is easily scalable, and that may assist in minimizing power consumption.
The traditional network topology implementation is based on totally independent switches organized in a hierarchical tree structure as shown in 
Many attempts have been made, however, to improve switching scalability, reliability, capacity and latency in data centers. For instance, efforts have been made to implement more complex switching solutions by using a unified control plane (e.g. the QFabric System switch from Juniper Networks; see, for instance, http://www.juniper.net/us/en/products-services/switching/qfabric-system/), but such a system still uses and maintains the traditional hierarchical architecture. In addition, given the exponential increase in the number of system users and data to be stored, accessed, and processed, processing power has become the most important factor when determining the performance requirements of a computer network system. While server performance has continually improved, one server is not powerful enough to meet the needs. This is why the use of parallel processing has become of paramount importance. As a result, what was predominantly north-south traffic flows, has now primarily become east-west traffic flows, in many cases up to 80%. Despite this change in traffic flows, the network architectures haven't evolved to be optimal for this model. It is therefore still the topology of the communication network (which interconnects the computing nodes (servers)) that determines the speed of interactions between CPUs during parallel processing communication.
The need for increased east-west traffic communications led to the creation of newer, flatter network architectures, e.g. toroidal/torus networks. A torus interconnect system is a network topology for connecting network nodes (servers) in a mesh-like manner in parallel computer systems. A torus topology can have nodes arranged in 2, 3, or more (N) dimensions that can be visualized as an array wherein processors/servers are connected to their nearest neighbor processors/servers, and wherein processors/servers on opposite edges of the array are connected. In this way, each node has 2N connections in a N-dimensional torus configuration (
The present invention seeks to overcome the deficiencies in such prior art network topologies by providing a system and architecture that is beneficial and practical for commercial deployment in DCs and CDCs.
In one aspect, the present invention provides a method for managing the wiring and growth of a direct interconnect network implemented on a torus or higher radix interconnect structure, comprising: populating a passive patch panel comprising at least one connector board having multiple connectors with an interconnect plug at each of said connectors; removing an interconnect plug from a connector and replacing said plug with a connecting cable attached to a PCIe card housed in a server to add said server to the interconnect structure; discovering connectivity of the server to the interconnect structure; and discovering topology of the interconnect structure based on the servers added to the interconnect structure.
In another aspect, the present invention provides a passive patch panel for use in the implementation of a torus or higher radix interconnect, comprising: a passive backplane that houses node to node connectivity for the torus or higher radix interconnect; and at least one connector board plugged into the passive backplane comprising multiple connectors. The passive patch panel may be electrical, optical, or a hybrid of electrical and optical. The optical passive patch panel is capable of combining multiple optical wavelengths on the same fiber. Each of the multiple connectors of the at least one connector board is capable of receiving an interconnecting plug that may be electrical or optical, as appropriate, to maintain the continuity of the torus or higher radix topology.
In yet another aspect, the present invention provides a PCIe card for use in the implementation of a torus or higher radix interconnect, comprising: at least 4 electrical or optical ports for the torus or higher radix interconnect; a local switch; a processor with RAM and ROM memory; and a PCI interface. The local switch may be electrical or optical. The PCIe card is capable of supporting port to PCI traffic, hair pinning traffic, and transit with add/drop traffic. The PCIe card is further capable of combining multiple optical wavelengths on the same fiber.
The embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings in which:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
The present invention uses a torus mesh or higher radix wiring to implement direct interconnect switching for data center applications. Such architecture is capable of providing a high performance flat layer 2/3 network to interconnect tens of thousands of servers in a single switching domain.
With reference to 
One problem to be addressed in such a topology, however, is how to reduce deployment complexity by promoting wiring simplification and simplicity when adding new nodes in the network without impacting the existing implementation. This is one aspect of the present invention, and this disclosure addresses the wiring issues when implementing large torus or higher radix structures.
  
  
  
  
The novel method of generating a netlist of the connectivity of the TPP is explained with the aid of pseudocode as shown at 
If a person skilled in the art of network architecture desires to interconnect all the servers in a rack 33 (up to 42 servers; see the middle section of 
As shown in 
The TPP identification mechanism (patch panel ID) is also implemented using the electronic device 240 which may be programmed at installation. The local persistent memory of device 240 may also hold other information—such as manufacturing date, version, configuration and ID. The connectivity of device 240 to the PCIe cards permits the transfer of this information at software request.
At the card initialization the software applies power to the IC 230 and reads the connector 21 ID. A practical implementation requires wire connectivity—two for power and ground and the third to read the connector 21 ID using “1-Wire” technology.
In a similar fashion, the patch panel ID, programmed at installation with the management software, can be read using the same wiring as with IC 230. The unpowered device 240 has non-volatile memory with the ability to support read/write transactions under software control. IC 240 may hold manufacturing information, TPP version, and TPP ID.
  
This implementation can increase significantly the number of servers in the rack and also provides flexibility in connector/wiring selection.
The printed circuit board 23 supporting the connectors 21 is plugged via high capacity connectors 22 to the backplane 26. The printed circuit board 24 also has high capacity connectors 22 and is also plugged into the backplane 26 to provide connectivity to the connector board 23. The high capacity connectors 21 on the board 24 can be used to interconnect the TPPs rack 33 to rack 33.
The direct interconnect wiring is implemented on the backplane 26. Any time the wiring changes (for different reasons) the only device to change is the backplane 26. For example, where a very large torus implementation needs to change (e.g. for a 10,000 server configuration the most efficient 4D torus would be a 10×10×10×10 configuration as opposed to trying to use a 6×7×16×15; and for a 160,000 server deployment the most efficient configuration would be a 20×20×20×20), you can accommodate these configurations by simply changing the backplane 26 while maintaining the connector boards 23 and 24 the same.
  
Another implementation option for the optical TPP is presented in 
The TPP can also be deployed as an electrical/optical hybrid implementation. In such a case, the torus nodes would have optical ports and electrical ports. A hybrid implementation would usually be used to provide connectivity to very large data centers. You could use the electrical connectivity at the rack level and optical connectivity in all rack to rack or geographical distributed data center interconnects. The electrical cables are frequently used for low rate connectivity (e.g. 1 Gbps or lower rate 10/100 Mbps). Special electrical cables can be used at higher rate connectivity (e.g. 10 Gbps). The higher rate interconnect network may use optical transmission, as it can offer longer reach and can support very high rates (e.g. 100 Gbps or 400 Gbps).
  
The ToR switch 38 is an ordinary layer 2 Ethernet switch. The switch provides connectivity to the servers and connectivity to other ToR switches in a torus configuration where the ToR switch is a torus node. According to this embodiment of the invention the ToR switches 38 and the PCIe cards 41 are interconnected further using a modified version of the TPP 31.
  
  
A second type of traffic supported by the card 41 is the hair pinning traffic (as shown by 410). This occurs where traffic is switched from one port to another port; the traffic is simply transiting the node. A third type of traffic supported by the card 41 is transit with add/drop traffic (as shown at 420). This occurs when incoming traffic from one port is partially dropped to the PCI port and partially redirected to another port, or where the incoming traffic is merged with the traffic from the PCI port and redirected to another port.
The transit and add/drop traffic capability implements the direct interconnect network, whereby each node can be a traffic add/drop node.
| Filing Document | Filing Date | Country | Kind | 
|---|---|---|---|
| PCT/CA2014/000652 | 8/29/2014 | WO | 00 | 
| Publishing Document | Publishing Date | Country | Kind | 
|---|---|---|---|
| WO2015/027320 | 3/5/2015 | WO | A | 
| Number | Name | Date | Kind | 
|---|---|---|---|
| 5588152 | Dapp | Dec 1996 | A | 
| 5590345 | Barker | Dec 1996 | A | 
| 6421251 | Lin | Jul 2002 | B1 | 
| 7301780 | AbuGhazaleh | Nov 2007 | B2 | 
| 7440448 | Lu | Oct 2008 | B1 | 
| 7822958 | Allen | Oct 2010 | B1 | 
| 7929522 | Lu | Apr 2011 | B1 | 
| 8391282 | Lu | Mar 2013 | B1 | 
| 8851902 | Shifris | Oct 2014 | B2 | 
| 8994547 | German | Mar 2015 | B2 | 
| 9026486 | Doorhy | May 2015 | B2 | 
| 9332323 | Zhang | May 2016 | B2 | 
| 9581636 | Yossef | Feb 2017 | B2 | 
| 9609782 | Faw | Mar 2017 | B2 | 
| 20040141285 | Lefebvre et al. | Jul 2004 | A1 | 
| 20080212273 | Bechtolsheim | Sep 2008 | A1 | 
| 20100098425 | Kewitsch | Apr 2010 | A1 | 
| 20100176962 | Yossef | Jul 2010 | A1 | 
| 20120134678 | Roesner | May 2012 | A1 | 
| 20130209100 | Drury | Aug 2013 | A1 | 
| 20130266315 | Drury | Oct 2013 | A1 | 
| 20130271904 | Hormuth | Oct 2013 | A1 | 
| 20130275703 | Schenfeld | Oct 2013 | A1 | 
| 20140098702 | Fricker | Apr 2014 | A1 | 
| 20140119728 | Zhang | May 2014 | A1 | 
| 20140141643 | Panella | May 2014 | A1 | 
| 20140184238 | Yossef | Jul 2014 | A1 | 
| 20140258200 | Doorhy | Sep 2014 | A1 | 
| 20150254201 | Billi | Sep 2015 | A1 | 
| 20150334867 | Faw | Nov 2015 | A1 | 
| 20170102510 | Faw | Apr 2017 | A1 | 
| Entry | 
|---|
| ‘How to use FPGAs to develop an intelligent solar tracking system’ by Altera Technical Staff—Sep. 24, 2008. | 
| ‘Sun Fire T2000 Server Administration Guide’ Chapter 1, copyright 2007, Sun Microsystems, Inc. | 
| ‘An All-Optical PCI-Express Network Interface for Optical Packet Switched Networks’ by Liboiron-Ladouceur et al., from the Conference on Optical Fiber Communication and the National Fiber Optic Engineers Conference, 2007. | 
| ‘A Survey on Optical Interconnects for Data Centers’ by Christoforos Kachris and Ioannis Tomkos, IEEE Communications Surveys & Tutorials, vol. 14, No. 4, Fourth Quarter 2012. | 
| Scalable Interconnect for a booster with Knights Corner processors, U. Bruening, The European Way to Exascale: DEEP at ISC BoF, Jun. 20, 2012, ISC Hamburg. Retrieved from Internet: http://www.deep-project.eu/SharedDocs/Downloads/DEEP-PROJECT/EN/Presentations/ISC12-BoF-Extoll.pdf?_blob=publicationFile. | 
| APEnet+: a 3D Torus network optimized for GPU-based HPC Systems, R Ammendola et al, International COnference on Computing in High Energy and Nuclear Physics 2012 (CHEP2012), Journal of Physics: Conference Series 396 (2012). Retrieved from Internet: http://iopscience.iop.org/article/10.1088/1742-6596/396/4/042059/pdf. | 
| APEnet+: High bandwidth 3D tons direct network for petaflops scale commodity clusters, R. Ammendola et al, proceeding of CHEP 2010, Taiwan, Oct. 18-22, Journal of Physics: Conference Series 331 (2011), Feb. 2011. Retrieved from Internet: http://arxiv.org/pdf/1102.3796.pdf. | 
| Advanced Technologies of the Supercomputer PRIMEHPCFX10, Next Generation Technical Computing Unit, Fujitsu Limited, Nov. 7, 2011. | 
| IBM System Blue Gene Solutions: Blue Gene/Q Hardware Overview and Installation Planning, International Business Machines, IBM, May 13, 2013, ISBN 0738438227. | 
| Number | Date | Country | |
|---|---|---|---|
| 20160210261 A1 | Jul 2016 | US | 
| Number | Date | Country | |
|---|---|---|---|
| 61871721 | Aug 2013 | US |