This invention relates to computer systems and, in particular, to computer network clusters having an increased bandwidth.
Computer cluster networks constructed with a fat-tree topology are very often used to interconnect the client nodes in massively parallel computer systems. This type of topology is often called a “tree” because of its structure with a trunk, branches, and leafs that ultimately connect to client nodes. In addition, fat-tree networks typically provide communication between client nodes at a constant bandwidth, because the number of connections out of each switch level to the next higher level is the same as the number of connections from the previous lower level. The lowest level includes the “leaves” with ports that connect to the client nodes. High performance computers are being used more and more for essential functions that require higher bandwidth, reliability, and for applications using very large numbers of processors requiring higher availability. To meet these needs, conventional network clusters typically include duplicate fat-tree networks stemming off the client nodes, or dual-rail network configurations. However, the cost of this improved capability is typically double that of a single-rail network.
In one embodiment, a computer cluster network includes at least three switches communicatively coupled to respective at least one client nodes. At least one of the at least three switches communicatively couples together at least two other ones of the plurality of switches.
In a method embodiment, a method of networking client nodes includes communicatively coupling each switch of at least three switches to respective at least one client nodes. The method also includes communicatively coupling together at least two switches of the at least three switches through at least one other switch of the at least three switches.
Technical advantages of some embodiments of the invention may include an increased bandwidth and redundancy over that provided by conventional single-rail fat-tree networks at a much lower cost than can be realized with conventional dual-rail networks. In addition, various embodiments may be able to use conventional software developed to manage fat-tree networks.
It will be understood that the various embodiments of the present invention may include some, all, or none of the enumerated technical advantages. In addition other technical advantages of the present invention may be readily apparent to one skilled in the art from the figures, description, and claims included herein.
For a more complete understanding of the present invention and features and advantages thereof, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:
In accordance with the teachings of the present invention, a network cluster having an improved network fabric and a method for the same are provided. By utilizing a particular network fabric configuration, particular embodiments are able to realize an increased bandwidth and redundancy at reduced costs. Embodiments of the present invention and its advantages are best understood by referring to
Client nodes 102 generally refer to any suitable device or devices operable to communicate with each other through network fabric 104, including one or more of the following: switches, processing elements, memory elements, or input-output elements. In the example embodiment, client nodes 102 include computer processors. Network fabric 104 generally refers to any interconnecting system capable of transmitting audio, video, signals, data, messages, or any combination of the preceding. In this particular embodiment, network fabric 104 comprises a plurality of switches interconnected by copper cables.
Conventional network fabrics generally include dedicated edge switches and dedicated core switches. The edge switches couple the core switches to client nodes, while the core switches couple other core switches and/or edge switches together. For example, a message from a client node may route through the respective edge switch, then through a core switch, and then to a destination edge switch connected to the destination client node. Core switches by definition are not directly coupled to client nodes. For purposes of this disclosure and in the following claims, the term “directly coupled” means communicatively networked together without any intervening switches or client nodes, while the term “coupled” means communicatively networked together with or without any intervening switches or client nodes. Conventionally, systems that require higher bandwidth and redundancy often include duplicate fat-tree networks stemming off the client nodes, or dual-rail network configurations. However, the cost of this improved capability is often double that of a single-rail network, at least partially due to the utilization of twice the number of switches. Accordingly, teachings of some of the embodiments of the present invention recognize ways to increase bandwidth and redundancy over that provided by conventional single-rail fat-tree networks at a much lower cost than can be realized with conventional dual-rail networks. As will be shown, in various embodiments, the enhanced bandwidth, redundancy, and cost-efficiency may be effected by reducing the total number of switches over conventional architectures and increasing switch functionality relative to conventional routing schemes. Example embodiments of such improved network clusters are illustrated in
Connectors 270 generally refer to any interconnecting medium capable of transmitting audio, video, signals, data, messages, or any combination of the preceding. Connectors 270 generally couple switches 212, 214, 216, 218, 222, 224, 226, and 228 and client nodes 232, 234, 236, 238, 240, 252, 244, 246, 248, 250, 252, 254, 256, 258, 260, and 262 of computer cluster network 100 together. In the example embodiment, connectors 270 comprise copper cables. However, any suitable connectors 270 may be used, including, for example, fiber optic cables or metallic traces on a circuit board.
In the example embodiment, each switch 212, 214, 216, 218, 222, 224, 226, and 228 includes a plurality of ports (e.g., ports 272, 274, 276, 278, 280, 282, 284, and 286 of switch 212) and an integrated circuit that allows data coming into any port to exit any other port. At least one port of each switch 212, 214, 216, 218, 222, 224, 226, and 228 couples the switch 212, 214, 216, 218, 222, 224, 226, and 228 to a respective client node (e.g., port 272 couples switch 212 to client node 232). At least one other port of each switch couples the switch to another switch (e.g., port 280 couples switch 212 to switch 223). Although the switches 212, 214, 216, 218, 222, 224, 226, and 228 in this example each have eight ports (e.g., ports 272, 274, 276, 278, 280, 282, 284, and 286 of switch 212) any appropriate number of ports may be used without departing from the scope of the present disclosure. For example, network fabric 104 may include switches having twenty-four ports, as illustrated in
Client nodes 202 are substantially similar in structure and function to client nodes 102 of
To effect the routing of communication paths between switches 212, 214, 216, 218, 222, 224, 226, and 228, the example embodiment uses static routing tables, meaning a message communicated between two client nodes 202 that are not directly coupled to the same switch 212, 214, 216, 218, 222, 224, 226, or 228 includes at least one predetermined communication path. In the example embodiment, each pre-determined communication path includes respective origin and destination switches 212, 214, 216, 218, 222, 224, 226, or 228 of one of the switch sets 210 or 220, and a middle switch 212, 214, 216, 218, 222, 224, 226, or 228 of the other switch set 210 or 220. For purposes of this disclosure and in the following claims, “origin switch” refers to the switch directly coupled to the client node communicating a particular message, “destination switch” refers to the switch directly coupled to the client node receiving a particular message, and “middle switch” refers to the switch communicatively coupling together the origin and destination switches. To illustrate, a message communicated from client node 240 to client node 232 may route through origin switch 214, then through middle switch 224, then through destination switch 212, which is directly coupled to client node 232. For simplicity, the connectors 270 and static routing tables of the example embodiment are arranged in
In this particular embodiment, at least a majority of the client nodes 202 are interconnected by redundant communication paths. Providing redundant communication paths may be effected by merging two virtual networks, as shown in
Because each switch 212, 214, 216, 218, 222, 224, 226, or 228 may function as an origin, middle, or destination switch, depending on the communication path, the example embodiment reduces the total number of switches compared to conventional dual-rail fat-tree networks, at least in part, by eliminating conventional dedicated core switches. In various embodiments, the reduction in the number of switches 212, 214, 216, 218, 222, 224, 226, or 228 may enhance reliability and cost efficiency of computer cluster network 200. Various other embodiments may advantageously use the teachings of the present disclosure in conjunction with core switches. For example, core switches may couple together multiple sub-arrays each having merged virtual networks similar to that illustrated in
The example configuration of
The network fabric 104 configuration illustrated in
Although the example embodiments of
Various embodiments using switches with at least twenty-four ports may make fully redundant networks a more viable solution. To illustrate, the example embodiment of
The network configuration of computer cluster network 400 provides several advantages over conventionally configured single-rail or dual-rail networks. To illustrate, configuring 144 client nodes in a conventional single-rail network typically requires eighteen 24-port switches and 288 cables, with some of the switches functioning as core switches. This may be expressed mathematically as S*P=3N, where S is the number of switches, P is the number of ports per switch, and N is the number of client nodes. Likewise, the number of connectors typically utilized by conventional single-rail networks may be mathematically expressed as (S*P)−N. For conventional dual-rail networks, the mathematical expression typically is S*P=6N, while the number of connectors typically is (S*P)−2N. Although conventional dual-rail networks typically have twice the bandwidth over relative single-rail networks, dual-rail networks typically comprise twice the number of relative switches and hence double the cost, as shown by the above equations. Accordingly, teachings of some embodiments of the present invention recognize a 1.2× to a 1.5× increase in bandwidth over conventional single-rail networks, while generally reducing the number of switches and connectors by over 30% relative to conventional dual-rail fat-tree networks. Thus, in various embodiments, the proportional increase in bandwidth may be greater than the proportional increase in cost over relative single-rail networks. For example, computer cluster network 400 redundantly networks 132 client nodes 402 using only twenty-four 24-port switches 410 and 396 connectors 470. The network configuration of this particular embodiment may be expressed mathematically as (S*P)=4(N−S), with the number of connectors expressed as [S*(P−2)]−N.
Although particular embodiments of the method and apparatus of the present invention have been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous rearrangements, modifications, and substitutions without departing from the spirit of the invention as set forth and defined by the following claims.