Governments, companies, educational institutions, and others increasingly rely on large numbers of computers located in data centers. These data centers may comprise hundreds or even thousands of interconnected servers.
Interconnecting these servers has traditionally been an expensive prospect. A tree-based interconnection infrastructure relied on multiple servers feeding commodity switches which in turn feed traffic into high-capacity switches. However, high-capacity switches are expensive and introduce a single point of failure for the servers which depend from them. Placement of additional redundant switches to minimize the single point of failure further increases the cost.
Furthermore, continuous data center growth is expected. This growth in the number of servers in a data center may exceed the capacity and cost effectiveness of existing infrastructures.
As described above, data centers are growing to incorporate an ever increasing numbers of servers. The interconnections between those servers have required expensive hardware with finite limits regarding how many servers may be interconnected.
Disclosed is a method for interconnecting servers in a highly scalable interconnection structure which utilizes low-cost network infrastructure hardware. The resulting interconnection structure results in relatively low diameter, that is the maximum distance between two servers is relatively low relative to the overall size of the structure. Thus the interconnection structure is able to support real-time applications, as well as exhibiting a high bisection width indicating robust link fault tolerance.
The disclosure is made with reference to the accompanying figures. In the figures, the left most reference number digit identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical terms.
Large numbers of servers can be inexpensively interconnected using low-cost commodity network switches, a first network port on each commodity server, a second network port on each commodity server, and a traffic-aware routing module executed on each commodity server.
Connecting two or more servers, including commodity servers, via the first network port on each server to a commodity network switch forms a “unit.” Connecting two commodity servers of different units via the second network ports forms a “group.” Each unit has a direct connection to another unit via the second network port on a server in the unit. Additionally or alternatively, each group may have a direct connection via a second network port on the server in the group to another group. Traffic-aware routing modules executing on each commodity server use a greedy approach to determine routing of data between servers and to balance traffic across the first and second network ports. Using this greedy approach results in optimizing each traffic-aware routing module's individual output with low computational overhead computationally while providing good overall performance across the interconnection structure.
Similar to unit 102, unit 112 comprises a four port switch 114 connected via level-0 links 110 to the first network ports on servers 116A, 116B, 116C, and 116D.
Similar to unit 102 above, unit 118 comprises a four port switch 120 connected via level-0 links 110 to the first network ports on servers 122A, 122B, 122C, and 122D.
Units are connected via level-I links 124 between second network ports on servers in different units. In this application, at levels 1 and greater, one-half of all available servers may link to servers at a same level. An available server is one which has a second network port unused.
For example, before interconnection, unit 102 has four available servers (106A-106D) as none have their second ports in use. One-half of these four is two. Therefore, two servers from each unit having four servers may be used as unit-connecting servers to link with other units at a same level. In this example, four servers in each unit results in a group limited to three units.
These links to other units are illustrated as follows: Level-1 link 126 connects from the second port on server 122D in unit 118 to the second port on server 106C in unit 102. Level-1 link 128 connects from the second port on server 122B in unit 118 to the second port on server 116C in unit 112. Level-1 link 130 connects from the second port on server 106A in unit 102 to the second port on server 116A in unit 112. Thus, each unit has one direct level-1 link to every other unit and forms a level-1 group 132.
Groups may link to other groups in similar fashion, with one-half of all available servers used for linking. In this example, after accounting for the level-1 links, there are six available servers: 106B and 106D in unit 102, 116B and 116D in unit 112, and 122A and 122C in unit 118. One-half of these six available servers may provide links, providing three links to other groups. Links are distributed across units or groups to prevent more than a single server in one unit or group from connecting to the other unit or group.
For example, server 106B in unit 102 may provide one end of a level-2 link 134 between groups, leading to connection 136 described in more depth below. Similarly, server 116B in unit 112 may provide one end of a level-2 link 134 between groups, leading to connection 138, also described in more depth below. Finally, server 122C in unit 118 may provide one end of a level-2 link 134 between groups, leading to connection 140, also described below. Thus, in this example three links to three different groups at the same level are possible. Note that this arrangement leaves servers 116D, 106D, and 122A available for additional links 142.
These available additional links 142 are a result of constructing an interconnection structure in the fashion described in
Additionally, the exponential nature of the interconnection structure allows rapid scaling to large numbers of servers. For example, if 48 port switches are used instead of the four port switches described above, a two level interconnection structure may support 361,200 servers. Given this exponential nature, the number of levels may be relatively small, such as 2 or 3, thus resulting in a relatively small overall diameter as described above. Furthermore, use of the second network port, traditionally thought of as a “backup” port, does not adversely affect reliability of a server in the event a failure of one of the network ports. This is because the server still may use the remaining network port to carry traffic.
In addition to the level-1 group 132 as described above in
Interconnecting level-1 groups forms a level-2 group 200. One server from each group connects to a server in a different group. No connections are duplicated, i.e., a group does not directly connect more than once to another group. In this example the connections are as follows:
Pseudo-code describes the building of the recursively defined interconnection structure of this application. The following variables are defined as:
Using these variables, the following pseudo-code constructs Groupk (where k>0) upon gk*Groupk−1 groups. In each Groupk−1, the servers satisfying
(uk−1−2k−1+1)mod 2k==0 (Equation 1)
are selected as level-k servers and interconnected as described in pseudo-code 1 below.
This interconnection structure allows for routing via multiple links. For example, data flow may have a source of server 122A and a destination of 212N. In this example, the data flow could traverse the following route:
The interconnected nature of the network provides robustness and redundancy. Should a level-2 link fail, data flow may still flow to a destination via other level-2 links. For example, assume level-2 link 138 fails or has insufficient bandwidth. One alternate route could comprise:
Because each element, such as a server, a unit, or a group, in the interconnected structure has two connections, alternate routes remain available so long as one of those two connections is functional. A bisection width of an interconnection structure is the minimum number of links that can be removed to break it into two equally sized disconnected networks. In the case of the interconnection structure described in this application, the lower bound of the bisection width of a Groupk is determined as follows:
This high bisection width indicates many possible paths exist between a given pair of servers, illustrating the inherent fault tolerance and possibility to provide multi-path routing in dynamic network environments, such as data centers.
At 304, N/2 servers in the first unit are connected via level-1 links to servers in each other unit using port 1 forming a level-1 group, wherein each level-1 link is to a different server in a different unit.
At 306, N/4 servers are connected via level-2 links in each level-1 group to servers in each other level-1 group to form a level-2 group, wherein each level-2 link is to a different server in a different group.
At 308, levels may continue to be added by connecting up to one-half of all available servers in each level “L” group to available servers in every other level L group to form a level L+1 group using level L+1 links, where each level L+1 link is to a server in a different group.
At 404, the source server sends a path-probing packet (PPP) towards the destination server using a traffic-aware routing (TAR) module. TAR provides effective link utilization by routing traffic based on dynamic traffic state. TAR does not require a centralized server for traffic scheduling, eliminating a single point of failure. TAR also does not require the exchange of traffic state information among even neighboring servers, thus reducing network traffic. Each intermediate server uses a TAR module to compute a traffic-aware path (TAP) on a hop-by-hop basis, based on available bandwidth of each port on the intermediate server. TAR will be discussed in more depth later in this application.
The PPP may also incorporate a progressive route (PR) field in the packet header. The PR field prevents problems with routing back and multiple bypassing. The routing back problem arises when an intermediate server chooses to bypass its level-L (where L>0) link and routes the PPP to a next-hop server in the same unit, which then routes the same PPP back using level-recursive routing, forming a loop. The multiple bypassing problem occurs when one level-L (where L>0) link is bypassed, and a third server at a lower level is chosen as the relay and two other level-L links in the current level will be bypassed. However, the two level-L links may need to be bypassed again, resulting in a path which is too long or potentially generating a loop.
The PR field prevents these problems by providing a counter for the TAR. Intermediate servers may modify the PR field. A PR field may have m entries, where m is the lowest common level of the source and destination servers. PRL denotes the Lth entry of PR field, where (1≦L≦m). Each PRL plays two roles: First, when bypassing a level-L link, the level-L server in a selected third Group(L−1) is chosen as a proxy server and is set in the PRL. Intermediate servers check the PR field and route the packet to the lowest-level proxy server. Thus, the PPP will not be routed back.
Second, PRL may carry bypass information about bypassing in the current GroupL. If a number of bypasses exceeds a bypass threshold, the PPP jumps out of the current GroupL and another GroupL is chosen for relay. Generally, the higher the bypass threshold, the more likely that the PPP finds a balanced path because with a higher bypass threshold there are more opportunities to find a lower-utilized link within a group.
For example, where the bypass threshold is 1, two special identifiers for a PRL may be specified: BYZERO and BYONE. These special identifiers are different from server identifiers. BYZERO indicates no level-L link is bypassed in the current GroupL, so PRL is set to BYZERO when the packet is initialized or after crossing a level-i link if i>L. The BYONE value indicates there is already one level-L link bypassed in the current GroupL, so PRL is set to BYONE after traversing the level-L proxy server in the current GroupL. PRL is set as the identifier of the level-L proxy server between the selection of the proxy server and the arrival to the proxy server. The source server initializes the PR entry in a PPP as BYZERO.
At 406, the destination server receives a PPP. Once received, at 408 the destination server sends a reply-PPP (RPPP) back to the source server by exchanging the original PPP's source and destination fields.
At 410, the source server's receipt of the RPPP confirms that a path is available for transmission, and data flow may begin. Intermediate servers then forward the flow based on established entries in their routing tables built during the transit of the PPP.
At 412, periodically during a data transfer session between the source and destination server, a PPP may be sent to update the routing path. This update provides for changing the routing path based on dynamic traffic states within the interconnection structure. For example, failures or congestion elsewhere in the network may render the original routing path less efficient than a new path determined by the TAR. Thus, the PPP updates provide a mechanism to discover new paths in response to changing network conditions during a session.
Where the destination server is not server s, at 508 the TARM tests whether a previous hop for the PPP is equal to a next hop in a routing table, and if so, processes the PPP using a Source Re-Route (SRR) module 510. SRR provides a mechanism for a PPP to bypass a busy or non-functional link.
When a server s decides to bypass its level-L (where L>0) link and choose a proxy server, server s may modify the PR field and re-route the PPP back to the previous hop from which server s received the packet. Original intermediate servers from the source server to s will then all receive the PPP from the next hop server for the flow in the routing table. The source server receives the PPP packet, and clears the routing entry for the flow, then re-routes the PPP to a lowest-level proxy server in the PR field for the PPP.
At 512, when s=level-L proxy server in the current level, the PRL is modified to BYONE at 514. Once PRL has been modified, or when s is not equal to the level-L proxy server in the current level, at 516 the next hop is determined using level-recursive routing. Another implementation is to randomly select a third GroupL−1 server when the outgoing link using level-recursive routing is the level-L link and the available bandwidth of the level-0 link is greater. This randomly selected third GroupL−1 server then relays the PPP.
Level-recursive routing at 516 comprises determining the next hop in the route. A lowest-level proxy server in the PR field of the PPP is returned. When no proxy server is present, the destination server of the packet is returned and the next hop towards the destination is computed using level-recursive routing. In the case of a server s routing a packet to a desired destination dst, a recursively computed routing may be described with the following pseudo code:
At 518, when at a source server for a PPP, at 520 this special case for computing the next hop at the source server occurs. A source server selects the level-L neighboring server as the next hop when the next hop determined using level-recursive routing is within the same unit but the available bandwidth of the unit's level-L link is greater than that of the unit's level-0 link. Computation of the available bandwidth includes consideration of a virtual flow.
Virtual flow (VF) alleviates an imbalance trap problem. Assume that a level-L server s routes a flow a level-L outgoing link and there is no traffic in its level-0 outgoing link. All subsequent flows that arrive from the level-0 incoming link will bypass the level-L link because the available bandwidth of the level-0 outgoing link is always higher. In this case, the outgoing bandwidth of the level-L link cannot be well utilized even though the other level-L links in the GroupL are heavily loaded. This imbalance trap problem results from the idea that the TAR seeks to balance the local outgoing links of a server, not links among servers.
VF compares the available bandwidth between two outgoing links. VFs for a server s indicate flows that once arrived at s from the level-0 link but are not routed by s because of bypassing. That is, s is removed from the path by SRR. Each server initializes a Virtual Flow Counter (VFC) at 0. When a flow bypasses a level-L link, VFC is incremented by one. A non-zero VFC is reduced by one when a flow is routed by the level-0 outgoing link.
Available bandwidth of an outgoing link and virtual flows for the level-0 link are considered when evaluating available bandwidth. Setting the traffic volume of a virtual flow to the average traffic volume of routed flows avoids the imbalance trap problem.
When a proxy server is found which bypasses the level-L link of s, the PR field is updated and a next hop towards the proxy server returned. At 522, when bypassing a level-L link, at 524, the level-L link is bypassed and the VFC is incremented and the next-hop server is returned at 526. When no proxy server is found, the level-L link is not bypassed. When no bypass of a level-L link is necessary at 522, then at 528 the VFC is decremented.
This process may also be described using the following pseudo-code:
Although specific details of illustrative systems and methods are described with regard to the figures and other flow diagrams presented herein, it should be understood that certain acts or elements of the systems and methods shown in the figures need not be performed in the order described, and may be modified, and/or may be omitted entirely, depending on the circumstances. As described in this application, modules may be implemented using software, hardware, firmware, or a combination of these. Moreover, the acts and methods described may be implemented by a computer, processor or other computing device based on instructions stored on memory, the memory comprising one or more computer-readable storage media (CRSM).
The CRSM may be any available physical media accessible by a computing device to implement the instructions stored thereon. CRSM may include, but is not limited to, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid -state memory technology, compact disk read-only memory (CD-ROM), digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computing device.