The present invention relates generally to data processing and, in particular, to an improved interconnection network topology for large scale, high performance computing (HPC) systems.
Scalable, cost-effective, and high performance interconnection networks are a prerequisite for large scale HPC systems. The dragonfly topology, described, for example, in US 2010/0049942, is a two-tier hierarchical interconnection network topology. At the first tier, a number of routers are connected in a group to form a large virtual router, with each router providing one or more ports to connect to other groups. At the second tier, multiple such groups of routers are connected such that the groups form a complete graph (full mesh), with each group having at least one link to every other group.
The main motivation for a dragonfly topology is that a dragonfly topology effectively leverages large-radix routers to create a topology that scales to very high node counts with a low diameter of just three hops, while providing high bisection bandwidth. Moreover, the dragonfly minimizes the number of expensive long optical links, which provides a clear cost advantage over fat tree topologies, which require more long links to scale to similar-size networks.
However, when considering exascale systems, fat tree and two-tier dragonfly topologies run into scaling limits. Assuming a per-node peak compute capacity Rn=10 TFLOP/s, an exascale system would require N=100,000 nodes. A non-blocking fat tree network with N end nodes built from routers with r ports requires n=1+log(N/r)/log(r/2) levels (with N rounded up to the next integer); therefore, using current Infiniband routers with r=36 ports, this system scale requires a network with n=4 levels, which amounts to 2n−1=7 router ports per end node and (2n−1)/r=0.19 routers per end node. To achieve this scale in just three levels, routers with a radix r=74 are needed, which corresponds to 0.068 routers per node.
A balanced—i.e., providing a theoretical throughput bound of 100% under uniform traffic—two-tier dragonfly network (p, a, h)=(12, 26, 12) can also scale to about 100,000 nodes, where p is “bristling factor” indicating the number of terminals connected to each router, a is the number of routers in each group, and h is the number of channels in each router used to connect to other groups. This corresponds to 1/12=0.083 routers per node and 49/12=4.1 ports per node, which is significantly more cost-effective than the four-level fat tree, and about on par with the three-level fat tree, which requires much larger routers.
The present disclosure appreciates that as HPC systems scale to ever increasing node counts, integrating the router on the CPU chip would improve total system cost, density, and power, which are all important aspects of HPC systems. However, this next step in CPU integration reverses the industry trend, in the sense that the practical radix of such on-chip routers is much smaller than what has been previously predicted in the art. Consequently, it would be useful and desirable to deploy direct interconnection networks, such as dragonfly networks, that scale to high node counts and arbitrary numbers of tiers, while reducing router radices to a level that supports commercially practical integration of the routers into the CPU chips.
In at least some embodiments, a multiprocessor computer system includes a plurality of processor nodes and at least a three-tier hierarchical network interconnecting the processor nodes. The hierarchical network includes a plurality of routers interconnected such that each router is connected to a subset of the plurality of processor nodes; the plurality of routers are arranged in a hierarchy of n≧3 tiers (T1, . . . , Tn); the plurality of routers are partitioned into disjoint groups at the first tier T1, the groups at tier Ti being partitioned into disjoint groups (of complete Ti groups) at the next tier Ti+1 and a top tier Tn including a single group containing all of the plurality of routers; and for all tiers 1≦i≦n, each tier-Ti−1 subgroup within a tier Ti group is connected by at least one link to all other tier-Ti−1 subgroups within the same tier Ti group.
As noted above, dragonfly topologies are highly scalable direct networks with a good cost-performance ratio and are one of the principal options for future exascale machines. Dragonfly networks are hierarchical networks, and in principle, at each level of the hierarchy, a different connection pattern could be chosen. Most prior art dragonfly networks have adopted the fully connected mesh at each level of the hierarchy, but others have employed alternative connectivity patterns, such as a flattened butterfly. Connection patterns other than the fully-connected mesh increase scalability at the cost of longer shortest paths and/or an increased number of virtual channels required to avoid deadlock.
With reference now to
In accordance with the present disclosure, the prior art dragonfly topology can be generalized to an arbitrary number of tiers n. Using the enhanced notation introduced supra, a generalized dragonfly (GDF) topology is specified by GDF(p;
Gi is further defined as the number of tier i−1 groups that constitute a fully connected topology at tier i. For convenience, G0 can be utilized to indicate the individual routers, which can each be considered as “groups” at tier 0:
The total number of routers and nodes scales very rapidly as a function of the parameters hi, as the total number of routers Si at tier i is given by the product of the group sizes up to and including tier i:
where the total number of routers equals Sn and the total number of nodes N equals pSn.
Each router in the topology can be uniquely identified by an n-value coordinate vector
Referring now to
It should be appreciated that the routers within a GDF topology are subject to a number of different interconnection patterns. In one exemplary implementation, a convenient interconnection pattern in which a particular router sa with coordinates
where the global link index Γia(x)=Δia·hi+x, where Δia is the relative index of router sa within its group at tier i and is given by:
This interconnection pattern allows each router sa to easily determine the positions of its neighbors based on its own position
From the foregoing equations, it is clear that GDF interconnection network topologies scale to extremely large node counts even with small hi values utilizing just a few tiers. Given a fixed router radix r excluding the end-node-facing ports, the values of hi can be selected to maximize the total number of routers Sn, subject to the constraint Σi=1nhi=r. For example, in the two-tier case, the total number of routers equals S2(h1, h2)=h12h2+2h1h2+h1+h2+1. Substituting h2=r−h1 in the relation yields S2(h1)=−h13+(r−2)h12+2rh1+r+1. Differentiating this relation with respect to h1 yields
Setting the derivative to zero and solving for h1 finally yields:
h1opt=1/3(r−2+√{square root over (r2+2r+4)})
which for large r can approximated by
Because each hi must be an integer value, the real maximum can be determined by evaluating the integer combinations around the exact mathematical maximum. In the three-tier case, the total number of routers equals
S3(h1, h2, h3)=(((h1+1)h2+1)(h1+1)h3+1)·((h1+1)h2+1)(h1+1),
which for large h1 scales as {tilde over (S)}3(h1, h2, h3)=h14h223. Substituting h3=r−h1−h2 yields
{tilde over (S)}3(h1, h2)=rh14h22−h15h22−h14h23.
Applying partial differentiation of {tilde over (S)}3(h1, h2) with respect to h1 and h2 gives
Setting each of these partial differentials to zero yields h1=4(r−h2)/5 and h1=(2r−3h2)/2, which after simple manipulations results in
meaning that for large r the optimal ratio h1:h2:h3 equals 4:2:1.
Using a similar analysis, it can be shown that for a four-tier dragonfly topology, where the total number of routers S4 scales as h18h24h32h4, the optimal choice for the parameters hi is as follows:
Although these values are approximations, combinations of integer values that are close to these approximations and sum up to r indeed provide the best scaling.
The bisection B (expressed in the number of links) of a two-tier dragonfly DF(p, h1, h2) equals the minimum number of links that need to be cut to separate the network into two equal-sized halves. For balanced networks, this equals the worst-case cut between groups. As each group is connected to all groups in the other half by exactly one link, where the number of groups G=(h1+1)h2+1, the bisection is given as follows:
B=G2/2, G mod 2=0
(G−1)2/2+(h1+1)2/2, G mod 2=1
Note that each link is counted twice because the links are bidirectional.
For a generalized dragonfly GDF(p;
The relative bisection per node Bn/N can be expressed as follows:
Thus, Bn/N>hn/2p. Therefore, to ensure full bisection bandwidth hn≧2p. Note that this relation depends only on the top-tier hn and the bristling factor p, but not on the number of tiers n nor on any of the lower-tier values hi<n. However, to ensure that the lower tiers do not impose a bottleneck, the number of times each type of link is traversed must be considered. For shortest-path (direct) routing, most paths traverse 2n−i links at tier i, where 1≦i<n. Therefore, the values for hi at each tier should satisfy hi=2hi+1, where 1≦i<n. Thus, hi/hn=2n−i, for 1≦i<n or hi=2n−i+1p, for 1≦i<n.
The average distance davg (i.e., the number of inter-router hops) in a two-tier GDF(p; h1, h2) follows from the path count breakdown listed in Table I shown in
Hence, the average distance in a two-tier dragonfly network is given by:
Similarly, the average distance in a three-tier dragonfly network is expressed as:
As discussed above, satisfying hi=2n−i+1p, for 1≦i<n provides full bisection bandwidth. However, the notion of a balanced dragonfly network requires only half bisection bandwidth, because a balanced dragonfly network assumes uniform traffic. For uniform traffic, only half of the traffic crosses the bisection. Correspondingly, the hi/p values for a balanced multi-tier dragonfly network are halved: hi=2n−ip, for 1≦i<n. Note that these ratios between the hi values are also the optimal ratios between hi values to achieve the maximum total system size.
Formally, a GDF(p; (h1, . . . , hn)) is referred to as balanced if for all i, 1≦i <n, hi/hi+1=Hi/Hi+1, where Hi represents the total number of tier-i hops across all shortest paths from a given node to every other node, which is equivalent to the hop-count distribution under uniform traffic. The balance ratios βi,j are defined as βi,j=hiHj/hjHi. A network is perfectly balanced when for all i, 1≦i<n:βi,i+1=1. In practice, achieving ratios exactly equal to one may not be possible, so it is preferable if the ratios are as close to 1 as possible. If βi,i+1>1, the network (at tier i) is overdimensioned, whereas the network is underdimensioned if βi,i+1<1. It should be noted that as employed herein the term “balance” relates specifically to bandwidth, not router port counts; however, as equal bandwidth on all links is assumed herein (unless specifically stated otherwise), router port count and link bandwidth can appropriately be viewed as equivalent concepts.
The relation hi=2n−ip, for 1≦i<n for balanced dragonfly networks is sufficient but conservative for a balanced network because, depending on h, a certain fraction of minimal paths are shorter than 2n−1 hops. More precisely, for a two-tier dragonfly network, the ratio between the number of l1 versus l2 hops equals
As a consequence, for small first-tier groups, the effective balance factor is clearly smaller than 2. The exact ratio between h1 and h2 for a balanced network can be determined by solving β1,2=1. In particular, since h1=h2H1/H2, h1=h2−1+√{square root over (h2+1)}≈2h2−1.
For three-tier balance, Table II shown in
From Table II, the total number of hops Hi per link type li can be obtained as follows:
H1=h1(1+2h2+2h1h2+h3(2+2h1+6h2+4h22+12h12h22+4h13h22))
H2=h2(1+2h1+2h12+h3(2+2h1+6h1+2h2+8h1h2+6h12+12h12h2+2h13+8h13h2+2h14h2))
H3=h3(1+2h1+2h2+h12+6h1h2+h22+4h1h22+6h12h2+6h12h22++2h13h24h13h22+h14h22)
From these relations, the balance ratios H1/H2 and H2/H3 can be determined. Although the value of H1/H2 is not entirely independent of h3, it can be shown that the derivative
is extremely close to zero (<4e−16) for any valid combination of h1, h2, and h3, implying that the two-tier balance condition for h1 and h2 also holds in the three-tier case. The condition for h3 can be determined by solving h3=h2H3/H2, which following reductions yields
It should further be appreciated that the condition hi=2*hi+1 for a balanced network is satisfied by any combination hi*BWi=2*(hi+1*BWi+1), where BWi is the bandwidth per port at tier i, which could be achieved by doubling BWi+1 rather than hi+1.
Substituting the conditions for a balanced dragonfly network into the expression for the total number of nodes N yields an expression for N that depends only on a single variable, allowing the network parameters p and hi to be uniquely determined as a function of N. For example, for a two-tier topology, N=4h24+2h22. Solving this equation for h2 yields
which for large N is approximately equal to
Therefore using the conditions for a balanced dragonfly topology, the balanced router radix for the two-tier case equals
It can similarly be shown that for the three-tier case the respective expressions are:
Network costs in terms of number of routers and inter-router links per end node equal 1/p routers/node and
links/node. For balanced networks these ratios amount to 1/h2 and (4h2−1)/h2≈4 for two tiers and 1/h3 and (8h3−1)/h3≈8 for three tiers.
Turning now to routing considerations, routing packets in an interconnection network having a dragonfly topology can include both shortest-path, i.e., minimal or direct routing, as well as non-minimal or indirect routing. Because exactly one shortest path exists between any pair of routers (or processing nodes), minimal routing is the more straightforward case. In a preferred embodiment, a dimension order routing approach is implemented in which each dimension corresponds to a tier of the dragonfly topology. The routers comprising the minimal route determine the minimal route in n phases, where at each phase, one or more of the routers in the minimal route compares its coordinates and those of the destination router and routes the packet to the correct position (group) in the current dimension (tier) before the packet is passed to the next dimension. In a GDF, routers preferably order the dimensions from most to least significant, because routing to a destination group at a certain tier i may require hops through some or all of the dimensions lower than i as well.
As an example, assuming n=3, source coordinates (g1s, g2s, g3s) and destination coordinates (g1s, g2s, g3s), the routing algorithm implemented by the routers in the dragonfly interconnection network first routes to the correct destination group at tier 3 (x, y, g3d), then within that top-tier group to the correct subgroup at tier 2 (x, g2d, g3s), and finally to the correct sub-subgroup at tier 1, (g1d, g2d, g3s). Because of the particular structure of a GDF, in which exactly one link connects two given groups at a given tier, each of the n routing phases (one per dimension) may require intermediate hops to get to the router that connects the current group to the destination group.
With reference now to
In a generalized dragonfly topology, the diameter at tier i is denoted by di. From the above discussion, it follows that the diameter at successive tiers can be derived by the simple recursive formula di=2di−1+1, 1≦i≦n, with d0=0. It follows that the diameter dn is given by
In this example, the distance from source processing node 502a to first destination node 502b equals three, that is, the diameter at tier 2 (i.e., d2) equals 3. The distance from source processing node 502a to second destination processing node 540c equals twice the diameter at tier 2 plus the hop connecting the two tier-2 groups 508a-508b. Hence, the diameter at tier 3 (i.e., d3) equals 7.
For non-minimal routing, path diversity in a GDF network is immense, and many different techniques of performing indirect routing, such as Valiant routing, are possible. In Valiant routing, a packet is first routed to a randomly selected intermediate group at the “nearest common ancestor” tier between the source and the destination processing nodes, and from there to the destination processing node. In other words, given that both source and destination processing nodes are within the same group at tier i, Valiant routing allows an indirect path to visit one intermediate group at tier i-1. For instance, if source and destination processing nodes are within the same tier 1 group, the longest indirect path is l1→l1 (the intermediate “group” in this case comprises just a single router). If source and destination processing nodes are within the same tier 2 group, then the longest (5-hop) indirect path is l1→l2→l1→l2→l1, whereas if the direct path leads up to tier 3, a corresponding (11-hop) indirect path is l1→l2→l1→l3→l1→l2→l1→l3→l1→l2→l1. It follows that the longest indirect path according to Valiant routing policy has a length of 2n−1+2n−1=3·2n−1−1 hops. A variant of Valiant routing, Valiant-Any, which can mitigate certain adverse traffic patterns, allows misrouting to any intermediate router rather than any intermediate group. In this case the longest indirect path length equals 2(2n−1)=2n+1−2 hops.
Shortest-path routing in dragonfly networks is inherently prone to deadlocks because shortest-path routing induces cyclic dependencies in the channel dependency graph. Without a proper deadlock avoidance policy in place, forward progress cannot be guaranteed. It has previously been demonstrated that to guarantee deadlock freedom the basic two-tier dragonfly requires for the tier-1 links two virtual channels for shortest-path routing and three virtual channels for indirect routing.
To guarantee deadlock freedom in GDF topologies with an arbitrary number of tiers, a sufficient number of virtual channels (VC) must be allocated. Referring now to
In the general case, a GDF with n tiers requires 2n−i virtual channels for deadlock avoidance with shortest-path routing on tier i. This follows from the maximum number of times that hops at a given tier are visited, which is exactly 2n−i for tier i for the worst-case shortest paths.
Given an incoming port corresponding to tier l1 and channel vc1, an outbound port corresponding to tier l2 and channel vc2, with λ=|l2−l1| being the number of tiers crossed, then the outbound virtual channel vc2 is given by
Note that assigning the next VC according to this deadlock-free general-case VC assignment policy only requires knowledge of the current VC and the current and next dimension, and can be implemented by the routers in a GDF network using simple shifting and addition operations.
The indirect routing paths described above require additional virtual channels for deadlock-free operation, as the additional available routes can create additional cycles in the channel dependency graphs. For Valiant routing, the maximum number of hops at a given tier i of such an indirect path equals 2n−i−1+2n−i (for n−i≧1; at tier n there are two hops at maximum). Correspondingly, each link at tier i should provide at least 2n−i−1+2n−i virtual channels. For the tier-1 links, this policy implies three channels for n=2 and six for n=3.
Consequently, the virtual channel assignment policy set forth above changes. As an indirect path is composed of an initial Valiant component from the source processing node to the selected intermediate destination, followed by the shortest path from the intermediate destination to the actual destination processing node. The number of tier-i hops of the first and second components are at most 2n−i and 2n−i−1, respectively. For the second component, the VC assignment policy set forth above is applied, with VCs numbered from 0 to 2n−i−1 for each tier as before. For the first component, which is roughly half a shortest path, the shortest-path assignment is also applied, except that the VCs are numbered from 2n−i to 2n−i+2n−i−1−1. Therefore, prior to applying the VC assignment policy set forth above, a value of 2n−i is subtracted, and added back in afterwards. To correctly compute the VC, the VC assignment policy must therefore be aware of whether the routing phase is Valiant (first part) or not (second part).
The present disclosure further appreciates that the GDF topology described previously, which employs only a single link between each pair of groups at a given tier, can be extended to employ multiple links between groups. This arrangement gives rise to a new class of interconnection networks referred to herein as extended generalized dragonfly (XGDF) topologies. XGDF networks cover the entire spectrum of network topologies from fully-scaled dragonflies to Hamming graphs, in which each router is directly connected with its peers in each of the other groups at each tier. Note that XGDF topologies can all be built using different configurations of the same “building block” router.
An extended generalized dragonfly is specified by XGDF(p;
To obtain a completely regular topology, the constraint is imposed that, at each tier i, the number of outbound links from a tier-(i−1) group must be an integer multiple of bi, so that each bundle has exactly the same number of subgroups and links:
Moreover, it is assumed b1=1, because there are no subgroups at the first tier and from a topological perspective it is not useful to introduce multiple links between a pair of routers. (Although such redundant links can be present for non-topological reasons.)
The number of groups at each tier G′i is given by the following equations:
The total number of switches S′i in one of the groups at tier i is given by the product of the group sizes up to tier i:
The total number of nodes N′ in the system equals N′=pS′n. Each switch in the topology can be uniquely identified by an n-value coordinate vector (g1, g2, . . . , gn), with 0≦gi<Gi, with each coordinate indicating the relative group position at each tier of the hierarchy.
“Full bundling” at tier i is defined herein as the case where
such that G′i=hi+1. This relation implies that every router has a direct link to its respective peers in each other group at tier i. If all tiers have full bundling, the topology is equivalent to a prior art HyperX (i.e., Hamming graph) topology. Consequently, herein it is assumed that an XGDF does not have full bundling at all tiers. It should be noted that the number of groups at each tier and therefore the total number of routers and end nodes decreases with increasing values of bi. By virtue of the XGDF definition, the switch radix is independent of the bi value, r=p+Σj=1nhi. Therefore, bundling trades off network scale for average distance, and a bundled network requires a larger router radix than one without bundling in order to scale to the same size.
The interconnection pattern for an arbitrary XGDF is the same as that for a GDF topology, with one minor modification to Γia(x):
Γia(x)=(Δia·hi+x)mod(Gj−1)
to account for the fact that there are now more global links than the number of (other) remote groups Gj−1.
As every group is connected to every other group by bn links, the number of links that cross the bisection at the top tier equals B′n=(G′n)2bn/2 (assuming G′n is even). Hence, the relative bisection per node can be given as:
The significance of this result is that the top-tier bisection does not depend on bn, but only on hn. Similarly, the bisection bandwidths at lower tiers also do not depend on b. In other words, the relative bisection of an XGDF is not affected by the bundling factors, but is determined fully by
The average distance in a two-tier XGDF(p; h1, h2; 1, b2) is given by
Here, the correction factor x accounts for the fact that—unlike in GDFs without bundling—there may be multiple shortest paths to a given node:
Routing in an XGDF is in principle the same as in a GDF as groups are traversed in a hierarchical manner. The longest direct routes are identical to those for the GDF without bundling. However, in the case where full bundling is implemented at a certain tier, the route at that tier “degenerates” in the sense that no hops at a lower tier are necessary to each the destination group because each switch has a direct links to all its peers. Even in a full-scale non-bundled dragonfly topology, a certain fraction of the routes is (partially) degenerate. An example of this in a two-tier dragonfly topology is the four routes l1, l2, ll→l2, and l2→l1. As the bundling factor increases, the relative fraction of such routes will increase, because there are more switches at a given tier that have a link to a given destination group at that tier.
In the extreme case of the Hamming graph, all shortest paths have exactly n hops vs. 2n−1 for the full-scale dragonfly topology. Naturally, this also has implications for network balance and deadlock avoidance. Deadlocks disappear entirely in Hamming graphs as long as a strict dimension-order routing policy is applied, so multiple virtual channels would not be required for direct, shortest path routing. A second virtual channel is required, however, for indirect routing.
As there are multiple paths between peer groups in a bundled network, at each tier a specific link must be selected. In a preferred embodiment, routers in an XGDF topology implement a shortest-path routing scheme that load balances the traffic equally across the available paths. For example, routers at every tier may select a link from a bundle utilizing a hashing function on packet ID, source, and destination. This load-balancing is preferably implemented when the source router is not connected to the destination group and the destination router is not connected to the source group to ensure the shortest-path property. This shortest-path routing algorithm induces an imbalanced usage of local links, however. To illustrate this point, consider a simple case with n=2 and b2>1, XGDF(p; h1, h2; 1, b2). In such a system, a given router with coordinates (xi, x2) is one of b2 local routers all connecting to each specific remote group to which this router is also connected. The local links of the designated router can be classified into two types: Type 1 links connect to a local router that does not connect to the same remote groups, and Type 2 links connect to a local router that does connect to the same remote groups. Assuming uniform random traffic (without self-traffic), the relative load on each remote link arriving at the router equals ρ=G1/b2G2.
Given a traffic arrival intensity of μ, the total load on local links of Types 1 and 2 equals
where the first terms corresponds to local traffic, the second terms to traffic arriving from remote links, and the third terms to traffic with a destination in a remote group other than the one to which the designated router connects. Note that the third terms are the ones that cause the imbalance—because of the shortest-path requirement, a router may not load-balance remote traffic to remote groups to which it itself has a link.
From the above, it follows that the load ratio between the two types of local links equals
which for large values of the ratio (G1h2)/b2 approaches two. This ratio represents the worst-case imbalance, which only applies to networks in which the third term is either zero or maximum. This third term, which represents traffic generated at the local source router with a destination in a remote group, depends on how many remote groups the local source router and the local intermediate router have in common. This overlap Ω(l) can range from 0 to h2. If Ω(l)=0 then the total load for Type 1 links applies, and if Ω(l)=h2 then the total load for Type 2 links applies. As the overlap may be different for every local link, the load on links can be generalized as follows:
To determine the maximum load λl across all local links l of a given router, the minimum Ωmin=minl(Ω(l)). If the number of remote groups G1h2/b2 is at least twice the number of remote links h2, i.e., G1h2/b2≧2h2, then there is at least one link for which Ω(l) equals zero, namely the link to the next local router, and therefore Ωmin=0, such that the generalized link load equation reduces to that of the Type 1 links. This condition can be simplified to G1/b2≧2. If b2=G1, the tier implements full bundling, in which all local routers have links to all remote groups, i.e., Ω(l)=h2 for all l and therefore Ωmin=h2, such that the link load reverts to that of Type 2 links as each path contains only one local hop. For the intermediate range 1<G1/b2<2it can be shown that
and, taking into account the other cases above, for G1/b2≧1
This expression is required to compute the correct topology values for a balanced XGDF.
From the foregoing discussions on network bisection and shortest path routing, it follows that the conditions for a balanced multi-tier dragonfly network set forth above are sufficient for a balanced XGDF network. However, higher values for bi also increase the number of “degenerated” minimal paths, thus reducing average minimal path length. Therefore, the usage ratios of lower tier to higher-tier links also decrease. Table III depicted in
For an XGDF network of two tiers, the corresponding balance ratio equals
The balance ratio quickly decreases as a function of b2. The extreme case of b2=2h2+1 (x=0) corresponds to a HyperX topology; correspondingly, the balance ratio equals (h1h2+h1)/(h1h2+h2) at that point, which is close to 1 for large h1, h2. To find a balanced ratio between h1 and h2, a solution for h1=h2H1/H2 is computed. This equation has an intricate closed-form solution, for which a good approximation is given by h1≈2h2−b2, so the tier-1 group size can be decreased relative to h2 as b2 increases.
Based on the foregoing discussion regarding link loading, the condition for p can be derived as follows. The utilization of the higher-utilized “Type 1” tier-1 links is given by the equation set forth above. To prevent these links from becoming a bottleneck, the following relation must hold:
which yields
This can be approximated by
which implies that in the extreme cases of the fully-scaled dragonfly (b2=1) and the Hamming graph (b2=h1+1) the balanced bristling factor equals h2, but for intermediate values of b2, it can be as low as 2h2/3.
For XGDF topologies having three tiers, bundling is too complex to be readily analyzed in closed-form. Therefore, an analysis was performed using a C++ simulated implementation of a generic XGDF topology and its corresponding shortest-path routing algorithm. By traversing all paths from a selected source processing node to every other destination processing node, the number of times a hop at each tier is traversed can be determined. This yields the per-tier hop counts Hi, from which the relative balance ratios β1,2 and β2,3 can be determined.
To develop a suitable sample set, this simulation analysis was performed for 457 different three-tier XGDF topologies, subject to the constraint h3=2, which is sufficient to scale to balanced networks with up to 444,222 nodes, i.e., XGDF(2:8, 4, 2:1, 1, 1). Given the upper bound on the balance conditions, every h2 ∈ [h3, 2h3] and for every value of h2 every h1 ∈ [h2, 2h2] was analyzed. For each of these triples (h1, h2, h3), every value of b2 such that G1h2 mod b2=0 and b2≦G1 was analyzed, and for each of those values of b2 every value of b3 such that G1G2h3 mod b3=0 and b3≦G1G2. Out of the 457 topologies analyzed, 23 are nearly optimally balanced, in the sense that they satisfy 0.96≦β1,2, β2,3≦1.1. For these 23 topologies summarized in
Using the foregoing approximation for the bristling factor, the number of ports per processing (end) node in the balanced two-tier case can be expressed as:
for b2≦h1+1.
The number of routers per end node equals 1/p; hence
The deadlock avoidance policy as described above for GDF is also valid for XGDF topologies. Only in the extreme case of full bundling at all tiers, minimal dimension-order routing requires just a single virtual channel, because each dimension is visited only once.
A good system balance is key to achieving sustained exaflop performance. Providing sufficient communication bandwidth will be critical to enabling a wide range of workloads to benefit from exascale performance. With current HPC interconnect technology, the byte-to-FLOP ratio will likely be orders of magnitude less than in current petascale systems, which would pose significant barriers to performance portability for many, particularly for communication-intensive, workloads. For reasons of cost and density, integrated routers are preferred towards achieving acceptable byte/FLOP ratios. Based on IO pin and power arguments, the prior art two-tier dragonfly is not amenable for an integrated exascale network. Although three-tier networks actually increase the overall number of links and their associated power, the fact that a drastically lower router radix is sufficient to scale to million-node networks (radix 20 vs. radix 90 for two tiers) enables low-radix routers that are amenable to integration on the main compute node, because they require modest link IO power and pin budgets.
Referring now to
The operation of each processor core 1002 is supported by a multi-level volatile memory hierarchy having at its lowest level a shared system memory 1004 accessed via an integrated memory controller 1006, and at its upper levels, one or more levels of cache memory, which in the illustrative embodiment include a store-through level one (L1) cache within and private to each processor core 1002, a respective store-in level two (L2) cache 1008a, 1008b for each processor core 1002a, 1002b. Although the illustrated cache hierarchy includes only two levels of cache, those skilled in the art will appreciate that alternative embodiments may include additional levels (L3, etc.) of on-chip or off-chip, private or shared, in-line or lookaside cache, which may be fully inclusive, partially inclusive, or non-inclusive of the contents the upper levels of cache.
Processing node 1000 further includes an I/O (input/output) controller 1010 supporting the attachment of one or more I/O devices (not depicted). Processing node 1000 additionally includes a local interconnect 1012 supporting local communication among the components integrated within processing node 1000, as well as one or more integrated routers 1016 that support communication with other processing nodes 1000 or other external resources via an interconnection network having a GDF or XGDF topology as previously described. As shown, a router 1016 includes a plurality of ports 1018 and a controller 1020 that controls transfer of packets between ports 1018 and between router 1016 and local interconnect 1012 according to one or more routing policies, as previously described. Router 1016 may optionally further include one or more data structures referenced by controller 1020 in the course of making routing determinations, such as forwarding database (FDB) 1022 and routing database (RDB) 1024.
In operation, when a hardware thread under execution by a processor core 1002 of a processing node 1000 includes a memory access (e.g., load or store) instruction requesting a specified memory access operation to be performed, processor core 1002 executes the memory access instruction to determine the target address (e.g., an effective address) of the memory access request. After translation of the target address to a real address, the cache hierarchy is accessed utilizing the target address. Assuming the indicated memory access cannot be satisfied by the cache memory or system memory 1004 of the processing node 1000, router 1016 of processing node 1000 may transmit the memory access request to one or more other processing nodes 1000 of a multi-node data processing system (such as those shown in
With reference now to
Design flow 1100 may vary depending on the type of representation being designed. For example, a design flow 1100 for building an application specific IC (ASIC) may differ from a design flow 1100 for designing a standard component or from a design flow 1100 for instantiating the design into a programmable array, for example a programmable gate array (PGA) or a field programmable gate array (FPGA) offered by Altera® Inc. or Xilinx® Inc.
Design process 1110 preferably employs and incorporates hardware and/or software modules for synthesizing, translating, or otherwise processing a design/simulation functional equivalent of the components, circuits, devices, or logic structures shown herein to generate a netlist 1180 which may contain design structures such as design structure 1120. Netlist 1180 may comprise, for example, compiled or otherwise processed data structures representing a list of wires, discrete components, logic gates, control circuits, I/O devices, models, etc. that describes the connections to other elements and circuits in an integrated circuit design. Netlist 1180 may be synthesized using an iterative process in which netlist 1180 is resynthesized one or more times depending on design specifications and parameters for the device. As with other design structure types described herein, netlist 1180 may be recorded on a machine-readable storage medium or programmed into a programmable gate array. The medium may be a non-volatile storage medium such as a magnetic or optical disk drive, a programmable gate array, a compact flash, or other flash memory. Additionally, or in the alternative, the medium may be a system or cache memory, or buffer space.
Design process 1110 may include hardware and software modules for processing a variety of input data structure types including netlist 1180. Such data structure types may reside, for example, within library elements 1130 and include a set of commonly used elements, circuits, and devices, including models, layouts, and symbolic representations, for a given manufacturing technology (e.g., different technology nodes, 32 nm, 45 nm, 90 nm, etc.). The data structure types may further include design specifications 1140, characterization data 1150, verification data 1160, design rules 1170, and test data files 1185 which may include input test patterns, output test results, and other testing information. Design process 1110 may further include, for example, standard mechanical design processes such as stress analysis, thermal analysis, mechanical event simulation, process simulation for operations such as casting, molding, and die press forming, etc. One of ordinary skill in the art of mechanical design can appreciate the extent of possible mechanical design tools and applications used in design process 1110 without deviating from the scope and spirit of the invention. Design process 1110 may also include modules for performing standard circuit design processes such as timing analysis, verification, design rule checking, place and route operations, etc.
Design process 1110 employs and incorporates logic and physical design tools such as HDL compilers and simulation model build tools to process design structure 1120 together with some or all of the depicted supporting data structures along with any additional mechanical design or data (if applicable), to generate a second design structure 1190. Design structure 1190 resides on a storage medium or programmable gate array in a data format used for the exchange of data of mechanical devices and structures (e.g., information stored in a IGES, DXF, Parasolid XT, JT, DRG, or any other suitable format for storing or rendering such mechanical design structures). Similar to design structure 1120, design structure 1190 preferably comprises one or more files, data structures, or other computer-encoded data or instructions that reside on transmission or data storage media and that when processed by an ECAD system generate a logically or otherwise functionally equivalent form of one or more of the embodiments of the invention shown herein. In one embodiment, design structure 1190 may comprise a compiled, executable HDL simulation model that functionally simulates the devices shown herein.
Design structure 1190 may also employ a data format used for the exchange of layout data of integrated circuits and/or symbolic data format (e.g., information stored in a GDSII (GDS2), GL1, OASIS, map files, or any other suitable format for storing such design data structures). Design structure 1190 may comprise information such as, for example, symbolic data, map files, test data files, design content files, manufacturing data, layout parameters, wires, levels of metal, vias, shapes, data for routing through the manufacturing line, and any other data required by a manufacturer or other designer/developer to produce a device or structure as described above and shown herein. Design structure 1190 may then proceed to a stage 1195 where, for example, design structure 1190: proceeds to tape-out, is released to manufacturing, is released to a mask house, is sent to another design house, is sent back to the customer, etc.
As has been described, the present disclosure formally generalizes the dragonfly topology to an arbitrary number of tiers, yielding the generalized dragonfly (GDF) topology. These networks are characterized herein in terms of their connectivity pattern, scaling properties, diameter, average distance, cost per node, direct and indirect routing policies, number of virtual channels (VCs) required for deadlock freedom, and a corresponding VC assignment policy. Moreover, specific configurations are disclosed that reflect balance, i.e., configurations that theoretically provide 100% throughput under uniform traffic. Closed-form expressions for the topological parameters of balanced two- and three-tier GDF networks are provided. Moreover, the present disclosure introduces the notion of a balanced router radix and derives closed-form expressions as a function of network size for two- and three-tier networks. The present disclosure also demonstrates that the parameters that provide a balanced network also happen to lead to a maximum-sized network, given a fixed router radix.
In a second generalization, the present disclosure extends the GDF framework to encompass networks with more than one link in between groups, referred to herein as the extended generalized dragonfly (XGDF). This extension appreciates that most practical installations of a given dragonfly network are much smaller than the theoretical maximum size, which could leave many links unused. Rather than leaving such links unused, the XGDF topology employs these links to provide additional bandwidth between groups. To this end, bundling factors that specify the number of links in between groups at each tier are introduced. XGDFs are analyzed in terms of the same criteria as GDFs, again paying special attention to the notion of balance, and quantifying the effect of bundling on network balance. In particular, it has been found that the balance between first and second tier depends linearly on the bundling factor at the second tier, and that the bristling factor exhibits a non-monotonous behavior as a function of the bundling factor.
In at least one embodiment, a multiprocessor computer system includes a plurality of processor nodes and at least a three-tier hierarchical network interconnecting the processor nodes. The hierarchical network includes a plurality of routers interconnected such that each router is connected to a subset of the plurality of processor nodes; the plurality of routers are arranged in a hierarchy of n≧3 tiers (T1, . . . , Tn); the plurality of routers are partitioned into disjoint groups at the first tier T1, the groups at tier Ti being partitioned into disjoint groups (of complete Ti groups) at the next tier Ti+1 and a top tier Tn including a single group containing all of the plurality of routers; and for all tiers 1≦i≦n, each tier-Ti−1 subgroup within a tier Ti group is connected by at least one link to all other tier-Ti−1 subgroups within the same tier Ti group.
While various embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the appended claims and these alternate implementations all fall within the scope of the appended claims.
This application is a continuation of U.S. patent application Ser. No. 14/326,208 entitled “IMPROVED INTERCONNECTION NETWORK TOPOLOGY FOR LARGE SCALE HIGH PERFORMANCE COMPUTING (HPC) SYSTEMS,” filed on Jul. 8, 2014, the disclosure of which is incorporated herein by reference in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
8489718 | Brar | Jul 2013 | B1 |
20030088696 | McCanne | May 2003 | A1 |
20090106529 | Abts et al. | Apr 2009 | A1 |
20100049942 | Kim et al. | Feb 2010 | A1 |
20120020242 | McLaren et al. | Jan 2012 | A1 |
20120072614 | Marr et al. | Mar 2012 | A1 |
20120144064 | Parker | Jun 2012 | A1 |
20120144065 | Parker et al. | Jun 2012 | A1 |
Entry |
---|
Jiang, Nan et al., “Indirect Adaptive Routing on Large Scale Interconnection Networks”, ISCA'09, Austin, Texas, USA Copyright 2009 ACM, Jun. 20-24, 2009. |
Kim, John et al.,“Cost-Efficient Dragonfly Topology for Large-Scale Systems”, Published by the IEEE Computer Society 2009 IEEE, Feb. 1, 2009. |
Faanes, Greg et al., “Cray Cascade: a Scalable HPC System based on a Dragonfly Network”, Salt Lake City, Utah, USA © 2012 IEEE, Nov. 10-16, 2012. |
Garcia, M.,“Global Misrouting Policies in Two-level Hierarchical Networks”, INA-OCMC '13, Berlin, Germany. Copyright 2013 ACM, Jan. 23, 2013. |
Kim, John et al., “Technology-Driven, Highly-Scalable Dragonfly Topology”, International Symposium on Computer Architecture 2008 IEEE. |
Abts, Dennis et al., “The Cray BlackWidow: A Highly Scalable Vector Multiprocessor”,Reno, Nevada, USA ISBN, Nov. 10-16, 2007. |
Scott, Steve et al.,“The BlackWidow High-Radix Clos Network”, Proceedings of the 33rd International Symposium on Computer Architecture © 2006 IEEE. |
Number | Date | Country | |
---|---|---|---|
20160012002 A1 | Jan 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14326208 | Jul 2014 | US |
Child | 14486719 | US |