Not Applicable
Not Applicable
The invention relates generally to the interconnection of nodes in a network. More specifically, the invention relates to interconnected nodes of a communication network that provide some combination of computation and/or data storage and provide an efficient interchange of data packets between the interconnected nodes. The network can be defined by a base network structure that can be optimized by selectively defining and connecting long hops between sections of the network, for example, to reduce the network diameter.
The invention provides a system and method for creating a cost-effective way of connecting together a very large number of producers and consumers of data streams.
One real world analogy is the methods for constructing roadway networks that allow drivers to get from a starting location to a destination while satisfying real world constraints such as 1) a ceiling on the amount of tax people are willing to pay to fund roadway construction; 2) a desire to maximize speed of travel subject to safety constraints; and 3) a desire to avoid traffic jams at peak travel times of day.
In this analogy, the cars are similar to the data sent over a computer network and the starting locations and destinations represent the host computers connected to the network. The constraints translate directly into cost, speed and congestion constraints in computer networks.
The basic quality of solving these types of problems is that they get much harder to solve efficiently as the number of starting locations and destinations (e.g., host computers) increases. As a starting point, consider how three destinations can be connected together.
As can be seen from the figures, both the number of connections between nodes and the number of different ways of making those connections grows faster than the number of connections. For example, a set of 6 nodes can have more than twice as many alternative ways to connect the nodes as a set of 3 nodes. Also, the possible number of connections between the nodes can vary from, on the low side, the number of nodes (N) minus 1 for destinations connected, for example, along a single line as shown in
Another measure of the performance of a network is the diameter of the network, which refers to how many connections need to be traveled in order to get from any one destination to another. In the network shown in
The two networks shown in
Another difficulty arises in the construction of computer networks: It is difficult to have a large number of connections converging on a single point, such as shown in
The sample network layouts shown in
The emergence of “cloud computing”, supported by huge data centers where hundreds of thousands of computers all connected to one network provide economies of scale and thereby reduced costs, has stressed the ability of current network designs to provide a reliable and cost effective way of allowing data to be exchanged between the computers.
A number of approaches have been tried by both academia and industry, but to date, all the approaches fall short of theoretical limits by a factor of 2 to 5 times. Some embodiments of the invention include a method for constructing networks that can be within 5-10% of the theoretical maximum for data throughput across networks with multiple simultaneously communicating hosts, a highly prevalent use case in modern data centers.
In accordance with some embodiments of the invention, methods for constructing highly ordered networks of hosts and switches are disclosed that make maximum use of available switch hardware and interconnection wiring. The basic approach can include the following: selecting a symmetrical network base design, such as, a hypercube, a star, or another member of the Cayley graph family; developing an appropriate topological routing method that simplifies data packet forwarding; and adding short cuts or long hops to the base symmetrical network to reduce the network diameter.
The regularity of symmetrical networks makes them well suit for topological addressing schemes.
It is one of the objects of the present invention to provide an improved network design that can be expanded greatly without performance penalty.
It is another object of the present invention to provide an improved network design allows the network to be more easily operated and managed. In some embodiments, the entire network can be operated and managed as a single switch.
It is another object of the present invention to provide an improved network design that provided improved network performance. In some embodiments, the network can have 2 to 5 times greater bisection bandwidth than with conventional network architectures that use the same number of component switches and ports.
The invention also includes flexible methods for constructing physical embodiments of the networks using commercially available switches and method for efficiently, accurately and economically interconnecting (wiring) the switched together to form a high performance network having improved packet handing.
The present invention is directed methods and systems for designing large networks and the resulting large networks. In accordance with some embodiments of the invention, a way of connecting large numbers of nodes, consisting of some combination of computation and data storage, and providing improved behaviors and features. These behaviors and features can include: a) practically unlimited number of nodes, b) throughput which scales nearly linearly with the number of nodes, without bottlenecks or throughput restriction, c) simple incremental expansion where increasing the number of nodes requires only a proportional increase in the number of switching components, while maintaining the throughput per node, d) maximized parallel multipath use of available node interconnection paths to increase node-to-node bandwidth, e) Long hop topology enhancements which can simultaneously minimize latency (average and maximum path lengths) and maximize throughput at any given number of nodes, f) a unified and scalable control plane, g) a unified management plane, h) simple connectivity—nodes connected to an interconnection fabric do not need to have any knowledge of topology or connection patterns, i) streamlined interconnection path design—dense interconnections can be between physically near nodes, combined with a reduced number of interconnections between physically distant nodes, resulting in simple interconnection or wiring.
In one embodiment of the invention, the nodes can represent servers or hosts and network switches in a networked data center, and the interconnections represent the physical network cables connecting the servers to network switches, and the network switches to each other.
In another embodiment of the invention, the nodes can represent geographically separated clusters of processing or data storage centers and the network switches that connect them over a wide area network. The interconnections in this case can be the long distance data transfer links between the geographically separated data centers.
Those skilled in the art will realize that the described invention can be applied to many other systems where computation or data storage nodes require high bandwidth interconnection, such as central processing units in a massively parallel supercomputer or other multiple CPU or multi-core CPU processing arrays.
In accordance with some embodiments of the invention, component switches can be used as building blocks, wherein the component switches are not managed by data center administrators as individual switches. Instead, switches can be managed indirectly via the higher level parameters characterizing collective behavior of the network, such as latency (maximum and average shortest path lengths), bisection (bottleneck capacity), all-to-all capacity, aggregate oversubscription, ratio of external and topological ports, reliable transport behavior, etc. Internal management software can be used to translate selected values for these collective parameters into the internal configuration options for the individual switches and if necessary into rewiring instructions for data center technicians. This approach makes management and monitoring scalable.
Hypercubes and their variants have attracted great deal of attention within parallel and supercomputer fields, and recently for data center architectures as well due to their highly efficient communications, high fault tolerance and reliable diagnostics, lack of bottlenecks, simple routing & processing logistics, and simple, regular construction. In accordance with some embodiments of the invention, a method of designing an improved network includes modifying a basic hypercube network structure in order to optimize latency and bandwidth across the entire network. Similar techniques can be used to optimize latency and bandwidth across other Cayley graph symmetrical networks such as star, pancake and truncated hypercube networks.
A symmetrical network is one that, from the perspective of a source or a destination looks the same no matter where you are in the network and which allows some powerful methods to be applied for developing both routing methods for moving traffic through the network and for adding short cuts to improve throughput and reduce congestion. One commonly known symmetrical network structure is based on the structure of a hypercube. The hypercube structured network can include a set of destinations organized as the corners of a cube, such as shown in
Hypercubes are just one form of symmetrical network. Another form of symmetrical network is the star graph shown in
In accordance with some embodiments of the present invention, topological routing can be used route messages through the symmetrical network. Topological routing can include a method for delivering messages from a source node to a destination node through a series of intermediate locations or nodes where the destination address on the message describes how to direct the message through the network. A simple analogy is the choice of method for labeling streets and numbering houses in a city. In some planned areas such as Manhattan, addresses not only describe a destination location, “425 17th Street”, but also describe how to get there from a starting point. If it is known that house numbers are allocated 100 per block, and the starting location is 315 19th Street, it can be determined that the route includes going across one block and down two streets to get to the destination. Similarly, for the organization shown in
In contrast, a typical unplanned town like Concord, Mass., shown in
Topological addressing is important in large networks because it means that a large map does not have to be both generated and then consulted at each step along the way of sending a message to a destination. Generating a map is time consuming and consumes a lot of computing resources, and storing a map at every step along the way between destinations consumes a lot of memory storage resources and requires considerable computation to look up the correct direction on the map each time a message needs to be sent on its way towards its destination. The small maps required by topological addressing are not just a matter of theoretical concern. Present day data centers have to take drastic, performance impacting measures to keep their networks divided into small enough segments that the switches that control the forwarding of data packets do not get overwhelmed with building a map for the large number of destinations for which traffic flows through each switch.
The regularity of symmetrical networks makes them excellent candidates for having topological addressing schemes applied to them, just as a regular, basically symmetrical, arrangement of streets allows addresses to provide implied directions for getting to them.
In accordance with some embodiments of the invention, the performance of these symmetrical networks can be greatly improved by the select placement of “short cuts” or long hops according to the invention. The long hops can simultaneously reduce the distance between destinations and improve the available bandwidth for simultaneous communication. For example,
In accordance with some embodiments of the invention, this method can be applied to hypercubes of higher order with many more destinations. In accordance with some of the embodiments of the invention, a method for identifying select long hops in higher order hypercube networks and symmetric networks can include determining a generator matrix using linear error correcting codes to identify potential long hops within the network.
All real world network implementations are limited by the physical constraints of constructing switches and wiring them together. With the limitations of conventional wiring techniques, one of the parameters that can be adjusted to improve network performance is to increase the number of ports per network switch, which allows that group of ports to exchange data with very high throughput within the single physical device. Problems then arise maintaining that high throughput when groups of switches have to be assembled in order to connect a large number of servers together. Switch manufacturers have been able to increase the number of ports per switch into the several hundreds (e.g., 500), and some new architectures claim the ability to create switch arrays that have several thousand ports. However, that is two to three orders of magnitude less than the number of servers in large data centers. The number of switch ports is referred to as the “radix” of the switch.
In accordance with some embodiments of the invention, one difference between networks according to the invention and the prior art, networks according to the invention can be expanded (increasing the number of host computer ports) practically, without limit or performance penalty. The expansion can be flexible, using commodity switches having a variable radix. Although there are presently switches which can be upgraded from an initial configuration with a smaller radix to a configuration with a higher radix, the latter maximum radix is fixed in advance to at most a few hundred ports. Further, the ‘radix multiplier’ switching fabric for the maximum configuration is hardwired in the switch design. For example, a typical commercial switch such as the Arista 7500 can be expanded to 384 ports by adding up to 8 line cards, each providing 48 ports; but the switching fabric gluing the 8 separate 48 port switches into one 384 port switch is rigidly fixed by the design and it is even included in the basic unit. In contrast, the networks constructed according some embodiments of the invention have no upper limit on the maximum number of ports it can provide. And this holds for an initial network design as well as any subsequent expansion of the same network. In accordance with some embodiments of the invention, for any given type of switch having radix R, the upper limit for simple expansion without performance penalty is 2R-1 component switches. Since typical R is at least 48, even this conditional limit of 247≈1.4·1014 on the radix expansion is already far larger than the number of ports in the entire interne, let alone in any existing or contemplated data center.
Another difference between networks according to some embodiments of the invention and prior art data centers is that data center layer 2 networks are typically operated and managed as networks of individual switches where each switch requires individual installation, configuration, monitoring and management. In accordance with some embodiments of the invention, the data center network can be operated and managed as a single switch. This allows the invention to optimize all aspects of performance and costs (of switching fabric, cabling, operation and management) to a far greater degree than existing solutions.
In addition, networks according to some embodiments of the invention can provide improved performance over any existing data center Layer 2 networks, on the order of 2 to 5 times greater bisection bandwidth than conventional network architectures that use the same number of component switches and ports.
The invention also describes novel and flexible methods for realizing physical embodiments of the network systems described, both in the area of wiring switches together efficiently, accurately and economically, as well as ways to use existing functionality in commercial switches to improve packet handing.
Hypercubes can be characterized by their number of dimensions, d. To construct a (d+1)-cube, take two d-cubes and connect all 2d corresponding nodes between them, as shown in
For purpose of illustrating one embodiment of the invention, a d-cube can be a d-dimensional binary cube (or Hamming cube, hypercube graph) with network switches as its nodes, using d ports per switch for the d connections per node. By convention, coordinate values for nodes can be 0 or 1, e.g. a 2-cube has nodes at (x, y)=(0,0), (0,1), (1,0), (1,1), or written concisely as binary 2-bit strings: 00, 01, 10 and 11.
Each switch can have some number of ports dedicated to interconnecting switches, and hosts can be connected to some or all of the remaining ports not used to interconnect switches. Since the maximum number of switches N in a d-cube is N=2d, the dimensions d of interest for typical commercial scalable data center applications can include, for example, d=10 . . . 16, i.e. d-cubes with 1K-64K switches, which corresponds to a range of 20K-1280K physical host (computers or servers), assuming a typical subscription of 20 hosts per switch.
In accordance with some embodiments of the invention, a concise binary d-bit notation for nodes (and node labels) of a d-cube can be used. The hops, defined as the difference vectors between directly connected nodes, can be d-bit strings with a single bit=1 and (d−1) bits=0. The jumps (difference vectors) between any two nodes S1 and S2 can be: J12=S1̂S2 (̂ is a bitwise XOR) and the minimum number of hops (distance or the shortest path) L between them is the Hamming weight (count of 1's) of the jump J12 i.e. L≡L(J12)=|J12|. There are exactly L! distinct shortest paths of equal length L between any two nodes S1 and S2 at distance L. The diameter D (maximum shortest path over all node pairs) of a d-cube is D=log(N)=d hops, which is also realized for each node. For any node S, its bitwise complement (˜S) is at the maximum distance D from S. The average number of hops between two nodes is d/2 and bisection (minimum number of links to cut in order to split a d-cube into 2 equal halves) is N/2.
In accordance with some embodiments of the invention, the d-cube coordinates of the switches (d-bit strings with d˜10 . . . 16) can be used as their physical MAC addresses M, and the optimal routing becomes very simple. Routing can be done entirely locally, within each switch using only O(log(N)) resources (where O can be ______ and N is the maximum number of switches). When a frame with Mdst arrives at a switch M, the switch M computes J=M̂Mdst and if J=0, then the switch M is the destination. Otherwise it selects the next hop h corresponding to any bit=1 in J which will bring the frame one hop closer to the Mdst since the next node after the hop at Mnxt=M̂h will have one less bit=1, hence one less hop, in its jump vector to Mdst which is Jnxt=Mnxt̂Mdst.
In accordance with some embodiments of the invention, the total number of switches Ns in the network is not an exact power of 2, so in this case, the d-cubes can be truncated so that for any accessible M the relation M<Ns holds, where bit string M is interpreted as an integer (instead of M<2d which is used for a complete d-cube). Hence, instead of the O(N) size forwarding table and an O(N) routing tree, the switches only need one number Ns and their own MAC address M to forward frames along the shortest paths.
In accordance with some embodiments of the invention, one useful parameter of the hypercubic network topology is the port distribution or ratio of the internal topology ports (T-ports) used to interconnect switches and external ports (E-ports) that the network uses to connect the network to hosts (servers and routers). Networks built according to some embodiments of the invention can use a fixed ratio: E/T (E=#E-ports, T=#T-ports) for all IPA switches. In accordance with one embodiment, the ratio is λ=1 (ports are split evenly between E and T), is shown in
For a hypercube of dimension d there are m≡d·2d total T-ports and S-ports, d ports per switch for either type. Since the ports are duplex, each E-port can be simultaneously a source and a sink (destination) of data packets. Hence, there are m sources X1, X2, . . . Xm and m destinations Y1, Y2 . . . Ym. The non-blocking (NB) property of a network or a switch can usually be defined via performance on the ‘permutation task’—each source Xi (i=1 . . . m) is sending data to a distinct destination Yj (where j=πm[i] and πm a permutation of m elements), and if these m transmissions can occur without collisions/blocking, for all m! permutations of Ys, the network is NB. The evaluation of the NB property of a network can depend on the specific meaning of “sending data” as defined by the queuing model. Based on kinds of “sending data”, there can be two forms of NB, Circuit NB (NB-C) and Packet NB (NB-P). For NB-C, each source X can send a continuous stream at its full port capacity to its destination Y. For NB-P, each source can send one frame to its destination Y. In both cases, for NB to hold, for any π(Y) there exists a set of m paths (sequences of hops), each path connecting its XY pair. The difference in these paths for the two forms of NB is that for NB-C each XY path has to have all its hops reserved exclusively for its XY pair at all times, while for NB-P, the XY path needs to reserve a hop only for the packet forwarding step in which the XY frame is using it. Hence NB-C is a stronger requirement, i.e. if a network is NB-C then it is also NB-P.
In accordance with some embodiments of the invention, a hypercube network with a λ=1 has Packet Non-Blocking Property. This is self-evident for d=1, where there are only 2 switches, two ports per switch, one T-port and one E-port. In this case m=2, hence there are only 2!=2 sets of XY pairing instances to consider: I1=[X1→Y1, X2→Y2] and I2=[X1→Y2, X2→Y1]. The set of m=2 paths for I1 are: {(X10 Y1), (X21 Y2)}, each taking 0 hops to reach its destination (i.e. there were no hops between switches, since the entire switching function in each path was done internally within the switch). The paths are shown as (X S1 S2 . . . SkY), where Si sequence specifies switches visited by the frame in each hop from X. This path requires k−1 hops between the switches (X and Y are not switches but ports on S1 and Sk respectively). For the pairing I2, the two paths are {(X10 1 Y2), (X21 0 Y1)}, each 1 hop long. Since there were no collisions in either instance I1 or I2, the d=1 network is NB-P. For the next size hypercube, d=2, m=8 and there are 8! (40320) XY pairings, so we will look at just one instance (selected to maximize the demands over the same links) and show the selection of the m=8 collision free paths, before proving the general case.
To prove that in the general case all m=d·N frames sent by X1, X2, . . . Xm can be delivered to proper destinations, in a finite time and without collisions or dropped frames, the following routing algorithm can be used. In the initial state when m frames are injected by the sources into the network, each switch receives d frames from its d S-ports. If there were just one frame per switch instead of d, the regular hypercube routing could solve the problem, since there are no conflicts between multiple frames targeting the same port of the same switch. Since each switch also has exactly d T-ports, if each switch sends d frames, one frame to each port in any order, in the next stage each switch again has exactly d frames (received via its d T-ports), without collisions or frame drops so far. While such ‘routing’ can go on forever without collisions/frame drops, it does not guarantee delivery. In order to assure a finite time delivery, each switch must pick out of the maximum d frames it can have in each stage, the frame closest to its destination (the one with the lowest Hamming weight of its jump vectorDst̂Current) and send it to the correct port. The remaining d−1 frames (at most; there may be fewer) are sent on the remaining d−1 ports applying the same rule (the closest one gets highest priority, etc). Hence after this step is done on each of the N switches, there are at least N frames (the N “winners” on N switches) which are now closer by 1 hop to their destinations i.e. which are now at most d−1 hops away from their destination (since the maximum hop distance on a hypercube is d). After k such steps, there will be at least N frames which are—at most d−k hops away from their destinations. Since the maximum distance on a hypercube is d hops, in at most d steps from start at least N frames are delivered to their destinations and there are no collisions/drops. Since the total number of frames to deliver is d·N, the above sequence of steps need not be repeated more than d times, therefore all frames are delivered in at most d2 steps after the start. QED.
In accordance with some embodiments of the invention, load balancing can be performed locally at each switch. For each arriving frame, the switch can select the next hop along a different d-cube dimension than the last one sent, if one is available. Since for any two points with distance (shortest path) L there are L! alternative paths of equal length L, there are plenty of alternatives to avoid congestion, especially if aided by a central control and management system with a global picture of traffic flows.
Much of this look ahead at the packet traffic flow and density at adjacent nodes required to decide which among the equally good alternatives to pick can be done completely locally between switches with a suitable lightweight one-hop (or few hops) self-terminating (time to live set to 1 or 2) broadcast through all ports, notifying neighbors about its load. The information packet broadcast in such manner by a switch M can also combine its knowledge about other neighbors (with their weight/significance scaled down geometrically, e.g. by a factor 1/d for each neighbor). The division of labor between this local behavior of switches and a central control and management system can be that switching for short distance and near time regions can be controlled by switches and that switching for long distance and long time behavior can be controlled by the central control and management system.
In accordance with some embodiments of the invention, symmetrical networks with long hop shortcuts are used to achieve high performance in the network, however additional forwarding management can be used to optimize the network and achieve higher levels of performance. As the size of the network (number of hosts) becomes large, it is useful to optimize the forwarding processes to improve network performance.
One reason for current data center scaling problems is the non-scalable nature forwarding tables used in current switches. These tables grow as O(N2) where N is the number of edge devices (hosts) connected to the network. For large networks, this quickly leads to forwarding tables that can not be economically supported with current hardware, leading to various measures to control forwarding table size by segmenting networks, which leads to further consequences and sub-optimal network behavior.
In accordance with some embodiments of to the invention, each switch can maintain a single size forwarding table (of size O(N)) and network connection matrix (of size O(N·R), where R is the switch radix and N the number of switches). The scalable layer 2 topology and forwarding tables maintained by the switches can be based on hierarchical labeling and corresponding hierarchical forwarding behavior of the switches, which require only m·N1/m table entries for the m-level hierarchy (where m is a small integer parameter, typically m=2 or 3).
In accordance with one embodiment of the invention, the network can be divided into a hierarchy of clusters, which for performance reasons align with the actual network connectivity. The 1st level clusters contain R nodes (switches) each, while each higher level cluster contains R sub-clusters of the previous lower level. Hence, each node belongs to exactly one 1st level cluster, which belongs to exactly one 2nd level cluster, etc. The number of levels m needed for a network with N nodes and a given R, is then determined from the relations Rm-1<N≦Rm, i.e. m=┌log(N)/log(R)┐. The forwarding identifier (FID or Forwarding ID) of a node consists of m separate fields (digits of the node ordinal 0 . . . N−1 expressed in radix R), FID=F1. F2 . . . Fm where F1 specifies the node index (number 0 . . . R−1) within its 1st level cluster, F2 the index of the node's 1st level cluster within its second level cluster, etc.
For example, in an N=100 node network and selecting R=10, each node is labeled via two decimal digits, e.g. a node 3.5 is a node with index 3 in a cluster with index 5. In this embodiment, if node 3.5 needs to forward to some node 2.8, all that 3.5 needs to know is how to forward to a single node in cluster 8, as long as each node within the cluster 8 knows how to forward within its own cluster. For multi-path topologies, nodes have more than single destination forwarding address. Or, for a general node each node needs to know how to forward to 9 nodes in its own cluster and to a single node in each of other 9 clusters, hence it needs tables with only 2*9=18 elements (instead of 99 elements that conventional forwarding uses).
In accordance with some embodiments of the invention, the forwarding tables for each node can consist of m arrays Ti, i=1 . . . m, each of size R elements (the elements are forwarding ports). For example, for R=16 and a network with N=64*1024 switches (corresponding to a network with 20*N=1280*1024 hosts), the forwarding tables in each switch consist of 4 arrays, each of size 16 elements, totaling 64 elements.
For any node F with FID(F)=F1. F2 . . . Fm the array T1[R] contains ports which F needs to use to forward to each of the R nodes in its own 1st level cluster. This forwarding is not assumed to be a single hop, so the control algorithm can seek to minimize the number of hops when constructing these tables. A convenient topology, such as hypercube type, makes this task trivial since each such forwarding step is a single hop to the right cluster. In accordance with some embodiments of the invention, in the hypercube network, the control algorithm can harmonize node and cluster indexing with port numbers so that no forwarding tables are needed at all. The array T2 contains ports F needed for forwarding to a single node in each of the R 2nd level clusters belonging to the same third level cluster as node F;T3 contains ports F needed for forwarding to a single node in each of the R 3rd level clusters belonging to the same 4th level cluster as F, . . . and finally Tm contains ports F needs to use to forward to a single node in each of the Rmth level cluster belonging to the same (m+1)th cluster (which is a single cluster containing the whole network).
In accordance with some embodiments of the invention, forwarding can be accomplished as follows. A node F with FID(F)=F1. F2 . . . Fm receiving a frame with final destination FID(Z)=Z1. Z2 . . . Zm determines the index i=1 . . . m of the highest ‘digit’ Zi that differs from its own corresponding ‘digit’ F, and forward the frame to the port Ti[Zi]. The receiving node G then has (from the construction of tables Ti) for its i-th digit the value Gi=Zi. Hence, repeating the procedure, node G determines the index j<i of the highest digit Zj differing from corresponding Gj and forwards to port Tj[Zj] . . . etc., until j=1, at which point the node is performing the final forwarding within its own cluster.
In accordance with some embodiments of the invention, the implementation of this technique can involve the creation of hierarchical addresses. Since the forwarding to clusters at levels >1 involves approximation (a potential loss of information, and potentially sub-optimal forwarding), for the method to forward efficiently it can be beneficial to a) reduce the number of levels m to the minimum needed to fit the forwarding tables into the CAMs (content addressable memories) and b) reduce the forwarding approximation error for m>1 selecting the formal clustering used in the construction of the network hierarchy to match as closely as possible the actual topological clustering of the network.
Forwarding efficiency can be improved by reducing the number of levels m to the minimum needed to fit the forwarding tables into the CAMs. In situations where one can modify only the switch firmware but not the forwarding hardware to implement hierarchical forwarding logic, the conventional CAM tables can be used. The difference from the conventional use is that instead of learning the MAC addresses, which introduce additional approximation and forwarding inaccuracy, the firmware can program the static forwarding tables directly with the hierarchical tables.
Since m levels reduce the size of the tables from N to m·N1/m entries (e.g. m=2 reduces the tables from N entries to 2·√N entries), a 2-3 level hierarchy may be sufficient to fit the resulting tables in the C=16K entries CAM memory (e.g. m=2, C=16K allows 2.8K entries, or N=64·206 nodes). Generally, m is the lowest value satisfying inequality: m·N1/m≦C.
In order to reduce the forwarding approximation error for m>1, the formal clustering used in the construction of the hierarchical should match as closely as possible the actual topological clustering of the network. For enhanced hypercube topologies used by the invention, optimum clustering is possible since hypercubes are a clustered topology with m=log(N). In practice, where minimum m is preferred, the hypercubes of dimension d are intrinsically clustered into lower level hypercubes corresponding to partition of d into m parts. E.g. partition d=a+b corresponds to 2a clusters (hypercube of dim=a) of size 2b each (hypercubes of dim=b). The following clustering algorithm performs well in practice and can be used for general topologies:
A node which is the farthest node from the existent complete clusters is picked as the seed for the next cluster (the first pick, when there are no other clusters, is arbitrary). The new cluster is grown by adding to it one of the unassigned nearest neighbors x based on the scoring function: V(x)=#i−#e, where #i is the number of intra-cluster links and #e is the number of extra-cluster links in the cluster resulting from adding node x to it. The neighbor x with max value of V(x) score is then assigned to the cluster. The cluster growth stops when there are no more nodes or when the cluster target size is reached (whichever comes first). When no more unassigned nodes are available the clustering layer is complete. The next layer clusters are constructed by using the previous lower layer clusters as the input to this same algorithm.
In accordance with some embodiments of the invention, networks can be considered to include n “switches” (or nodes) of radix (number of ports per switch) Ri for the i-th switch, where i=1 . . . n. The network thus has the total of PT=ΣiRi ports. Some number of ports P1 is used for internal connections between switches (“topological ports”) leaving P=PT−PI ports free (“external ports”), available for use by servers, routers, storage, . . . etc. The number of cables C1 used by the internal connections is C1=P1/2. For regular networks (graphs), those in which all nodes have the same number of topological links per node m (i.e. m is a node degree), it follows that P1=n·m.
The network capacity or throughput is commonly characterized via the bisection (bandwidth) which is defined in the following manner: network is partitioned into two equal subsets (equipartition) S1+S2 so that each subset contains n/2 nodes (within ±1 for odd n). The total number of links connecting S1 and S2 is called a cut for partition S1+S2. Bisection B is defined as the smallest cut (min-cut) for all possible equipartitions S1+S2 of the network.
Bisection is thus an absolute measure of the network bottleneck throughput. A related commonly used relative throughput measure is the network oversubscription φ defined by considering the P/2 free ports in each min-cut half, S1 and S2, with each port sending and receiving at its maximum capacity to/from the ports in the opposite half. The maximum traffic that can be sent in each direction this way without overloading the network is B link (port) capacities since that's how many links the bisection has between the halves. Any additional demand that free ports are capable of generating is thus considered to be an “oversubscription” of the network. Hence, the oversubscription φ is defined as the ratio:
The performance comparisons between network topologies, such as [1]-[5], [9]-[10], typically use non-oversubscribed networks φ=1) and compare the costs in terms of number of switches n of common radix R and number of internal cables C1 used in order to obtain a given target number of free ports P. Via eq. (3.1), that is equivalent to comparing the costs n and C1 needed to obtain a common target bisection B.
Therefore, the fundamental underlying problem is how to maximize B given the number of switches n each using some number of topological ports per switch m (node degree). This in turn breaks down into two sub-problems:
For general networks (graphs), both sub-problems are computationally intractable, i.e. NP-complete problems. For example, the ‘easier’ of the two tasks is (i), since (ii) requires multiple evaluations of (i) as the algorithm (ii) iterates/searches for the optimum B. Task (i) involves finding the graph equipartition H0+H1 which has the minimum number of links between the two halves, in general case would have to examine every possible equipartition H0+H1 and in each case count the links between the two, then pick the one with the lowest count. Since there are C(n, n/2)2n/2√{square root over (πn/2)} ways to split the set of n nodes into two equal halves, the exact brute force solution has exponential complexity. The problem with approximate bisection algorithms is the poor solution quality as network size increases—the polynomial complexity algorithms bisection applied to general graphs cannot guarantee to find an approximate cut even to within merely a constant factor from the actual minimum cut as n increases. And without an accurate enough measure of network throughput, the subtask (ii) cannot even begin to optimize the links.
An additional problem with (ii) becomes apparent, that even for small networks such as those with few dozen nodes, for which one can compute exact B via brute force and also compute the optimum solution by examining all combinations of the links. Namely, a greedy approach for solving (ii), successively computes B for all possible addition of the next link, then picks the link which produces the largest increment of B among all possible additions. That procedure continues until the target number of links per node is reached. The numerical experiments on small networks show that in order to get the optimum network in step m→m+1 links per node, one often needs to replace one or more existent links as well, the links which were required for optimum at previous smaller values of m.
In addition to bandwidth optimization for a given number of switches and cables, the latency, average or maximum (diameter), is another property that is often a target of optimization. Unlike the B optimization, where an optimum solution dramatically reduces network costs, yielding ˜2-5 fewer switches and cables compared to conventional and approximate solutions, the improvements in latency are less sensitive to the distinction between the optimal and approximate solutions, with typical advantage factors of only 1.2-1.5. Accordingly, greater optimization can be achieved in LH networks by optimizing the bisection than by optimizing the network to improve latency.
The present invention is directed to Long Hop networks and methods of creating Long Hop networks. The description provides illustrative examples of methods for constructing a Long Hop network in accordance with the invention. In accordance with one embodiment, one function of a Long Hop network is to create a network interconnecting a number of computer hosts to transfer data between computer hosts connected to the network. In accordance some embodiments, the data can be transferred simultaneously and with specified constraints on the rate of data transmission and the components (e.g., switches and switch interconnect wiring) used to build the network.
In accordance with the invention, a Long Hop network includes any symmetrical network whose topography can be represented by a Cayley graph, and the corresponding Cayley graphs have generators corresponding to the columns of Error Correcting Code (ECC) generator matrices G (or their isometric equivalents, also instead of G one can use equivalent components of the parity check matrix H). In addition, the Long Hop networks in accordance with some embodiments of the invention can have performance (bisection in units of n/2) within 90% of the lower bounds of the related ECC, as described by the Gilbert-Varshamov bound theorem. In accordance with some embodiments of the invention, Long Hop networks will include networks having 128 or more switches (e.g., dimension 7 hypercube or greater) and/or direct networks. In accordance with some embodiments of the invention, Long Hop networks can include networks having the number of interconnections m not equal to d, d+1, . . . d+d−1 and m not equal to n−1, n−2. In accordance with some embodiments of the invention, the wiring pattern for connecting the switches of the network can be determined from a generator matrix that is produced from the error correcting code that corresponds to the hypercube dimension and the number of required interconnections determined as function of the oversubscription ratio.
In other embodiments of the invention, similar methods can be used to create networks for interconnecting central processing units (CPUs) as is typically used in supercomputers, as well as to interconnect data transfer channels within integrated circuits or within larger hardware systems such as backplanes and buses.
In accordance with some embodiments of the invention, the Long Hop network can include a plurality of network switches and a number of network cables connecting ports on the network switches to ports on other network switches or to host computers.
Each cable connects either a host computer to a network switch or a network switch to another network switch. In accordance with some embodiments of the invention, the data flow through a cable can be bidirectional, allowing data to be sent simultaneously in both directions. In accordance with some embodiments of the invention, the rate of data transfer can be limited by the switch or host to which the cable is connected. In accordance with other embodiments of the invention, the data flow through the cable can be uni-directional. In accordance with other embodiments of the invention, the rate of data transfer can be limited only the physical capabilities of the physical cable media (e.g., the construction of the cable). In accordance with some embodiments, the cable can be any medium capable of transferring data, including metal wires, fiber optic cable, and wired and wireless electromagnetic radiation (e.g., radio frequency signals and light signals). In accordance with some embodiments, different types of cable can be used in the same Long Hop network.
In accordance with some embodiments of the invention, each switch has a number of ports and each port can be connected via a cable to another switch or to a host. In accordance with some embodiments of the invention, at least some ports can be capable of sending and receiving data, and at least some ports can have a maximum data rate (bits per second) that it can send or receive. Some switches can have ports that all have the same maximum data rate, and other switches can have groups of ports with different data rates or different maximum data transfer rates for sending or receiving. In accordance with some embodiments, all switches can have the same number of ports, and all ports can have the same send and receive maximum data transfer rate. In accordance with other embodiments of the invention, at least some of the switches in a Long Hop network can have different numbers of ports, and at least some of the ports can have different maximum data transfer rates.
The purpose of a switch is to receive data on one of its ports and to send that data on out another port based on the content of the packet header fields. Switches can receive data and send data on all their ports simultaneously. A switch can be thought of as similar to a rail yard where incoming train cars on multiple tracks can be sent onward on different tracks by using a series of devices that control which track among several options a car continues onto.
In accordance with some embodiments of the invention, the Long Hop network is constructed of switches and cables. Data transferred between a host computer or a switch and another switch over a cable. The data received from a sending host computer enters a switch, which can then forward the data either directly to a receiving host computer or to another switch which in turn decides whether to continue forwarding the data to another switch or directly to a host computer connected to the switch. In accordance with some embodiments of the invention, all switches in the network can be both connected to other switches and to hosts. In accordance with other embodiments of the invention, there can be interior switches that only send and receive to other switches and not to hosts as well.
In accordance with some embodiments, the Long Hop network can include a plurality of host computers. A host computer can be any device that sends and/or receives data to or from a Switch over a Cable. In accordance with some embodiments of the invention, host computers can be considered the source and/or destination of the data transferred through the network, but not considered to be a direct part of the Long Hop network being constructed. In accordance with some embodiments of the invention, host computers cannot send or receive data faster than the maximum data transfer rate of the Switch Port to which they are connected.
In accordance with some embodiments of the invention, at least some of following factors can influence the construction of the network. The factors can include 1) the number of Hosts that must be connected; 2) the number of switches available, 3) the number of ports on each switch; 4) the maximum data transfer rate for switch ports; and 5) the sum total rate of simultaneous data transmission by all hosts. Other factors, such as the desired level of fault tolerance and redundancy can also be factor in the construction of a Long Hop network.
In accordance with some embodiments of the invention, the desired characteristics of the Long Hop network can limit combinations of the above factors used in the construction of a Long Hop network that can actually be built. For example, it is not possible to connect more hosts to a network than the total number of switches multiplied by the number of ports per switch minus the number of ports used to interconnect switches. As one ordinary skill would appreciate, a number of different approaches can be used to design a network depending on the desired outcome. For example, for a specified number of hosts, switches with a given maximum data transfer rate, and ports per switch, how many switches are needed and how should they be connected in order to allow all hosts to send and receive simultaneously at 50% of their maximum data transfer rate, Alternatively, for a specified number of hosts, number of switches with a given number of ports and maximum data transfer rate, how much data can be simultaneously transferred across the network and what switch connection pattern(s) supports that performance.
For purposes of illustration, the following description explains how to construct a Long Hop network according to some embodiments of the invention. In this embodiment, the Long Hop network includes 16 switches and uses up to 7 ports per switch for network interconnections (between switches). As one of ordinary skill will appreciate any number of switches can be selected and the number ports for network interconnection can be selected in accordance with the desired parameters and performance of the Long Hop network.
In accordance with some embodiments of the invention, the method includes determining how to wire the switches (or change the wiring of an existing network of switches) and the relationship between the number of attached servers per switch and the oversubscription ratio.
In accordance with some embodiments of the invention, the ports on each switch can be allocated to one of two purposes, external connections (e.g., for connecting the network to external devices including host computers, servers and external routers or switches that serve as sources and destinations within the network), and topological or internal connections. An external network connection is a connection between a switch and a source or destination device that enables data to enter the network from a source or exit the network to a destination. A topological or internal network connection is a connection between networks switches that form the network (e.g., that enables data to be transferred across network).
In accordance with some embodiments of the invention, the oversubscription ratio can be determined as the ratio between the total number of host connections (or more generally, external ports) and the bisection (given as number of links crossing the min-cut partition). In accordance with some embodiments of the invention, an oversubscription ratio of 1 indicates that in all cases, all hosts can simultaneously send at the maximum data transfer rate of the switch port. In accordance with some embodiments of the invention, an oversubscription ratio of 2 indicates that the network can only support a sum total of all host traffic equal to half of the maximum data transfer rate of all host switch ports. In accordance with some embodiments of the invention, an oversubscription ratio of 0.5 indicates that the network has twice the capacity required to support maximum host traffic, which provides a level of failure resilience such that if one or more switches or connections between switches fails, the network will still be able to support the full traffic volume generated by hosts.
In accordance with some embodiments of the invention, the base network can be an n-dimensional hypercube. In accordance with other embodiments of the invention, the base network can be another symmetrical network such as a star, a pancake and other Cayley graphs based network structure. In accordance with some embodiments of the invention, an n-dimensional hypercube can be selected as a function of the desired number of switches and interconnect ports.
In accordance with some embodiments of the invention, a generator matrix is produce for the linear error correcting code that matches the underlying hypercube dimension and the number of required interconnections between switches as determined by the network oversubscription ratio. In accordance with some embodiments of the invention, the generator matrix can be produced by retrieving it from one of the publicly available lists, such as the one maintained by the MinT project (http://mint.sbg.ac.at/index.php). In accordance with other embodiments of the invention, the generator matrix can be produced using a computer algebra system such as the Magma package (available from http://magma.maths.usyd.edu.au/magma/). For example, in Magma package a command entered into Magma calculator (http://magma.maths.usyd.edu.au/calc/):
Generator Matrix:
In accordance with some embodiments of the invention, a linear error correcting code generator matrix can be converted into a wiring pattern matrix by rotating the matrix counterclockwise 90 degrees, for example, as shown in Table 4.9.
In the illustrative example shown in Table 4.9, each switch has 7 ports connected to other switches and 16 total switches corresponding to an LH augmented dimension 4 hypercube. Generators h1 through h7 correspond to the original columns from rotated [G4,7] matrix that can be used to determine how the switches are connected to each other by cables.
In accordance with some embodiments of the invention, the 16 switches can be labeled with binary addresses, 0000, 0001, through 1111. The switches can be connected to each other using the 7 ports assigned for this purpose, labeled h1 through h7, by performing the following procedure for each of the sixteen switches. For example, connect a cable between each source switch network port (1-7) and the same port number on the destination switch whose number is determined by performing an exclusive or logical operation between the source switch number and the value of the Cayley graph generator h1 to h7 (column 2 in the table below) for the network port number.
For example, to determine how to connect the 7 wires going from switch number 3 (binary 0011), take the graph generator (number in 2nd column) and exclusive or (XOR) it with 0011 (the source switch number), which results in “Destination switch number” in the following connection map (the XOR of columns 2 and 3 yields column 4):
This wiring procedure describes how to place the connections to send from a source switch to a destination switch, so for each connection from a source switch to a destination switch, there is also a connection from a destination switch to a source switch. As a practical matter, in this embodiment, a single bi-directional cable is used for each pair of connections.
The LH networks are direct networks constructed using general Cayley graphs Cay(Gn, Sm) for the topology of the switching network. The preferred embodiment for LH networks belongs to the most general hypercubic-like networks, with uniform number of external (E) and topological (m) ports per switch (where E+m=R=‘switch radix’), which retain the vertex and edge symmetries of the regular d-cube . The resulting LH network with n=2d switches in that case is a Cayley graph of type Cay(Z2d, Sm) with n−1>m>d+1 (these restriction on m exclude well known networks such as d-cube which has m=d, folded d-cube with m=d+1, as well as fully meshed network m=n and m=n−1). It will become evident that the construction method shown on Z2d example applies directly to the general group Zqd with q>2. For q>2, the resulting Cay(Zqd, Sm) is the most general LH type construction of a d-dimensional hyper-torus-like or flattened butterfly-like networks of extent q (which is equivalent to a hyper-mesh-like network with cyclic boundary conditions). The preferred embodiment will use q=2, since Z2d is the most optimal choice from practical perspective due to the shortest latency (average and max), highest symmetry, simplest forwarding and routing, simplest job partitioning (e.g. for multi-processor clusters), easiest and most economical wiring in the Zqd class.
Following the overall task breakdown in section 3, the LH construction proceeds in two main phases:
For the sake of clarity, the main phases are split further into smaller subtasks, each described in the sections that follow.
Network built on Cay(Zqd, Sm) graph has n=qd vertices (syn. nodes), and for q=2 which is the preferred embodiment, n=2d nodes. These n nodes make the n element vertex set V={ν0, ν2, . . . νn-1}. We are using 0-based subscripts since we need to do modular arithmetic with them.
The nodes νi are labeled using d-tuples in alphabet of size q: νi≡iε {0, 1, . . . n−1} expressed as d-digit integers in base q. The group operation, denoted as ⊕, is not the same as integer addition mod n but rather it is the component-wise addition modulo q done on d components separately. For q=2, this is equivalent to a bitwise XOR operation between the d-tuples, as illustrated in Table 2.1 (Appendix A) which shows the full Z2d group operation table for d=4.
Table 4.1 illustrates analogous Z3d group operation table for d=2 and q=3, hence there are n=32=9 group elements and the operations table has n×n=9×9=81 entries. The 2-digit entries have digits which are from alphabet {0,1,2}. The n rows and n columns are labeled using 2-digit node labels. Table entry at row r and column c contains result of r⊕c (component-wise addition mod q=3). For example, the 3rd row labeled 02, and the 6-th column labeled 12, yield table entry 02⊕12=(0+1)%3, (2+2)%3=1,1=11.
It can be noted in Table 4.1 for Z3d and in Table 2.1 (Appendix A) for Z2d that each row r and column c contains all n group elements, but in a unique order. The 0-th row or 0-th column contain the unmodified r and c values since the ‘identity element’ is I0=0. Both tables are symmetrical since the operation r⊕c=c⊕r is symmetrical (which is a characteristic of the abelian group Zqd used in the example).
Generator set Sm contains m “hops” h1, h2, . . . hm (they are also elements of the group Gn in Cay(Gn, Sm)), which can be viewed as the labels of the m nodes to which the “root” node, ν0≡0 is connected. Hence, the row r=0 of the adjacency matrix [A] has m ones, at columns A(0,h) for m hops hεSm and 0 elsewhere. Similarly, the column c=0 has m ones at rows A(h,0) for m hops hεSm and 0 elsewhere. In a general case, some row r=y, has m ones at columns A(y,y⊕h) for hεSm and 0 elsewhere. Similarly a column c=x has m ones at rows A(x⊕h,x) for hεSm and 0 elsewhere. Denoting contributions of a single generator hεSm to the adjacency matrix [A] as a matrix T(h), these conclusions can be written more compactly via Iverson brackets and bitwise OR operator ‘|’ as:
T(a)i,j≡[i⊕a=j]|[j⊕a=i] aεGn (4.1)
[A]=ΣhεS
Note that eq. (4.1) defines T(a) for any element a (or vertex) of the group Gn. Since the right hand side expression in eq. (4.1) is symmetric in i and j it follows that T(a) is a symmetric matrix, hence it has real, complete eigenbasis:
T(a)i,j=T(a)j,i (4.3)
For the group Gn=Z2d, the group operator ⊕ becomes regular XOR ‘̂’ operation, simplifying eq. (4.1) to:
T(a)i,j≡[îj=a], aεZ2d (4.4)
Table 4.2 illustrates the T(a) matrices for q=2, d=3, n=8 and all group elements a=0 . . . 7. For given a=0 . . . 7, value 1 is placed on row r and column c iff r̂c=a, and 0 otherwise (0s are shown as ‘-’).
Table 4.3 (a) shows the 8×8 adjacency matrix [A] obtained for the generator set S4≡{1, 2, 4, 7}hex≡{001, 010, 100, 111}bin by adding the 4 generators from Table 4.2: [A]=T(1)+T(2)+T(4)+T(7), via eq. (4.2). For pattern clarity, values 0 are shown as ‘-’. Table 4.3 (b) shows the indices of the 4 generators (1, 2, 3, 4) which contributed 1 to a given element of [A] in Table 4.3 (a).
To solve the eigen-problem of [A], couple additional properties of T(a) are derived from eq. (4.4) (using x̂x=0 and x̂y=ŷx):
Eq. (4.5) shows that T(a) matrices are a representation of the group Gn and eq. (4.6) that they commute with each other. Since via eq. (4.2), [A] is the sum of T(a) matrices, then [A] commutes with all T(a) matrices as well. Therefore, since they are all also symmetric matrices, the entire set {[A], T(a)∀a} has a common eigenbasis (via result (M4) in section 2.F). The next sequence of equations shows that Walsh functions viewed as n-dimensional vectors |Uk are the eigenvectors for T(a) matrices. Using eq. (4.4) for the matrix elements of the T(a), the action of T(a) on Walsh ket vector |Uk yields for the i-th component of the resulting vector:
The result Uk(îa) is transformed via eq. (2.5) for the general function values of Uk(x):
Collecting all is components of the left side of eq. (4.7) and right side of eq. (4.8) yields in vector form:
T(a)|Uk=Uk(a)|Uk (4.9)
Hence, the orthogonal basis set {|Uk, k=0 . . . n−1} is the common eigenbasis for all T(a) matrices and for the adjacency matrix [A]. The is eigenvalues for T(a) are Walsh function values Uk(a), k=0 . . . n−1. The eigenvalues for [A] are obtained by applying eq. (4.9) to the expansion of [A] via T(h), eq. (4.2):
Since U0(x)=1 is constant (for x=0 . . . n−1), the eigenvalue λ0 of [A] for the eigenvector |U0 is:
λ0=m≧λk (4.12)
From eq. (4.11) it also follows that λ0≧λk for k=1 . . . n−1 since the sum in eq. (4.11) may contain one or more negative addends Uk(hs)=−1 for k>0, while for the k=0 case all addends are equal to +1.
Cuts from Adjacency Matrix and Partition Vector
The bisection B is computed by finding the minimum cut C(X) in the set E={X} of all possible equipartitions X=S1+S2 of the set of n vertices. An equipartition X can be represented by an n-dimensional vector |Xε containing n/2 values +1 selecting nodes of group S1, and n/2 values −1 selecting the nodes of group S2. Since the cut value of a given equipartition X does not depend on particular +1/−1 labeling convention (e.g. changing sign of all elements defines the same graph partition), all vectors |X will have by convention the 1st component set to 1 and only the remaining n−1 components need to be varied (permuted) to obtain all possible distinct equipartitions from E. Hence, the equipartitions set E consists of all vectors X=(x0, x1, . . . xn-1), where x0=1, xiε{±1,−1} and Σi=0n-1xi=0.
The cut value C(X) for a given partition X=(x0, x1, . . . xn-1) is obtained as the count of links which cross between nodes in S1 and S2. Such links can be easily identified via E and adjacency matrix [A], since [A]i,j is 1 iff nodes i and j are connected and 0 if they are not connected. The group membership of some node i is stored in the component xi of the partition X. Therefore, the links (i,j) that are counted have [A]i,j=1, i.e. nodes i and j must be connected, and they must be in opposite partitions i.e. xi≠xj. Recalling that xi and xj have values +1 or −1, the “xi≠xj” is equivalent to “(xi·xj)=−1”. To express that condition as a contribution +1 when xi≠xj and a contribution 0 when xi=xj, expression (1−xi·xj)/2 is constructed which yields precisely the desired contributions +1 and 0 for any xi, xj=±1. Hence, the values added to the link count can be written as (1−xi·xj)·[A]i,j/2 since Ci,j=1 iff nodes i and j are connected ([A]i,j=1) and they are in different groups (xi·xj=−1). Otherwise Ci,j is 0, thus adding no contribution to the C(X).
A counting detail that needs a bit of care arises when adding Ci,j terms for all i,j=0 . . . n−1. Namely, if the contribution of e.g. C3,5 for nodes 3 and 5 is 1, because [A]3,5=1 (3,5 linked), x3=−1 and x5=+1, then the contribution of the same link will contribute also via C5,3 term since [A]5,3=1, x5=+1, x3=−1. Hence the sum of Ci,j for all i,j=0 . . . n−1 counts the contribution for each link twice. Therefore, to compute the cut value C(X) for some partition X, the sum of Ci,j terms must be divided by 2. Noting also that for any vector XεE(X|X=Σi=0n-1xixi=n and Σi,j=0n-1[A]i,j=Σj=0n-1m=n·m, yields for the cut C(X):
To illustrate operation of the formula (4.14), the Table 4.5 shows adjacency matrix [A] for Cay(Z24,S5), which reproduces (folded 4-cube), with d=4, n=2d=24=16 nodes, m=5 links per node, produced by the generator set S5={1, 2, 4, 8, F}hex={0001, 0010, 0100, 1000, 1111}bin. The row and column headers show the sign pattern of the example partition X=(1, 1, 1, 1, −1, −1, −1, −1, 1, 1, 1, 1, −1, −1, −1, −1) and the shaded areas indicate the blocks of [A] in which eq. (4.14) counts ones—elements of [A] where row r and column c have opposite signs of the X components xr and xe. The cut is computed as C(X)=½ (sum of ones in shaded blocks)=½*(4*8)=16 which is the correct B for 4 (2*n/2=2*8=16). Note that the zeros (they don't contribute to C(X)) in the matrix [A] are shown as ‘-’ symbol.
Bisection B is computed as the minimum cut C(X) for all XεE, which via eq. (4.14) yields:
Despite the apparent similarity between the max{ } term ME in eq. (4.16) to the max { } term MV in eq. (2.46), the Rayleigh-Ritz eqs. (2.45)-(2.46) do not directly apply to min{ } and max{ } expressions in eq. (4.15). Namely, the latter extrema are constrained to the set E of equipartitions, which is a proper subset of the full vector space to which the Rayleigh-Ritz applies. The ME≡max{ } in eq. (4.16) can be smaller than the MV≡max{ } computed by eq. (2.46) since the result MV can be a vector from which doesn't belong to E (the set containing only the equipartition vectors X) i.e. if MV is solved only by some vectors Y which do not consist of exactly n/2 elements +1 and n/2 elements −1.
As an illustration of the problem, ME is analogous to the “tallest programmer in the world” while MV is analogous to the “tallest person in the world.” Since the set of “all persons in the world” (analogous to ) includes as a proper subset the set of “all programmers in the world” (analogous to E) the tallest programmer may be shorter than the tallest person (e.g. the latter might be a non-programmer). Hence in general case the relation between the two extrema is ME≦MV. The equality holds only if at least one solution from MV belongs also to ME, or in the analogy, if at least one person among the “tallest person in the world” is also a programmer. Otherwise, strict inequality holds ME<My.
In order to evaluate ME≡max{ } in eq. (4.16), the n-dimensional vector space (the space to which vectors |X belong) is decomposed into a direct sum of two mutually orthogonal subspaces:
=⊕ (4.17)
Subspace is one dimensional space spanned by a single ‘vector of all ones’ 1| defined as:
1|≡(1,1,1, . . . ,1) (4.18)
while is the (n−1) dimensional orthogonal complement of within , i.e. is spanned by some basis of n−1 vectors which are orthogonal to 1|. Using the eq. (2.6) for Walsh function U0(x), it follows:
1|≡(1,1,1, . . . ,1)=U0| (4.19)
Hence, is spanned by the remaining orthogonal set of n−1 Walsh functions |Uk, k=1 . . . n−1. For convenience the latter subset of Walsh functions is labeled as set Φ below:
Φ≡{|Uk:k=1 . . . n−1} (4.20)
Since all vectors XεE contain n/2 components equal +1 and n/2 components equal −1, then via (4.18):
1|X=Σi=0n-11·xi=0, ∀XεE (4.21)
i.e. 1| is orthogonal to all equipartion vectors X from E, hence the entire set E is a proper subset of (which is the set of all vectorsε orthogonal to 1|). Using ME in eq. (4.16) and eq. (2.46) results in:
The My in eq. (4.22) is solved by an eigenvector |Y of [A] for which [A]|Y=λmax|Y since:
Recalling, via eq. (4.10), that the eigenbasis of the adjacency matrix [A] in eq. (4.22) is the set of Walsh functions |Uk, and that in which the MV=max{ } is searched for, is spanned by the n−1 Walsh functions |UkεΦ, it follows that the eigenvector |Y of [A] in eq. (4.23) can be selected to be one of these n−1 Walsh functions from Φ (since they form a complete eigenbasis of [A] in ) i.e.:
|YεΦ≡{|Uk:k=1 . . . n−1} (4.24)
The equality in (4.22) holds if at least one solution |Yε is also a vector from the set E. In terms of the earlier analogy, this can be stated as: in the statement “the tallest student”≦“the tallest person”, the equality holds if at least one among the “tallest person” happens to be a “programmer.”
Since |Y is one of the Walsh functions from Φ and since all |UkεΦ have, via eqs. (2.5) and (2.7), exactly n/2 components equal +1 and n/2 components equal −1, |Y belongs to the set E. Hence the exact solution for ME in eq. (4.22) is the Walsh functions |UkεΦ with the largest eigenvalue λk. Returning to the original bisection eq. (4.15), where ME is the second term, it follows that B is solved exactly by this same solution |Y=|UkεΦ. Combining thus eq. (4.15) with equality case for ME in eq. (4.22) yields:
Therefore, the computation of B is reduced to evaluating n−1 eigenvalues λk of [A] for k=1 . . . n−1 and finding a t≡(k with the largest λk) i.e. at such that λt≧λk for k=1 . . . n−1. The corresponding Walsh function Ut provides the equipartition which achieves this bisection B (the exact minimum cut). The evaluation of λk in eq. (4.25) can be written in terms of the m generators hsεSm via eq. (4.11) as:
Although the function values Uk(x) above can be computed via eq. (2.5) as Uk(x)=(−1)(k&x), due to parallelism of binary operation on a regular CPU, it is computationally more efficient to use binary form of Walsh functions, Wk(x). The binary algebraic translations in eqs. (2.8) can be rewritten in vector form for Uk and Wk, with aid of definition of 11) from eq. (4.18), as:
Hence, the B formula (4.26) can be written in terms of Wk via eq. (4.28) and Wk formula eq. (2.10) as:
The final expression in (4.29) is particularly convenient since for each k=1 . . . n−1 it merely adds parities of the bitwise AND terms: (k&hs) for all m Cayley graph generators hsεSm. The parity function (x) in eq. (4.29) can be computed efficiently via a short C function ([14] p. 42) as follows:
Using a (x) implementation Parity(x), the entire computation of B via eq. (4.29) can be done by a small C function Bisection(n,hops[ ],m) as shown in code (4.31).
The inner loop in (4.31) executes m times and the outer loop (n−1) times, yielding total of ˜m·n steps. Hence, for n−1 values of k, the total computational complexity of B is ˜O(m·n2).
A significant further speedup can be obtained by taking full advantage of the symmetries of Walsh functions Wk particularly evident in the recursive definition of Hadamard matrix Hn in eq. (2.1). The corresponding recursion for the binary Walsh matrix [Wn] can be written as:
where [
Analogue of the above ‘halving’ optimization of B computation can be formulated for the algebraic form of Walsh functions Uk by defining a function ƒ(x) for x=0, 1, . . . n−1 as:
where and 0≦x<n and Sm={h1, h2, . . . hm} is the set of m graph generators. Hence, f(x) is 1 when x is equal to one of the generators hsεSm and 0 elsewhere. This function can be viewed as a vector |f, with components fi=f(i). Recalling the computation of adjacency matrix [A] via eq. (4.2), vector ‥f can also be recognized as the 0-th column of [A]i.e. fi=[A]0,i. With this notation, the eq. (4.26) for B becomes:
Therefore, the B computation consists of finding the largest element in the set {Fk} of n−1 elements. Using the orthogonality and completeness of the n vectors |Uk, Uj|Uk=n·δj,k from eq. (2.3), important property of the set {Fk} follows:
The eqs. (4.35),(4.36) can be recognized as the Walsh transform ([14] chap. 23) of function ƒ(x), with n coefficients Fk/n as the transform coefficients. Hence, evaluation of all n coefficients Fk, which in direct (4.35) computation requires O(n2) steps, can be done via Fast Walsh Transform (FWT) in O(n·log(n)). Note that FWT will produce n coefficients Fk, including F0, even though we don't need F0 i.e. according to eq. (4.34), we still need to look for the max{ } in the set {F1, F2, . . . Fn-1}. Since each step involves adding of m points, the net complexity of the B computation via (4.34) using FWT is O(m·n·log(n)), which is the same as the “symmetry optimization” result in the previous section.
Although both methods above achieve a speedup by a factor n/log(n) over the direct use of eqs. (4.26) and (4.29), the far greater saving has already occurred in the original eq. (4.26). Namely, the eq. (4.26) computes B by computing only the n−1 cuts for equipartitions UkεΩ, instead of computing the cuts for all equipartitions in the set E of all possible equipartitions. The size of the full set E of “all possible equipartitions” is (factor ½ is due to convention that all partitions in E have +1 as the 1st component):
To appreciate the savings by eq. (4.26) alone, consider a very small network of merely n=32 nodes. To obtain the exact B for this network the LH method needs to compute n−1=31 cuts, while the exact enumeration would need to compute |E|=0.5·C(32,16)=300,540,195 cuts i.e. 9,694,845 times greater number of cuts. Further, this ratio via eq. (2.37) grows exponentially in the size of the network n, nearly doubling for each new node added.
With the couple O(m·n·log(n)) complexity methods for computation of bisection B for a given set of generators Sm described in the previous sections, the next task identified is the optimization of the generator set Sm={h1, h2, . . . hm} i.e. the finding of the Sm with the largest B. The individual hops it, are constrained to n−1 values: 1, 2, . . . n−1 (0 is eliminated since no node is connected to itself), i.e. Sm is an m element subset of the integer sequence 1 . . . n−1. For convenience, this set of all m-subsets of integer sequence 1 . . . n−1 is labeled as follows:
With this notation and using the binary formula for B, eq. (4.29), the B optimization task is:
For convenience, eq. (4.42) also defines a quantity b which is the bisection in units n/2. The worst case computational complexity the B optimization is thus O((m·n·log(n))m), which is polynomial in n, hence, at least in principle, it is a computationally tractable problem as n increases. (The actual exponent m would be (m−log(n)−1), not m, since the Cayley graphs are highly symmetrical and one would not have to search over the symmetrically equivalent subsets Sm. Note that m is typically a hardware characteristics of the network components, such as switches, which usually don't get replaced often as network size n increases.
Since for large enough n, even a low power polynomial can render ‘an in principle tractable’ problem practically intractable, approximate methods for the max{ } part of the computation (4.42) would be used in practice. Particularly attractive for this purpose would be genetic algorithms and simulated annealing techniques used in [12] (albeit for the task of computing B, which the methods of this invention solve efficiently and exactly). Some of the earlier implementations of this inventions have used fast greedy algorithms, which work fairly well. The ‘preferred embodiment’ for the invention which is described next does not perform any such direct optimization of eq. (4.42), but uses a more effective method instead.
In order to describe this method, the inner-most term within the nested max{min{ }} expression in the eq. (4.42) is identified and examined in more detail. For convenience, this term, which has a meaning of a cut for a partition defined via the pattern of ones in the Walsh function Wk(x), is labeled as:
Eq. (4.43) also expresses Wk(x) in terms of parity function (x) via eq. (2.10). The function (x) for some d-bit integer x=(xd-1 . . . x1x0)binary is defined as:
The last expression in eq. (4.44) shows that (x)≡(xd-1 . . . x1 x0) is a “linear combination” in terms of the selected field GF(2)d, of the field elements provided in the argument. The eq. (4.43) contains a modified argument of type (k&h), for hεSm, which can be reinterpreted as: the ‘ones’ from the integer k are selecting a subset of bits from the d-bit integer h, then (x) performs the linear combination of the selected subset of bits of h. For example, if k=11dec=1011bin than the action of W1011(h)≡(1011&h) is to compute linear combination of the bits bit-0,1 and 3 of h (bit numbering is zero based, from low/right to high/left significance). Since eq. (4.43) performs the above “linear combination via ones in k” action of Wk on a series of d-bit integers hs, s=1 . . . m, the Wk “action” on such series of integers is interpreted as the parallel linear combination on the bit-columns of the list of hs as shown in the Table 4.6, for k=1011 and W1011 acting on a set of generators S5={0001, 0010, 0100, 1101}. The 3 bit-columns V3, V1 and V0 selected by ones in k are combined via XOR into the resulting bit-column V: |V3⊕|V1⊕|V0=|V.
Therefore, the action of a Wk on the generator set Sm={h1, h2, . . . hm} can be seen as a “linear combination” of the length-m columns of digits (columns selected by ones in k from Wk) formed by the m generators hs. If instead of the g used in the example of Table 4.6, there was a more general Cayley graph group, such as Z2d, instead of the bit-columns there would have been length-m columns made of digits in alphabet of size q (i.e. integers 0 . . . q−1) and the XOR would have been replaced with the appropriate GF(q) field arithmetic e.g. addition modulo q on m-tuples for Zqd as illustrated in an earlier example in Table 4.1. The construction of column vectors |Vμ of Table 4.6 can be expressed more precisely via an m×d matrix [Rm,d] defined as:
where: (|Vμ)s≡hs,μ=(hs|)μ for μ=0 . . . d−1, s=1 . . . m (4.46)
Hence the m rows of matrix [Rm,d] are m generators hs|εSm and its d columns are d column vectors |Vμ. The above ‘linear combination of columns via ones in k’ becomes in this notation:
where the linear combination of kμ|Vμ is performed in GF(q) i.e. mod q on each component of m-tuples kμ|Vμ. The sum computing the cut Ck in eq. (4.43) is then simply adding (without mod q) all components of the vector |V(k) from eq. (4.47). Recalling the definition of Hamming weight as the number of non-zero digits, this cut Ck is recognizable as the Hamming weight of the vector |V(k):
C
k
=
V(k) (4.48)
The next step is to propagate the new “linear combination” interpretation of Wk action back one more level, to the original optimization problem in eq. (4.42), in which the cut Ck was only the innermost term. The min{ } block of eq. (4.42), seeks a minimum value of Ck for all k=1 . . . n−1. The set of n vectors |V(k) obtained via eq. (4.47) when k runs through all possible integers 0 . . . n−1 is a d-dimensional vector space, a linear span (subspace of m-tuples vector space ), which is denoted as (d,m,q):
(d,m,q)≡{|V(k):k=0 . . . n−1} (4.49)
Therefore, the min{ } level optimization in eq. (4.42) computing bisection b, seeks a non-zero vector |V(k) from the linear span (d,m,q) with the smallest Hamming weight V(k):
b=min{V(k):(V(k)ε(d,m,q)) and (V(k)≠0)} (4.50)
While Hamming weight can be used in some embodiments of the invention, any other weight, such as Lee weight, which would correspond to other Cayley graph groups Gn and generator sets Sm, can also be used.
But b in eq. (4.50) is precisely the definition eq. (2.25) of the minimum weight wmin in the codeword space (linear span) (_k,_n,q) of non-zero codewords Y. Note: In order to avoid the mix up in the notation between the two fields, the overlapping symbols [n,k] which have a different meaning in ECC, will in this section have an underscore prefix, i.e. the linear code [n,k] is relabeled as [_n,_k].
The mapping between the ECC quantities and LH quantities is then: wminb,_kd,_nm,_k vectors gi| spanning linear space (_k,_n,q) of _n-tuples and constructing code generator matrix [G] (eq. (2.20))d columns |Vμ for μ=0 . . . d−1 spanning linear space (d,m,q) of m-tuples (digit-columns in the generator list). Since, via eq. (2.26) the minimum weight of the code wmin is same as the minimum distance Δ between the codewords Y, it follows that the bisection b is also the same quantity as the ECC Δ (even numerically). Table 4.7 lists some of the elements of this mapping.
(_k,_n,q)
(d,m,q)
The optimization of linear code [_n,_k, Δ] that maximizes Δ is thus the same optimization as the outermost level of the LH optimization, max{ } level in eq. (4.42) that seeks the Cayley graph generator set Sm with the largest bisection b—other than difference in labeling conventions, both optimizations seek the d-dimensional subspace (d,m,q) of some vectors space which maximizes the minimum non-zero weight wmin3 of the subspace . The two problems are mathematically one and the same.
Therefore, the vast numbers of good/optimal linear ECC codes computed over the last six decades (such as EC code tables [17] and [22]) are immediately available as good/optimal solutions for the b optimization problem of the LH networks, such as eq. (4.42) for Cayley graph group Gn=Zqd. Similarly any techniques, algorithms and computer programs (e.g. MAGMA ECC module http://magma.maths.usyd.edu.au/magma/handbook/linear_codes_over_finite_fields) used for constructing and combining of good/optimum linear EC codes, such as quadratic residue codes, Goppa, Justesen, BCH, cyclic codes, Reed-Muller codes, . . . [15],[16], via translation Table 4.7, automatically become techniques and algorithms for constructing good/optimum LH networks.
As an illustration of the above translation procedure, a simple parity check EC code [4,3,1]2 with generator matrix [G3,4] is shown in Table 4.8. The codeword has 1 parity bit followed by 3 message bits and is capable of detecting all single bit errors. The translation to the optimum network shown on the right, is obtained by rotating 90° counter-clockwise (5 the 3×4 generator matrix [G3,4]. The obtained block of 4 rows with 3 bits per row is interpreted as 4 generators hs, each 3 bits wide, for the Cay(Z23,C4) graph. The resulting network thus has d=3, n=23=8 nodes and m=4 links/node. The actual network is a folded 3-cube shown within an earlier example in Table 4.4. Its bisection is: b=2 and B=b·n/2=8 links.
A slightly larger and denser network using EC code [7,4,3]2 from Table 2.4 (Appendix A), is converted into an optimum solution, a graph Cay(Z24,C7), with d=4, n=16 nodes and m=7 link/node as shown in Table 4.9.
The 4 row, 7 column generator matrix [G4,7] of the linear EC code [7,4,3]2 on the left side was rotated 90° counter-clockwise and the resulting 7 rows of 4 digits are binary values for the 7 generators hs (also shown in hex) of the 16 node Cayley graph. The resulting n=16 node network has relative bisection (in n/2 units) b=Δ=3 and absolute bisection (in # of links) of: B=b·n/2=3·16/2=24 links. Since the network is a non-planar 4-dimensional cube with total n·m/2=16·7/2=56 links it is not drawn.
The above examples are captured by the following simple, direct translation recipe:
EC code[—n,—k,Δ]q→LHCay(Zqd,Sm) (4.45)
The methods of determining the bisection B can be implemented using a computer program or set of computer programs organized to perform the various steps described herein. The computer can include one or more processors and associate member, including volatile and non-volatile memory to store the programs and data. For example, a conventional IBM compatible computer running the Windows or Linix operating system or an Apple computer system can be used and the programs can be written, for example, the in C programming language.
Order of elements in a generator set Sm={h1, h2, . . . hm} is clearly a matter of convention and network performance characteristics don't depend on particular ordering. Similarly, the subspace (d,m,q) of the column vectors can be generated using any linearly independent set of d vectors from (d,m,q) instead of the original subset {Vμ}. All these transformation of a given network yield equivalent networks, differing only in labeling convention but all with the same distribution of cuts (including min-cut and max-cut) and the same network paths distribution (e.g. same average and max paths). This equivalence is used to compute specific generators optimized for some other objective, beyond the cuts and paths. Some of these other objectives are listed in the notes below.
During expansion of the network, it is useful that the next larger network is produced with the minimum change from the previous configuration e.g. requiring the fewest cables to be reconnected to other switches or ports. The equivalence transforms of N-1 are used to “morph” the two configuration, initial and final toward each other, using the number of different links in Sm as the cost function being minimized. Techniques and algorithms of “Compressive Sensing” [CS] (see [20]) are particularly useful as the source for the efficient “morphing” algorithms.
It is often useful, especially in physical wiring, discovery and routing, to have a Zqd based network in which (usually first) d hops from Sm are powers of q. This property of generator set Sm corresponds to systematic generator matrix [G—k,
A simple, efficient method for computing a “systematic generator” from non-systematic one is to select for each column c=0 . . . d−1 a row r(c)=1 . . . m that contains a digit 1 in column c. If row r(c) doesn't contain any other ones, then we have one column with desired property (the hr(c) is a power of 2). If there are any other columns, such as c′ which contain ones in row r(c), the column Vc is XOR-ed into these columns Vc′, clearing the excessive ones in r(c). Finally, when there is a single 1 in row r(c) and column c, the hop hr(c) is swapped with hop hc+1 so that the resulting matrix contains generator hc+1=2c. The process is repeated for the remaining columns c<d.
The number of XOR operations between columns needed to reduce some row r(c) to a single 1 in column c, is hr(c)−1. Therefore, to reduce number of required XOR-s (columns are m bits long which can be much larger than the machine word), for each new c to diagonalize, algorithm picks the row which has the smallest weight, min {hr(c)}.
N-4. Digital or (t,m,s) Nets (or Designs, Orthogonal Arrays)
This research field closely related to design of optimal linear codes [_n,_k,Δ]q (cf. [21],[22]). The basic problem in the field of ‘digital nets’ is to find distribution of points on s-dimensional hypercubic (fish-) net with “binary intervals” layout of ‘net eyes’ (or generally analogous b-ary intervals via powers of any base b, not only for b=2) which places the same number of points into each net eye. There is a mapping between (t,_m,s)b digital nets and [_n,_k]q codes via identities: _n=s,_k=s−_m, q=b. A large database of optimal (t,_m,s) nets, which includes linear code translations is available via a web site [22]. Therefore, the solutions, algorithms and computer programs for constructing good/optimal (t,_m,s) nets are immediately portable to construction of good/optimal LH networks via this mapping followed by the [_n,_k]q→LH mapping in Table 4.7.
The linear codes with q>2 generate hyper-torus/-mesh type of networks of extent q when the Δ metrics of the code is Lee distance. When Hamming distance is used for q>2 codes, the networks are of generalized hypercube/flattened butterfly type [3]. For q=2, which is the binary code, the two types of distance metrics are one and the same.
Walsh functions readily generalize to other groups, besides cyclic group Z2d used here (cf. [23]). A simple generalization to base q>2 for groups Zqd, for any integer q is based on defining function values via q-th primitive root of unity ω:
U
q,k(x)=ωΣ
where: ω≡e2πi/q (4.51)
For q=2, eq. (4.51) yields ω=(−1), which reduces Uq,k(x) from eq. (4.50) to the regular Walsh functions Uk(x), eq. (2.5). The q discrete values of Uq,k(x) can be also mapped into integers in [0,q) interval to obtain integer-valued Walsh functions Wq,k(x) (analogue of binary form Wk(x)), which is useful for efficient computer implementation, via analogous mapping to the binary case e.g. via mapping a=ωb for integer b=0 . . . n−1, where b:integer, a:algebraic value, as in eq. (2.8) where this same mapping (expressed differently) was used for q=2.
The non-binary Walsh functions Uq,k can also be used to define graph partition into f parts where f is any divisor of q (including q). For even q, this allows for efficient computation of bisection. The method is a direct generalization of the binary case: the q distinct function values of Uq,k(x) define partitions arrays Xk[x]≡Uq,k(x) containing n=qd elements indexed by x=0 . . . n−1. Each of q values of Uq,k(x) indicates a node x belongs to one of the q parts. The partitions Xk for k=1 . . . n−1 are examined and cuts computed using the adjacency matrix [A] for Cay(Zqd,Sm) graph, as in eq. (4.14) for q=2. The generators T(a) and adjacency matrix [A] are computed via general eqs. (4.1),(4.2), where ⊕ operator is GF(q) addition (mod q).
The algorithmic speed optimizations via “symmetry optimization” and “Fast Walsh Transform optimization” apply here as well (see [14] pp. 465-468 on fast transforms for multi-valued Walsh functions).
Once the optimum solution for (4.42) is obtained (via ECC, Digital nets, or via direct optimization), secondary optimizations, such as seeking the minimum diameter (max distance) or minimum average distance or largest max-cut, can be performed on the solution via local, greedy algorithms. Such algorithms were used in construction of our data solutions data base, where each set of parameters (d, m, q) has alternate solutions optimized for some other criteria (usually diameter, then average distance).
The basic algorithm attempts replacement of typically 1 or 2 generators hsεSm, and for each new configuration it evaluates (incrementally) the target utility function, such as diameter, average distance or max-cut (or some hierarchy of these, used for tie-breaking rules). The number of simultaneous replacements depends on n, m and available computing resources. Namely, there are ˜nr possible simultaneous deletions and insertions (assuming the “best deletion” is followed by “best” insertion). The utility function also uses indirect measures (analogous to sub-goals) as a tie-breaking selection criterion e.g. when minimizing diameter, it was found that an effective indirect measure is the number of nodes #F in the farthest (from node 0) group of nodes. The indirect objective in this case would be to minimize the #F of such nodes, whenever the examined change (swap of 1 or two generators) leaves the diameter unchanged.
In addition to incremental updates to the networks after each evaluated generators replacement, these algorithms rely on vertex symmetry of Cayley graphs to further reduce computations. E.g. all distance tables are only maintained and updated for n−1 distances from node 0 (“root”), since the table is the same for all nodes (with mere permutation of indices, obtainable via T(a) representation of Gn if needed).
Depending on network application, the bisection b can be maintained fixed for all replacements (e.g. if bisection is the highest valued objective), or one can allow b to drop by some value, if the secondary gains are sufficiently valuable.
After generating and evaluating all replacements to a given depth (e.g. replacement of 1 or 2 generators), the “best” one is picked (according to the utility/cost function) and replacement is performed. Then the outer iteration loop would continue, examining another set of replacements seeking the best one, etc. until no more improvements to the utility/cost function can be obtained in the last iteration.
This section describes several optimum LH solutions with particularly useful parameters or simple construction patterns.
This is a special case of LH networks with high topological link density, suitable for combining smaller number of high radix switches into a single large radix modular switch. This is a specialized domain of network parameters where the 2-layer Fat Tree (FT-2) networks are currently used since they achieve the yield of E=R/3 external ports/switch, which is the maximum mathematically possible for the worst case traffic patterns. The ‘high density’ LH networks (LH-HD) match the FT-2 in this optimum E=R/3 external ports/switch yield for the worst case traffic patterns, while achieving substantially lower average latency and the cost in Gb/s of throughput on random or ‘benign’ (non-worst case) traffic.
In our preferred embodiment using Cay(Z2d,Sm) graph, the network size is n=2d switches and the number of links per node m is one of the numbers: n/2, n/2+n/4, n/2+n/4+n/8, . . . , n/2+n/4+n/8+ . . . +1, then the optimum m generators for LH-HD are constructed as follows:
(ii) Optionally diagonalize and sort Sm via procedure (N-3) (Of course, there are a large number of equivalent configurations obtained via equivalence transforms N-1.)
The resulting bisection is: b=└(m+1)/2┘ or B=b·n/2, diameter is 2 and average hops is 2−m/n. The largest LH-HD m=n/2+n/4+n/8+ . . . +1=n−1 has b=n/2 and corresponds to a fully meshed network.
Table 4.10 shows an example of LH-HD generators for n=26=64 nodes and m=n/2=32 hops/node, with the hops shown in hex and binary (binary 0s are shown as ‘-’ character). Table 4.10(a) shows the non-diagonalized hops after the step (i), and Table 4.10(b) shows the equivalent network with m=32 hops after diagonalization in step (ii) and sorting. Other possible LH-HD m values for the same n=64 node network are m=32+16=48, m=48+8=56, m=56+4=60, m=60+2=62 and m=61+1=63 hops.
Additional modified LH-HD networks are obtained from any of the above LH-HD networks via removal of any one or two generators, which yields networks LH-HD1 with m1=m−1 and LH-HD2 with m2=m−2 generators. Their respective bisections are b1=b−1 and b2=b−2. These two modified networks may be useful when an additional one or two server ports are needed on each switch compared to the unmodified LH-HD network.
These three types of high density LH networks are useful for building modular switches, networks on a chip in multi-core or multi-processor systems, flash memory/storage network designs, or generally any of the applications requiring very high bisection from a small number of high radix components and where FT-2 (two level Fat Tree) is presently used. In all such cases, LH-HD will achieve the same bisections at a lower latency and lower cost for Gb/s of throughput.
S-2. Low Density LH Networks with b=3
This subset of LH networks is characterized by comparatively low link density and low bisection b=3 i.e. B=3n/2 links. They are constructed as a direct augmentation of regular hypercubic networks which have bisection b=1. The method is illustrated in Table 4.11 using augmentation of the 4-cube.
The d=4 hops h1, h2, h3 and h4 for the regular 4-cube are enclosed in a 4×4 box on the top. The augmentation consists of 3 additional hops h5, h6 and h7 added in the form of 4 columns C1, C2, C3 and C4, where each column Cμ (μ=1 . . . d) has length of L=3 bits. The resulting network has n=16 nodes with 7 links per node and it is identical to an earlier example in Table 4.9 with b=3 obtained there via translation from a [7,4,3]2 EC code into the LH network. General direct construction of the b=3 LH network from a d-cube is done by appending d columns Cμ (μ=1 . . . d) of length L bits, such that each bit column has at least 2 ones and L is the smallest integer satisfying inequality:
2L−L−1≧d (4.60)
The condition in eq. (4.60) expresses the requirement that d columns Cμ must have at least 2 ones. Namely, there are total of 2L distinct bit patterns of length L. Among all 2L possible L-bit patterns, 1 pattern has 0 ones (00 . . . 0) and L patterns have a single one. By removing these two types, with 0 or single one, there are 2L−(L+1) remaining L-bit patterns with two or more ones, which is the left hand side of eq. (4.60). Any subset of d distinct patterns out of these 2L−(L+1) remaining patterns can be chosen for the above augmentation. The Table 4.12 shows values L (number of added hops to a d-cube) satisfying eq. (4.60) for dimensions d of practical interest.
S-3. Augmentation of LH Networks with b=Odd Integer
This is a very simple, yet optimal, augmentation of an LH network which has m links per node and bisection b=odd integer into LH network with bisection b1=b+1 and m1=m+1 links per node. The method is illustrated in Table 4.14 using the augmented 4-cube (d=4, n=16 nodes) with m=7 links per node and bisection b=3, which was used in earlier examples in Tables 4.9 and 4.12.
A single augmenting link h8=h1̂h2̂ . . . ̂h7 (bitwise XOR of the list) is added to the network which increases bisection from b=3 to b=4 i.e. it increases the absolute bisection B by n/2=16/2=8 links. The general method for Cay(Z2d,Sm) with b=‘odd integer’ consists of adding the link hm+1=h1̂h2̂ . . . ̂hm n (the bitwise XOR of the previous m hops) to the generator set Sm. The resulting LH network Cay(Z2d,Sm+1) has bisection b1=b+1.
The only case which requires additional computation, beyond merely XOR-ing the hop list, is the case in which the resulting hop hm+1 happens to come out as 0 (which is an invalid hop value, a self-link of node 0 to itself). In such case, it is always possible to perform a single hop substitution in the original list Sm which will produce the new list with the same b value but a non-zero value for the list XOR result hm+1.
In practice one would often need to construct a network satisfying requirements expressed in terms of some target number of external ports P having oversubscription φ, obtained using switches of radix R. The resulting construction would compute the number n of radix-R switches needed, as well as the list for detailed wiring between switches. For concreteness, each radix-R switch will be assumed to have R ports labeled as port #1, #2, . . . #R. Each switch will be connected to mother switches using ports #1, #2, . . . #m (these are topological ports or links) and leave E=R−m ports: #m+1, #m+2, . . . #R as “external ports” per switch available to the network users for servers, routers, storage, . . . etc. Hence, the requirement of having total of P external ports is expressed in terms of E and number of switches n as:
E=P/n (4.70)
The oversubscription eq. (3.1) is then expressed via definition of bisection b in eq. (4.42) as:
The illustrative construction below will use non-oversubscribed networks, φ=1, simplifying eq. (4.71):
E=b=R−m (4.72)
i.e. for non-oversubscribed networks, the number of external ports/switch E must be equal to the relative bisection b (this the bisection in units n/2), or equivalently, the number of links/switch: m=R−b.
In order to find appropriate n=2d and m parameters, LH solutions database, obtained by translating optimum EC code tables [17] and [22] via recipe (4.45), groups solutions by network dimension d into record sets Dd, where d=3, 4, . . . 24. These dimensions cover the range of network sizes n=2d that are of practical interest, from n=23=8 to n=224≈16 million switches. Each record set Dd contains solution records for m=d, d+1, . . . mmax links/switch, where the present database has mmax=256 links/switch. Each solution record contains, among others, the value m, bisection b and the hop list h1, h2, . . . hm.
For given P, R and φ, LH constructor scans record sets Dd, for d=3, 4, . . . and in each set, inspects the records for m=d, d+1, . . . computing for each (d,m) record values E(d,m)=R−m ports/switch, total ports P(d,m)=n·E(d,m)=2d·R−m) and oversubscription φ(d,m)=E(d,m)/b (value b is in each (d,m) record). The relative errors δP=|P(d,m)−P|/P and δφ=|φ(d,m)−+φ|/φ are computed and the best match (record (d,m) with the lowest combined error) is selected as the solution to use. If the requirement is “at least P ports” then the constraint P(d,m)−P≧0 is imposed for the admissible comparisons. The requirements can also prioritize δP and δφ via weights for each (e.g. 0.7·δP+0.3·δφ) for total error). After finding the best matching (d,m) record, the hop list h1, h2, . . . hm is retrieved from the record and the set of links L(ν) is computed for each node ν, where ν=0, 1, . . . n−1, as: L(ν)={ν̂hs for s=1 . . . m}. Given n such sets of links, L(0), L(1), . . . , L(n−1), the complete wiring for the network is specified. The examples below illustrate the described construction procedure.
The LH database search finds the exact match (δP=0, δφ=0) for the record d=5, m=9, hence requiring n=2d=25=32 switches of radix 11=12. The bisection b=3 and the hop list (in hex base) for the record is: S9={1, 2, 4, 8, 10, E, F, 14, 19}hex. The number of external ports per switch is E=b=3, combined with m=9 topological ports/switch, results in radix 11=3+9=12 total ports/switch as specified. The total number of external ports is P=E·n=3·32=96 as required. Diameter (max hops) for the network is D=3 hops, and the average hops (latency) is Avg=1.6875 hops. Table 4.15 shows complete connection map for the network for 32 switches, stacked in a 32-row rack one below the other, labeled in leftmost column “Sw” as 0, 1, . . . 1F (in hex). Switch 5 is outlined with connections shown for its ports #1, #2, . . . #9 to switches (in hex) 04, 07, 01, 0D, 15, 0B, 0A, 11 and 1C. These 9 numbers are computed by XOR-ing 5 with the 9 generators (row 0): 01, 02, 04, 08, 10, 0E, 0F, 14, 19. The free ports are #10, #11 and #12.
To illustrate the interpretation of the links via numbers, the outlined switch “5:” indicates on its port #2 a connection to switch 7 (the encircled number 07 in the row 5:). In the row 7:, labeled as switch “7:”, there is an encircled number 05 at its port #2 (column #2), which refers back to this same connection between the switch 5 and the switch 7 via port #2 on each switch. The same pattern can be observed between any pair of connected switches and ports.
The LH solutions database search finds an exact match for d=8, n=256 switches of radix 12=24 and m=18 topological ports/switch. Diameter (max hops) of the network is D=3 hops, and average latency is Avg=2.2851562 hops. The bisection is b=6, providing thus E=6 free ports per switch at φ=1. The total number of ports provided is E·n=6·256=1536 as required. The set of 18 generators is: S18={01, 02, 04, 08, 10, 20, 40, 80, 1A, 2D, 47, 78, 7E, 8E, 9D, B2, D1, F13}hex. Note that the first 8 links are regular 8-cube links (power of 2), while the remaining 10 are LH augmentation links. These generators specify the target switches (as index 00 . . . FFhex) connected to switch 00 via ports #1, #2, . . . #18 (switches on both ends of a link use the same port number for mutual connections). To compute the 18 links (to 18 target switches) for some other switch x≠00, one would simply XOR number x with the 18 generators. Table 4.16 shows the connection table only for the first 16 switches of the resulting network, illustrating this computation of the links. For example, switch 1 (row ‘1:’) has on its port #4 target switch 09, which is computed as 1̂8=9, where 8 was the generator in row ‘0:’ for port #4. Checking then switch 9 (in row ‘9:’), on its port #4 is switch 01 (since 9̂8=1), i.e. switches 1 and 9 are connected via port #4 on each. The table also shows that each switch has 6 ports #19, #20, . . . #24 free.
The database lookup finds the exact match using d=16, n=216=65,536=64K switches of radix R=48. Each switch uses m=38 ports for connections with other switches leaving E=48−38=10 ports/switch free, yielding total of P=E·n=10·64K=640K available ports as required. Bisection is b=10 resulting in φ=E/b=1. The list of m=38 generators S38={h1, h2, . . . h38} is shown in Table 4.17 in hex and binary base. The 38 links for some switch x (where x: 0 . . . FFFF) are computed as S38(x)=≡{x̂h1, x̂h2, . . . x̂h38}. Diameter (max hops) of the network is D=5 hops, and the average latency is Avg=4.061691 hops.
The LH solutions database was used to compare LH networks against several leading alternatives from industry and research across broader spectrum of parameters. The resulting spreadsheet charts are shown in
In
For example, the Ports/Switch chart in
It is desirable to maximize λ since λ quantifies the external port yield of each switch. Namely if each switch's port count (radix) is R, then R=E+T (where E is the number of external ports and T number of topological ports) and the E-port yield per IPA port is: Yield≡E/R=λ/(λ+1), i.e. increasing λ increases the Yield. But increasing λ for a given N also lowers the bisection for that N, hence in practical applications, data center administrators need to select a balance of Yield vs. bisection and N suitable for the usage patterns in the data center. The centralized control and management software provides modeling tools for such evaluations.
Denoting the number of external ports and topology ports per switch as E and T, the radix (number of ports) R of a switch is R=E+T. The topology ports in turn consist of the d ports needed to connect a d-dimensional hypercube HCd and of h long hop ports used for trunking, so T=d+h. If the number of switches is N, the N=2d or d=log(N), where log(x) is the logarithm base 2 of x i.e. log(x)=ln(x)/ln(2)≈1.443·ln(x). In order to relate formally the invention's long hops to terminology used with conventional trunking (where each of the d HCd cables is replaced with q cables, a trunk quantum), define q≡T/d, i.e. T=q·d. Hence q and h are related as: q=1+h/d and h=d·(q−1). Using the ratio: λ≡E/T, E and T is expressed as T=R/(1+X) and E=λ·R/(1+λ). Restating the bisection formula:
B≡B(N)=(N/2)·q·C=N/2(1+h/d)C (5)
Where C is a single IPA switch port capacity (2·<Port Bit Rate> for duplex ports). Bisection B is the smallest total capacity of links connecting two halves of the network (i.e. it's the minimum for all possible network cuts into halves). Consider two network halves with N/2 switches each and E external ports per switch, there are E·N/2 external ports in each half. If these two sets of eternal ports were to transmit to each other at full port capacity C, the total capacity needed to support it is E·(N/2)·C.
Since bisection limits the worst case capacity between halves to B, the oversubscription 4) is defined as the ratio between the capacity needed E·(N/2)·C and the capacity for the job available via B:
φ≡E·(N/2)·C/B=E/q=λ·d=λ·log(N) (6)
Eq. (6) shows in what ratio λ=E/T ports must be divided in order to obtain oversubscription φ using N switches: λ=φ/log(N). The quantity most often of interest is the total number of external ports provided by the network, P=N·E, which in terms of other quantities typically given as constraints φ, N and radix R), and recalling that E=λ·R/(1+λ), is then:
Although Eq. (7) doesn't yield a closed form expression for N, it does allow computation of the number of IPA switches N needed to get some target number of total network ports P at IB-oversubscription φ, knowing the radix R of the switches being used. Qualitatively, the number of total network ports P increases slightly slower than linearly in N (when φ is kept fixed) due to the denominator D≡(φ+log(N)) which also increases with N. Its effects diminish as N increases (or if φ is large or grows with N), since doubling of N increments D by +1 (which is only by ˜5% for N=64K and φ=φ. Within the log(log(P)) error margin, the N above grows as N˜P·log(P), which is an unavoidable mathematical limit on performance of larger switches combined from N smaller switches at fixed φ.
i.e. we get a fixed cost and power per port as N grows. In this case the tradeoff is that it is φ which now grows as λ·log(N) as N grows. Recalling that typical aggregate oversubscriptions on core switches and routers are ˜200+ in the current data centers, log(N) is quite moderate in comparison. The network bandwidth properties for λ=1 are shown in
By using mathematically convenient topologies such as an enhanced hypercube connection pattern or its hierarchical variants, the switch forwarding port can be computed on the fly via simple hardware performing a few bitwise logical operations on the destination address field, without any expensive and slow forwarding Content Addressable Memory (CAM) tables being required. Hence, for customized switches, price and power use advantages can be gained by removing CAM hardware entirely.
Although the most favorable embodiments of the invention can eliminate CAMs completely, a much smaller (by at least 3 orders of magnitude smaller) CAM hardware can still be useful to maintain forwarding exceptions arising from faults or congestion. Since the enhanced hypercubic topology allows for forwarding via simple, small logic circuits (in the ideal, exception free case), the only complication arises when some port P is faulty due to a fault at the port or failure/congestion at the nearest neighbor switch connected to it. Since number of such exceptions is limited by the radix R of the switch, the necessary exception table needs a space for at most R small entries (typical R=24 . . . 128, entry size 5-7 bits). A match of a computed output port with an entry in the reduced CAM overrides the routine forwarding decision based on the Jump Vector computed by the logic circuit. Such a tiny table can be implemented in the substantially reduced residual CAMs, or even within the address decoding logic used in forwarding port computation. This exception table can also be used to override the routine forwarding decisions for local and global traffic management and load balancing.
In order to increase the pipe capacity along overloaded paths, while under the tree topology constraints, the conventional data center solution is trunking (or link aggregation in the IEEE 802.1AX standard, or Cisco's commercial EtherChannel product), which amounts to cloning the link between two switches, resulting in multiple parallel links between the two switches using additional pairs of ports. The invention shows a better version of trunking for increasing the bisection with a fixed number of switches.
With the invention, this problem arises when number of switches in a network is fixed for some reason so bisection cannot be increased by increasing N. Generally, this restriction arises when the building block switches are a smaller number of high radix switches (such as the Arista 7500) rather than the larger number of low radix switches that allow the desirable high bisection bandwidth as provided by the invention. Data centers making use of the invention can use conventional trunking by building hypercubes using multiple parallel cables per hypercube dimension. While that will increase the bisection as it does for regular tree based data center networks, there are better approaches that can be used.
The procedure is basically the opposite of the approach used for traditional trunking. By adding a link from some switch A, instead of picking the target switch B from those closest to A, B is picked such that it is the farthest switch from A. Since the invention's topologies maintain uniform bisection across the network, any target switch will be equally good from the bisection perspective, which is not true for conventional trees or fat trees. By taking advantage of this uniformity, picking the farthest switch B also maximally reduces the longest and the average hop counts across the network. For example, with a hypercube topology, the farthest switch from any switch A is the switch B which is on the long diagonal from A. Adding that one link to A cuts its longest path by half, and reduce the average path by at least 1 hop. When the long hops are added uniformly to all switches (hence N/2 wires are added per new long hop), the resulting topology is called enhanced hypercube.
The table was obtained by a simple ‘brute force’ counting and updates of distance tables as the new long hops were added. At each stage, the farthest node from the origin is used as a new link (a variety of tiebreaking rules were explored to provide a pick when multiple ‘farthest’ nodes are equally far, which is the common occurrence). After each link is added the distance table is updated. For Dim=4, N=16, adding long hops beyond 11 doesn't have an effect since the small network becomes fully meshed (when total number of links is N−1), hence all distances become 1 hop.
In some embodiments of the invention, systems being implemented via a set of switches in a data center, (e.g. available as line cards in a rack), wiring such dense networks can easily become very complex, error prone and inefficient. With (d!)N topologically equally correct mappings between ports and dimensions for a d-dimensional hypercube using N=2d switches, d ports per switch, there are lots of ways to create an unmanageable, error prone, wasteful tangle. The invention optimizes the mapping between the ports and HC/FB dimensions using the following rules:
The resulting wiring pattern shown in
Provided the cables and corresponding port connectors in the same column are color coded using matching colors (properties (b) and (c) makes such coding possible), and the cables are of the minimum length necessary in each vertical column, this port-dimension mapping makes the wiring of a rack of switches easy to learn, easy to connect and virtually error proof (any errors can be spotted at a glance). The total length of cables is also the minimum possible (requiring no slack) and it has the fewest number of distinct cable lengths allowed by the topology. In addition to economizing the quantity and complexity of the wiring, the shortening and uniformity of cables reduces the power needed to drive the signals between the ports, a factor identified as having commercial relevance in industry research.
In
The 6 numbers inside some row #k show the 6 switches connected to the 6 ports of the switch #k. E.g. row #7 shows that switch #7 is connected to switches #6, 5, 3, 15, 23, 39 on its ports 0, 1, 2, . . . 5. Picking now say, port (column) #4 for switch (row) #7, it connects on port 4 to switch #23. Looking down to switch (row) #23, its port (column) #4 it connects back to switch #7 i.e. switch 7 and switch 23 are connected to each other's port #4. This simple rule—two switches always connect on the same port # with each other holds generally for hypercubes. This leads to the proposed port and cable color coding scheme. E.g. green: 4 cables connect green ports #4 on some pair of switches, red: 0 cables connect red ports #0 on some other pair of switches, blue:1 cables connect blue ports #1, etc.
The wiring pattern is as simple. All wires of the same color have the same length L=2port#, e.g. orange: 2 wire (connecting always ports #2, orange:2 ports) has length 22=4, green:4 24=16, red: 0 20=1, etc. Hence switch pairs connected on their port #2 with each other are 4 rows apart, e.g. switch (row) 0 connects on its port #2 to switch 4 on its port #2 and they use orange:2 wire (the color of port #2). This connection is shown as the top orange: 2 arc connecting numbers 4 and 0. The next orange:2 (port #2) wire start at the next unconnected row, which is row #1 (switch #1), and connects to row 1+4=5 (switch #5), and so on until the first row already connected on port #2 is reached, which is row #4 (Step 1-φ. At that point 8 top rows on port #2 are connected. Then proceed down to the next row with free port #2, which is row 8. That port #2 is now connected with the port #2 down 4 rows, i.e. with row 8+4=12, which is shown with orange: 2 wire linking numbers 12 and 8. Now the next two rows (orange: 2 arc connecting numbers 13 and 9), etc, until column (port) #2 is connected on all switches. Then follows purple: 3 port #3, using purple: 3 wires 23=8 slots long, and repeat the same procedure, except with longer wires . . . etc.
While the above wiring of a 64-switch hypercube H6≡H(64) is not difficult since errors are unlikely because starting at the top row and going down, any new wire can go into just one port of the matching color, the pattern above suggests a simple way to design easily connectable internally prewired containers, which eliminate much of the tedium and expense of this kind of dense manual wiring.
Consider the above H(64) as being composed of two prewired H(32) boxes A and B (separated by the dotted horizontal line at 32/32). The first 5 dimensions, ports 0, 1, . . . 4, of each H(32) are already fully wired and the only missing connections are the 32 wires connecting ports #5 on the 32 switches from one to the other container, in perfectly orderly manner (row 0 of container A to row 0 of container B, row 1 from A to row 1 from B, . . . etc). Hence, instead of wiring 32×6=192 wires for H(64), two prewired containers and 32 wires now connect between them in a simple 1, 2, 3 . . . order. The job is made even easier with a bundled, thick cable with these 32 lines and a larger connector on each box, requiring thus only one cable to be connected.
Looking further at the wiring relation between port #4 and port #5, it is obvious that these thick cables (each carrying e.g. 64 or 128 Cat 5 cables) follow the exact pattern as ports #1 and #2, except with cable bundles and big connectors instead of single Cat 5 cables and individual ports. Hence, if one had a row of internally prewired (e.g. via ASIC) 128-switch containers (e.g. one rack 64RU tall, 2 line cards per slot), each container having 8 color coded big connectors lined up vertically on its back panel, matching color thick cables may be used that repeat the above exact wiring pattern between these 28=256 containers (except it goes horizontally) to create a network with 27+8=32K IPA switches (for only $393 million), providing 786,432×10Gports (1 port per 10G virtual server with 32 virtual machines (VMs), totaling 25,165,824 VMs; i.e. switching cost <$16/VM). For large setups a single frame may be used where any newly added container can just be snapped into the frame (without any cables), that has built in frame-based connectors (with all the inter-container thick cabling prewired inside the frame base).
The ultimate streamlining of the wiring (and of a lot more) is achieved by using “merchant silicon”, where all such dense wiring, along with the connectors and their supporting hardware on the switch is replaced with ASICs tying together the bare switching fabric chips. This approach not only eliminates the wiring problem, but also massively reduces the hardware costs and power consumption.
For ASIC wiring of the
The above manual wiring scheme can also be used to build a network that has a number of switches N which is not a power of 2 (thus it cannot form a conventional hypercube). Consider the case of a network where that has 32 switches (d=5, using ports #0 . . . #4, rows 0 . . . 31) and now wish to add two more switches, (rows) #32 and #33. This starts the 6th dimension (port #5, long cyan wires), but only having two of the 32 cyan lines connected on port #6 (the two are connecting port #6 in rows 032 and 133 for the 2 new switches #32 and #33). The first 5 ports #0-#4 of the two new switches have no switches to go to, since these haven't been filled in (these will come later in the rows 34-63).
The problem with such partial wiring is that it severely restricts forwarding to and from the new switches (just 1 link instead of 6 links), along with reduced bandwidth and fragility (due to single points of failure. This problem can be eliminated by using port (column) #4 of the new first new switch (row) #32. The port #32:4 normally connects (via green wire going down to row #48) to switch: port #48:4, but the switch #48 isn't there yet. Switch #48 also connects on port #5 (via dotted cyan wire) back to the existent switch #16:5. Thus, there are two broken links #32:4#48:4 and #48:5#16:5, with missing switch #48 in the middle. Therefore, the two ends of existing switches can be connected directly to each other, i.e. #32:4#16:5 as shown by the top dotted green wire (which happens to be just the right length, too). Later, when switch #48 is finally added, the shortcut (green dotted wire going up) moves down to #48:4 while the #16:5, which becomes free as well (after moving the green wire down), now connects to #48:5 (dotted cyan wire). The same maneuver applies to switch #33 as shown with the 2nd green dotted wire. The analogous shortcuts follows for lower ports of #32 and #33 e.g. the broken pairs #32:3#40:3 and #40:5#8:5 are short-circuited via #32:3#8:5 etc, resulting in full (with natural forwarding) 6-D connectivity for the new switches and their neighbors. The general technique is to first construct correct links for the target topology (e.g. hypercube), which include the non-existent nodes. Then one extends all shortest paths containing the non-existent nodes until they reach existent nodes on both ends. The existent nodes terminating such “virtual” shortest paths (made of non-existent nodes on the inner links) are connected directly, using the available ports (reserved on existent nodes for connections with as yet non-existent ones).
Another approach according to embodiments of the invention for interconnecting switches can include building large, software controlled super-connectors (“C-Switches”), where making any desired connections between the physical connectors can be controlled by software.
Unlike a standard switch, which forwards packets dynamically based on the destination address in the packet frame header, a C-Switch forwards packets statically, where the settings for the network of crossbar connections within the C-Switch can be provided by an external program at initialization time. Without any need for high speed dynamic forwarding and buffering of data packets, the amount of hardware or power used by a C-Switch is several orders of magnitude smaller than a standard switch with the same number of ports.
The individual connectors (or per-switch bundles of for example 48 individual circuit cables brought in via trunked thick cables, plugged into a large single connector), plug into the C-Switch's panel (which can cover 3-5 sides of the C-Switch container), which can include a matrix containing hundreds or thousands of receptacles. Beyond the simple external physical connection, everything else can be done via software controls. Any desired topology can be selected via an operator using software to select from a library of topologies or topology modules or topology elements.
To facilitate physical placement and heat management, C-Switches can be modular, meaning that a single C-Switch module can combine several hundred to several thousand connectors, and the modules can be connected via single or few cables (or fiber links), depending on the internal switching mechanism used by the C-Switch. In such a modular implementation, the inter-module cabling can be done via the cabling built into the frame where the connections can be established indirectly, by snapping a new module into the frame.
There is a great variety of possible ways to implement core functionality of a C-Switch, ranging from telephony style crossbar switches, to arrays of stripped down, primitive hub or bridge elements, to nanotech optical switches and ASIC/FPGA techniques. Since the internal distances within a C-Switch are several orders of magnitude smaller than standard Ethernet connections, it is useful (for the heat& power reduction) that the incoming signal power be downscaled by a similar factor before entering the crossbar logic (the signals can be amplified back to the required levels on the output from the crossbar logic). In other embodiments, for example using MEMS based devices, power reduction may not be necessary where optical signals are switched via piezo-electrically controlled nano-mirrors or other purely optical/photonic techniques such as DLP normally used for projection screens, where such down/up-scaling is implicit in the transceivers.
The internal topology of the C-Switch can be multi-staged since the complexity of a single, flat crossbar grows as O(X2) for X external ports. For example, a arrangable non-blocking hypercubic topology requires a hypercube dimension of d, connecting N=2d smaller crossbars, which is twice the number of external ports p per smaller crossbar, i.e. d=2p. Hence each small crossbar of radix 3p has a circuit complexity (number of cross points) of O(9p2). The number of external ports X=N·p=22p·p determines value p needed for a given X in implicit form where approximately p≈½ log(X)+O(log(log(X))). Hence, the number of small crossbars is N=2d≈X·log(X). With the small crossbar radix p=72, the C-Switch hardware scales to X=224≈16 million ports.
This kind of software controlled multi-connector has a much wider applicability than data centers, or even than Ethernet LANs, since cabling and connectors are a major problem in many other settings and at much smaller scales of connectivity.
The traffic patterns in a data center are generally not uniform all-to-all traffic. Instead, smaller clusters of servers and storage elements often work together on a common task (e.g. servers and storage belonging to the same client in a server farm). The integrated control plane of the current invention allows traffic to be monitored, and to identify these types of traffic clusters and reprogram the C-Switch so that the nodes within a cluster become topologically closer within the enhance hypercube of Ethernet switches. By reducing the path lengths of the more frequent traffic patterns or flows by using a C-Switch, the load on the switching network is reduced since fewer switching operations are needed on average from ingress to egress, hence increasing capacity. The C-Switch is used in this new division of labor between the dynamic switching network of the Layer 2 switches and the crossbar network within the C-Switch, which offloads and increases the capacity of the more expensive network (switches) by the less expensive network (crossbars). This is a similar kind of streamlining of the switching network by C-Switch that layer 2 switching networks perform relative to the more expensive router/layer 3 networks. In both cases, a lower level, more primitive and less expensive form of switching takes over some of the work of the more expensive form of switching.
Although the d-cube wiring is highly regular and can be performed mechanically (a la weaving), the ‘long hops’ do complicate the simple pattern enough to make it error prone for brute force manual wiring. Since this problem is shared by many other desirable topologies, a general solution is desirable to make networks built according to the invention practical in the commercial world.
In this method, the switches are numerically labeled in a hierarchical manner tailored to the packaging and placement system used, allowing technicians to quickly locate the physical switch. A wiring program displays the wiring instructions in terms of the visible numbers on the switches (containers, racks, boxes, rooms) and ports. The program seeks to optimize localization/clustering of the wiring steps, so that all that is needed in one location is grouped together and need not be revisited.
This is a more attainable lower tech variation of the C-Switch in the form of a connector box with pre-wired topologies, such as enhanced hypercubes, within certain range of sizes. Front panels of the C-Box provide rows of connectors for each switch (with 10-20 connectors per switch) with numbered rows and columns for simple, by the numbers, wiring for the entire rows of rack switches and hosts.
C-Box is as easy to hook up and functions exactly as the C-Switch (e.g. with a built in processor and a unified control plane per box), except that the topology is fixed. As with the C-Switch, multiple C-Boxes can be connected via thick cables to form a larger network.
This facility is useful for the manual wiring methods described above. Diagnostic software connected to the network can test the topology and connections, then indicates which cables are not connected properly and what corrective actions need to be taken.
The network spanned by the T-Lines is the network backbone. The encircled “A” above the top-of-rack (TOR) switches represents fabric aggregation for parts of the TOR fabric which reduces the TOR inefficiencies.
The control and management software, MMC (Management, Monitoring and Control module), CPX (Control Plane Executive) and IDF (Data Factory), can run on one or more servers connected to the network switching fabric.
In a data center using virtual machine instances, the MMC and CPX can cooperate to observe and analyze the traffic patterns between virtual machine instances. Upon discovering a high volume of data communication between two virtual machine instances separated by a large number of physical network hops, the MMC and/or CPX can issue instructions to the virtual machine supervisor that results in one or more virtual machine instances being moved to physical servers separated by a smaller number of network hops or network hops that are less used by competing network communication. This function both optimizes the latency between the virtual machines and releases usage of some network links for use by other communicating entities.
The most commonly used layer 3 (or higher) reliable communication protocols, such as TCP and HTTP, which have large communication overheads and non-optimal behaviors in data center environments, can be substantially optimized in managed data center networks with a unified control plane such as in the current invention.
The optimization consists of replacing the conventional multi-step sequence of protocol operations (such as three way handshake and later ACKs in TCP, or large repetitive request/reply headers in http) which have source and destination addresses within the data center, with streamlined, reliable Layer 2 virtual circuits managed by the central control plane where such circuits fit naturally into the flow-level traffic control. In addition to reducing communication overhead (number of frames sent, or frame sizes via removal of repetitive, large headers) and short-circuiting the slow error detection and recovery (the problem known as “TCP incast performance collapse”), this approach also allows for better, direct implementation of the QoS attributes of the connections (e.g. via reservation of the appropriate network capacity for the circuit). The network-wide circuit allocation provides additional mechanism for global anticipatory traffic management and load balancing that operates temporally ahead of the traffic in contrast to reactive load balancing. This approach of tightly integrating with the underlying network traffic management is a considerable advance over current methods of improving layer 3+ protocol performance by locally “spoofing” remote responses without visibility into the network behavior between the spoofing appliances at the network end points.
Further, by operating in the network stacks/hypervisor, the virtualized connections cooperate with the Layer 2 flow control, allowing for congestion/fault triggered buffering to occur at the source of the data (the server memory), where the data is already buffered for transmission, instead of consuming additional and far more expensive and more limited fast frame buffers in the switches. This offloading of the switch frame buffers further improves the effective network capacity, allowing switches to handle much greater fluctuations of the remaining traffic without having to drop frames.
The FRS Control Plane (FRS-CP) makes use of the advanced routing and traffic management capabilities of the Infinetics Super Switch (ISS) architecture. It can also be used to control conventional switches, although some of the capabilities for Quality of Service control congestion control may be limited.
FRS-CP provides:
Performance
Management
Security
Cost Savings
Control Plane Architecture
FRS-CP can include a central control system that connects directly to all the switches in the network, which may be replicated for redundancy and failover. Each switch can run an identical set of services that discover network topology and forward data packets.
Switches can be divided into three types based upon their role in the network, as shown in
ARP and broadcast squelching. When a specific machine attempts to locate another machine on the network in a classic network, it sends out a broadcast ARP (sort of a where are you type message), which will be transmitted across the entire network. This message needs to be sent to every machine across the network on every segment which significantly lowers the throughput capacity of the network. We keep a master list (distributed to every switch) of every host on the network, so that any host can find any other host immediately. Also any other broadcast type packets which would have been sent completely across the network are also blocked. (** See CPX Controller/Data Factory)
Data Factory (IDF)
Control Plane Executive (CPX)
The Data Factory communicates with the Control Plane Executive (CPX) through a service interface using a communication mechanism such as Thrift or JSON as shown in
Universal Boundary Manager (UBM)
In accordance with some embodiments of the invention, the UBM can provide some or all of the following functions:
A UBM entry can describe a name for an organization or a specific service. A UBM entry could be a company name like ReedCO which would contain all the machines that the company ReedCO would use in the data center. A UBM entry can also be used to describe a service available in that data center. A UBM entry has the following attributes:
To allow external access, a flag can be provided in or associated with the Node definition that indicates that this Node can be accessible from anybody without restrictions. So a typical company with a Database server, Backup Database server, WWW server, and Backup server could look like the following:
A machine table contains at least the following information:
The firewall rules that are necessary to allow dataflow across the network can be created from this table. Only flows that are allowed will be sent to the KLM.
UBM Service
The Universal Boundary Manager service can provide membership services, security services and QoS. There can be two or more types of UBM groups:
Transparent UBM Group
A transparent group can be used as an entry point into the IPA Eco-System. It can be visible and allow standard IP traffic to flow over its interface—UBM Interfaces can be determined by port number—e.g. Port 80. This type of group can be used to handle legacy IP applications such as Mail and associated Web Services. Since a Web Service can be tied to an IP port, limited security (at the Port Level) and QoS attributes (such as Load Balancing) can be attributes of the UBM structure.
Opaque UBM Group
An opaque group can have all the attributes of the Transparent group's attributes, but allows for the extension of pure IPA security, signaling (switch layer) and the ability to provided guaranteed QoS.
The major extensions to the Opaque group can include the security attributes along with the guaranteed QoS attributes. Multiple opaque or visible groups can be defined from this core set of attributes.
Firewall
The firewall can be a network-wide mechanism to pre-authorize data flows from host to host. Since every host on the network must be previously configured by the network administrator before it can be used, no host can successfully transmit or receive data unless it has been authorized in the network. Furthermore because of the built in security model applied to all devices connected to the network, hosts can only communicate with other authorized hosts. There is no way a rogue host can successfully communicate with any unauthorized host. The data defined in the UBM can control all access to hosts. The KLM loaded into each Hypervisor can provide this functionality. Alternatively, this functionality can be provided on each switch for each attached physical host.
The ingress switch where a data packet from a host first arrives in the network can use the following rules to determine whether the data packet will be admitted to the network as shown in
This is the opposite way to which traditional firewalls work, where data is allowed to enter the network from any source, the data then traverses the network and is prevented from reaching a destination host once the data packet has nearly reached its intended destination. This significantly lowers “backbone” traffic on the network.
Central Services
Data Factory
This is the starting point for full control of the network. All static and dynamic data is stored here, and a user interface is used to view and modify this data.
CPX Controller
The CPX computer is the Control Plane Executive which controls all switches, and receives and sends data to the switches. This data is what is necessary to route data, firewall info, etc. It also controls the ICP (Integrated Control Plane) module which determines topology, and controls the IFX (Firmware eXtensions) which are installed on every switch and hypervisor.
CPX connects to the Data Factory to read all of the configuration data necessary to make the entire network work. It also writes both log data and current configuration data to the Data Factory for presentation to users.
ICP (Integrated Control Plane)
This module controls each instance of IFX on each switch, and takes that neighbor data from each IFX instance and generates cluster data which is then sent back to each IFX instance on each switch.
CPX Interaction with ICP
The types of data that will flow through CPX for the data plane are:
Triplets (which contain the Host IP Address, Switch ID, and MAC address of the host) are generated by the Host detector that runs on each switch. The detected triplets are sent through the Host Controller to the CPX controller. First the triplet's data is validated to make sure that this host MAC address (and IP address if defined), is a valid one. Once validate, the triplet is enabled in the network. Optionally, before a host's triplet is added to the database, the host can be forced to validate themselves using various standard methods such as 802.1x.
The triplets can be sent to the Data Factory for permanent storage, and are also sent to other switches that have previously requested that triplet. The sends will be timed out, so that if a switch has not requested a specific triplet for a specific time, the CPX will not automatically send it if it changes again unless the ICP requests it.
When a switch needs to route data to a host that it does not have a triplet for, the host controller sends a request for the triplet associated with the specific IP address. The CPX looks up that triplet and sends it to the IFX which in turn sends it to the KLM module so that the KLM can route data.
Firewall rules and Quality of Service (QOS) data travel along the same route as triplets. A switch always receives all the firewall rules involving hosts that are connected to that switch so that quick decisions can be made by the KLM module. If a firewall rule changes, then it is sent to the IFX which sends it to the KLM module. In cases where there are firewall rules with schedules or other “trigger points”, the firewall rules are sent to the IFX and IFX sends them to the KLM module at the appropriate time.
Logging Data such as data sent/received, errors, etc is sent from the KLM (or some other module) to IFX, and then to CPX which sends it to the Data Factory.
ICP Interaction with IFX on Switches
CPX controls ICP which then controls each instance of IFX on each switch through ICP, telling it to send “discover” packets, and return back neighbor topology data to ICP. All this data is stored in the Data Factory for permanent storage, and for presentation to users. This topology data is used by IFX to generate routes. When link states change, the IFX module notifies ICP, and a new routing table will be generated by IFX. Initially IFX will reroute the data around the affected path.
CPX Interaction with Data Factory
CPX reads the following data from the Data Factory:
The following information is needed by ICP.
This can happen at a very high rate upon startup, and can reoccur on a regular basis very slowly
This can happen on a very regular basis (e.g., at least 1 per second and can occur more often), but the writes can be buffered and delayed for writing if need be. The data will not be read on a regular basis, except for startup, but will need to be updated on all other switches. Of course the data will be read by the User for network status monitoring.
The following information will be written by the switches
All of these reads can occur as fast as possible. Any slowness in these reads may slow down the data path.
The following services can run on all switches in the network.
IFX (Firmware eXtensions)
This module runs on each switch and is responsible for determining the topology of the neighbors. It sends data back to the ICP module about its local physical connectivity, and also receives topology data from ICP. It supports multiple simultaneous network logical topologies, including n-cube, butterfly, torus, etc as shown in
IFXS (Firmware eXtensions for Servers)
This module runs on each hypervisor and interact s with the Hypervisor/KLM module to control the KLM. Flow data related to how many bytes of data flowing from this hypervisor to various destinations is accepted by this module and used to calculate forwarding tables.
This can include a Linux kernel loadable module (KLM) that implements the Data plane. It can be controlled by the Switch Controller.
The input to this module are:
The KLM can route packets from hosts to either other hosts, or to outside the network if needed (and allowed by rules). All packets sent across the “backbone” can be encrypted, if privacy is required.
The KLM switch module can have access to caches of the following data: triplets (they map IPv4 addresses into (Egress Switch ID, host Ethernet Address pairs); routes (they define the outbound interfaces, and next hop Ethernet Address to use to reach a given Egress Switch); and firewall rules (they define which IPv4 flows are legal, and how much bandwidth they may utilize).
The KLM can eavesdrop on all IP traffic that flows from VM instances (that are supported by the local hypervisor). It can, for example, use functionality (defined in the Linux netfilter library) to STEAL, DROP, or ACCEPT individual IP datagrams that are transmitted by any VM.
When a datagram is transmitted by a VM, the KLM switch can intercepts (STEALs) it and determines if firewall rules classify the corresponding flow to be legal. If it's illegal, the packet is dropped. If the flow is legal and it's destination is local to the hypervisor, it's made to obey QoS rules, and delivered. If the flow is legal and exogenous, the local triplet cache is consulted with the destination IP address as an index. If a triplet exists, it determines the Egress Switch ID (which is just a six-byte Ethernet address). If a route also exists to the Egress switch, then the packet will be forwarded with the destination switch Topological MAC address put into the Ethernet frame.
The KLM can use a dedicated Ethernet frame type to make it impossible for any backbone switch or rogue host to send a received frame up its protocol stack.
When a frame arrives at a hypervisor, it can be intercepted by its kernel's protocol handler (functionality inside the KLM) for Ethernet frame type defined. The protocol handler can examine the IP datagram, extract the destination IP address, and then index it into it's triplet cache to extract the Ethernet address of the local VM. If no triplet exists, the frame can dropped. The socket buffer's protocol type can switched from 0xbee5 to 0x0800, and the packet can be made to obey QoS rules before it is queued for transmission to the local host.
The KLM can use IFXS, for example, as its method to talk with CPX to access the data factory.
Additional supporting information relating to the construction of Long Hop networks is provided in attached Appendix A, which is hereby incorporated by reference.
Those skilled in the art will realize that the methods of the invention may be used to develop networks than interconnect devices or nodes with arbitrary functionality and with arbitrary types of information being exchanged between the nodes. For example, nodes may implement any combination of storage, processing or message forwarding functions, and the nodes within a network may be of different types with different behaviors and types of information exchanged with other nodes in the network or devices connected to the network.
This application claims any and all benefits as provided by law, including benefit under 35 U.S.C. §119(e) of U.S. Provisional Application Nos. 61/483,686 and 61/483,687, both filed on May 8, 2011, both of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
61483686 | May 2011 | US | |
61483687 | May 2011 | US |