The disclosed system and operating method are related to subject matter disclosed in the following patents and patent applications that are incorporated by reference herein in their entirety:
Prior to widespread networking of computing capacity, computers such as traditional mainframes scaled performance by balancing processing power and communications throughput in an environment of predictable workloads. As computing networking has enormously expanded in variety as well as load, multiple-tier server systems distribute communications across a range of architectures and interconnect technologies. The Internet has fundamentally changed the nature of computing management. Previous to widespread Internet usage, all but a fraction of computing was performed local to a particular computer. Widespread Internet connectivity has fundamentally changed usage character so that now most computing is performed over a network. Service providers have responded by improving connectivity, enabling transfer of massive data amounts throughout the world.
The impressive increase in capability and capacity enabled by the evolution from local to highly networked computing brings challenges to providers of computing capability and services. Workloads have evolved from highly predictable to vastly unpredictable. Not only consumers of computing power but all of society has become highly dependent on networked computing and database operations.
One aspect of reliable communications and computing is the efficiency of message communication through the network to transfer data. The traditional “load-store” model for transferring data is insufficient to meet today's networking demands.
In accordance with an embodiment of an interconnect device for usage in an interconnect structure. The interconnect device includes a data switch and a control switch coupled in parallel between multiple input lines and a plurality of output ports. The interconnect device comprises an input logic element coupled between the multiple input lines and the data switch. The input logic element can receive a data stream composed of ordered data segments, insert the data segments into the data switch, and regulate data segment insertion to delay insertion of a data segment subsequent in order until a signal is received designating exit from the data switch of a data segment previous in order.
Embodiments of the illustrative systems and associated technique relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings:
The disclosed system relates to structures and operating methods for interconnecting a plurality of devices for the purpose of passing data between said devices. Devices that can be interconnected include but are not limited to: 1) computing units such as work stations; 2) processors in a supercomputer; 3) processor and memory modules located on a single chip; 4) storage devices in a storage area network; and 5) portals to a wide area network, a local area network, or the Internet. The disclosed system also can include aspects of self-management of the data passing through the interconnect structure. Self-management can accomplish several functions including: 1) ensuring that individual packets of a message and segments of a packet leave the interconnect structure in the order of entry, and 2) controlling packets entering the interconnect structure prevent overloading of individual output ports.
The interconnect structures described in the related patents and patent applications are excellent for usage in interconnecting a large number devices when low latency and high bandwidth are important. The self-routing characteristics and a capability to simultaneously deliver multiple packets to a selected output port of the referenced interconnect structures and networks can also be favorably exploited.
References 1, 2, 3, 4, 6 and 7 disclose topology, logic, and use of the variations of a revolutionary interconnect structure that is termed a “Multiple Level Minimum Logic” (MLML) network in the first reference and has been referred to elsewhere as the “Data Vortex”. References 8 and 10 show how the Data Vortex can be used to build next generation communication products, for example including routers. A Hybrid Technology Multi Threaded (HTMT) petaflop computer can use an optical version of the MLML network in an architecture in which all message packets are of the same length. Reference 5 teaches a method of parallel computation and parallel memory access within a network.
Reference 11 teaches a method of scheduling messages through an interconnect structure including a Data Vortex data structure. One consequence of the scheduling is a highly desirable capability of guaranteeing that output ports of a scheduled switch are not overloaded and also guaranteeing that message packet segments exit a switch in the same order as the segments entered the switch.
The present disclosure describes a structure and technique for regulating data flow through an interconnect structure or network that eliminates overloading of output ports and ensures that segments of a data packet and packets of a data stream pass through the interconnect structure in the same order as the segments entered the structure.
In some embodiments, the disclosed structure and technique can be configured using the interconnect structures, networks, systems, and techniques described in the referenced and related patents and patent applications to further exploit the performance and advantages of the referenced systems. For example, the present system can be constructed using a Data Vortex switch in a configuration as an unscheduled switch so that overloading of output ports is eliminated and data segments and packets pass through the Data Vortex switch in order. Overloading and/or mis-ordering events that occur with very low probability in systems using structures and methods of the referenced patents and patent applications are eliminated using the structures and methods of the present disclosure.
Because InfiniBand switching standards require that segments leave a switch in the same order as the segments entered the switch, the structures and methods described herein are highly useful in InfiniBand applications.
The present disclosure teaches interconnect structures, networks, and switches, and associated operating methods for regulating data flow through an interconnect structure or switch, in some embodiments a Data Vortex switch, so that output ports do not become overloaded and message packet segments are guaranteed to remain in the same order at entry and exit from the switch. In one embodiment two Data Vortex switches are connected. A first Data Vortex switch is used for data transfer and a second switch is used to regulate the first switch. In a particular configuration, a data switch has N input ports and N output ports. K data transmission lines connect to each of the output ports. Output port overloading and/or data mis-ordering conventionally arises when, over a prolonged time, more than K input ports send data to a single output port.
The following terminology is used in the present description.
A message is a data stream comprising a plurality of packets. A packet is a data stream comprising a plurality of segments. Messages and packets can have different lengths. All segments have the same length. Segments are inserted into the switch at segment insertion times.
For example, a packet P can arrive at a data switch input port IP and the packet P is targeted for output port OP. The packet can include multiple segments arranged in a sequence S0, S1, S2, . . . , SJ so that segment S0 is the first segment to be inserted into the data switch DS; S1 is the second segment to be inserted into the data switch and so forth whereby SJ is the final segment inserted into the data switch. When the segment Sn, for entries 0≦n<J, is inserted into the data switch, a lock is placed on input port IP forbidding insertion of Sn+1 until the lock is removed by a signal indicating that Sn has exited the data switch.
A header is attached to the individual segments. The header of a segment has a leading bit set to one to indicate message presence, followed by a binary address of a selected target output port, followed by a binary address of a selected target input port, followed by a single bit set to one. In case multiple data lines are connected to a particular input port, the input lines are distinguished by additional header bits between the address of the input port and the single bit, which functions as a payload.
When the first header bit arrives at the target output port from the data switch, header address bits have been removed by the switch. A packet with a header with a leading bit set to one, followed by the binary address of the input port, followed by a payload of a single bit set to one is called a control packet and is sent through a control switch CS to the input logic unit IL specified in the header of the control packet. The single bit payload arrives at the input logic unit IL and unlocks the control lock so that the next segment of the packet is free to enter data switch DS. When the single payload bit of the control packet exits the data switch, the contents of the payload are sent out of the data switch output port OP to a downstream device. Smooth operation of the system is possible even for short messages in part because the control switch is self routing and has extremely low latency. Moreover, when a group of segments is inserted into the control switch, more segments are never targeted for a single input port than the number of data lines into data switch DS from the input device at input port IP. In case multiple data lines are connected into the data switch, the individual lines can be reached by small crossbars arranged into an auxiliary switch AS. If the data switch has only a single input line from each of the input logic devices IL, the auxiliary switch AS can be omitted. References 10 and 11 describe an example of crossbars in an auxiliary switch AS. In case the data switch DS has N inputs and N outputs and each output has K data lines, the control packet can conveniently be sent through a K·N concentrator prior to sending the control packet into the control switch CS. A suitable concentrator is described in more detail in reference 4.
Referring to
The illustrative system 100 includes a plurality of networks including the network DS 110 the network CS 130, and the network AS 140. Data enters the system through input lines 102 and exits the system through output lines 118. Data packets arriving on input lines 102 enter input logic units IL 150. The input logic units IL 150 divide the packet into segments of fixed length. The logic unit IL 150 applies a header to a segment prior to sending the segment into the data switch DS 110. The input logic unit IL governs insertion of a first data segment S into the data switch DS 110 and a second segment T into data switch DS 110 when the segment T directly follows segment S and is destined for the same output port as segment S because segments S and T are two adjacent data segments of the same packet or because segment S is the last segment of one packet P and segment T is the first segment of a following packet Q of the same message. When the input logic IL sends S through line 104 into the data switch DS, input logic element IL sets a lock prohibiting the entrance of segment T into data switch DS until input logic element IL receives a signal indicating that segment S has reached an output port of data switch DS.
In some embodiments, the data switch DS 110 can be a Data Vortex type interconnect structure.
A control line 116 from the data switch DS 110 to the input logic ILJ may also prevent a signal from entering, for example by sending a one-bit signal to input logic ILJ to indicate that the entry node is busy. In a particular embodiment, data in multiple lines 116 can be passed through a reduced number of pins.
The packet segment S, along with other packets from the input logic units, enters the data switch DS 110 at a segment-input insertion time. In a first case, the first bit of segment S can exit data switch DS 110 through line 106 before the beginning of the next segment insertion time or, in a second case, segment S circulates around the data switch through a first-in first-out (FIFO) buffer and, as a result, the first bit of segment S does not exit the data switch before the beginning of the next segment insertion time. In a second case, the number of segments inserted with segment S into data switch DS at the same time segment insertion time exceeds the number of data-carrying lines 106 exiting data switch DS. The system can be designed so that, in the first case, the segment T is allowed to enter data switch DS directly after segment S. In a second case, insertion of segment T is delayed.
In one embodiment, when the insertion of segment T is delayed, the line that carries segment S from input logic unit IL to data switch DS is idle. A single bit sent to an input logic unit IL unit through a line 114 is sufficient to manage data transfer. In a second embodiment, the line that carries segment S from input logic unit IL to data switch DS may be enabled to carry a segment U of a packet R from input logic unit IL to data switch DS when no other segment of packet R presently in the data switch DS. More information is generally sent to the input logic unit IL to enable the additional functionality of the second embodiment.
Data travels on lines 106 from the data switch DS to the output switches OS. As the data traverses the data switch DS 110, output port address information is pared, bit-by-bit while passing through nodes in the data switch DS. When a data segment reaches an output switch, at least a portion of the header is discarded. Output switch OS sends the remainder of the header and a single payload bit through a line 108 to the control switch CS. In some embodiments, the payload bit may be omitted and a timing bit may be used to inform the input logic unit that the data packet segment sent from the input logic unit to the data switch DS has exited the data switch DS. Segments enter the output switch OS at staggered times. Therefore some method of managing the time that the control segments enter the control switch is used. In some embodiments, a concentrator may be incorporated into control switch CS 130 to reduce the number of output lines exiting the control switch in comparison to the number of input lines to the control switch.
Control segments travel through lines 112 from the control switch CS 130 to the auxiliary switch AS 140. The auxiliary switch may include one or more crossbar switches. Examples of suitable cross-bar switches and timing interfaces between the control switch CS and the auxiliary switch AS are described in more detail in the discussion of
In the depicted embodiment, only a single bit exits the auxiliary switch AS and travels on line 114 to the input logic unit IL 150. The single control bit is used to unlock an input logic gate to enable another segment to flow through that gate into the data switch DS 110.
A control line 112 from the external device to data switch DS is used to prevent packet segments from being sent from data switch DS to the output switch and from the output switch to the external device. In one embodiment, the output switch sends a single-bit set to one at a segment sending time when the output port is not prepared to receive the data (possibly resulting from a full input buffer). In a second embodiment, the external device sends a two-bit locking signal (for example [1,1]) indicating that no data is to be sent until further notice and sends an unlocking signal (for example [1,0]) indicating that data may be sent until the next locking signal is received.
In many embodiments, for example the embodiment shown in
The second listed reference incorporated herein, U.S. Pat. No. 6,289,021, describes multiple-level switches in which the bottom level has multiple output rings. Each node on the output ring can receive data in a bit-serial arrangement, either from a neighboring node on the ring or from a node not on the output ring. The node NA on the bottom output ring is positioned to send data directly to a node NA+1, on the bottom ring. The timing is such that if the first bit of a packet segment arrives at node NA at tick t, also termed time t, and node NA forwards the segment to node NA+1, then the first bit of the segment arrives at node NA+1, at time t+2. Node NA+1 is also positioned to receive data from a node distinct from node NA. A segment arriving at node NA+1 from a node distinct from node NA also arrives at node NA+1 at time t+2.
In the embodiment illustrated in
Referring to
The input logic units insert data into data switch DS 110 through lines 104. The data switch DS can be of the type disclosed in the cited references incorporated herein. The nodes 402 of data switch DS are arranged into node arrays 404. Data can pass from a node array on one level to another array on the same level, or can pass to a node array on a lower level. The data lines connecting same-level node arrays pass through a permutation π 406. In case a data packet segment passes through all of the top level node arrays without dropping to a lower level, then the packet is inserted into a FIFO delay line 420. In the illustrative embodiment the FIFO delay line 420 has single-bit delay elements 422. Because of the permutations π between node arrays, a data packet segment entering a FIFO on the row J typically enters data switch DS in a node on a level K≠J. Data exiting a FIFO re-circulates back into a leftmost node array on lines 412. Although only one of lines 412 is shown in
In the illustrative example shown in
In a first alternate embodiment, data segments from different message packets are interleaved in a single line 104. The message packets are labeled and the labels are carried in the payload of the control packet.
In a second alternate embodiment that is particularly useful for switches in which a large proportion of the data packets from different input ports are targeted to a common output port, the input logic units IL can implement a slightly more complex operating technique. If a particular message packet segment S of a message packet is inserted at a given message insertion time, and a number of segment insertion times pass before the control signal is returned to the input port, then the input port logic can continue to stay latched for several message insertion times. The number of consecutive latched insertion times after receiving the control packet can be a function of the time interval beginning with the packet segment insertion time and ending with the control packet receiving time.
The interconnect structures, networks, and switches described herein may be used in various applications. For example, the structures can be used in the unscheduled portion of the network described in reference 11. In the controlled portion of the reference 11 network, packets with multiple segments are guaranteed to arrive in sequence. For an embodiment using switches disclosed herein in the unscheduled portion of a reference 11 network, then multiple segment packets traversing the unscheduled portion of the network are also guaranteed to have segments arrive in-order, guaranteeing that the processors are always in sequence and never require re-sequencing.
Referring to
Referring to
Unscheduled or Uncontrolled Switch
Unscheduled or uncontrolled network switch U receives data from devices 530 through lines 512. Switch U sends data to devices through lines 514. Scheduled or controlled network switch S 520 receives data from devices through lines 522 and sends data to external devices through auxiliary switches AS 540. Data passes from network S 520 to the auxiliary switch 540 via line 524 and passes from the auxiliary switch 540 to the device D via lines 526.
Referring to
An embodiment has N pins that carry the control signals to the external devices, with one pin corresponding to each device. In other embodiments, fewer or more pins can be dedicated to the task of carrying control signals.
In another embodiment, that is not shown, a first-in-first-out (FIFO) with a length greater than N and a single pin, or a pair of pins in case differential logic is employed, are used for carrying control signals to the devices D0, D1, . . . , DN−1. At a time T0 the pin carries a control signal to device D0. At time T0+1 the pin carries a control signal for device D1, and so forth, so that at time T0+k, the pin carries the control signal for device DN+k. The control signals are delivered to a control signal dispersing device, not shown, that delivers the signals to the proper devices.
In a third embodiment, also not shown, the pin that delivers data from line 512 to the network U 510 also passes control signals from network U to the external devices. In the third embodiment, the timing is arranged so that a time interval separates the last bit of one message and the first bit of a next message to allow the pin to carry data in the opposite direction. The second and third embodiments reduce the number of pins. In addition to the control signals from network U to the external devices, control signals connect from the external devices into network U. The purpose of the control signals is to guarantee that the external device input buffers do not overflow. In case the buffers have insufficient capacity to accept additional packets from network U, the external device 530 sends a signal via line 518 to network U to indicate the condition. In a simple embodiment, the signal, for example comprising a single bit, is sent when the device D input buffers have insufficient capacity to hold all the data that can be received in a single cycle through all of the lines 514 from network U 510 to device D 530. If a blocking signal is sent, the signal is broadcast to all of the nodes that are positioned to send data through lines 514. The two techniques for reducing pin count for the control signals out of network U can be used to reduce the pin count for signals into network U.
The Controlled Switch
Referring to
Data passes from devices 530 into the switch 520 in a single column through lines 522 and exit the switch 520 in multiple columns through lines 524 into the auxiliary switches AS 540 shown in
One method of controlling the traffic through switch S 520 is to send request packets through switch U 510, an effective method for a many applications, including storage array network (SAN) applications. In another application involving parallel computing, including cluster computing), data through switch S is scheduled by a compiler that manages the computation. The system has the flexibility to enable a portion of the scheduled network to be controlled by the network U and a portion of the scheduled network to be controlled by a compiler.
The Auxiliary Output Switch
Referring to
In the illustrative example of
Many control algorithms are usable with the illustrative architecture. Algorithms can be implemented in hardware, software, or a combination of hardware and software.
Using Multiple Switches to Lower Pin Count
Referring to
While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, components, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. Variations and modifications of the embodiments disclosed herein may also be made while remaining within the scope of the following claims.
Number | Date | Country | |
---|---|---|---|
60486883 | Jul 2003 | US |