Interconnect network technology is a fundamental component of computational and communications products ranging from supercomputers to grid computing switches to a growing number of routers. However, characteristics of existing interconnect technology result in significant limits in scalability of systems that rely on the technology.
An interconnect structure comprises a plurality of network-connected devices and a logic adapted to control a first subset of the network-connected devices to transmit data and simultaneously control a second subset of the network-connected devices to prepare for data transmission at a future time. The logic can execute an operation that activates a data transmission action upon realization of at least one predetermined criterion.
Embodiments of the illustrative systems and associated technique relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings.
The disclosed structures and methods may be used to couple multiple devices using a plurality of interconnects and may be used for the controlled interconnection of devices over an optical or wireless medium. An aspect of the illustrative structures and methods involves control of a set of interconnection mediums wherein, at a given time, a subset of the interconnection mediums transmit data while another subset of the interconnection mediums are set for transmission of data at a future time.
A wide variety of next generation parallel computing and data storage systems may be implemented on a high-bandwidth, low-latency interconnect network capable of connecting an extremely large number of devices. Optical and wireless network fabrics enable a very high-bandwidth, large-port-count switch. However, these systems have not been widely employed in packet based systems because of the lack of an efficient management scheme in conventional usage. The present disclosure describes an efficient solution to the problem that is based on the Data Vortex™ switch described in related patents and applications 1, 2, 3, 6, 7, 13, 15, and 17.
References 8 and 10 show how the flow of telecommunication data through a switch fabric, including a stack of Data Vortex™ stair-step switch chips, can be managed by a system incorporating Data Vortex™ switches. References 11, 15, and 16 show how, in computing and storage area network systems, the flow of data through a collection of data carrying stair-step Data Vortex™ switch chips can be managed by another Data Vortex™ chip that carries control information. Reference 14 shows how the flow of data through a collection of optical telecommunication switches can be controlled by a system employing an electronic Data Vortex™ switch. The structures and methods disclosed herein depict how the flow of data through a collection of optical or wireless switches for computing and data management purposes can be managed by a system employing an electronic Data Vortex™ switch.
Referring to
One type of packet may be used in operation of the system is a “request-to-send data packet” (RTS). The packet has multiple fields. In one illustrative embodiment, the “request-to-send packet” includes a field F1 that describes the data to be sent. The field F1 may point to the physical location of the data. Field F1 may indicate the amount of data to be sent. Field F1 may give some other information that identifies the data to be sent. A field F2 can designate the target device for the data. In embodiments in which the devices have multiple input ports, the field F3 can indicate the target input port of the target device. The field F4 can be used to assign priority to the request. A field F5 designates one or more criteria that are to be realized to enable sending of the data. The criteria may include the time for the data to be transmitted by the sending device or the time that the data is to be received by the receiving device. In another mode of operation, the field F5 can indicate the earliest time that the receiving device will be prepared to receive the data.
The fields may be exploited in multiple ways. In a system wherein a device is scheduled to receive data at a designated time at a designated device input port and the receiving device has access to the designated time and the port information, the operation code prescribed for the incoming data may be embedded in the time and location fields. The RTS packet can be sent to a device through an unscheduled network or can be embedded in a long packet being sent to the device. In the latter case, the RTS may inform the receiving device what action to take after the long packet is received.
In a first example, the system can be used in a message passing computing environment wherein the computational devices perform the same function on different data sets. In a general case, the processing times for the various data sets are not equal. When all of the processors have completed their tasks and reported to a master processor, the master processor sends RTS packets to all processors that are to send or receive data. The master processor has information relating to the status of all input ports and output ports of the computational device. Therefore, for each packet to be sent the associated RTS packet can designate the target input port of a target processor. In case a message longer than a single packet is to be sent, the entire stream of packets containing the message can be scheduled for sending in consecutive time intervals. The sending processor has the instruction from the RTS to send when a certain condition is satisfied, and the receiving processor has the instruction to be prepared to receive during the receiving time interval specified in the RTS packet.
In a second shared-memory example, a receiving processor sends an RTS packet to a sending processor requesting certain data to be sent as soon as possible. In case the receiving processor requests the data be sent through the controlled network, the receiving processor designates a target input port and holds that port open until the data has arrived. In case the receiving processor requests data through the uncontrolled network, the receiving processor does not indicate a receiving processor target input port. The data is sent by the sending processor as soon as all of the criteria in the RTS packet are realized. The criteria include the following: 1) the data is available at the sending processor and 2) the sending processor has a free output port into the scheduled network. In case the data is transmitted over the controlled network, the receiving processor does not request another message be sent to the input port designated for the incoming data packet until that packet has begun to arrive. Once the data begins to arrive at the receiving processor, the receiving processor has information relating to when the transmission of the message is to end, and thus can make a request that data from another sending processor be sent to the same receiving port. In this case, one of the fields in the RTS packet designates the earliest time that the data can be accepted at this input port by the receiving processor. The model of computation in the second mode of operation may be possible using a parallel program language such as UPC.
In a third mode of operation, the flow of data among all or a subset of all devices is handled by a master processor that controls the time and location for sending and receiving of each packet. The model of computation enables streams of data to arrive at processors at the exact time that the data is used to perform the computations. The mode is enabled because the time of flight of messages is known in advance. The following small example illustrates the operation mode. A designated device DC is scheduled to receive data stream A from device DA through device DC data input port IPA, commencing at time t0 and ending at time tE. Device DC is also scheduled to receive data stream B from device DB through device DC data input port IPB, also commencing at time t0 and ending at time tF. Device DC is scheduled to perform a function on the streams A and B to produce a stream X that is scheduled to be transmitted to a given input port of another device DD, commencing at time tU and ending at time tV, where tU>t0. The device DD may also be scheduled to receive a plurality of data streams concurrently with the stream X. The method of systolic processing is enabled by the ability of the system to transmit multiple messages to a designated device with the arrival time of the various messages known because of the deterministic latency through the controlled network. The model of computation described in the third illustrative example can be enabled by extending a parallel language such as UPC to handle the scheduling of times.
The illustrative structures and methods enable a wide range of computation models.
Permission to send a packet from a device DA to a device DB through the controlled network is obtained by a request-to-send data packet RTS through the uncontrolled network to DB. In response to the request-to-send packet, device DB reserves an input line for the incoming data during the proper data receiving interval or intervals in case a message comprising multiple packets is sent.
The uncontrolled network manages traffic through the controlled network. The entire system works effectively because, in some embodiments, the Data Vortex™ is a building block of the uncontrolled network. In response to an RTS packet traveling through the uncontrolled network to a sending device DS, the sending device sends information that is used, along with information from other sending devices, to set the proper switches in the set of switches S0, S1, . . . , SK−1. As soon as the data passes through one of the switches SA, all devices may send switch setting information to switch SA. Packets of an entire message comprise PN packets that can be sent in contiguous order through the switches with the first packet sent through SA, the second packet sent through SA+1, and so forth, until the last packet is sent through SA+PN−1. The illustrative subscripts are expressed modulo K.
In one optical embodiment, switch SA has the topology of a stair-step Data Vortex™ switch. ESA, an electronic, stair-step Data Vortex™ copy of SA, uses copies of the headers of messages that are sent through the switch SA to determine how to set the nodes in SA. Nodes in the optical switch SA are then set to the same setting as the nodes in ESA. Nodes in the optical Data Vortex™ switch can be of a type that switch slowly, and are therefore relatively inexpensive and have low power requirements. In other embodiments, the switch SA is some other type of optical switch. While the switch SA is being set, data travels through the switches SA+1, SA+2, . . . , SK−1, S0, . . . , SA−1, with the subscripts expressed modulo K.
Devices 130 each have a plurality of input ports. Some of the input ports may be positioned to receive packets that pass through the uncontrolled switch 120, shown in
The switch 290 has two components. The first component is a Data Vortex™ switch DV 250 that receives data packets from the devices D0, D1, . . . , DN-1 on lines 272 and sends the data packets to the appropriate output line 274 as specified in the header of the packet. In the example illustrated, the leftmost input line 272 receives packets from device D0, the second from left input line receives packets from device D1, and so forth, so that the rightmost line receives packets from DN-1. Likewise, the output lines 274 from DV are ordered from left to right and send packets to the devices D0, D1, . . . , DN-1 respectively.
The second component of the system is a unit 260 which contains N-1 rows of switches 262, one row for each possible group G0, G1, . . . , GN-2, with the row associated with G0 at the top and the row associated with GN-2 at the bottom. Each row K for rows 0≦K≦N-2 contains N-K switches, one switch for each possible member of group GK. Switches in each row are arranged in ascending order from left to right in device order. Lines 276 exiting the system from the component are also ordered from left to right and send packets to the devices D0, D1, . . . , DN-1 respectively. The rightmost line 274 passes through unit 260, sending packets directly to device DN-1 on the rightmost line 276. The first switch 262 on each row K is labeled gK and performs two simple functions: 1) gK sends each packet received down line 276 to device DK, and 2) gK examines the multicast bit in the header of the packet and sends the packet on line 278 to the next switch in the row associated with device DK+1 only if the bit is turned on, for example equal to one. Other switches in row K also perform two simple functions, first for a switch that is not the last switch in the row the packet or a copy of the packet is sent to the switch to the right, and second if the group bit for the switch is set on, equal to one, the packet is sent on line 276 to the device associated with the switch. Group bits for the switches 262 are set by the multicast logic element previously discussed.
In one embodiment, a separate switch chip is used to carry multicast messages through the uncontrolled switch. The electronic uncontrolled switch is therefore able to handle short multicast messages efficiently.
One method of multicasting longer messages in the controlled network includes building an optical version of the electronic switch illustrated in
Management of the system illustrated in
Permission to send a packet from a device DA to a device DB through the controlled network is obtained by a request-to-send data packet RTS through the uncontrolled network to DB. In response to the request-to-send packet, device DB reserves an input line for the incoming data during the proper data receiving interval or intervals in case a message comprising several packets is sent. In the tunable output laser embodiment, packets are sent in K different time slots and a designated device can simultaneously receive J data packets.
In a second optical embodiment illustrated by
Input ports 240 and output ports 230 of a device DM 130 are illustrated in
In an illustrative embodiment, N devices may include computing or data management devices. A device DA sends a short data packet to device DB via the uncontrolled network. In the present embodiment, the connection between the uncontrolled network and the devices may be a wireless connection. In some examples the uncontrolled network may be a Data Vortex™ network. Computing device data output ports DO 402 send data in the form of packets to the uncontrolled network data input device DI 404. In a simple embodiment, only one uncontrolled network S may be used and each computing device D may have a unique output port that sends data to switch S. In the simple embodiment, the uncontrolled switch S has N input devices with each input device tuned to receive data from a unique output transmitter of a sending device. In other embodiments, a computing device may have multiple output devices and correspondingly more input devices on an uncontrolled switch S. In still other embodiments, multiple uncontrolled networks may be implemented as described in incorporated references 1, 2, 3, 6, 7, 13, 15 and 17. A control signal input device CI 414 may be associated with each data output device 402. The Data Vortex™ switch has the ability to send a control signal from the control sending device CO 412 to a control signal input device CI 414. In case a control signal input device receives a blocking signal, the device informs an associated data sending device 402 not to transmit at a specific message packet transmission time.
In the uncontrolled portion of the network each switch input port 404 may be paired with a specific device output port 402 and the uncontrolled network operates as if the computing devices are hard-wired to the uncontrolled network. The Data Vortex™ switch has the ability to send multiple messages to the same receiving device, and therefore, the uncontrolled Data Vortex™ switch has multiple data output devices DO 422, each tuned to send data to a specific data input device DI 424 of a device DM 130.
As in the other embodiments, data may be scheduled for sending through the controlled network. In a case whereby a receiving device DR is scheduled to receive information from a sending device DS when a certain criterion is met, prior to transmission of the packet the receiving device DR tunes one of data input devices DI 434 to a pre-arranged frequency of the data output device DO 432 of the sending device DS.
Referring to
In certain embodiments described herein, devices have a single output or input port which is capable of processing packets during each time interval. In alternate embodiments, multiple output or input ports of the type may be employed. In some embodiments described herein, devices have K inputs or outputs that process data, with only one device processing data at a given time. In alternate embodiments, the devices have K·J inputs with the device capable of processing data through J inputs at a designated time. Other modifications may be implemented to design a wide variety of systems using the techniques taught in the present description.
In
The structures and operating methods disclosed herein have an error correction capability for correcting errors in payloads of data packet segments and for correcting errors resulting from misrouted data packet sub-segments. In some embodiments, the illustrative system performs error correction for data packet segments that are routed through stacks of networks, including network stacks with individual networks in the stack having the stair-step configuration depicted in
Various embodiments of the disclosed system correct errors in data packet segments that are routed through stacks of networks with individual networks in the stack having the stair-step design illustrated in
Some of the illustrative structures and operating methods correct errors occurring in systems that decompose data packet segments into sub-segments and a sub-segment fails to exit through an output port of a stair-step interconnect structure, for example the sub-segment is discarded by the switch. Various embodiments can correct errors for packets entering request and answer switches disclosed in
The network has two kinds of transmission paths: one for data, and another for control information. In an illustrative embodiment, all nodes in the network may have the same design. In other embodiments, the nodes may have mutually different designs and characteristics. A node accepts data from a node on the same cylinder or from a cylinder outward from the node's cylinder, and sends data to node on the same cylinder or to a cylinder inward from the node's cylinder. Messages move in uniform rotation around the central axis in the sense that the first bit of a message at a given level uniformly moves around the cylinder. When a message bit moves from a cylinder to a more inward cylinder, the message bits synchronize exactly with messages at the inward cylinder. Data can enter the interconnect or network at one or more columns or angles, and can exit at one or more columns or angles, depending upon the application or embodiment.
A node sends control information to a more outward positioned cylinder and receives control information from a more inward positioned cylinder. Control information is transmitted to a node at the same angle or column. Control information is also transmitted from a node on the outermost cylinder to an input port to notify the input port when a node on the outermost cylinder that is capable of receiving a message from the input port is unable to accept the message. Similarly, an output port can send control information to a node on the innermost cylinder whenever the output port cannot accept data. In general, a node on any cylinder sends a control signal to inform a node or input port that the control signal sending node cannot receive a message. A node receives a control signal from a node on a more inward positioned cylinder or an output port. The control signal informs the recipient of the control signal whether the recipient may send a message to a third node on a cylinder more inward from the cylinder of the recipient node.
In the network shown in
In U.S. Pat. No. 5,996,020 the terms “cylinder” and “angle” are used in reference to position. These terms are analogous to “level” and “column,” respectively, used in U.S. Pat. No. 6,289,021, and in the present description. Data moves horizontally or diagonally from one cylinder to the next, and control information is sent outward to a node at the same angle.
In another embodiment of the stair-step interconnect, multicasting of messages is supported by the use of multiple headers for a single payload. Multicasting occurs when a payload from a single input port is sent to multiple output ports during one time cycle. Each header specifies the target address for the payload, and the address can be any output port. The rule that no output port can receive a message from more than one input port during the same cycle is still observed. The first header is processed as described hereinbefore and the control logic sets an internal latch which directs the flow of the subsequent payload. Immediately following the first header, a second header follows the path of the first header until reaching a cell where the address bits determinative of the route for that level are different. Here the second header is routed in a different direction than the first. An additional latch in the cell represents and controls a bifurcated flow out of the cell. Stated differently, the second header follows the first header until the address indicates a different direction and the cell makes connections such that subsequent traffic exits the cell in both directions. Similarly, a third header follows the path established by the first two until the header bit determinative for the level indicates branching in a different direction. When a header moves left to right through a cell, the header always sends a busy signal upward indicating an inability to receive a message from above.
The rule is always followed for the first, second, and any other headers. Stated differently, when a cell sends a busy signal to upward then the control signal is maintained until all headers are processed, preventing a second header from attempting to use the path established by a first header. The number of headers permitted is a function of timing signals, which can be external to the chip. The multicasting embodiment of the stair-step interconnect can accommodate messages with one, two, three or more headers at different times under control of an external timing signal. Messages that are not multicast have only a single header followed by an empty header, for example all zeros, in the place of the second and third headers. Once all the headers in a cycle are processed the payload immediately follows the last header, as discussed hereinabove. In other embodiments, multicasting is accomplished by including a special multicast flag in the header of the message and sending the message to a target output that in turn sends copies of the message to a set of destinations associated with said target output.
While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, components, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. Variations and modifications of the embodiments disclosed herein may also be made while remaining within the scope of the following claims.
The disclosed system and operating method are related to subject matter disclosed in the following patents and patent applications that are incorporated by reference herein in their entirety: 1. U.S. Pat. No. 5,996,020 entitled, “A Multiple Level Minimum Logic Network”, naming Coke S. Reed as inventor; 2. U.S. Pat. No. 6,289,021 entitled, “A Scaleable Low Latency Switch for Usage in an Interconnect Structure”, naming John Hesse as inventor; 3. U.S. Pat. No. 6,754,207 entitled, “Multiple Path Wormhole Interconnect”, naming John Hesse as inventor; 4. U.S. Pat. No. 6,687,253 entitled, “Scalable Wormhole-Routing Concentrator”, naming John Hesse and Coke Reed as inventors; 5. U.S. patent application Ser. No. 09/693,603 entitled, “Scaleable Interconnect Structure for Parallel Computing and Parallel Memory Access”, naming John Hesse and Coke Reed as inventors; 6. U.S. patent application Ser. No. 09/693,358 entitled, “Scalable Interconnect Structure Utilizing Quality-Of-Service Handling”, naming Coke Reed and John Hesse as inventors; 7. U.S. patent application Ser. No. 09/692,073 entitled, “Scalable Method and Apparatus for Increasing Throughput in Multiple Level Minimum Logic Networks Using a Plurality of Control Lines”, naming Coke Reed and John Hesse as inventors; 8. U.S. patent application Ser. No. 09/919,462 entitled, “Means and Apparatus for a Scaleable Congestion Free Switching System with Intelligent Control”, naming John Hesse and Coke Reed as inventors; 9. U.S. patent application Ser. No. 10/123,382 entitled, “A Controlled Shared Memory Smart Switch System”, naming Coke S. Reed and David Murphy as inventors; 10. U.S. patent application Ser. No. 10/289,902 entitled, “Means and Apparatus for a Scaleable Congestion Free Switching System with Intelligent Control II”, naming Coke Reed and David Murphy as inventors; 11. U.S. patent application Ser. No. 10/798,526 entitled, “Means and Apparatus for a Scalable Network for Use in Computing and Data Storage Management”, naming Coke Reed and David Murphy as inventors; 12. U.S. patent application Ser. No. 10/866,461 entitled, “Means and Apparatus for Scalable Distributed Parallel Access Memory Systems with Internet Routing Applications”, naming Coke Reed and David Murphy as inventors; 13. U.S. patent application Ser. No. 10/887,762 entitled, “Means and Apparatus for a Self-Regulating Interconnect Structure”, naming Coke Reed as inventor 14. U.S. patent application Ser. No. ______ entitled, “Means and Apparatus for a Scaleable Congestion Free Switching System with Intelligent Control III”, naming John Hesse, Coke Reed and David Murphy as inventors; 15. U.S. patent application Ser. No. ______ entitled, “Highly Parallel Switching Systems Utilizing Error Correction”, naming Coke Reed and David Murphy as inventors; 16. U.S. patent application Ser. No. ______ entitled, “Highly Parallel Switching Systems Utilizing Error Correction II”, naming Coke Reed and David Murphy as inventors; and 17. U.S. patent application Ser. No. ______ entitled, “Apparatus for Interconnecting Multiple Devices to a Synchronous Device”, naming Coke Reed as inventor.
Number | Date | Country | |
---|---|---|---|
60638068 | Dec 2004 | US |