Claims
- 1. Apparatus for performing collective reductions, broadcasts, and point-to-point message passing during parallel algorithm operations executing in a computing structure comprising a plurality of processing nodes, said apparatus comprising:
a global tree network including routing devices interconnecting said nodes in a tree configuration, said tree configuration including one or more virtual tree networks thereof, said global tree network enabling global processing operations including one or more of: global broadcast operations downstream from a root node to leaf nodes of specified virtual tree networks, global reduction operations upstream from leaf nodes to root node in said virtual tree, and point-to-point message passing from any node of said virtual tree to the root node of said virtual tree as required, wherein said global tree network and routing device configuration are optimized for providing low-latency communications in said computing structure.
- 2. The apparatus as claimed in claim 1, wherein said computing structure includes a plurality of processing nodes interconnected to form a first network, said one or more virtual tree networks and said first network are collaboratively or independently utilized according to bandwidth and latency requirements of a parallel algorithm for optimizing parallel algorithm processing performance.
- 3. The apparatus as claimed in claim 1, wherein a root node of a virtual tree network functions as an I/O node including a high speed connection to an external system, said. I/O node performing I/O operations for that virtual tree network independent of processing performed in said first network.
- 4. The apparatus as claimed in claim 3, wherein each said router includes input devices for receiving packets from other nodes of a virtual tree, output devices for forwarding packets to other nodes of said tree, a local injection device for injecting packets into said tree, and a local reception device for removing packets from said tree, said apparatus further including means for configuring said router to either participate or not participate in said virtual tree.
- 5. The apparatus as claimed in claim 4, wherein said means for configuring said router further specifies participation of said node as a root of a virtual tree for reduction operations.
- 6. The apparatus as claimed in claim 5, wherein said means for configuring said router further specifies participation of input devices and the local injection device for providing operands during reduction operations.
- 7. The apparatus as claimed in claim 6, wherein said router further including means for computing a specified reduction operation on packet contents received by contributing input devices and the local injection device if it is contributing, and means for causing transmission of computation results to that node's upstream parent node via said output device.
- 8. The apparatus as claimed in claim 7, wherein said virtual tree network is programmed for recursively causing a global combined result to be computed up said virtual tree for completion as a single packet at said root node.
- 9. The apparatus as claimed in claim 8, further including means for broadcasting a single, combined packet at said root to each all of the participating children configured to contribute operands to reductions on that virtual tree.
- 10. The apparatus as claimed in claim 3, further including mechanism for enabling compute nodes to send point-to-point packets to an I/O node at the root of a virtual tree that are destined for an external system via said high-speed connection.
- 11. The apparatus as claimed in claim 9, further comprising filter mechanism for controlling reception of broadcast packets at nodes of a virtual tree, said reception being based upon a node address and participation in said virtual tree.
- 12. The apparatus as claimed in claim 11, wherein each node includes an address, said system further comprising programmable means enabling point-to-point messaging among nodes of each said virtual tree, said address enabling an external host system to directly communicate to every node or a subset of the nodes.
- 13. The apparatus as claimed in claim 9, further including a mechanism for generating a hardware interrupt to a processor of a processing node based on the contents of a packet received by the local reception device.
- 14. The apparatus as claimed in claim 9, further including a mechanism for blocking unnecessary downtree traffic on each virtual tree independently.
- 15. The apparatus as claimed in claim 11, further comprising a mechanism for providing flow control between routers when communicating packets.
- 16. The apparatus as claimed in claim 15, further comprising means enabling broadcasting of packets on individual downstream links decoupled from said flow control mechanism to perform aggressive broadcasting.
- 17. The apparatus as claimed in claim 2, wherein said first network includes an n-dimensional torus, where n is greater or equal to one.
- 18. A method for performing collective reductions, broadcasts, and message passing during parallel algorithm operations executing in a computer structure having a plurality of interconnected processing nodes, said method comprising:
providing router devices for interconnecting said nodes via links according to a global tree network structure, said tree structure including one or more one or more virtual sub-tree structures; and, enabling low-latency global processing operations to be performed at nodes of said virtual tree structures, said global operations including one or more of: global broadcast operations downstream from a root node to leaf nodes of specified a tree virtual sub-tree networks, global reduction operations upstream from leaf nodes to root, node in said tree, and point-to-point message passing from any node of said virtual tree to the root node of said virtual tree as required when performing said parallel algorithm operations.
- 19. The method as claimed in claim 18, wherein said computing structure includes a plurality of processing nodes interconnected to form a first network, said method further including the step of collaboratively or independently utilizing said global tree network and first network in accordance with bandwidth and latency requirements of a parallel algorithm for optimizing parallel algorithm processing performance.
- 20. The method as claimed in claim 18, wherein a root node of each virtual tree network functions as an I/O node including a high-speed connection to an external system, said method including the step of performing node I/O operations for that virtual tree network independent of operations performed in said first network.
- 21. The method as claimed in claim 20, wherein each said router includes input devices for receiving packets from other nodes of a virtual tree, output devices for forwarding packets to other nodes of said tree, a local injection device for injecting packets into said tree, and a local reception device for removing packets from said tree, said method further including the step of configuring said router to either participate or not participate in a virtual tree.
- 22. The method as claimed in claim 20, wherein said router configuring step further comprises the step of specifying participation of a node as a root of a virtual tree when performing reduction operations.
- 23. The method as claimed in claim 22, wherein said router configuring step further comprises one or more steps of:
specifying participation of said processing node coupled to said router for injecting operands during reduction operations; and, specifying participation of said processing node coupled to said router for injecting operands during reduction operations.
- 24. The method as claimed in claim 23, further comprising the steps of: configuring said router to compute a specified reduction operation on packet contents received from contributing children nodes and said processing node, and causing transmission of computation results to that node's upstream parent node via an output device.
- 25. The method as claimed in claim 24, further including the step of recursively causing a global combined result to be computed up said virtual tree for completion as a single packet at said root node.
- 26. The method as claimed in claim 25, further including the step of broadcasting a single, combined packet at said root to each all of the participating children configured to contribute operands to reductions on that virtual tree.
- 27. The method as claimed in claim 20, further including the step of enabling compute nodes to send point-to-point packets to an I/O node at the root of a virtual tree that are destined for an external system via said high-speed connection.
- 28. The method as claimed in claim 26, further comprising the step of controlling reception of broadcast packets at nodes of a virtual tree, said reception being based upon said address of said node and its participation in said virtual tree.
- 29. The method as claimed in claim 28, wherein each node includes an address, said method further comprising the step of enabling point-to-point and sub-tree messaging among nodes of each said virtual tree, said address enabling a host system to directly communicate to every node or a subset of the nodes.
- 30. The method as claimed in claim 26, further including the step of: generating a hardware interrupt to a processor of a processing node based on the contents of a packet received by the local reception device.
- 31. The method as claimed in claim 26, further including the step of: independently blocking unnecessary downtree traffic on each virtual tree.
- 32. The method as claimed in claim 28, further comprising the step of providing flow control between routers when communicating packets.
- 33. The method as claimed in claim 32, further comprising the step of enabling aggressive broadcasting of packets on individual downstream links by decoupling said flow control mechanism.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] The present invention claims the benefit of commonly-owned, co-pending U.S. Provisional Patent Application Serial No. 60/271,124 filed Feb. 24, 2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the whole contents and disclosure of which is expressly incorporated by reference herein as if fully set forth herein. This patent application is additionally related to the following commonly-owned, co-pending United States Patent Applications filed on even date herewith, the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. U.S. patent application Ser. No. (YOR920,020,027US1, YOR920,020,044US1 (15270)), for “Class Networking Routing”; U.S. patent application Ser. No. (YOR920,020,028US1 (15271)), for “A Global Tree Network for Computing Structures”; U.S. patent application Ser. No. (YOR920,020,029US1 (15272)), for ‘Global Interrupt and Barrier Networks”; U.S. patent application Ser. No. (YOR920,020,030US1 (15273)), for ‘Optimized Scalable Network Switch”; U.S. patent application Ser. Nos. (YOR920,020,031US1, YOR920,020,032US1 (15258)), for “Arithmetic Functions in Torus and Tree Networks’; U.S. patent application Ser. Nos. (YOR920,020,033US1, YOR920,020,034US1 (15259)), for ‘Data Capture Technique for High Speed Signaling”; U.S. patent application Ser. No. (YOR920,020,035US1 (15260)), for ‘Managing Coherence Via Put/Get Windows’; U.S. patent application Ser. Nos. (YOR920,020,036US1, YOR920,020,037US1 (15261)), for “Low Latency Memory Access And Synchronization”; U.S. patent application Ser. No. (YOR920,020,038US1 (15276), for ‘Twin-Tailed Fail-Over for Fileservers Maintaining Full Performance in the Presence of Failure”; U.S. patent application Ser. No. (YOR920,020,039US1 (15277)), for “Fault Isolation Through No-Overhead Link Level Checksums’; U.S. patent application Ser. No. (YOR920,020,040US1 (15278)), for “Ethernet Addressing Via Physical Location for Massively Parallel Systems”; U.S. patent application Ser. No. (YOR920,020,041US1 (15274)), for “Fault Tolerance in a Supercomputer Through Dynamic Repartitioning”; U.S. patent application Ser. No. (YOR920,020,042US1 (15279)), for “Checkpointing Filesystem”; U.S. patent application Ser. No. (YOR920,020,043US1 (15262)), for “Efficient Implementation of Multidimensional Fast Fourier Transform on a Distributed-Memory Parallel Multi-Node Computer”; U.S. patent application Ser. No. (YOR9-20010211 US2 (15275)), for “A Novel Massively Parallel Supercomputer”; and U.S. patent application Ser. No. (YOR920,020,045US1 (15263)), for “Smart Fan Modules and System”.
PCT Information
Filing Document |
Filing Date |
Country |
Kind |
PCT/US02/05586 |
2/25/2002 |
WO |
|