High performance computing system

Information

  • Patent Grant
  • 11625393
  • Patent Number
    11,625,393
  • Date Filed
    Wednesday, February 5, 2020
    5 years ago
  • Date Issued
    Tuesday, April 11, 2023
    a year ago
  • CPC
    • G06F16/244
    • G06F16/214
    • G06F16/2237
    • G06F16/2246
  • Field of Search
    • US
    • 707 696000
    • CPC
    • G06F16/244
    • G06F16/2237
    • G06F16/214
    • G06F16/2246
  • International Classifications
    • G06F16/242
    • G06F16/21
    • G06F16/22
    • G06F16/11
Abstract
A method including providing a SHARP tree including a plurality of data receiving processes and at least one aggregation node, designating a data movement command, providing a plurality of data input vectors to each of the plurality of data receiving processes, respectively, the plurality of data receiving processes each passing on the respective received data input vector to the at least one aggregation node, and the at least one aggregation node carrying out the data movement command on the received plurality of data input vectors. Related apparatus and methods are also provided.
Description
FIELD OF THE INVENTION

The present invention relates to high performance computing systems in general.


BACKGROUND OF THE INVENTION

In high performance computing (HPC) systems many applications are written in a manner that requires communication between the systems which perform portions of the work (termed herein “processes”).


Part of the communications includes collective operation such as (by way of non-limiting example) doing sum of multiple vectors (element-wise add operation) from multiple processes, one per process, and sending a copy of the resulting vector to all the participating processes; this operation is called “all-reduce”. Another non-limiting example would be sending the result to only one of the processes; this operation is called “reduce”.


In addition to the compute elements (that is, reduce and all-reduce involve mathematical operators) there are data movement commands such as (by way of non-limiting example) all2all, gather, all gather, gather v, all gather v, scatter, etc. Commands of this type are defined in the well-known Message Passing Interface specification, and other communication Application Programmer Interface definitions.


SUMMARY OF THE INVENTION

The present invention, in certain embodiments thereof, seeks to provide an improved system and method for high performance computing.


There is thus provided in accordance with an exemplary embodiment of the present invention a method including providing a SHARP tree including a plurality of data receiving processes and at least one aggregation node, designating a data movement command, providing a plurality of data input vectors to each of the plurality of data receiving processes, respectively, the plurality of data receiving processes each passing on the respective received data input vector to the at least one aggregation node, and the at least one aggregation node carrying out the data movement command on the received plurality of data input vectors.


Further in accordance with an exemplary embodiment of the present invention the data movement command includes one of the following: gather, all gather, gather v, and all gather v.


Still further in accordance with an exemplary embodiment of the present invention the at least one aggregation node produces an output vector.


Additionally in accordance with an exemplary embodiment of the present invention at least one of the plurality of data input vectors includes a sparse vector.


Moreover in accordance with an exemplary embodiment of the present invention the at least one aggregation node utilizes a SHARP protocol.


There is also provided in accordance with another exemplary embodiment of the present invention apparatus including a SHARP tree including a plurality of data receiving processes and at least one aggregation node, the SHARP tree being configured to perform the following: receiving a data movement command, receiving a plurality of data input vectors to each of the plurality of data receiving processes, respectively, the data receiving processes each passing on the respective received data input vector to the at least one aggregation node, and the at least one aggregation node carrying out the data movement command on the received plurality of data input vectors.


Further in accordance with an exemplary embodiment of the present invention the data movement command includes one of the following: gather, all gather, gather v, and all gather v.


Still further in accordance with an exemplary embodiment of the present invention the at least one aggregation node is configured to produce an output vector.


Additionally in accordance with an exemplary embodiment of the present invention at least one of the plurality of data input vectors includes a sparse vector.


Moreover in accordance with an exemplary embodiment of the present invention the at least one aggregation node utilizes a SHARP protocol.





BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:



FIG. 1 is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an exemplary embodiment of the present invention;



FIG. 2 is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an alternative exemplary embodiment of the present invention;



FIG. 3A is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to all processes; and



FIG. 3B is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to a single process.





DETAILED DESCRIPTION OF AN EMBODIMENT
General Introduction

By way of introduction (but without limiting the generality of the present application), the concept behind exemplary embodiments of the present invention is to use the SHARP protocol to accelerate (at least the operations) gather, all gather, gather V, and all gather V.


The SHARP algorithm/protocol is described in US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which is hereby incorporated herein by reference.


In certain exemplary embodiments of the present invention, computations which may be described herein as if they occur in a serial manner may be executed in such a way that computations already occur while data is still being received.


The inventors of the present invention believe that current methodology for addressing the above-mentioned scenarios is to use software algorithms in order to perform the operations mentioned above. The software algorithms involve a large amount of data transfer/movement. Significant overhead may also be generated on a CPU which manages the algorithms, since multiple packets, each with a small amount of data, are sent; sometimes identical such packets are sent multiple times to multiple destinations, creating additional packet/bandwidth overhead, including header overhead. In addition, such software algorithms may create large latency when a large number of processes are involved.


In exemplary embodiments of the present invention, one goal is to offload work from the management CPU/s by simplification of the process using the SHARP algorithm/protocol (referred to above and described in US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference); latency as well as bandwidth consumption may also be reduced in such exemplary embodiments.


Embodiment Description

In exemplary embodiments of the present invention, the SHARP algorithm/protocol (referred to above) is used to implement at least: gather; gather v; all gather; and all gather v.


In general, in exemplary embodiments of the present invention, the gather operation is treated as a regular reduction operation by using a new data format supporting sparse representation. When using the sparse representation each process/aggregation point sends its data to the SHARP network, while allowing the aggregation to move forward; this is different from a regular aggregation operation which assumes that each one of the processes/aggregation points contribute the exact same vector size.


It is appreciated that, in certain exemplary embodiments, within the SHARP network, a sparse representation may be converted to a dense representation; in certain cases, a dense representation may be able to be processed with greater efficiency. It is also appreciated that, in such a case, both sparse and dense representations may co-exist simultaneously in different points within the SHARP network.


In exemplary embodiments of the present invention, the following protocol updates are made relative to the SHARP protocol (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference):

    • The SHARP operation header carries an indication the SHARP payload is sparse
    • The SHARP operation header uses a “sum” operation to eliminate the need to create a new operation


The following is one possible non-limiting example of an appropriate data format useable in exemplary embodiments of the present invention:


Index data size in bytes data [ ] data [ ] represents a list of data elements, each of which can be byte/s or bit/s. A special index value is reserved to mark the end of the data.


In exemplary embodiments, the following is an example of SHARP protocol behavior for using sparse data: An aggregation point looks at the data that arrives. If there is a matching index the aggregation point will implement the operation that is mentioned in the operation header; as a consequence of the matching index, the aggregation point will add the single index data to the result vector (that is, will concatenate the single data to the result vector).


The following, in exemplary embodiments, are non-limiting implementations of various operations:


1. How to implement gather:


Each process sends its data with index=rank_id, data_size=data size contributed, the result vector will include data from all the processes because each index will be unique and aggregation nodes will concatenate all the received data.


In gather, each process sends data, and that data is (in the end of the gather) held by a single process.


2. How to implement all-gather:


Similar to gather but ask SHARP protocol to send the result to all the processes that contributed to this operation. In all-gather, each process sends data, and that data is sent to all other processes.


3. How to implement gather v:


Each process sends a variant data size, indicating the amount of data that it sent; the SHARP protocol generates a result that includes all the data with variant size. A rank id, identifying the sending process, may also be sent.


The addition of “v” indicates that each process may send a vector of any size which that process wishes to send; otherwise, gather v is similar to gather.


4. How to implement all-gather v:


Similar to gather v but ask SHARP protocol to send the result to all the processes that contributed to this operation. Similarly to gather v, in certain exemplary embodiments the amount of data that is sent, and/or a rank id, may be provide by processes which send data, although it is appreciated that including such information is generally optional


The addition of “v” indicates that each process may send a vector of any size which that process wishes to send; otherwise, all-gather v is similar to all-gather.


As indicated above, the data format is exemplary only, and is in no way meant to be limiting. Without limiting the generality of the foregoing, it is appreciated that certain optimizations, including compression of meta-data (index and data size), may be used.


In exemplary embodiments, the present invention utilizes the SHARP protocol all-reduce ability in which processes send data for reduction. However, the all gather operation differs from previously-used examples of the SHARP protocol all-reduce: in all-reduce each process sends a vector of size X and the result is also of size X (an element-wise operation is performed). In all gather or all gather v, each process j sends vector of size Xj where the result vector size is Sum(Xj). In order to support this operation, each process to sends its own data Xj and all of the data is gathered into a single large vector. Persons skilled in the art will appreciate how the same principles apply to the other operations described herein.


Reference is now made to FIG. 1, which is a simplified pictorial illustration of a high-performance computing system, generally designated 100, constructed and operative in accordance with an exemplary embodiment of the present invention. FIG. 1, by way of non-limiting example, depicts a particularly small system including only 4 processes 110; it is appreciated that, in general, a much larger number of processes would be used.


In FIG. 1, each of the 4 processes 110 (designated P1, P2, P3, and P4) is depicted as sending vectors 120 (designated X1, X2, X3, and X4; corresponding respectively to P1, P2, P3, and P4), the vectors 120 being various sizes, for an operation to be carried out. An aggregation node 130 gathers all of the data received (the vectors 120, X1, X2, X3, and X4) from the 4 processes 110 (P1, P2, P3, and P4) into a single vector 140.


Assuming, as described above, that the sparse vector format includes indexes, the aggregation node will be able to generate an ordered vector as depicted in FIG. 1. In general, if (by way of non-limiting example), indexes range from 0-100, the resulting sorted vector may be sorted in order of the indexes. This feature would allow finding data relating to a given index more easily.


The aggregation operation will be carried out (in a manner more complex than that shown for purposes of simplicity of illustration in FIG. 1) on multi-level aggregation, as is usual in SHARP (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference), thus achieving scalability and hierarchical operation.


In the general case the aggregation node will not assume the indexes are consecutive. For example, in the particular non-limiting example of FIG. 1, not all vectors 120 need be present.


Reference is now made to FIG. 2, which is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an alternative exemplary embodiment of the present invention. The system of FIG. 2, generally designated 200, operates similarly to the system of FIG. 1, with a plurality of processes 210 sending vectors 220 towards aggregation nodes 230. In the system of FIG. 2, two different aggregation nodes 230 are shown as receiving the vectors 220 from the processes 210; this type of architecture allows better scaling. It will be appreciated that, while only two aggregation nodes 230 are shown, in practice a much larger number of aggregation nodes, arranged in a tree structure, may be used (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference, for a more detailed discussion).


In the system of FIG. 2, an aggregation/root node 235 ultimately receives all of the vectors 220 and produces a single combined vector 240.


Reference is now made to FIG. 3A, which is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to all processes; and to FIG. 3B, which is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to a single process.



FIGS. 3A and 3B show a system, generally designated 300, which operates similarly to the system 100 of FIG. 1. For purposes of simplicity of illustration, the systems of FIGS. 3A and 3B are based on the system of FIG. 1, it being appreciated that systems based on the system of FIG. 2 may alternatively be used.


In FIGS. 3A and 3B, the systems are shown after the processes 310 have already sent their individual vectors (not shown) to the aggregation node 330, and the combined vector 340 has been produced. In FIG. 3A, a situation is depicted in which the combined vector 340 is sent to all of the processes 310 (as would be the case, by way of non-limiting example, in an all-gather operation); while in FIG. 3B, a situation is depicted in which the combined vector 340 is sent to only one of the processes 310 (as would be the case, by way of non-limiting example, in a gather operation.


In FIGS. 3A and 3B, inter alia, input data is depicted as being provided in order (X1, 2, X3, X4). While providing input data in order may be optimal in certain preferred embodiments, it is appreciated that it is not necessary to provide input data in order.


It is appreciated that software components of the present invention may, if desired, be implemented in ROM (read only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. It is further appreciated that the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.


It is appreciated that various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable subcombination.


It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention is defined by the appended claims and equivalents thereof:

Claims
  • 1. A method for computation, comprising: in a high-performance computing system that runs an application in which multiple processes perform portions of work of the application, defining a network comprising the multiple processes and at least one aggregation node;in response to a data movement command, passing respective vectors of data having different, respective data sizes from the multiple processes over the network to the at least one aggregation node;at the at least one aggregation node, in response to the data movement command, receiving and concatenating the respective vectors to generate a result vector having a result vector size equal to a sum of the respective data sizes of the received vectors; andoutputting the result vector from the at least one aggregation node to the network,wherein defining the network comprises providing multiple aggregation nodes, comprising a root node, which outputs the result vector, and at least two intermediate aggregation nodes, each of which receives and concatenates the respective vectors from respective ones of the processes.
  • 2. The method according to claim 1, wherein the data movement command comprises a gather command.
  • 3. The method according to claim 1, wherein outputting the result vector comprises sending the result vector to the multiple processes.
  • 4. The method according to claim 1, wherein defining the network comprises providing a hierarchical tree having nodes corresponding to the multiple processes.
  • 5. The method according to claim 1, wherein passing the respective vectors comprises sending the data from the multiple processes with respective indexes, and wherein concatenating the respective vectors comprises adding the data to the result vector according to the respective indexes.
  • 6. The method according to claim 5, wherein sending the data comprises providing indications from the multiple processes to the at least one aggregation node of the respective data sizes of the respective vectors.
  • 7. A high-performance computing system comprising multiple nodes, which are configured to run an application in which multiple processes perform portions of work of the application, wherein a network comprising the multiple processes and at least one aggregation node is defined in the system, such that in response to a data movement command, the multiple processes pass respective vectors of data having different, respective data sizes over the network to the at least one aggregation node, andwherein at the at least one aggregation node, in response to the data movement command, receives and concatenates the respective vectors to generate and outputs a result vector to the network having a result vector size equal to a sum of the respective data sizes of the received vectors,wherein the network comprises multiple aggregation nodes, including a root node, which outputs the result vector, and at least two intermediate aggregation nodes, each of which receives and concatenates the respective vectors from respective ones of the processes.
  • 8. The system according to claim 7, wherein the data movement command comprises a gather command.
  • 9. The system according to claim 7, wherein the at least one aggregation node is configured to send the result vector to the multiple processes.
  • 10. The system according to claim 7, wherein the network comprises is defined as a hierarchical tree in which the nodes correspond to the multiple processes.
  • 11. The system according to claim 7, wherein the multiple processes pass the respective vectors to the at least one aggregation node together with respective indexes, and wherein the at least one aggregation node concatenates the respective vectors according to the respective indexes.
  • 12. The system according to claim 11, wherein the multiple processes provide indications to the at least one aggregation node of the respective data sizes of the respective vectors.
PRIORITY CLAIM

The present application claims priority from U.S. Provisional Patent Application Ser. 62/807,266 of Levi et al, filed 19 Feb. 2019.

US Referenced Citations (214)
Number Name Date Kind
4933969 Marshall et al. Jun 1990 A
5068877 Near et al. Nov 1991 A
5325500 Bell et al. Jun 1994 A
5353412 Douglas et al. Oct 1994 A
5404565 Gould Apr 1995 A
5606703 Brady et al. Feb 1997 A
5944779 Blum Aug 1999 A
6041049 Brady Mar 2000 A
6370502 Wu Apr 2002 B1
6483804 Muller et al. Nov 2002 B1
6507562 Kadansky Jan 2003 B1
6728862 Wilson Apr 2004 B1
6857004 Howard et al. Feb 2005 B1
6937576 Di Benedetto et al. Aug 2005 B1
7102998 Golestani Sep 2006 B1
7124180 Ranous Oct 2006 B1
7164422 Wholey, III et al. Jan 2007 B1
7171484 Krause et al. Jan 2007 B1
7313582 Bhanot et al. Dec 2007 B2
7327693 Rivers et al. Feb 2008 B1
7336646 Muller Feb 2008 B2
7346698 Hannaway Mar 2008 B2
7555549 Campbell et al. Jun 2009 B1
7613774 Caronni et al. Nov 2009 B1
7636424 Halikhedkar et al. Dec 2009 B1
7636699 Stanfill Dec 2009 B2
7738443 Kumar Jun 2010 B2
8213315 Crupnicoff et al. Jul 2012 B2
8380880 Gulley et al. Feb 2013 B2
8510366 Anderson et al. Aug 2013 B1
8738891 Karandikar May 2014 B1
8761189 Shachar et al. Jun 2014 B2
8768898 Trimmer et al. Jul 2014 B1
8775698 Archer et al. Jul 2014 B2
8811417 Bloch et al. Aug 2014 B2
9110860 Shahar Aug 2015 B2
9189447 Faraj Nov 2015 B2
9294551 Froese et al. Mar 2016 B1
9344490 Bloch et al. May 2016 B2
9563426 Bent et al. Feb 2017 B1
9626329 Howard Apr 2017 B2
9756154 Jiang Sep 2017 B1
10015106 Florissi et al. Jul 2018 B1
10158702 Bloch et al. Dec 2018 B2
10284383 Bloch et al. May 2019 B2
10296351 Kohn et al. May 2019 B1
10305980 Gonzalez et al. May 2019 B1
10318306 Kohn et al. Jun 2019 B1
10425350 Florissi Sep 2019 B1
10521283 Shuler et al. Dec 2019 B2
10541938 Timmerman et al. Jan 2020 B1
10621489 Appuswamy et al. Apr 2020 B2
20020010844 Noel et al. Jan 2002 A1
20020035625 Tanaka Mar 2002 A1
20020150094 Cheng et al. Oct 2002 A1
20020150106 Kagan et al. Oct 2002 A1
20020152315 Kagan et al. Oct 2002 A1
20020152327 Kagan et al. Oct 2002 A1
20020152328 Kagan et al. Oct 2002 A1
20030018828 Craddock et al. Jan 2003 A1
20030061417 Craddock et al. Mar 2003 A1
20030065856 Kagan et al. Apr 2003 A1
20040062258 Grow et al. Apr 2004 A1
20040078493 Blumrich et al. Apr 2004 A1
20040120331 Rhine et al. Jun 2004 A1
20040123071 Stefan et al. Jun 2004 A1
20040252685 Kagan et al. Dec 2004 A1
20040260683 Chan Dec 2004 A1
20050097300 Gildea et al. May 2005 A1
20050122329 Janus Jun 2005 A1
20050129039 Biran et al. Jun 2005 A1
20050131865 Jones et al. Jun 2005 A1
20050281287 Ninomi et al. Dec 2005 A1
20060282838 Gupta et al. Dec 2006 A1
20070127396 Jain et al. Jun 2007 A1
20070162236 Lamblin Jul 2007 A1
20080104218 Liang et al. May 2008 A1
20080126564 Wilkinson May 2008 A1
20080168471 Benner et al. Jul 2008 A1
20080181260 Vonog et al. Jul 2008 A1
20080192750 Ko et al. Aug 2008 A1
20080244220 Lin et al. Oct 2008 A1
20080263329 Archer et al. Oct 2008 A1
20080288949 Bohra et al. Nov 2008 A1
20080298380 Rittmeyer et al. Dec 2008 A1
20080307082 Cai et al. Dec 2008 A1
20090037377 Archer et al. Feb 2009 A1
20090063816 Arimilli et al. Mar 2009 A1
20090063817 Arimilli et al. Mar 2009 A1
20090063891 Arimilli et al. Mar 2009 A1
20090182814 Tapolcai et al. Jul 2009 A1
20090247241 Gollnick et al. Oct 2009 A1
20090292905 Faraj Nov 2009 A1
20100017420 Archer et al. Jan 2010 A1
20100049836 Kramer Feb 2010 A1
20100074098 Zeng et al. Mar 2010 A1
20100095086 Eichenberger et al. Apr 2010 A1
20100185719 Howard Jul 2010 A1
20100241828 Yu et al. Sep 2010 A1
20110060891 Jia Mar 2011 A1
20110066649 Berlyant et al. Mar 2011 A1
20110093258 Xu Apr 2011 A1
20110119673 Bloch et al. May 2011 A1
20110173413 Chen et al. Jul 2011 A1
20110219208 Asaad Sep 2011 A1
20110238956 Arimilli et al. Sep 2011 A1
20110258245 Blocksome et al. Oct 2011 A1
20110276789 Chambers et al. Nov 2011 A1
20120063436 Thubert et al. Mar 2012 A1
20120117331 Krause et al. May 2012 A1
20120131309 Johnson May 2012 A1
20120216021 Archer et al. Aug 2012 A1
20120254110 Takemoto Oct 2012 A1
20130117548 Grover et al. May 2013 A1
20130159410 Lee et al. Jun 2013 A1
20130318525 Palanisamy et al. Nov 2013 A1
20130336292 Kore et al. Dec 2013 A1
20140019574 Cardona et al. Jan 2014 A1
20140033217 Vajda Jan 2014 A1
20140047341 Breternitz et al. Feb 2014 A1
20140095779 Forsyth et al. Apr 2014 A1
20140122831 Uliel et al. May 2014 A1
20140189308 Hughes et al. Jul 2014 A1
20140211804 Makikeni et al. Jul 2014 A1
20140258438 Ayoub Sep 2014 A1
20140280420 Khan Sep 2014 A1
20140281370 Khan Sep 2014 A1
20140362692 Wu et al. Dec 2014 A1
20140365548 Mortensen Dec 2014 A1
20150106578 Warfield et al. Apr 2015 A1
20150143076 Khan May 2015 A1
20150143077 Khan May 2015 A1
20150143078 Khan et al. May 2015 A1
20150143079 Khan May 2015 A1
20150143085 Khan May 2015 A1
20150143086 Khan May 2015 A1
20150154058 Miwa et al. Jun 2015 A1
20150178211 Hiramoto Jun 2015 A1
20150180785 Annamraju Jun 2015 A1
20150188987 Reed et al. Jul 2015 A1
20150193271 Archer et al. Jul 2015 A1
20150212972 Boettcher et al. Jul 2015 A1
20150269116 Raikin et al. Sep 2015 A1
20150278347 Meyer Oct 2015 A1
20150365494 Cardona et al. Dec 2015 A1
20150379022 Puig et al. Dec 2015 A1
20160055225 Xu et al. Feb 2016 A1
20160092362 Barron et al. Mar 2016 A1
20160105494 Reed et al. Apr 2016 A1
20160112531 Milton et al. Apr 2016 A1
20160117277 Raindel et al. Apr 2016 A1
20160179537 Kunzman Jun 2016 A1
20160219009 French Jul 2016 A1
20160248656 Anand et al. Aug 2016 A1
20160299872 Vaidyanathan et al. Oct 2016 A1
20160342568 Burchard et al. Nov 2016 A1
20160352598 Reinhardt et al. Dec 2016 A1
20160364350 Sanghi et al. Dec 2016 A1
20170063613 Bloch Mar 2017 A1
20170093715 McGhee et al. Mar 2017 A1
20170116154 Palmer et al. Apr 2017 A1
20170187496 Shalev et al. Jun 2017 A1
20170187589 Pope et al. Jun 2017 A1
20170187629 Shalev et al. Jun 2017 A1
20170187846 Shalev et al. Jun 2017 A1
20170199844 Burchard et al. Jul 2017 A1
20170262517 Horowitz Sep 2017 A1
20170344589 Kafai Nov 2017 A1
20180004530 Vorbach Jan 2018 A1
20180046901 Xie et al. Feb 2018 A1
20180047099 Bonig et al. Feb 2018 A1
20180089278 Bhattacharjee et al. Mar 2018 A1
20180091442 Chen et al. Mar 2018 A1
20180097721 Matsui et al. Apr 2018 A1
20180173673 Daglis et al. Jun 2018 A1
20180262551 Demeyer et al. Sep 2018 A1
20180285316 Thorson et al. Oct 2018 A1
20180287928 Levi et al. Oct 2018 A1
20180302324 Kasuya Oct 2018 A1
20180321912 Li et al. Nov 2018 A1
20180321938 Boswell et al. Nov 2018 A1
20180349212 Liu et al. Dec 2018 A1
20180367465 Levi Dec 2018 A1
20180375781 Chen et al. Dec 2018 A1
20190018805 Benisty Jan 2019 A1
20190026250 Das Sarma et al. Jan 2019 A1
20190044889 Serres et al. Feb 2019 A1
20190065208 Liu et al. Feb 2019 A1
20190068501 Schneider et al. Feb 2019 A1
20190102179 Fleming et al. Apr 2019 A1
20190102338 Tang et al. Apr 2019 A1
20190102640 Balasubramanian Apr 2019 A1
20190114533 Ng et al. Apr 2019 A1
20190121388 Knowles et al. Apr 2019 A1
20190138638 Pal et al. May 2019 A1
20190147092 Pal et al. May 2019 A1
20190149488 Bansal et al. May 2019 A1
20190171612 Shahar et al. Jun 2019 A1
20190235866 Das Sarma et al. Aug 2019 A1
20190303168 Fleming, Jr. et al. Oct 2019 A1
20190303263 Fleming, Jr. et al. Oct 2019 A1
20190324431 Cella et al. Oct 2019 A1
20190339688 Cella et al. Nov 2019 A1
20190347099 Eapen et al. Nov 2019 A1
20190369994 Parandeh Afshar et al. Dec 2019 A1
20190377580 Vorbach Dec 2019 A1
20190379714 Levi et al. Dec 2019 A1
20200005859 Chen et al. Jan 2020 A1
20200034145 Bainville et al. Jan 2020 A1
20200057748 Danilak Feb 2020 A1
20200103894 Cella et al. Apr 2020 A1
20200106828 Elias et al. Apr 2020 A1
20200137013 Jin et al. Apr 2020 A1
20210203621 Ylisirnio et al. Jul 2021 A1
Non-Patent Literature Citations (41)
Entry
Danalis et al., “PTG: an abstraction for unhindered parallelism”, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, pp. 1-10, Nov. 17, 2014.
Cosnard et al., “Symbolic Scheduling of Parameterized Task Graphs on Parallel Machines,” Combinatorial Optimization book series (COOP, vol. 7), pp. 217-243, year 2000.
Jeannot et al., “Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors using paramerized Task Graphs”, World Scientific, pp. 1-8, Jul. 23, 2001.
Stone, “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations,” Journal of the Association for Computing Machinery, vol. 10, No. 1, pp. 27-38, Jan. 1973.
Kogge et al., “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C-22, No. 8, pp. 786-793, Aug. 1973.
Hoefler et al., “Message Progression in Parallel Computing—To Thread or not to Thread?”, 2008 IEEE International Conference on Cluster Computing, pp. 1-10, Tsukuba, Japan, Sep. 29-Oct. 1, 2008.
U.S. Appl. No. 16/357,356 office action dated May 14, 2020.
European Application # 20156490.3 search report dated Jun. 25, 2020.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pp. 298-309, Aug. 1, 1994.
Chiang et al., “Toward supporting data parallel programming on clusters of symmetric multiprocessors”, Proceedings International Conference on Parallel and Distributed Systems, pp. 607-614, Dec. 14, 1998.
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI AII-to-AII”, Proceedings of the 23rd European MPI Users' Group Meeting, pp. 167-179, Sep. 2016.
Pjesivac-Grbovic et al., “Performance Analysis of MPI Collective Operations”, 19th IEEE International Parallel and Distributed Processing Symposium, pp. 1-19, 2015.
Mellanox Technologies, “InfiniScale IV: 36-port 40GB/s Infiniband Switch Device”, pp. 1-2, year 2009.
Mellanox Technologies Inc., “Scaling 10Gb/s Clustering at Wire-Speed”, pp. 1-8, year 2006.
IEEE 802.1D Standard “IEEE Standard for Local and Metropolitan Area Networks—Media Access Control (MAC) Bridges”, IEEE Computer Society, pp. 1-281, Jun. 9, 2004.
IEEE 802.1AX Standard “IEEE Standard for Local and Metropolitan Area Networks—Link Aggregation”, IEEE Computer Society, pp. 1-163, Nov. 3, 2008.
Turner et al., “Multirate Clos Networks”, IEEE Communications Magazine, pp. 1-11, Oct. 2003.
Thayer School of Engineering, “An Slightly Edited Local Elements of Lectures 4 and 5”, Dartmouth College, pp. 1-5, Jan. 15, 1998 http://people.seas.harvard.edu/˜jones/cscie129/nu_lectures/lecture11/switching/clos_network/clos_network.html.
“MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, version 3.1, pp. 1-868, Jun. 4, 2015.
Coti et al., “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), pp. 1-12, Aug. 2009.
Petrini et al., “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (Hotl'01), pp. 1-6, Aug. 2001.
Sancho et al., “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 1-10, Sep. 17-20, 2007.
Infiniband Trade Association, “InfiniBand™ Architecture Specification”, release 1.2.1, pp. 1-1727, Jan. 2008.
InfiniBand Architecture Specification, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007.
Deming, “Infiniband Architectural Overview”, Storage Developer Conference, pp. 1-70, year 2013.
Fugger et al., “Reconciling fault-tolerant distributed computing and systems-on-chip”, Distributed Computing, vol. 24, Issue 6, pp. 323-355, Jan. 2012.
Wikipedia, “System on a chip”, pp. 1-4, Jul. 6, 2018.
Villavieja et al., “On-chip Distributed Shared Memory”, Computer Architecture Department, pp. 1-10, Feb. 3, 2011.
Ben-Moshe et al., U.S. Appl. No. 16/750,019, filed Jan. 23, 2020.
Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” Proceedings of the Forth Conferene on Partitioned Global Address Space Programming Model, pp. 1-4, Oct. 2010.
Priest et al., “You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives”, IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 221-230, year 2019.
Wikipedia, “Nagle's algorithm”, pp. 1-4, Dec. 12, 2019.
U.S. Appl. No. 16/430,457 Office Action dated Jul. 9, 2021.
Yang et al., “SwitchAgg: A Further Step Toward In-Network Computing,” 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking, pp. 36-45, Dec. 2019.
EP Application # 20216972 Search Report dated Jun. 11, 2021.
U.S. Appl. No. 16/789,458 Office Action dated Jun. 10, 2021.
U.S. Appl. No. 16/750,019 Office Action dated Jun. 15, 2021.
U.S. Appl. No. 17/147,487 Office Action dated Jun. 30, 2022.
U.S. Appl. No. 17/147,487 Office Action dated Nov. 29, 2022.
U.S. Appl. No. 17/495,824 Office Action dated Jan. 27, 2023.
Related Publications (1)
Number Date Country
20200265043 A1 Aug 2020 US
Provisional Applications (1)
Number Date Country
62807266 Feb 2019 US