High performance computing system

Description

FIELD OF THE INVENTION

The present invention relates to high performance computing systems in general.

BACKGROUND OF THE INVENTION

In high performance computing (HPC) systems many applications are written in a manner that requires communication between the systems which perform portions of the work (termed herein “processes”).

Part of the communications includes collective operation such as (by way of non-limiting example) doing sum of multiple vectors (element-wise add operation) from multiple processes, one per process, and sending a copy of the resulting vector to all the participating processes; this operation is called “all-reduce”. Another non-limiting example would be sending the result to only one of the processes; this operation is called “reduce”.

In addition to the compute elements (that is, reduce and all-reduce involve mathematical operators) there are data movement commands such as (by way of non-limiting example) all2all, gather, all gather, gather v, all gather v, scatter, etc. Commands of this type are defined in the well-known Message Passing Interface specification, and other communication Application Programmer Interface definitions.

SUMMARY OF THE INVENTION

The present invention, in certain embodiments thereof, seeks to provide an improved system and method for high performance computing.

There is thus provided in accordance with an exemplary embodiment of the present invention a method including providing a SHARP tree including a plurality of data receiving processes and at least one aggregation node, designating a data movement command, providing a plurality of data input vectors to each of the plurality of data receiving processes, respectively, the plurality of data receiving processes each passing on the respective received data input vector to the at least one aggregation node, and the at least one aggregation node carrying out the data movement command on the received plurality of data input vectors.

Further in accordance with an exemplary embodiment of the present invention the data movement command includes one of the following: gather, all gather, gather v, and all gather v.

Still further in accordance with an exemplary embodiment of the present invention the at least one aggregation node produces an output vector.

Additionally in accordance with an exemplary embodiment of the present invention at least one of the plurality of data input vectors includes a sparse vector.

Moreover in accordance with an exemplary embodiment of the present invention the at least one aggregation node utilizes a SHARP protocol.

There is also provided in accordance with another exemplary embodiment of the present invention apparatus including a SHARP tree including a plurality of data receiving processes and at least one aggregation node, the SHARP tree being configured to perform the following: receiving a data movement command, receiving a plurality of data input vectors to each of the plurality of data receiving processes, respectively, the data receiving processes each passing on the respective received data input vector to the at least one aggregation node, and the at least one aggregation node carrying out the data movement command on the received plurality of data input vectors.

Further in accordance with an exemplary embodiment of the present invention the data movement command includes one of the following: gather, all gather, gather v, and all gather v.

Still further in accordance with an exemplary embodiment of the present invention the at least one aggregation node is configured to produce an output vector.

Additionally in accordance with an exemplary embodiment of the present invention at least one of the plurality of data input vectors includes a sparse vector.

Moreover in accordance with an exemplary embodiment of the present invention the at least one aggregation node utilizes a SHARP protocol.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:

FIG. 1 is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an exemplary embodiment of the present invention;

FIG. 2 is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an alternative exemplary embodiment of the present invention;

FIG. 3A is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to all processes; and

FIG. 3B is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to a single process.

DETAILED DESCRIPTION OF AN EMBODIMENT
General Introduction

By way of introduction (but without limiting the generality of the present application), the concept behind exemplary embodiments of the present invention is to use the SHARP protocol to accelerate (at least the operations) gather, all gather, gather V, and all gather V.

The SHARP algorithm/protocol is described in US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which is hereby incorporated herein by reference.

In certain exemplary embodiments of the present invention, computations which may be described herein as if they occur in a serial manner may be executed in such a way that computations already occur while data is still being received.

The inventors of the present invention believe that current methodology for addressing the above-mentioned scenarios is to use software algorithms in order to perform the operations mentioned above. The software algorithms involve a large amount of data transfer/movement. Significant overhead may also be generated on a CPU which manages the algorithms, since multiple packets, each with a small amount of data, are sent; sometimes identical such packets are sent multiple times to multiple destinations, creating additional packet/bandwidth overhead, including header overhead. In addition, such software algorithms may create large latency when a large number of processes are involved.

In exemplary embodiments of the present invention, one goal is to offload work from the management CPU/s by simplification of the process using the SHARP algorithm/protocol (referred to above and described in US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference); latency as well as bandwidth consumption may also be reduced in such exemplary embodiments.

Embodiment Description

In exemplary embodiments of the present invention, the SHARP algorithm/protocol (referred to above) is used to implement at least: gather; gather v; all gather; and all gather v.

In general, in exemplary embodiments of the present invention, the gather operation is treated as a regular reduction operation by using a new data format supporting sparse representation. When using the sparse representation each process/aggregation point sends its data to the SHARP network, while allowing the aggregation to move forward; this is different from a regular aggregation operation which assumes that each one of the processes/aggregation points contribute the exact same vector size.

It is appreciated that, in certain exemplary embodiments, within the SHARP network, a sparse representation may be converted to a dense representation; in certain cases, a dense representation may be able to be processed with greater efficiency. It is also appreciated that, in such a case, both sparse and dense representations may co-exist simultaneously in different points within the SHARP network.

In exemplary embodiments of the present invention, the following protocol updates are made relative to the SHARP protocol (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference):

- The SHARP operation header carries an indication the SHARP payload is sparse
- The SHARP operation header uses a “sum” operation to eliminate the need to create a new operation

The following is one possible non-limiting example of an appropriate data format useable in exemplary embodiments of the present invention:

Index data size in bytes data [ ] data [ ] represents a list of data elements, each of which can be byte/s or bit/s. A special index value is reserved to mark the end of the data.

In exemplary embodiments, the following is an example of SHARP protocol behavior for using sparse data: An aggregation point looks at the data that arrives. If there is a matching index the aggregation point will implement the operation that is mentioned in the operation header; as a consequence of the matching index, the aggregation point will add the single index data to the result vector (that is, will concatenate the single data to the result vector).

The following, in exemplary embodiments, are non-limiting implementations of various operations:

1. How to implement gather:

Each process sends its data with index=rank_id, data_size=data size contributed, the result vector will include data from all the processes because each index will be unique and aggregation nodes will concatenate all the received data.

In gather, each process sends data, and that data is (in the end of the gather) held by a single process.

2. How to implement all-gather:

Similar to gather but ask SHARP protocol to send the result to all the processes that contributed to this operation. In all-gather, each process sends data, and that data is sent to all other processes.

3. How to implement gather v:

Each process sends a variant data size, indicating the amount of data that it sent; the SHARP protocol generates a result that includes all the data with variant size. A rank id, identifying the sending process, may also be sent.

The addition of “v” indicates that each process may send a vector of any size which that process wishes to send; otherwise, gather v is similar to gather.

4. How to implement all-gather v:

Similar to gather v but ask SHARP protocol to send the result to all the processes that contributed to this operation. Similarly to gather v, in certain exemplary embodiments the amount of data that is sent, and/or a rank id, may be provide by processes which send data, although it is appreciated that including such information is generally optional

The addition of “v” indicates that each process may send a vector of any size which that process wishes to send; otherwise, all-gather v is similar to all-gather.

As indicated above, the data format is exemplary only, and is in no way meant to be limiting. Without limiting the generality of the foregoing, it is appreciated that certain optimizations, including compression of meta-data (index and data size), may be used.

In exemplary embodiments, the present invention utilizes the SHARP protocol all-reduce ability in which processes send data for reduction. However, the all gather operation differs from previously-used examples of the SHARP protocol all-reduce: in all-reduce each process sends a vector of size X and the result is also of size X (an element-wise operation is performed). In all gather or all gather v, each process j sends vector of size Xj where the result vector size is Sum(Xj). In order to support this operation, each process to sends its own data Xj and all of the data is gathered into a single large vector. Persons skilled in the art will appreciate how the same principles apply to the other operations described herein.

Reference is now made to FIG. 1, which is a simplified pictorial illustration of a high-performance computing system, generally designated 100, constructed and operative in accordance with an exemplary embodiment of the present invention. FIG. 1, by way of non-limiting example, depicts a particularly small system including only 4 processes 110; it is appreciated that, in general, a much larger number of processes would be used.

In FIG. 1, each of the 4 processes 110 (designated P1, P2, P3, and P4) is depicted as sending vectors 120 (designated X1, X2, X3, and X4; corresponding respectively to P1, P2, P3, and P4), the vectors 120 being various sizes, for an operation to be carried out. An aggregation node 130 gathers all of the data received (the vectors 120, X1, X2, X3, and X4) from the 4 processes 110 (P1, P2, P3, and P4) into a single vector 140.

Assuming, as described above, that the sparse vector format includes indexes, the aggregation node will be able to generate an ordered vector as depicted in FIG. 1. In general, if (by way of non-limiting example), indexes range from 0-100, the resulting sorted vector may be sorted in order of the indexes. This feature would allow finding data relating to a given index more easily.

The aggregation operation will be carried out (in a manner more complex than that shown for purposes of simplicity of illustration in FIG. 1) on multi-level aggregation, as is usual in SHARP (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference), thus achieving scalability and hierarchical operation.

In the general case the aggregation node will not assume the indexes are consecutive. For example, in the particular non-limiting example of FIG. 1, not all vectors 120 need be present.

Reference is now made to FIG. 2, which is a simplified pictorial illustration of a high-performance computing system constructed and operative in accordance with an alternative exemplary embodiment of the present invention. The system of FIG. 2, generally designated 200, operates similarly to the system of FIG. 1, with a plurality of processes 210 sending vectors 220 towards aggregation nodes 230. In the system of FIG. 2, two different aggregation nodes 230 are shown as receiving the vectors 220 from the processes 210; this type of architecture allows better scaling. It will be appreciated that, while only two aggregation nodes 230 are shown, in practice a much larger number of aggregation nodes, arranged in a tree structure, may be used (see US Published Patent Application 2017/0063613 of Bloch et al, the disclosure of which has been incorporated herein by reference, for a more detailed discussion).

In the system of FIG. 2, an aggregation/root node 235 ultimately receives all of the vectors 220 and produces a single combined vector 240.

Reference is now made to FIG. 3A, which is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to all processes; and to FIG. 3B, which is a simplified pictorial illustration showing distribution of data, in a system similar to that of FIG. 1, to a single process.

FIGS. 3A and 3B show a system, generally designated 300, which operates similarly to the system 100 of FIG. 1. For purposes of simplicity of illustration, the systems of FIGS. 3A and 3B are based on the system of FIG. 1, it being appreciated that systems based on the system of FIG. 2 may alternatively be used.

In FIGS. 3A and 3B, the systems are shown after the processes 310 have already sent their individual vectors (not shown) to the aggregation node 330, and the combined vector 340 has been produced. In FIG. 3A, a situation is depicted in which the combined vector 340 is sent to all of the processes 310 (as would be the case, by way of non-limiting example, in an all-gather operation); while in FIG. 3B, a situation is depicted in which the combined vector 340 is sent to only one of the processes 310 (as would be the case, by way of non-limiting example, in a gather operation.

In FIGS. 3A and 3B, inter alia, input data is depicted as being provided in order (X1, 2, X3, X4). While providing input data in order may be optimal in certain preferred embodiments, it is appreciated that it is not necessary to provide input data in order.

It is appreciated that software components of the present invention may, if desired, be implemented in ROM (read only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. It is further appreciated that the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.

It is appreciated that various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable subcombination.

It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention is defined by the appended claims and equivalents thereof:

Claims

1. A method for computation, comprising: in a high-performance computing system that runs an application in which multiple processes perform portions of work of the application, defining a network comprising the multiple processes and at least one aggregation node;in response to a data movement command, passing respective vectors of data having different, respective data sizes from the multiple processes over the network to the at least one aggregation node;at the at least one aggregation node, in response to the data movement command, receiving and concatenating the respective vectors to generate a result vector having a result vector size equal to a sum of the respective data sizes of the received vectors; andoutputting the result vector from the at least one aggregation node to the network,wherein defining the network comprises providing multiple aggregation nodes, comprising a root node, which outputs the result vector, and at least two intermediate aggregation nodes, each of which receives and concatenates the respective vectors from respective ones of the processes.
2. The method according to claim 1, wherein the data movement command comprises a gather command.
3. The method according to claim 1, wherein outputting the result vector comprises sending the result vector to the multiple processes.
4. The method according to claim 1, wherein defining the network comprises providing a hierarchical tree having nodes corresponding to the multiple processes.
5. The method according to claim 1, wherein passing the respective vectors comprises sending the data from the multiple processes with respective indexes, and wherein concatenating the respective vectors comprises adding the data to the result vector according to the respective indexes.
6. The method according to claim 5, wherein sending the data comprises providing indications from the multiple processes to the at least one aggregation node of the respective data sizes of the respective vectors.
7. A high-performance computing system comprising multiple nodes, which are configured to run an application in which multiple processes perform portions of work of the application, wherein a network comprising the multiple processes and at least one aggregation node is defined in the system, such that in response to a data movement command, the multiple processes pass respective vectors of data having different, respective data sizes over the network to the at least one aggregation node, andwherein at the at least one aggregation node, in response to the data movement command, receives and concatenates the respective vectors to generate and outputs a result vector to the network having a result vector size equal to a sum of the respective data sizes of the received vectors,wherein the network comprises multiple aggregation nodes, including a root node, which outputs the result vector, and at least two intermediate aggregation nodes, each of which receives and concatenates the respective vectors from respective ones of the processes.
8. The system according to claim 7, wherein the data movement command comprises a gather command.
9. The system according to claim 7, wherein the at least one aggregation node is configured to send the result vector to the multiple processes.
10. The system according to claim 7, wherein the network comprises is defined as a hierarchical tree in which the nodes correspond to the multiple processes.
11. The system according to claim 7, wherein the multiple processes pass the respective vectors to the at least one aggregation node together with respective indexes, and wherein the at least one aggregation node concatenates the respective vectors according to the respective indexes.
12. The system according to claim 11, wherein the multiple processes provide indications to the at least one aggregation node of the respective data sizes of the respective vectors.

PRIORITY CLAIM

The present application claims priority from U.S. Provisional Patent Application Ser. 62/807,266 of Levi et al, filed 19 Feb. 2019.

US Referenced Citations (214)

Number	Name	Date	Kind
4933969	Marshall et al.	Jun 1990	A
5068877	Near et al.	Nov 1991	A
5325500	Bell et al.	Jun 1994	A
5353412	Douglas et al.	Oct 1994	A
5404565	Gould	Apr 1995	A
5606703	Brady et al.	Feb 1997	A
5944779	Blum	Aug 1999	A
6041049	Brady	Mar 2000	A
6370502	Wu	Apr 2002	B1
6483804	Muller et al.	Nov 2002	B1
6507562	Kadansky	Jan 2003	B1
6728862	Wilson	Apr 2004	B1
6857004	Howard et al.	Feb 2005	B1
6937576	Di Benedetto et al.	Aug 2005	B1
7102998	Golestani	Sep 2006	B1
7124180	Ranous	Oct 2006	B1
7164422	Wholey, III et al.	Jan 2007	B1
7171484	Krause et al.	Jan 2007	B1
7313582	Bhanot et al.	Dec 2007	B2
7327693	Rivers et al.	Feb 2008	B1
7336646	Muller	Feb 2008	B2
7346698	Hannaway	Mar 2008	B2
7555549	Campbell et al.	Jun 2009	B1
7613774	Caronni et al.	Nov 2009	B1
7636424	Halikhedkar et al.	Dec 2009	B1
7636699	Stanfill	Dec 2009	B2
7738443	Kumar	Jun 2010	B2
8213315	Crupnicoff et al.	Jul 2012	B2
8380880	Gulley et al.	Feb 2013	B2
8510366	Anderson et al.	Aug 2013	B1
8738891	Karandikar	May 2014	B1
8761189	Shachar et al.	Jun 2014	B2
8768898	Trimmer et al.	Jul 2014	B1
8775698	Archer et al.	Jul 2014	B2
8811417	Bloch et al.	Aug 2014	B2
9110860	Shahar	Aug 2015	B2
9189447	Faraj	Nov 2015	B2
9294551	Froese et al.	Mar 2016	B1
9344490	Bloch et al.	May 2016	B2
9563426	Bent et al.	Feb 2017	B1
9626329	Howard	Apr 2017	B2
9756154	Jiang	Sep 2017	B1
10015106	Florissi et al.	Jul 2018	B1
10158702	Bloch et al.	Dec 2018	B2
10284383	Bloch et al.	May 2019	B2
10296351	Kohn et al.	May 2019	B1
10305980	Gonzalez et al.	May 2019	B1
10318306	Kohn et al.	Jun 2019	B1
10425350	Florissi	Sep 2019	B1
10521283	Shuler et al.	Dec 2019	B2
10541938	Timmerman et al.	Jan 2020	B1
10621489	Appuswamy et al.	Apr 2020	B2
20020010844	Noel et al.	Jan 2002	A1
20020035625	Tanaka	Mar 2002	A1
20020150094	Cheng et al.	Oct 2002	A1
20020150106	Kagan et al.	Oct 2002	A1
20020152315	Kagan et al.	Oct 2002	A1
20020152327	Kagan et al.	Oct 2002	A1
20020152328	Kagan et al.	Oct 2002	A1
20030018828	Craddock et al.	Jan 2003	A1
20030061417	Craddock et al.	Mar 2003	A1
20030065856	Kagan et al.	Apr 2003	A1
20040062258	Grow et al.	Apr 2004	A1
20040078493	Blumrich et al.	Apr 2004	A1
20040120331	Rhine et al.	Jun 2004	A1
20040123071	Stefan et al.	Jun 2004	A1
20040252685	Kagan et al.	Dec 2004	A1
20040260683	Chan	Dec 2004	A1
20050097300	Gildea et al.	May 2005	A1
20050122329	Janus	Jun 2005	A1
20050129039	Biran et al.	Jun 2005	A1
20050131865	Jones et al.	Jun 2005	A1
20050281287	Ninomi et al.	Dec 2005	A1
20060282838	Gupta et al.	Dec 2006	A1
20070127396	Jain et al.	Jun 2007	A1
20070162236	Lamblin	Jul 2007	A1
20080104218	Liang et al.	May 2008	A1
20080126564	Wilkinson	May 2008	A1
20080168471	Benner et al.	Jul 2008	A1
20080181260	Vonog et al.	Jul 2008	A1
20080192750	Ko et al.	Aug 2008	A1
20080244220	Lin et al.	Oct 2008	A1
20080263329	Archer et al.	Oct 2008	A1
20080288949	Bohra et al.	Nov 2008	A1
20080298380	Rittmeyer et al.	Dec 2008	A1
20080307082	Cai et al.	Dec 2008	A1
20090037377	Archer et al.	Feb 2009	A1
20090063816	Arimilli et al.	Mar 2009	A1
20090063817	Arimilli et al.	Mar 2009	A1
20090063891	Arimilli et al.	Mar 2009	A1
20090182814	Tapolcai et al.	Jul 2009	A1
20090247241	Gollnick et al.	Oct 2009	A1
20090292905	Faraj	Nov 2009	A1
20100017420	Archer et al.	Jan 2010	A1
20100049836	Kramer	Feb 2010	A1
20100074098	Zeng et al.	Mar 2010	A1
20100095086	Eichenberger et al.	Apr 2010	A1
20100185719	Howard	Jul 2010	A1
20100241828	Yu et al.	Sep 2010	A1
20110060891	Jia	Mar 2011	A1
20110066649	Berlyant et al.	Mar 2011	A1
20110093258	Xu	Apr 2011	A1
20110119673	Bloch et al.	May 2011	A1
20110173413	Chen et al.	Jul 2011	A1
20110219208	Asaad	Sep 2011	A1
20110238956	Arimilli et al.	Sep 2011	A1
20110258245	Blocksome et al.	Oct 2011	A1
20110276789	Chambers et al.	Nov 2011	A1
20120063436	Thubert et al.	Mar 2012	A1
20120117331	Krause et al.	May 2012	A1
20120131309	Johnson	May 2012	A1
20120216021	Archer et al.	Aug 2012	A1
20120254110	Takemoto	Oct 2012	A1
20130117548	Grover et al.	May 2013	A1
20130159410	Lee et al.	Jun 2013	A1
20130318525	Palanisamy et al.	Nov 2013	A1
20130336292	Kore et al.	Dec 2013	A1
20140019574	Cardona et al.	Jan 2014	A1
20140033217	Vajda	Jan 2014	A1
20140047341	Breternitz et al.	Feb 2014	A1
20140095779	Forsyth et al.	Apr 2014	A1
20140122831	Uliel et al.	May 2014	A1
20140189308	Hughes et al.	Jul 2014	A1
20140211804	Makikeni et al.	Jul 2014	A1
20140258438	Ayoub	Sep 2014	A1
20140280420	Khan	Sep 2014	A1
20140281370	Khan	Sep 2014	A1
20140362692	Wu et al.	Dec 2014	A1
20140365548	Mortensen	Dec 2014	A1
20150106578	Warfield et al.	Apr 2015	A1
20150143076	Khan	May 2015	A1
20150143077	Khan	May 2015	A1
20150143078	Khan et al.	May 2015	A1
20150143079	Khan	May 2015	A1
20150143085	Khan	May 2015	A1
20150143086	Khan	May 2015	A1
20150154058	Miwa et al.	Jun 2015	A1
20150178211	Hiramoto	Jun 2015	A1
20150180785	Annamraju	Jun 2015	A1
20150188987	Reed et al.	Jul 2015	A1
20150193271	Archer et al.	Jul 2015	A1
20150212972	Boettcher et al.	Jul 2015	A1
20150269116	Raikin et al.	Sep 2015	A1
20150278347	Meyer	Oct 2015	A1
20150365494	Cardona et al.	Dec 2015	A1
20150379022	Puig et al.	Dec 2015	A1
20160055225	Xu et al.	Feb 2016	A1
20160092362	Barron et al.	Mar 2016	A1
20160105494	Reed et al.	Apr 2016	A1
20160112531	Milton et al.	Apr 2016	A1
20160117277	Raindel et al.	Apr 2016	A1
20160179537	Kunzman	Jun 2016	A1
20160219009	French	Jul 2016	A1
20160248656	Anand et al.	Aug 2016	A1
20160299872	Vaidyanathan et al.	Oct 2016	A1
20160342568	Burchard et al.	Nov 2016	A1
20160352598	Reinhardt et al.	Dec 2016	A1
20160364350	Sanghi et al.	Dec 2016	A1
20170063613	Bloch	Mar 2017	A1
20170093715	McGhee et al.	Mar 2017	A1
20170116154	Palmer et al.	Apr 2017	A1
20170187496	Shalev et al.	Jun 2017	A1
20170187589	Pope et al.	Jun 2017	A1
20170187629	Shalev et al.	Jun 2017	A1
20170187846	Shalev et al.	Jun 2017	A1
20170199844	Burchard et al.	Jul 2017	A1
20170262517	Horowitz	Sep 2017	A1
20170344589	Kafai	Nov 2017	A1
20180004530	Vorbach	Jan 2018	A1
20180046901	Xie et al.	Feb 2018	A1
20180047099	Bonig et al.	Feb 2018	A1
20180089278	Bhattacharjee et al.	Mar 2018	A1
20180091442	Chen et al.	Mar 2018	A1
20180097721	Matsui et al.	Apr 2018	A1
20180173673	Daglis et al.	Jun 2018	A1
20180262551	Demeyer et al.	Sep 2018	A1
20180285316	Thorson et al.	Oct 2018	A1
20180287928	Levi et al.	Oct 2018	A1
20180302324	Kasuya	Oct 2018	A1
20180321912	Li et al.	Nov 2018	A1
20180321938	Boswell et al.	Nov 2018	A1
20180349212	Liu et al.	Dec 2018	A1
20180367465	Levi	Dec 2018	A1
20180375781	Chen et al.	Dec 2018	A1
20190018805	Benisty	Jan 2019	A1
20190026250	Das Sarma et al.	Jan 2019	A1
20190044889	Serres et al.	Feb 2019	A1
20190065208	Liu et al.	Feb 2019	A1
20190068501	Schneider et al.	Feb 2019	A1
20190102179	Fleming et al.	Apr 2019	A1
20190102338	Tang et al.	Apr 2019	A1
20190102640	Balasubramanian	Apr 2019	A1
20190114533	Ng et al.	Apr 2019	A1
20190121388	Knowles et al.	Apr 2019	A1
20190138638	Pal et al.	May 2019	A1
20190147092	Pal et al.	May 2019	A1
20190149488	Bansal et al.	May 2019	A1
20190171612	Shahar et al.	Jun 2019	A1
20190235866	Das Sarma et al.	Aug 2019	A1
20190303168	Fleming, Jr. et al.	Oct 2019	A1
20190303263	Fleming, Jr. et al.	Oct 2019	A1
20190324431	Cella et al.	Oct 2019	A1
20190339688	Cella et al.	Nov 2019	A1
20190347099	Eapen et al.	Nov 2019	A1
20190369994	Parandeh Afshar et al.	Dec 2019	A1
20190377580	Vorbach	Dec 2019	A1
20190379714	Levi et al.	Dec 2019	A1
20200005859	Chen et al.	Jan 2020	A1
20200034145	Bainville et al.	Jan 2020	A1
20200057748	Danilak	Feb 2020	A1
20200103894	Cella et al.	Apr 2020	A1
20200106828	Elias et al.	Apr 2020	A1
20200137013	Jin et al.	Apr 2020	A1
20210203621	Ylisirnio et al.	Jul 2021	A1

Non-Patent Literature Citations (41)

Entry
Danalis et al., “PTG: an abstraction for unhindered parallelism”, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, pp. 1-10, Nov. 17, 2014.
Cosnard et al., “Symbolic Scheduling of Parameterized Task Graphs on Parallel Machines,” Combinatorial Optimization book series (COOP, vol. 7), pp. 217-243, year 2000.
Jeannot et al., “Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors using paramerized Task Graphs”, World Scientific, pp. 1-8, Jul. 23, 2001.
Stone, “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations,” Journal of the Association for Computing Machinery, vol. 10, No. 1, pp. 27-38, Jan. 1973.
Kogge et al., “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C-22, No. 8, pp. 786-793, Aug. 1973.
Hoefler et al., “Message Progression in Parallel Computing—To Thread or not to Thread?”, 2008 IEEE International Conference on Cluster Computing, pp. 1-10, Tsukuba, Japan, Sep. 29-Oct. 1, 2008.
U.S. Appl. No. 16/357,356 office action dated May 14, 2020.
European Application # 20156490.3 search report dated Jun. 25, 2020.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997.
Bruck et al., “Efficient Algorithms for AII-to-AII Communications in Multiport Message-Passing Systems”, Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pp. 298-309, Aug. 1, 1994.
Chiang et al., “Toward supporting data parallel programming on clusters of symmetric multiprocessors”, Proceedings International Conference on Parallel and Distributed Systems, pp. 607-614, Dec. 14, 1998.
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI AII-to-AII”, Proceedings of the 23rd European MPI Users' Group Meeting, pp. 167-179, Sep. 2016.
Pjesivac-Grbovic et al., “Performance Analysis of MPI Collective Operations”, 19th IEEE International Parallel and Distributed Processing Symposium, pp. 1-19, 2015.
Mellanox Technologies, “InfiniScale IV: 36-port 40GB/s Infiniband Switch Device”, pp. 1-2, year 2009.
Mellanox Technologies Inc., “Scaling 10Gb/s Clustering at Wire-Speed”, pp. 1-8, year 2006.
IEEE 802.1D Standard “IEEE Standard for Local and Metropolitan Area Networks—Media Access Control (MAC) Bridges”, IEEE Computer Society, pp. 1-281, Jun. 9, 2004.
IEEE 802.1AX Standard “IEEE Standard for Local and Metropolitan Area Networks—Link Aggregation”, IEEE Computer Society, pp. 1-163, Nov. 3, 2008.
Turner et al., “Multirate Clos Networks”, IEEE Communications Magazine, pp. 1-11, Oct. 2003.
Thayer School of Engineering, “An Slightly Edited Local Elements of Lectures 4 and 5”, Dartmouth College, pp. 1-5, Jan. 15, 1998 http://people.seas.harvard.edu/˜jones/cscie129/nu_lectures/lecture11/switching/clos_network/clos_network.html.
“MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, version 3.1, pp. 1-868, Jun. 4, 2015.
Coti et al., “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), pp. 1-12, Aug. 2009.
Petrini et al., “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (Hotl'01), pp. 1-6, Aug. 2001.
Sancho et al., “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 1-10, Sep. 17-20, 2007.
Infiniband Trade Association, “InfiniBand™ Architecture Specification”, release 1.2.1, pp. 1-1727, Jan. 2008.
InfiniBand Architecture Specification, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007.
Deming, “Infiniband Architectural Overview”, Storage Developer Conference, pp. 1-70, year 2013.
Fugger et al., “Reconciling fault-tolerant distributed computing and systems-on-chip”, Distributed Computing, vol. 24, Issue 6, pp. 323-355, Jan. 2012.
Wikipedia, “System on a chip”, pp. 1-4, Jul. 6, 2018.
Villavieja et al., “On-chip Distributed Shared Memory”, Computer Architecture Department, pp. 1-10, Feb. 3, 2011.
Ben-Moshe et al., U.S. Appl. No. 16/750,019, filed Jan. 23, 2020.
Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” Proceedings of the Forth Conferene on Partitioned Global Address Space Programming Model, pp. 1-4, Oct. 2010.
Priest et al., “You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives”, IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 221-230, year 2019.
Wikipedia, “Nagle's algorithm”, pp. 1-4, Dec. 12, 2019.
U.S. Appl. No. 16/430,457 Office Action dated Jul. 9, 2021.
Yang et al., “SwitchAgg: A Further Step Toward In-Network Computing,” 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking, pp. 36-45, Dec. 2019.
EP Application # 20216972 Search Report dated Jun. 11, 2021.
U.S. Appl. No. 16/789,458 Office Action dated Jun. 10, 2021.
U.S. Appl. No. 16/750,019 Office Action dated Jun. 15, 2021.
U.S. Appl. No. 17/147,487 Office Action dated Jun. 30, 2022.
U.S. Appl. No. 17/147,487 Office Action dated Nov. 29, 2022.
U.S. Appl. No. 17/495,824 Office Action dated Jan. 27, 2023.

Related Publications (1)

	Number	Date	Country
	20200265043 A1	Aug 2020	US

Provisional Applications (1)

	Number	Date	Country
	62807266	Feb 2019	US

High performance computing system

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications