The present invention, in exemplary embodiments thereof, relates to collective communication systems and methods, and particularly but not exclusively to message passing operations, and also particularly but not exclusively to all-to-all operations.
The present invention, in certain embodiments thereof, seeks to provide improved systems and methods for collective communication, and in particular, but not only, for message passing operations, including all-to-all operations.
There is thus provided in accordance with an exemplary embodiment of the present invention a method including providing a plurality of processes, each of the plurality of process being configured to hold a block of data destined for others of the plurality of processes, providing at least one instance of data repacking circuitry including receiving circuitry configured to receive at least one block of data from at least one source process of the plurality of processes, repacking circuitry configured to repack received data in accordance with at least one destination process of the plurality of processes, and sending circuitry configured to send the repacked data to the at least one destination process of the plurality of processes, receiving a set of data for all-to-all data exchange, the set of data being configured as a matrix, the matrix being distributed among the plurality of processes, and transposing the data by each of the plurality of processes sending matrix data from the process to the repacking circuitry, and the repacking circuitry receiving, repacking, and sending the resulting matrix data to destination processes.
Further in accordance with an exemplary embodiment of the present invention the method also includes providing a control tree configured to control the plurality of processes and the repacking circuitry.
Still further in accordance with an exemplary embodiment of the present invention the control tree is further configured to receive registration messages from each of the plurality of processes, mark a given subgroup of the plurality of processes as ready for operation when registration messages have been received from all members of the given subgroup, when a given subgroup which is a source subgroup and a corresponding subgroup which is a destination subgroup are ready for operation, pair the given source subgroup and the given destination subgroup and assign the given source subgroup and the given destination subgroup to an instance of repacking circuitry, and notify each the source subgroup and each the destination subgroup when operations relating to each the source subgroup and each the destination subgroup have completed.
Additionally in accordance with an exemplary embodiment of the present invention the control tree is configured, in addition to pairing the given source subgroup and the given destination subgroup, to assign the given source subgroup and the given destination subgroup to an instance of data repacking circuitry.
Moreover in accordance with an exemplary embodiment of the present invention the method also includes assigning circuitry other than the control tree, the assigning circuitry being configured to assign the given source subgroup and the given destination subgroup to an instance of data repacking circuitry.
Further in accordance with an exemplary embodiment of the present invention the control tree includes a reduction tree.
There is also provided in accordance with another exemplary embodiment of the present invention apparatus including receiving circuitry configured to receive at least one block of data from at least one source process of a plurality of processes, each of the plurality of process being configured to hold a block of data destined for others of the plurality of processes, at least one instance of data repacking circuitry configured to repack received data in accordance with at least one destination process of the plurality of processes, and sending circuitry configured to send the repacked data to the at least one destination process of the plurality of processes, the apparatus being configured to receive a set of data for all-to-all data exchange, the set of data being configured as a matrix, the matrix being distributed among the plurality of processes, and the apparatus being further configured to transpose the data by receiving, from each of the plurality of processes, matrix data from the process at the repacking circuitry, and the data repacking circuitry receiving, repacking, and sending the resulting matrix data to destination processes.
Further in accordance with an exemplary embodiment of the present invention the apparatus also includes a control tree configured to control the plurality of processes and the repacking circuitry.
Still further in accordance with an exemplary embodiment of the present invention the control tree is further configured to receive registration messages from each of the plurality of processes, mark a given subgroup of the plurality of processes as ready for operation when registration messages have been received from all members of the given subgroup, when a given subgroup which is a source subgroup and a corresponding subgroup which is a destination subgroup are ready for operation, pair the given source subgroup and the given destination subgroup and assign the given source subgroup and the given destination subgroup to an instance of data repacking circuitry, and notify each the source subgroup and each the destination subgroup when operations relating to each source subgroup and each destination subgroup have completed.
Additionally in accordance with an exemplary embodiment of the present invention the control tree is configured, in addition to pairing the given source subgroup and the given destination subgroup, to assign the given source subgroup and the given destination subgroup to a given instance of data repacking circuitry.
Moreover in accordance with an exemplary embodiment of the present invention the apparatus also includes assigning circuitry other than the control tree, the assigning circuitry being configured to assign the given source subgroup and the given destination subgroup to a given instance of data repacking circuitry.
Further in accordance with an exemplary embodiment of the present invention the control tree includes a reduction tree.
The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
The all-to-all operation, defined in communication standards such as the Message Passing Interface (MPI) (Forum, 2015), is a collective data operation in which each process sends data to every other process in the collective group, and receives the same amount of data from each process in the group. The data sent to each process is of the same length, a, and is unique, originating from distinct memory locations. In communications standards such as MPI, the concept of operations on processes is decoupled from any particular hardware infrastructure. A collective group, as discussed herein, refers to a group of processes over which a (collective) operation is defined. In the MPI specification a collective group is called a “communicator”, while in OpenSHMEM (see, for example, www.openshmem.org/site/) a collective group is called a “team”.
Reference is now made to
Exemplary methods of operation of the system of
Reference is now made to
The algorithms used to implement the all-to-all algorithm tend to fall into two categories—direct exchange and aggregation algorithms.
All-to-all aggregation algorithms are aimed at reducing latency costs, which dominate short data transfers. The all-to-all aggregation algorithms employ data forwarding approaches, to cut down on the number of messages sent, thus reducing latency costs. Such approaches gather/scatter the data from/to multiple sources, producing fewer larger data transfers, but send a given piece of data multiple times. As the number of communication contexts participating in the collective operation becomes too large, aggregation techniques become less efficient than a direct data exchange; this is due to the growing cost of transferring a given piece of data multiple times. The all-to-all algorithms take advantage of the fact that the data length a is a constant of the algorithm, providing sufficient global knowledge to coordinate data exchange through intermediate processes.
The direct exchange algorithms are typically used for all-to-all instances where the length of data being transferred, a, is above a threshold where bandwidth contributions dominate, or when the aggregation techniques aggregate data from too many processes, causing the aggregation techniques to be inefficient.
With growing system sizes, the need to support efficient implementations of small data all-to-all exchanges is increasing, as this is a data exchange pattern used by many High-Performance Computing (HPC) applications. The present invention, in exemplary embodiments thereof, presents a new all-to-all algorithm designed to increase the efficiency of small data exchanges over the full range of communicator sizes. This includes a new aggregation-based algorithm suitable for small data individualized all-to-all data exchange and may be viewed as transposing a distributed matrix. While reference to transposing, in various grammatical forms, are used throughout the present specification and claims, it is appreciated that transposing comprises a way to conceptualize algorithms in accordance with exemplary embodiments of the present invention; for example and without limiting the generality of the foregoing statement, there may be no such conceptualization at the level of (for example) the MPI standard. Such transposing comprises, in exemplary embodiments, changing the position of blocks relative to other blocks, without changing the structure within any block. The algorithms described herein with reference to exemplary embodiments of the present invention benefit from the large amount of concurrency available in the network and is designed to be simple and efficient for implementation by network hardware. Both switching hardware and Host-Channel-Adapter implementations are, in exemplary embodiments, targeted by this new design.
The individualized all-to-all-v/w algorithm is in certain respects similar to the individualized all-to-all data exchange. The individualized all-to-all-w algorithm differs from the all-to-all-v algorithm, in that the data type of each individual transfer may be unique across the function. A change is made to the all-to-all algorithm to support this collective operation. More specifically regarding data type: data being transferred using the MPI standard's interface specified a data type for all data, such as MPI_DOUBLE for a double precision word. The alltoallv interface specifies that all data elements are of the same data type. Alltoallw allows a different data type to be specified for each block of data, such as, for example, specifying a data type for data going from process i to process j.
The all-to-all-v/w operation is used for each process to exchange unique data with every other process in the group of processes participating in this collective operation. The size of data exchanged between two given processes may be asymmetric, and each pair may have a different data pattern than other pairs, with large variations in the data sizes being exchanged. A given rank need only have local, API-level information on the data exchanges in which it participates.
The individualized all-to-all-v/w algorithm aimed at the hardware implementation is somewhat similar to the individualized all-to-all algorithm, but requires more meta-data describing the detailed data lengths for implementation. In addition, only messages below a prespecified threshold are handled with this algorithm. A direct data exchange is used for the larger messages.
Previously, the algorithms for all-to-all function implementation have fallen into two broad categories:
The base algorithm definition describes data exchange between all pairs of processes in the collective group, or MPI communicator in the MPI definition. The term “base algorithm” refers to an algorithm definition at the interface level—logically what the function is/does, not how the function result is accomplished. Thus, by way of particular non-limiting example, the base description for alltoallv would be each process sending a block of data to all processes in the group. In certain exemplary embodiments of the present invention, by way of particular non-limiting example, methods are described for carrying out particular functions by aggregating data and by using communication patterns described herein. In general, the algorithm definition conceptually requires O(N2) data exchanges, where N is the group size.
Reference is now made to
The direct data exchange implementation of the function is the simplest implementation of the all-to-all function. A naïve implementation puts many messages on the network and has the potential to severely degrade network utilization by causing congestion and end-point n→1 contention. (The term “end-point”, as used herein, denotes an entity, such as a process or thread, which contributes data to a collective operation). As a result, algorithms that implement the direct data exchange use a communication pattern, such as pair-wise exchange, as shown in
Aggregation algorithms (Ana Gainaru, 2016) have been used to implement the small data aggregation, with the Bruck (J. Bruck, 1997) algorithm being perhaps the most well-known algorithm in this class. The number of data exchanges in which each process is involved using this approach is O((k−1)*logk(N)), where N is the collective groups size and k is the algorithm radix.
In exemplary embodiments of the present invention, the all-to-all and all-to-all-v/w algorithm is aimed at optimizing the small data exchange by:
The present invention, in exemplary embodiments thereof, may be viewed as using aggregation points within the network to collect data from a non-contiguous portion of a distributed matrix, transpose the data, and send the data to their destinations.
In exemplary embodiments, the invention may be summarized as follows:
In all-to-all and all-to-all-v/w algorithms, each process has a unique block of data destined for each other process in the group. The primary way all-to-all differs from all-to-all-v is in the data layout pattern. All-to-all data blocks are all of the same size, whereas the all-to-all-v/w algorithms support data blocks of differing sizes, and the data blocks need not be ordered in a monotonically increasing order within the user buffer.
The layout of blocks of data for the all-to-all algorithm may be viewed as a distributed matrix, with the all-to-all algorithm transposing this block distribution. It is important to note that, in exemplary embodiments of the present invention, the data within each block is not rearranged in the transposition, just the order of the data blocks themselves.
After the all-to-all operation is applied to the data in the example of
With the all-to-all-v/w algorithms a similar data transposition is performed. Such transform differs as follows:
The actual matrix transform is performed over sub-blocks of data. The term “the “actual matrix transform” is used herein because the blocks of data transfer defined by the operation can be viewed as a matrix transform, when each element in the matrix is a block of data. The columns of the matrix are the blocks of data owned by each process. Each process has a block of data associated with every process in the group, so the matrix can be viewed as a square matrix. For alltoall, the size of all the blocks is identical, for alltoall-v and alltoall-w, block sizes may be different. From a block-like view of the data layout (not the actual size of each block) alltoall-v and alltoall-w still are square.
For the purpose of the transform, horizontal submatrix dimension, dh, and vertical submatrix dimension, dv, are defined. The sub-block dimensions need not be an integer divisor of the full matrix dimension, and dh and dv need not be equal. Incomplete sub-blocks are permitted; that is, for a given group size, there are subgroups for which the ratio of the groups size to the sub-block size is not an integer. This situation gives “leftover” blocks at the edges. By way of particular non-limiting example, such “leftover” blocks would be present in a matrix of size 11, with sub-blocks of size 3. Finally, the vertical and horizontal ranges of values in the full matrix need not be contiguous, e.g., when mapped onto the full matrix, such a submatrix may be distributed into several different contiguous blocks of data over the matrix.
As an example, if we take dh=hv=2, and we use processes group {1,2}, {0,3} and {4,5} to sub-block the matrix,
The full end-to-end all-to-all is orchestrated, in exemplary embodiments of the present invention, using a reduction tree. As processes make a call to the collective operation, the reduction tree is used by each process to register with the collective operation. When all members of a sub-group have registered with the operation, the sub-group is marked as active. When both source and destination subgroup are active, that subgroup may be transposed.
In certain exemplary embodiments of the present invention, the collective operation is executed in the following manner:
It is appreciated that software components of the present invention may, if desired, be implemented in ROM (read only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. It is further appreciated that the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.
It is appreciated that various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable subcombination.
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention is defined by the appended claims and equivalents thereof:
Number | Date | Country | Kind |
---|---|---|---|
20156490 | Feb 2020 | EP | regional |
The present application claims priority from U.S. Provisional Patent Application Ser. No. 62/809,786 of Graham et al, filed 25 Feb. 2019.
Number | Name | Date | Kind |
---|---|---|---|
4933969 | Marshall et al. | Jun 1990 | A |
5068877 | Near et al. | Nov 1991 | A |
5325500 | Bell et al. | Jun 1994 | A |
5353412 | Douglas et al. | Oct 1994 | A |
5404565 | Gould et al. | Apr 1995 | A |
5606703 | Brady et al. | Feb 1997 | A |
5944779 | Blum | Aug 1999 | A |
6041049 | Brady | Mar 2000 | A |
6370502 | Wu et al. | Apr 2002 | B1 |
6483804 | Muller et al. | Nov 2002 | B1 |
6507562 | Kadansky et al. | Jan 2003 | B1 |
6728862 | Wilson | Apr 2004 | B1 |
6857004 | Howard et al. | Feb 2005 | B1 |
6937576 | Di Benedetto et al. | Aug 2005 | B1 |
7102998 | Golestani | Sep 2006 | B1 |
7124180 | Ranous | Oct 2006 | B1 |
7164422 | Wholey, III et al. | Jan 2007 | B1 |
7171484 | Krause et al. | Jan 2007 | B1 |
7313582 | Bhanot et al. | Dec 2007 | B2 |
7327693 | Rivers et al. | Feb 2008 | B1 |
7336646 | Muller | Feb 2008 | B2 |
7346698 | Hannaway | Mar 2008 | B2 |
7555549 | Campbell et al. | Jun 2009 | B1 |
7613774 | Caronni et al. | Nov 2009 | B1 |
7636424 | Halikhedkar et al. | Dec 2009 | B1 |
7636699 | Stanfill | Dec 2009 | B2 |
7738443 | Kumar | Jun 2010 | B2 |
8213315 | Crupnicoff et al. | Jul 2012 | B2 |
8255475 | Kagan et al. | Aug 2012 | B2 |
8380880 | Gulley et al. | Feb 2013 | B2 |
8510366 | Anderson et al. | Aug 2013 | B1 |
8645663 | Kagan et al. | Feb 2014 | B2 |
8738891 | Karandikar et al. | May 2014 | B1 |
8761189 | Shachar et al. | Jun 2014 | B2 |
8768898 | Trimmer | Jul 2014 | B1 |
8775698 | Archer et al. | Jul 2014 | B2 |
8811417 | Bloch et al. | Aug 2014 | B2 |
9110860 | Shahar | Aug 2015 | B2 |
9189447 | Faraj | Nov 2015 | B2 |
9294551 | Froese et al. | Mar 2016 | B1 |
9344490 | Bloch et al. | May 2016 | B2 |
9456060 | Pope et al. | Sep 2016 | B2 |
9563426 | Bent et al. | Feb 2017 | B1 |
9626329 | Howard | Apr 2017 | B2 |
9756154 | Jiang | Sep 2017 | B1 |
10015106 | Florissi et al. | Jul 2018 | B1 |
10158702 | Bloch et al. | Dec 2018 | B2 |
10284383 | Bloch et al. | May 2019 | B2 |
10296351 | Kohn et al. | May 2019 | B1 |
10305980 | Gonzalez et al. | May 2019 | B1 |
10318306 | Kohn et al. | Jun 2019 | B1 |
10425350 | Florissi | Sep 2019 | B1 |
10521283 | Shuler et al. | Dec 2019 | B2 |
10528518 | Graham et al. | Jan 2020 | B2 |
10541938 | Timmerman et al. | Jan 2020 | B1 |
10547553 | Shattah et al. | Jan 2020 | B2 |
10621489 | Appuswamy et al. | Apr 2020 | B2 |
20020010844 | Noel et al. | Jan 2002 | A1 |
20020035625 | Tanaka | Mar 2002 | A1 |
20020150094 | Cheng et al. | Oct 2002 | A1 |
20020150106 | Kagan et al. | Oct 2002 | A1 |
20020152315 | Kagan et al. | Oct 2002 | A1 |
20020152327 | Kagan et al. | Oct 2002 | A1 |
20020152328 | Kagan et al. | Oct 2002 | A1 |
20020165897 | Kagan et al. | Nov 2002 | A1 |
20030018828 | Craddock et al. | Jan 2003 | A1 |
20030061417 | Craddock et al. | Mar 2003 | A1 |
20030065856 | Kagan et al. | Apr 2003 | A1 |
20030120835 | Kale et al. | Jun 2003 | A1 |
20040030745 | Boucher et al. | Feb 2004 | A1 |
20040062258 | Grow et al. | Apr 2004 | A1 |
20040078493 | Blumrich et al. | Apr 2004 | A1 |
20040120331 | Rhine et al. | Jun 2004 | A1 |
20040123071 | Stefan et al. | Jun 2004 | A1 |
20040252685 | Kagan et al. | Dec 2004 | A1 |
20040260683 | Chan et al. | Dec 2004 | A1 |
20050097300 | Gildea et al. | May 2005 | A1 |
20050122329 | Janus | Jun 2005 | A1 |
20050129039 | Biran et al. | Jun 2005 | A1 |
20050131865 | Jones et al. | Jun 2005 | A1 |
20050223118 | Tucker et al. | Oct 2005 | A1 |
20050281287 | Ninomi et al. | Dec 2005 | A1 |
20060282838 | Gupta et al. | Dec 2006 | A1 |
20070127396 | Jain et al. | Jun 2007 | A1 |
20070127525 | Sarangam et al. | Jun 2007 | A1 |
20080104218 | Liang et al. | May 2008 | A1 |
20080126564 | Wilkinson | May 2008 | A1 |
20080168471 | Benner et al. | Jul 2008 | A1 |
20080181260 | Vonog et al. | Jul 2008 | A1 |
20080192750 | Ko et al. | Aug 2008 | A1 |
20080219159 | Chateau et al. | Sep 2008 | A1 |
20080244220 | Lin et al. | Oct 2008 | A1 |
20080263329 | Archer et al. | Oct 2008 | A1 |
20080288949 | Bohra et al. | Nov 2008 | A1 |
20080298380 | Rittmeyer et al. | Dec 2008 | A1 |
20080307082 | Cai et al. | Dec 2008 | A1 |
20090037377 | Archer et al. | Feb 2009 | A1 |
20090063816 | Arimilli et al. | Mar 2009 | A1 |
20090063817 | Arimilli et al. | Mar 2009 | A1 |
20090063891 | Arimilli et al. | Mar 2009 | A1 |
20090182814 | Tapolcai et al. | Jul 2009 | A1 |
20090240838 | Berg et al. | Sep 2009 | A1 |
20090247241 | Gollnick et al. | Oct 2009 | A1 |
20090292905 | Faraj | Nov 2009 | A1 |
20090296699 | Hefty | Dec 2009 | A1 |
20090327444 | Archer et al. | Dec 2009 | A1 |
20100017420 | Archer | Jan 2010 | A1 |
20100049836 | Kramer | Feb 2010 | A1 |
20100074098 | Zeng et al. | Mar 2010 | A1 |
20100095086 | Eichenberger et al. | Apr 2010 | A1 |
20100185719 | Howard | Jul 2010 | A1 |
20100241828 | Yu et al. | Sep 2010 | A1 |
20100274876 | Kagan et al. | Oct 2010 | A1 |
20100329275 | Johnsen et al. | Dec 2010 | A1 |
20110060891 | Jia | Mar 2011 | A1 |
20110066649 | Berlyant et al. | Mar 2011 | A1 |
20110119673 | Bloch et al. | May 2011 | A1 |
20110173413 | Chen et al. | Jul 2011 | A1 |
20110219208 | Asaad | Sep 2011 | A1 |
20110238956 | Arimilli et al. | Sep 2011 | A1 |
20110258245 | Blocksome et al. | Oct 2011 | A1 |
20110276789 | Chambers et al. | Nov 2011 | A1 |
20120063436 | Thubert et al. | Mar 2012 | A1 |
20120117331 | Krause et al. | May 2012 | A1 |
20120131309 | Johnson | May 2012 | A1 |
20120254110 | Takemoto | Oct 2012 | A1 |
20130117548 | Grover et al. | May 2013 | A1 |
20130159410 | Lee et al. | Jun 2013 | A1 |
20130215904 | Zhou et al. | Aug 2013 | A1 |
20130312011 | Kumar et al. | Nov 2013 | A1 |
20130318525 | Palanisamy et al. | Nov 2013 | A1 |
20130336292 | Kore et al. | Dec 2013 | A1 |
20140033217 | Vajda et al. | Jan 2014 | A1 |
20140040542 | Kim et al. | Feb 2014 | A1 |
20140047341 | Breternitz et al. | Feb 2014 | A1 |
20140095779 | Forsyth | Apr 2014 | A1 |
20140122831 | Uliel et al. | May 2014 | A1 |
20140136811 | Fleischer et al. | May 2014 | A1 |
20140189308 | Hughes et al. | Jul 2014 | A1 |
20140211804 | Makikeni et al. | Jul 2014 | A1 |
20140280420 | Khan | Sep 2014 | A1 |
20140281370 | Khan | Sep 2014 | A1 |
20140362692 | Wu et al. | Dec 2014 | A1 |
20140365548 | Mortensen | Dec 2014 | A1 |
20150074373 | Sperber et al. | Mar 2015 | A1 |
20150106578 | Warfield et al. | Apr 2015 | A1 |
20150143076 | Khan | May 2015 | A1 |
20150143077 | Khan | May 2015 | A1 |
20150143078 | Khan et al. | May 2015 | A1 |
20150143079 | Khan | May 2015 | A1 |
20150143085 | Khan | May 2015 | A1 |
20150143086 | Khan | May 2015 | A1 |
20150154058 | Miwa et al. | Jun 2015 | A1 |
20150180785 | Annamraju | Jun 2015 | A1 |
20150188987 | Reed et al. | Jul 2015 | A1 |
20150193271 | Archer | Jul 2015 | A1 |
20150212972 | Boettcher et al. | Jul 2015 | A1 |
20150261720 | Kagan et al. | Sep 2015 | A1 |
20150269116 | Raikin et al. | Sep 2015 | A1 |
20150347012 | Dewitt et al. | Dec 2015 | A1 |
20150379022 | Puig et al. | Dec 2015 | A1 |
20160055225 | Xu et al. | Feb 2016 | A1 |
20160105494 | Reed et al. | Apr 2016 | A1 |
20160112531 | Milton et al. | Apr 2016 | A1 |
20160117277 | Raindel et al. | Apr 2016 | A1 |
20160119244 | Wang et al. | Apr 2016 | A1 |
20160179537 | Kunzman et al. | Jun 2016 | A1 |
20160219009 | French | Jul 2016 | A1 |
20160248656 | Anand et al. | Aug 2016 | A1 |
20160283422 | Crupnicoff et al. | Sep 2016 | A1 |
20160299872 | Vaidyanathan et al. | Oct 2016 | A1 |
20160342568 | Burchard et al. | Nov 2016 | A1 |
20160364350 | Sanghi et al. | Dec 2016 | A1 |
20170063613 | Bloch et al. | Mar 2017 | A1 |
20170093715 | McGhee et al. | Mar 2017 | A1 |
20170116154 | Palmer et al. | Apr 2017 | A1 |
20170187496 | Shalev et al. | Jun 2017 | A1 |
20170187589 | Pope et al. | Jun 2017 | A1 |
20170187629 | Shalev et al. | Jun 2017 | A1 |
20170187846 | Shalev et al. | Jun 2017 | A1 |
20170192782 | Valentine et al. | Jul 2017 | A1 |
20170199844 | Burchard et al. | Jul 2017 | A1 |
20170308329 | A et al. | Oct 2017 | A1 |
20180004530 | Vorbach | Jan 2018 | A1 |
20180046901 | Xie et al. | Feb 2018 | A1 |
20180047099 | Bonig et al. | Feb 2018 | A1 |
20180089278 | Bhattacharjee et al. | Mar 2018 | A1 |
20180091442 | Chen et al. | Mar 2018 | A1 |
20180097721 | Matsui et al. | Apr 2018 | A1 |
20180173673 | Daglis et al. | Jun 2018 | A1 |
20180262551 | Demeyer | Sep 2018 | A1 |
20180285316 | Thorson et al. | Oct 2018 | A1 |
20180287928 | Levi et al. | Oct 2018 | A1 |
20180302324 | Kasuya | Oct 2018 | A1 |
20180321912 | Li et al. | Nov 2018 | A1 |
20180321938 | Boswell et al. | Nov 2018 | A1 |
20180367465 | Levi | Dec 2018 | A1 |
20180375781 | Chen et al. | Dec 2018 | A1 |
20190026250 | Das Sarma et al. | Jan 2019 | A1 |
20190065208 | Liu et al. | Feb 2019 | A1 |
20190068501 | Schneder et al. | Feb 2019 | A1 |
20190102179 | Fleming et al. | Apr 2019 | A1 |
20190102338 | Tang et al. | Apr 2019 | A1 |
20190102640 | Balasubramanian | Apr 2019 | A1 |
20190114533 | Ng et al. | Apr 2019 | A1 |
20190121388 | Knowles | Apr 2019 | A1 |
20190138638 | Pal et al. | May 2019 | A1 |
20190147092 | Pal et al. | May 2019 | A1 |
20190149486 | Bohrer et al. | May 2019 | A1 |
20190235866 | Das Sarma et al. | Aug 2019 | A1 |
20190303168 | Fleming, Jr. et al. | Oct 2019 | A1 |
20190303263 | Fleming et al. | Oct 2019 | A1 |
20190324431 | Cella et al. | Oct 2019 | A1 |
20190339688 | Cella et al. | Nov 2019 | A1 |
20190347099 | Eapen et al. | Nov 2019 | A1 |
20190369994 | Parandeh Afshar et al. | Dec 2019 | A1 |
20190377580 | Vorbach | Dec 2019 | A1 |
20190379714 | Levi et al. | Dec 2019 | A1 |
20200005859 | Chen et al. | Jan 2020 | A1 |
20200034145 | Bainville et al. | Jan 2020 | A1 |
20200057748 | Danilak | Feb 2020 | A1 |
20200103894 | Cella et al. | Apr 2020 | A1 |
20200106828 | Elias et al. | Apr 2020 | A1 |
20200137013 | Jin | Apr 2020 | A1 |
Entry |
---|
U.S. Appl. No. 16/357,356 office action dated May 14, 2020. |
European Application # 20156490.3 search report dated Jun. 25, 2020. |
Bruck et al., “Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems”, Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pp. 298-309, Aug. 1, 1994. |
Chiang et al., “Toward supporting data parallel programming on clusters of symmetric multiprocessors”, Proceedings International Conference on Parallel and Distributed Systems, pp. 607-614, Dec. 14, 1998. |
Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” Proceedings of the Forth Conferene on Partitioned Global Address Space Programming Model, pp. 1-4, Oct. 2010. |
Priest et al., “You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives”, IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 221-230, year 2019. |
Wikipedia, “Nagle's algorithm”, pp. 1-4, Dec. 12, 2019. |
Mellanox Technologies, “InfiniScale IV: 36-port 40GB/s Infiniband Switch Device”, pp. 1-2, year 2009. |
Mellanox Technologies Inc., “Scaling 10Gb/s Clustering at Wire-Speed”, pp. 1-8, year 2006. |
IEEE 802.1D Standard “IEEE Standard for Local and Metropolitan Area Networks—Media Access Control (MAC) Bridges”, IEEE Computer Society, pp. 1-281, Jun. 9, 2004. |
IEEE 802.1AX Standard “IEEE Standard for Local and Metropolitan Area Networks—Link Aggregation”, IEEE Computer Society, pp. 1-163, Nov. 3, 2008. |
Turner et al., “Multirate Clos Networks”, IEEE Communications Magazine, pp. 1-11, Oct. 2003. |
Thayer School of Engineering, “An Slightly Edited Local Elements of Lectures 4 and 5”, Dartmouth College, pp. 1-5, Jan. 15, 1998 http://people.seas.harvard.edu/˜jones/cscie129/nu_lectures/lecture11/switching/clos_network/clos_network.html. |
“MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, version 3.1, pp. 1-868, Jun. 4, 2015. |
Coti et al., “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), pp. 1-12, Aug. 2009. |
Petrini et al., “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (HotI'01), pp. 1-6, Aug. 2001. |
Sancho et al., “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 1-10, Sep. 17-20, 2007. |
InfiniBand Trade Association, “InfiniBand™ Architecture Specification”, release 1.2.1, pp. 1-1727, Jan. 2008. |
InfiniBand Architecture Specification, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007. |
Deming, “Infiniband Architectural Overview”, Storage Developer Conference, pp. 1-70, year 2013. |
Fugger et al., “Reconciling fault-tolerant distributed computing and systems-on-chip”, Distributed Computing, vol. 24, Issue 6, pp. 323-355, Jan. 2012. |
Wikipedia, “System on a chip”, pp. 1-4, Jul. 6, 2018. |
Villavieja et al., “On-chip Distributed Shared Memory”, Computer Architecture Department, pp. 1-10, Feb. 3, 2011. |
Ben-Moshe et al., U.S. Appl. No. 16/750,019, filed Jan. 23, 2020. |
Bruck et al., “Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997. |
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All”, EuroMPI '16, Edinburgh, United Kingdom, pp. 1-13, year 2016. |
Pjesivac-Grbovic et al., “Performance analysis of MPI collective operations”, Cluster Computing, pp. 1-25, 2007. |
U.S. Appl. No. 16/181,376 office action dated May 1, 2020. |
Graham et al., U.S. Appl. No. 16/782,118, filed Feb. 5, 2020. |
Danalis et al., “PTG: an abstraction for unhindered parallelism”, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, pp. 1-10, Nov. 17, 2014. |
Cosnard et al., “Symbolic Scheduling of Parameterized Task Graphs on Parallel Machines,” Combinatorial Optimization book series (COOP, vol. 7), pp. 217-243, year 2000. |
Jeannot et al., “Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors using paramerized Task Graphs”, World Scientific, pp. 1-8, Jul. 23, 2001. |
Stone, “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations,” Journal of the Association for Computing Machinery, vol. 10, No. 1, pp. 27-38, Jan. 1973. |
Kogge et al., “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C-22, No. 8, pp. 786-793, Aug. 1973. |
Hoefler et al., “Message Progression in Parallel Computing—To Thread or not to Thread?”, 2008 IEEE International Conference on Cluster Computing, pp. 1-10, Tsukuba, Japan, Sep. 29-Oct. 1, 2008. |
U.S. Appl. No. 16/430,457 Office Action dated Jul. 9, 2021. |
Yang et al., “SwitchAgg: A Further Step Toward In-Network Computing,” 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking, pp. 36-45, Dec. 2019. |
EP Application No. 20216972 Search Report dated Jun. 11, 2021. |
U.S. Appl. No. 16/782,118 Office Action dated Jun. 3, 2021. |
U.S. Appl. No. 16/750,019 Office Action dated Jun. 15, 2021. |
Number | Date | Country | |
---|---|---|---|
20200274733 A1 | Aug 2020 | US |
Number | Date | Country | |
---|---|---|---|
62809786 | Feb 2019 | US |