The present disclosure relates to data processing in general. More specifically, the present disclosure relates to a processing device for a parallel computing system comprising a plurality of such processing devices for performing collective operations, as well as a corresponding method.
Collective operations, which have become an important part of parallel computing frameworks, describe common patterns of communication and computation in parallel computing systems, where data is simultaneously sent to and/or received from a plurality of processing devices (also referred to as processing nodes). As collective operations usually require communication from all N processing devices of a parallel computing system, up to N2 communication steps may be necessary, thereby resulting in a large latency for a collective operation involving a large number of processing devices.
It is an objective of the present disclosure to provide a processing device for a parallel computing system comprising a plurality of processing devices for performing collective operations with a reduced latency, as well as a corresponding parallel computing method.
The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect, a processing device for a parallel computing system is provided, wherein the parallel computing system comprises a plurality of processing devices for performing an application, including one or more collective operations. The processing device is configured to obtain a local processing result, wherein a global processing result of a collective operation depends on the local processing results of the plurality of processing devices. The processing device is further configured to distribute the local processing result of the processing device to one or more of the other processing devices, if: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device irrespective of the local processing results of the other processing devices; or (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device and a further local processing result of a further processing device of the plurality of processing devices irrespective of the local processing results of the other processing devices. As used herein, the global processing result is the final result of the collective operation, whereas a local processing result of a processing device is a result initially known to the respective processing device only.
Thus, advantageously, a processing device for a parallel computing system for performing collective operations with a reduced latency is provided.
In a further possible implementation form of the first aspect, the processing device is further configured to broadcast the local processing result of the processing device to all the other processing devices, only if it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device.
In a further possible implementation form of the first aspect, the collective operation is a logical or bitwise “AND” operation or a logical or bitwise “OR” operation.
In a further possible implementation form of the first aspect, the collective operation is a logical or bitwise “XOR” operation, wherein the processing device is further configured to broadcast the local processing result of the processing device to the other processing devices, if it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device and the further local processing result of the further processing device.
In a further possible implementation form of the first aspect, the processing device is further configured to receive the further local processing result from the further processing device and to perform the logical or bitwise “XOR” operation based on the local processing result of the processing device and the further local processing result of the further processing device.
In a further possible implementation form of the first aspect, the processing device is configured to distribute the local processing result of the processing device to a selected subset of the other processing devices for performing the collective operation only with the selected subset of the other processing devices, wherein, for each processing device of the selected subset of the other processing devices, a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device is larger than a likelihood threshold value. Thus, advantageously, only the important processing nodes may be selected for performing the collective operation, thereby further improving the latency.
In a further possible implementation form of the first aspect, the parallel computing system is configured to adjust the selected subset during run-time of the application. Thus, advantageously, the set of important processing devices may be adjusted depending on the state of the parallel computing system.
In a further possible implementation form of the first aspect, the processing device is configured to store for each collective operation of the application the global processing result of the collective operation and/or an identifier of the processing device providing the global processing result of the collective operation.
In a further possible implementation form of the first aspect, the processing device is configured to determine the likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device based on a comparison between the local processing result and one or more global processing results recorded for one or more preceding collective operations of the application Thus, advantageously, the global processing results of earlier collective operations may be used for determining the important processing devices.
In a further possible implementation form of the first aspect, the collective operation is a maximum operation for obtaining a maximum of the local processing results of the plurality of processing devices or a minimum operation for obtaining a minimum of the local processing results of the plurality of processing devices.
In a further possible implementation form of the first aspect, the processing device is configured to execute an iterative loop of operations and to terminate, i.e. exit the iterative loop based on a conditional statement depending on the global processing result of the collective operation.
In a further possible implementation form of the first aspect, the processing device is configured to broadcast the local processing result of the processing device to all the other processing devices, if it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device and the global processing result of the collective operation being equal to the local processing result triggers the processing device to terminate the iterative loop.
In a further possible implementation form of the first aspect, the processing device is configured to store for the iterative loop the number of iterations before terminating the iterative loop and/or a threshold value defined by the conditional statement for terminating the iterative loop.
In a further possible implementation form of the first aspect, for each iteration of the iterative loop, the processing device is configured to determine a likelihood that the conditional statement of the iterative loop will be fulfilled in a further iteration of the iterative loop, wherein the processing device is configured to broadcast the local processing result of the processing device to all the other processing devices and to terminate the iterative loop, if the likelihood that the conditional statement will be fulfilled in a further iteration of the iterative loop is larger than a likelihood threshold value.
In a further possible implementation form of the first aspect, the processing node is configured to determine the likelihood that the conditional statement of the iterative loop will be fulfilled in a further iteration of the iterative loop based on the stored number of iterations for terminating one or more preceding iterative loops.
In a further possible implementation form of the first aspect, in case the conditional statement of the iterative loop is not fulfilled in a further iteration of the iterative loop, the processing device is configured to continue executing the iterative loop, if the likelihood that the conditional statement will be fulfilled in a further iteration of the iterative loop is larger than a likelihood threshold value.
In a further possible implementation form of the first aspect, the collective operation is a sum operation of the plurality of local processing results.
In a further possible implementation form of the first aspect, the collective operation is a reduce operation for providing the global processing result at a selected root processing device of the plurality of processing devices or an all reduce operation for providing the global processing result at all of the plurality of processing devices.
According to a second aspect, a parallel computing system comprising a plurality of processing devices according to the first aspect and any one of the implementation forms of the first aspect is provided.
In a possible implementation form of the second aspect, the plurality of processing devices of the parallel computing system are configured to define a tree topology.
According to a third aspect, a method for performing an application, including one or more collective operations, in a parallel computing system having a plurality of processing devices is provided. For each processing device, the method comprises the steps of: obtaining a local processing result, wherein a global processing result of a collective operation depends on one or more of the local processing results of the plurality of processing devices; and distributing the local processing result of the processing device to one or more of the other processing devices, if: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device irrespective of the local processing results of the other processing devices; or (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device and a further local processing result of a further processing device of the plurality of processing devices irrespective of the local processing results of the other processing devices.
The method according to the third aspect of the present disclosure can be performed by the processing device according to the first aspect of the present disclosure and the parallel computing system according to the second aspect of the present disclosure. Further features of the method according to the third aspect of the present disclosure result directly from the functionality of the processing device according to the first aspect of the present disclosure and the parallel computing system according to the second aspect of the present disclosure and their different implementation forms described above and below.
Embodiments of the present disclosure can be implemented in hardware and/or software.
Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.
In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:
In the following, identical reference signs refer to identical or at least functionally equivalent features.
In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.
For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.
As illustrated in
A collective operation is a concept in parallel computing, according to which data is simultaneously sent to or received from many processing nodes, such as the processing devices 201 of the parallel computing system 200. A “broadcast operation” is an example of a collective operation for moving data among the plurality of processing devices 201. A “reduce operation” is an example of a collective operation that executes arithmetic or logical functions on data distributed among the plurality of processing devices 201. In an embodiment, the parallel computing system 200 may implement the Message Passing Interface (MPI) framework, i.e. a known parallel communications library for providing data communications between the plurality of processing devices 201 of the parallel computing system 200. Although in the following MPI terminology may be used for ease of explanation, MPI as such is not a requirement or limitation of the various embodiments disclosed herein.
As will be described in more detail further below, generally, a processing device 201 of the parallel computing system 200 is configured to obtain a local processing result, wherein a global processing result of the collective operation depends on the local processing results of the plurality of processing devices 201. Thus, as used herein, the global processing result is the final result of the collective operation, whereas a local processing result is a result initially known to, i.e. available at the respective processing device 201 only. For instance, each processing device 201 may be configured to perform a local data processing operation for obtaining the local processing result. A local processing result may be, for instance, an integer value, a real value, a logical “TRUE” or “FALSE” value or the like.
The processing device 201 of the parallel computing system 200 is further configured to distribute the local processing result of the processing device 201 to one or more of the other processing devices 201, if one of the following conditions is met: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 irrespective of the local processing results of the other processing devices 201; or (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 and a further local processing result of a further processing device of the plurality of processing devices 201 irrespective of the local processing results of the other processing devices 201. To verify whether one of these conditions (a), (b) or (c) is met, the processing device 201 is configured to check whether: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 irrespective of the local processing results of the other processing devices 201; (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 and a further local processing result of a further processing device of the plurality of processing devices 201 irrespective of the local processing results of the other processing devices 201.
In the embodiment shown in
Instead of distributing the local processing results, e.g. the “TRUE” or “FALSE” values, each processing device, i.e. node 201 of the parallel computing system 200 of
In case the processing device P3 knows that the global processing result of the collective operation is based only on, i.e. uniquely determined by its local processing result, the processing device P3, as illustrated in
Thus, the embodiment shown in
In another embodiment, the plurality of processing devices 201 of the parallel computing system 200 are configured to perform a reduce or all reduce operation in the form of a logical or bitwise “XOR” operation, such as the MPI_LXOR and the MPI_BXOR operation. In this embodiment, the processing node 201 is configured to broadcast its local processing result, such as a “TRUE” or “FALSE” value, to the other processing devices 201, if it is certain that the global processing result is based only on, i.e. uniquely determined by the local processing result of the processing device and a further local processing result of a further processing device 201. For instance, in the example shown in
As can be taken from
In case a processing device 201, such as, by way of example, the processing device P4 illustrated in
In many parallel computing applications, iterative schemes comprise the core of the overall algorithm. During the lifetime of a specific parallel computing application, the number of iterations varies mildly. Therefore, if a specific parallel computing application requires on average about 100 iterations at each step, it is very unlikely to have less than 80 at a specific step. However, conventionally, a blocking reduction operation is used at every iteration nonetheless, even if one can predict with near perfect certainty the outcome of the convergence test for the first 80 iterations. Furthermore, the local error value of a single process can be high enough to invalidate the global convergence test. This case is illustrated in
As can be taken from
Instead of distributing the local processing results, e.g. the real values, each processing device 201 of the parallel computing system 200 of
In an embodiment, each processing device 201 is configured to store for the iterative loop the number of iterations before terminating the iterative loop and/or a threshold value defined by the conditional statement for terminating the iterative loop. Moreover, in an embodiment, each processing device 201 may be configured to determine the likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop, wherein the processing device 201 is configured to broadcast the local processing result of the processing device 201 to the other processing devices 201 and to terminate the iterative loop, if the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is larger than a likelihood threshold value. Thus, in an embodiment, the parallel computing system 200 including the plurality of processing devices 201 may implement a branch prediction algorithm. In an embodiment, each processing device 201 is configured to determine the likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop based on the number of iterations recorded for terminating one or more preceding iterative loops. In an embodiment, in case the conditional statement of the iterative loop is not fulfilled in a further iteration of the iterative loop, each processing device 201 may be further configured to continue executing the iterative loop, if the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is larger than the likelihood threshold value.
Thus, in an embodiment, the parallel computing system 200 is configured to automatically recognize the blocking collective reduction call followed by a conditional statement on the returned reduced value as one global operation. This concatenation enables to take advantage of previously unused information to reduce the communication time. Moreover, as already described above, statistics may be gathered on the number of iterations required for the conditional statement to be activated. If the number of iterations is high enough, the blocking collective may be transformed into a non-blocking collective call, which will enable an overlap of communication and computation. Furthermore, branch prediction may be applied to the overall global collective reduction operation, i.e. the outcome of the conditional statement will be assumed to be false and the processing device 201 may resume with the computation of the next iteration while the communication is done simultaneously. If the conditional statement is found to be false as initially assumed, the processing device 201 is not interrupted and continues with its computation. In the other case where the conditional statement is true, the processing device 201 may retrace its step and exit the iterative scheme.
The person skilled in the art will understand that the “blocks” (“units”) of the various figures (method and apparatus) represent or describe functionalities of embodiments of the disclosure (rather than necessarily individual “units” in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit =step).
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
This application is a continuation of International Application No. PCT/EP2020/062872, filed on May 8, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2020/062872 | May 2020 | US |
Child | 17980851 | US |