The present disclosure relates to a computer system and corresponding method for distributed training of a machine learning (ML) model. In particular, the system and method of the present disclosure extend a Bulk Synchronous Parallel (BSP) system to support asynchronous gradient computation in a Parameter Server (PS)-based distributed machine learning approach.
Today, distributed computation is required to speed up iterative training of large scale machine learning problems. In order to efficiently support, for instance, distributed model training, the model training process is distributed over a cluster, as is depicted in
Each ML worker computes model updates, using the following iteration: First, the ML worker extracts a model replica (copy) M from the PS. Second, it computes a gradient of the model M, i.e. ΔM=computeGrad(M). Third, it updates the model in the PS based on the computed gradient: M+−ΔM.
Three main approaches to build such a PS-based system exist up to date, these approaches being shown in
The BSP system works in iterations, where an iteration consists of two phases: A computation phase and a synchronization barrier. During the computation phase, each worker (here labelled as executor) uses its local model replica and a part of training data to compute a gradient. At the synchronization barrier, the system waits until all workers complete their gradient computation. Then it uploads the computed gradients to the PS, and merges them with the PS model, i.e. the model residing in the PS. Subsequently, each worker downloads the current (updated) PS model, in order to start the next computation phase. Notably, at the synchronization barrier each worker is idle, and needs to wait for the updated PS model. Thus, the system resources are underutilized.
The AP system operates in a similar manner as the BSP system, except that the synchronization barrier is eliminated. Here, when a worker completes its computation of a gradient, it uploads the computed gradient to the PS, in order to merge it with the PS model. Then, it downloads the resulting updated PS model without synchronizing with other workers. While the AP solution thus eliminates the waiting at the synchronization barrier of the BSP system, it suffers from two different problems. Firstly, during the gradient upload and the merging of the uploaded gradient with the PS model, and also during the downloading of the updated PS model, the worker is idle, thus wasting resources. Secondly, the probably even more severe problem is that workers may merge delayed gradients on different models.
The SSP system is a compromise of the BSP and AP systems, trying to solve the disadvantages of both. On the one hand side, it allows model replica at different workers to stale or diverge. On the other hand side, it bounds this model staleness by a so-called staleness factor s. To enforce bounded staleness, the SSP system introduces a local clock at each worker, wherein the clock amounts to a local iteration number at a worker. That is, a local clock of a worker is incremented each time the worker completes a single gradient computation. A faster worker is allowed to proceed with gradient computation only, if its local clock is at most s iterations beyond the clock of the slowest worker. Notably, too high values of the staleness factor s will lead to model staleness problems, while too small values of the staleness factor s will lead to too a high synchronization overhead. Thus, the SSP system achieves the highest convergence rate at a sweet spot value of the staleness factor s—not too high and not too low.
According to the above discussion, the advantages and disadvantages of the BSP and AP systems, respectively, become evident, and also how the SSP system leads to a compromise between the BSP and AP solutions.
However, while the AP and SSP solutions have significant performance advantages, as compared to the BSP solutions, the AP and SSP solutions are much harder to implement than the BSP solution. Currently, the AP and SSP solutions both require the construction of a specialized system. The BSP solution is much simpler to implement and also to use for programming use cases. Furthermore, today there exist a few mature open source BSP platforms, such as HADOOP and Apache Spark. All BSP solutions, however, suffer from system underutilization due to the synchronization barriers.
Therefore, combining the simplicity of a BSP system at workers and PS level with the performance advantages of an AP or SSP system at the workers level is highly desired. There is a particular need to leverage a mature BSP system, in order to construct an AP or SSP system.
Many organizations are nowadays heavily based on existing BSP solutions. BSP-based asynchronous training systems lower an adoption barrier in such organizations. Today there are two major approaches. One approach leverages existing BSP systems to construct a BSP solution, e.g. CaffeOnSpark of Yahoo. Another approach is to construct a dedicated AP or SSP solution, such as CMU's Petuum or DistBelief over Google's TensorFlow.
In view of the above-mentioned problems and disadvantages, the present disclosure aims to improve the conventional systems and methods. The present disclosure has thereby the object to provide a system, which combines the simplicity of a BSP system at the workers and PS level with the performance advantages of an AP or SSP system at the workers level. In particular, the high overhead at the synchronization barrier of a BSP system should be avoided. Also, the wasteful requirement of heavy distributed core capabilities in AP and SSP systems should be eliminated. In addition, the scalability of the system is to be improved over conventional systems. Moreover, gradient merge delays should be minimized, in order to improve the convergence rate.
The object of the present disclosure is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In particular the present disclosure proposes a solution that uses a shared memory module to decouple asynchronous gradient computation at machine learning modules from a synchronous periodic model download and model update merge with the PS model. The shared memory module is used to synchronize these two components, and to facilitate a data flow between them.
A first aspect of the present disclosure provides a computer system for distributed training of a machine learning model, the computer system comprising: a bulk synchronous parallel, BSP, system including a central BSP control module and at least one local BSP module; at least one machine learning module associated with exactly one local BSP module; and a shared memory module associated with exactly one pair of a local BSP module and a machine learning module; wherein the central BSP control module is configured to instruct the at least one local BSP module to store, in its associated shared memory module, a local model; wherein the at least one machine learning module is configured to read, from its associated shared memory module, the local model, compute a gradient based on the local model, and aggregate the gradient immediately after its computation into an aggregated gradient in its associated shared memory module; and wherein the central BSP control module is further configured to instruct the at least one local BSP module to periodically read out its associated shared memory module.
The computer system of the first aspect uses the shared memory module to decouple asynchronous gradient computation by the at least one machine learning module from synchronous and periodic gradient read out by the BSP system. After computing a gradient based on the local model, the at least one machine learning module uses it to update the local model, and aggregates it into the aggregated gradient in the shared memory module. The next gradient is then computed on the updated local model. After computing this next gradient, the at least one machine learning module uses it to further update the local model, and aggregates it into the aggregated gradient in the shared memory module. The gradient read out is made to merge it with a PS model residing in the PS, and to subsequently update the local model by the updated PS model. The shared memory module is also used to synchronize the BSP system and the machine learning module, and to facilitate a data flow between these two components.
The computer system advantageously introduces specifically asynchrony at three levels: Firstly, each machine learning module (also called ML worker) runs independently from all other machine learning modules, i.e. calculates gradients without the need to wait for other machine learning modules. Secondly, the process of gradient collection—and consequently also of merging the gradients into the PS model is asynchronous with the gradient computation. Thirdly, the subsequent distribution of an updated PS model is asynchronous with the gradient computation. All these asynchrony levels in the system contribute to a significant reduction in wait times and, consequently, to a speeding up of the model training.
Further, the computer system of the first aspect allows using mature BSP systems, in order to rapidly implement systems for asynchronous training of machine learning models. Specifically, it allows reusing complex distributed components, thus saving time to implement them as compared to the implementation of dedicated asynchronous solutions. Actually, in the computer system of the first aspect, the only non-BSP components are the machine learning modules.
The computer system of the first aspect is highly scalable, due to its distributed approach, and due to the fact that data flows do not need to pass through a Master node.
The computer system of the first aspect also allows two further optimizations seamlessly: Firstly, it allows reassignment of cluster machines to host the PS model for a better load balancing on production clusters. Secondly, it allows an efficient gradient merge procedure, in order to increase network usage efficiency using tree-merge operation. Details of these optimizations are explained further below.
The computer system of the first aspect advantageously lowers adoption barriers in organizations, which are heavily based on existing BSP solutions.
Finally, the implementation of the computer system of the first aspect is simple, and the simplicity of its implementation results in very low implementation efforts and a higher stability.
In a first implementation form of the system according to the first aspect, the at least one machine learning module is configured to compute a plurality of gradients based on the local model, and aggregate the plurality of gradients into the aggregated gradient stored in its associated shared memory module.
Therefore, when a shared memory module is read out, all gradients so far computed by the associated machine learning module are obtained in the form of the aggregated gradient. Machine learning modules can thus compute gradients with different computational speeds, without any wait times. That is, each machine learning module can, without having to wait for any other machine learning module, compute a gradient and aggregate it into the aggregated gradient, as soon as its computation is finalized.
In a second implementation form of the system according to the first aspect as such or according to the first implementation form of the first aspect, the at least one machine learning module is configured to read training data from the BSP system; or the BSP system is configured to push training data to the at least one machine learning module via its associated shared memory module; and the at least one machine learning module is configured to compute the gradient based on the local model and the training data.
The training data can thereby be distributed more efficiently to the machine learning modules, without introducing any wait times or delays.
In a third implementation form of the system according to the first aspect as such or according to any previous implementation form of the first aspect, the at least one local BSP module is further configured to communicate with a parameter server, PS, in order to receive a PS model that is to be stored as the local model.
In this manner, the local models can be efficiently provided to the individual machine learning modules for training.
In a fourth implementation form of the system according to the first aspect as such or according to any of the previous implementation form of the first aspect, after every step of periodically reading out a shared memory module, the central BSP control module is further configured to instruct the associated local BSP module to provide, to a PS, the aggregated gradient for updating, in the PS, the PS model.
Thus, the PS model in the PS can be updated with the computed and aggregated gradients. The PS model update happens periodically after read out of a shared memory module, and is decoupled from the asynchronous gradient computations.
In a fifth implementation form of the system according to the first aspect as such or according to any previous implementation form of the first aspect, the central BSP control module is configured to notify the at least one local BSP module on the availability of an updated PS model residing in a PS, and the at least one local BSP module is configured to download the updated PS model from the PS and to use it to update the local model stored in its associated shared memory module.
In this manner, updated local models can be efficiently provided to the individual machine learning modules for further training.
In a sixth implementation form of the system according to the first aspect as such or according to any previous implementation form of the first aspect, the at least one machine learning module is further configured to, when storing in its associated shared memory module the aggregated gradient, set a gradient available flag; and the central BSP control module is further configured to, when periodically instructing the at least one local BSP module to read out its associated shared memory module, to instruct the at least one local BSP module to read out an aggregated gradient only, if the gradient available flag is set.
Thereby, the overall system efficiency is improved.
In a seventh implementation form of the system according to the first aspect as such or according to any previous implementation form of the first aspect, the central BSP control module is further configured to instruct the at least one local BSP module, when storing or updating the local model in its associated shared memory module, to set a model available flag; and the at least one machine learning module is further configured to read, from its associated shared memory module, the local model only, if the model available flag is set.
Thereby, the overall system efficiency is further improved.
In an eighth implementation form of the system according to the first aspect as such or according to any previous implementation form of the first aspect, the central BSP control module is further configured to instruct the at least one local BSP module to store, in its associated shared memory module, a global minimum clock calculated based on clock information obtained from all machine learning modules; and the at least one machine learning module is further configured to read, from its associated shared memory module, the global minimum clock, interrupt, if a difference of its local clock and the global minimum clock exceeds a predefined threshold, its computation until the global minimum clock advances and a difference of its local clock and the global minimum clock is bounded by the predefined threshold.
Thus, it can be ensured that the local models do not become too stale, in order to avoid, for instance, that computed gradients are merged with a PS model only with a high delay. In essence, a controllable staleness factor is introduced into the computer system, achieving advantages of an SSP system.
A second aspect of the present disclosure provides a method for operating a computer system for distributed training of a machine learning model, the method comprising the steps of: instructing, by a central bulk synchronous parallel, BSP, control module of a BSP system, a local BSP module of the BSP system to store a local model in a shared memory module associated with the local BSP module; reading, by a machine learning module associated with the local BSP module, the local model from the shared memory module associated with the local BSP module, computing, by the machine learning module, a gradient based on the local model, aggregating, by the machine learning module, the gradient immediately after its computation into an aggregated gradient in its associated shared memory module; and instructing, by the central BSP module, the local BSP module to periodically read out its associated shared memory module.
In a first implementation form of the method according to the second aspect, the at least one machine learning module is configured to compute a plurality of gradients based on the local model, and aggregate the plurality of gradients into the aggregated gradient stored in its associated shared memory module.
In a second implementation form of the method according to the second aspect as such or according to the first implementation form of the second aspect, the at least one machine learning module is configured to read training data from the BSP system; or the BSP system is configured to push training data to the at least one machine learning module via its associated shared memory module; and the at least one machine learning module is configured to compute the gradient based on the local model and the training data.
In a third implementation form of the method according to the second aspect as such or according to any previous implementation form of the second aspect, the at least one local BSP module is further configured to communicate with a parameter server, PS, in order to receive a PS model that is to be stored as the local model.
In a fourth implementation form of the method according to the second aspect as such or according to any of the previous implementation form of the second aspect, after every step of periodically reading out a shared memory module, the central BSP control module is further configured to instruct the associated local BSP module to provide, to a PS, the aggregated gradient for updating, in the PS, the PS model.
In a fifth implementation form of the method according to the second aspect as such or according to any previous implementation form of the second aspect, the central BSP control module is configured to notify the at least one local BSP module on the availability of an updated PS model residing in a PS, and the at least one local BSP module is configured to download the updated PS model from the PS and to use it to update the local model stored in its associated shared memory module.
In a sixth implementation form of the method according to the second aspect as such or according to any previous implementation form of the second aspect, the at least one machine learning module is further configured to, when storing in its associated shared memory module the aggregated gradient, set a gradient available flag; and the central BSP control module is further configured to, when periodically instructing the at least one local BSP module to read out its associated shared memory module, to instruct the at least one local BSP module to read out an aggregated gradient only, if the gradient available flag is set.
In a seventh implementation form of the method according to the second aspect as such or according to any previous implementation form of the second aspect, the central BSP control module is further configured to instruct the at least one local BSP module, when storing or updating the local model in its associated shared memory module, to set a model available flag; and the at least one machine learning module is further configured to read, from its associated shared memory module, the local model only, if the model available flag is set.
In an eighth implementation form of the method according to the second aspect as such or according to any previous implementation form of the second aspect, the central BSP control module is further configured to instruct the at least one local BSP module to store, in its associated shared memory module, a global minimum clock calculated based on clock information obtained from all machine learning modules; and the at least one machine learning module is further configured to read, from its associated shared memory module, the global minimum clock, interrupt, if a difference of its local clock and the global minimum clock exceeds a predefined threshold, its computation until the global minimum clock advances and a difference of its local clock and the global minimum clock is bounded by the predefined threshold.
The method of the second aspect and its implementation forms achieve the same advantages as the system of the first aspect and its respective implementation forms.
A third aspect of the present disclosure provides a computer program product comprising a program code for performing, when running on a computer, the method according to the second aspect as such or according to any implementation form of the second aspect.
The computer program product of the third aspect thus achieves all the advantages of the method of the second aspect.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
The above described aspects and implementation forms of the present disclosure will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which:
The BSP system 101 includes a central BSP control module 102 and at least one local BSP module 103. The at least one machine learning module 105 is associated with the local BSP module 103. In case that more than one machine learning module 105 is included in the computer system 100 (as exemplarily indicated by the dotted shapes in
The central BSP control module 102 is configured to provide instruction to the at least one local BSP module 103. For instance, the central BSP control module 102 is configured to instruct the at least one local BSP module 103 to store, in its associated shared memory module 104, a local model. This local model may particularly be a copy or replica of a PS model residing in a PS that is to be trained by a distributed machine learning system.
The at least one machine learning module 105 is accordingly configured to read, from its associated shared memory module 104, the stored local model, and is configured to compute a gradient based on the local model. Once the gradient is computed, the at least one machine learning module 105 is configured to aggregate the gradient, preferably immediately after its computation, into an aggregated gradient in its associated shared memory module 104. The central BSP control module 102 is also configured to instruct the at least one local BSP module 103 to periodically read out its associated shared memory module 104, i.e. to read out the aggregated gradient stored therein. All shared memory modules 104 included in the computer system 100 could be read out with the same periodicity, but periodicities could also differ. For example, only those shared memory modules 104 can be read out, whose aggregated gradient is non-empty. After the local BSP module 103 reads out the aggregated gradient of a shared memory module 104, the aggregated gradient becomes empty, until the machine learning module 105 aggregates into it a new gradient that it has computed. Also shared memory modules 104 are read out asynchronously to gradient computation by the machine learning modules 105. The asynchronous gradient computation of the individual machine learning modules 105 is decoupled from the read out process by means of the shared memory module 104.
The method 200 comprises a step of instructing 201, by a central BSP control module 102 of a BSP system 101, a local BSP module 103 of the BSP system 101 to store a local model in a shared memory module 104, the shared memory module 104 being associated with the local BSP module 103. Further, the method 200 comprises a step of reading 202, by a machine learning module 105 associated with the local BSP module 103, the local model from the shared memory module 104 associated with the local BSP module 103. The method 200 comprises another step of computing 203, by the machine learning module 105, a gradient based on the local model, and a step of aggregating 204, by the machine learning module 105, the gradient immediately after its computation into an aggregated gradient in its associated shared memory module 104. Finally, the method 200 comprises a step of instructing, by the central BSP module 102, the local BSP module 103 to periodically read out its associated shared memory module 104.
The aggregated gradient of a shared memory module is read out, in order to merge it with the PS model in the PS, and to subsequently update the local model by the updated PS model.
The system 100 uses a distributed data set, preferably Spark Resilient Distributed Dataset (RDD), to store a PS model. This RDD storing the PS model is referred to as PS RDD 300. The PS model may be distributed between several machines of a cluster (these machines being indicated in
Further, a ML distributed data set, preferably a ML RDD 303, is used to control the ML modules 105. In other words, the ML modules 105 are organized in the ML RDD 303 that controls them. The distributed data set accordingly includes a number of elements that is the same as the number of ML modules 105. In other words, the ML RDD 303 is partitioned into a number of elements corresponding to the number of the ML modules 105 (see (2) in
Each ML module 105 uses the local BSP module 103 to download a replica or copy of the PS model as local model at the request of said machine learning module 105. The data flow flows directly between ML modules 105 and the PS RDD 300. Since also all global metadata is stored along with the PS model, particularly in the part with ID 0, and since the global metadata contains the global clock information, the global clock info can also be downloaded to the ML modules 105 during downloading of the PS model copy or replica as local model. Each ML module 105 can use this clock information to enforce, for example, staleness guarantees.
Additionally, the ML module 105 preferably reads training data from a storage 305 by using BSP components. The training data may preferably be stored in a Hadoop distributed file system (HDFS) as shown in
A ML module 105 then uses the local model, and preferably the obtained training data, to compute gradients (see (4) in
Thereby, the synchronous collection of aggregated gradients, and the asynchronous computation of gradients, is decoupled via a BSP/ML module communication layer 306 implemented on at least one shared memory module 104, one module 104 per pair of ML module 105 and local BSP module 103. The BSP system 101 subsequently uses a join operation (see (1) in
A learning iteration starts, when the ML module 105 sets the ‘new model request’ flag to ask the local BSP module 103 to download a PS model. When the local BSP control module 103 discovers that this flag is set, it downloads a PS model, stores it the shared memory 104 and preferably raises the ‘new model available’ flag. When the ML module discovers that this flag is set, it copies the new model from the shared memory 104 into its own memory, and starts computing gradients using this model. When a new gradient is computed, the ML module 105 extracts the gradient from its own main memory, aggregates it in ‘aggregated gradient’ within the shared memory 104, and preferably raises the ‘model updates flag’. Periodically the BSP central control module 102 issues a gradient merge operation. This operation arrives to each local BSP module 103. The local BSP module 103 checks the ‘model updates flag’, and if it is set, it collects the aggregated gradients from the shared memory 104, and sends them to join with the PS model to complete the gradient merge process. When the local BSP module 103 downloads the new model, it preferably also downloads the global clock information and sets it into the ‘local state’ of the shared memory 104. The asynchronous ML module 105 uses this information for staleness enforcement.
For instance, each local BSP module 103 may store in its associated shared memory module 104 a global minimum clock calculated based on clock information obtained from all ML modules 105. The ML module 105 reads the global minimum clock, and interrupts its computation, if a difference of its local clock and the global minimum clock exceeds a predefined threshold. The computation may be resumed, if the global minimum clock advances and a difference of its local clock and the global minimum clock is bounded by the predefined threshold.
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.
This application is a continuation of International Application No. PCT/EP2017/055602, filed on Mar. 9, 2017, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2017/055602 | Mar 2017 | US |
Child | 16387247 | US |