The present application is concerned with distributed learning of neural networks such as federated learning or data-parallel learning, and concepts which may be used therein such as concepts for transmission of parameterization updates.
In most common machine learning scenarios it is assumed, or even needed, that all the data from which the algorithm is trained on is gathered and localized in a central node. However, in many real world applications, the data is distributed among several nodes, e.g., in IoT or mobile applications, implying that it can only be accessed through these nodes. That is, it is assumed that the data cannot be collected in a single central node. This might be, for instance, because of efficiency reasons and/or privacy reasons. Consequently, the training of machine learning algorithms is modified and accommodated to this distributed scenario.
The field of distributed deep learning is concerned with the problem of training neural networks in such a distributed learning setting. In principle, the training is usually divided into two stages. One, the neural network is trained at each node on the local data and, two, a communication round where the nodes share their training progress with each other. The process may be cyclically repeated. The last step is essential because it merges the learnings made at each node into the neural network, eventually allowing it to generalize throughout the entire distributed data set.
It becomes immediately clear that distributed learning, while spreading the computational load onto several entities, comes at the cost of having to communicate data to and from the individual nodes or clients. Thus, in order to achieve an efficient learning scenario, the communication overhead needs to be kept at a reasonable amount. If lossy coding is used for the communication, care should be taken as coding loss may slow down the learning progress and, accordingly, increase the cycles entailed in order to attain a converged state of the neural network's parameterization.
Accordingly, it is an object of the present invention to provide concepts for distributed learning which render distributed learning more efficient.
According to an embodiment, a method for federated learning of a neural network by clients in cycles may have the steps of, in each cycle: downloading, to a predetermined client, information on a setting of a parameterization of the neural network, the predetermined client, updating the setting of the parameterization of the neural network using training data at least partially individually gathered by the respective client to obtain a parameterization update, and uploading information on the parameterization update, merging the parameterization update with further parametrization updates of other clients to obtain a merged parameterization update defining a further setting for the parameterization for a subsequent cycle, wherein the uploading of the information on the parameterization update has lossy coding of an accumulated parametrization update corresponding to a first accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand.
Another embodiment may have a system for federated learning of a neural network in cycles, the system having a server and clients and configured to, in each cycle download, from the server to a predetermined client, information on a setting of a parameterization of the neural network, the predetermined client, updating the setting of the parameterization of the neural network using training data at least partially individually gathered by the respective client to obtain a parameterization update, and uploading information on the parameterization update, merge, by the server, the parameterization update with further parametrization updates of other clients to obtain a merged parameterization update defining a further setting for the parameterization for a subsequent cycle, wherein the uploading of the information on the parameterization update has lossy coding of an accumulated parametrization update corresponding to a first accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand.
Another embodiment may have a client device for decentralized training contribution to federated learning of a neural network in cycles, the client device being configured to, in each cycle, receive information on a setting of a parameterization of the neural network, gather training data, update the setting of the parameterization of the neural network using the training data to obtain a parameterization update, and uploading information on the parameterization update for being merged with the parameterization updates of other clients deices to obtain a merged parameterization update defining a further setting of the parameterization for a subsequent cycle, wherein the client device is configured to, in uploading the information on the parameterization update, lossy code an accumulated parametrization update corresponding to a first accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand.
According to another embodiment, a method for decentralized training contribution to federated learning of a neural network in cycles may have the steps of, in each cycle: receiving information on a setting of a parameterization of the neural network, gathering training data, updating the setting of the parameterization of the neural network using the training data to obtain a parameterization update, and uploading information on the parameterization update for being merged with the parameterization updates of other clients deices to obtain a merged parameterization update defining a further setting of the parameterization for a subsequent cycle, wherein the method has, in uploading the information on the parameterization update, lossy coding an accumulated parametrization update corresponding to a first accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand.
According to still another embodiment, a method for distributed learning of a neural network by clients in cycles may have the steps of, in each cycle: downloading, to a predetermined client, information on a setting of a parameterization of the neural network, the predetermined client updating the setting of the parameterization of the neural network using training data to obtain a parameterization update, and uploading information on the parameterization update, merging the parameterization update with further parametrization updates of the other clients to obtain a merged parameterization update which defines a further setting of the parameterization for a subsequent cycle, wherein, in a predetermined cycle, the downloading the information on the setting of the parameterization of the neural network has downloading information on the merged parametrization update of a preceding cycle by lossy coding of an accumulated merged parametrization update corresponding to a first accumulation of the merged parametrization update of the preceding cycle on the one hand and coding losses of downloads of information on merged parametrization updates of cycles preceding the preceding cycle on the other hand.
Another embodiment may have a system for distributed learning of a neural network in cycles, the system having a server and clients and configured to, in each cycle download, from the server to a predetermined client, information on a setting of a parameterization of the neural network, the predetermined client updating the setting of the parameterization of the neural network using training data to obtain a parameterization update, and uploading information on the parameterization update, merge, by the server, the parameterization update with further parametrization updates of the other clients to obtain a merged parameterization update which defines a further setting of the parameterization for a subsequent cycle, wherein, in a predetermined cycle, the downloading the information on the setting of the parameterization of the neural network has downloading information on the merged parametrization update of a preceding cycle by lossy coding of an accumulated merged parametrization update corresponding to a first accumulation of the merged parametrization update of the preceding cycle on the one hand and coding losses of downloads of information on merged parametrization updates of cycles preceding the preceding cycle on the other hand.
Another embodiment may have an apparatus for coordinating a distributed learning of a neural network by clients in cycles, the apparatus configured to, per cycle, download, to a predetermined client, information on a setting of a parameterization of the neural network for sake of the clients updating the setting of the parameterization of the neural network using training data to obtain a parameterization update, receive information on the parameterization update from the predetermined client, merge the parameterization update with further parametrization updates from other clients to obtain a merged parameterization update which defines a further setting of the parameterization for a subsequent cycle, wherein the apparatus is configured to, in a predetermined cycle, in downloading the information on the setting of the parameterization of the neural network, download the merged parametrization update of a preceding cycle by lossy coding of an accumulated merged parametrization update corresponding to a first accumulation of the merged parametrization update of the preceding cycle on the one hand and coding losses of downloads of information on merged parametrization updates of cycles preceding the preceding cycle on the other hand.
According to another embodiment, a method for coordinating a distributed learning of a neural network by clients in cycles may have the step of, per cycle, downloading, to a predetermined client, information on a setting of a parameterization of the neural network for sake of the clients updating the setting of the parameterization of the neural network using training data to obtain a parameterization update, receiving information on the parameterization update from the predetermined client, merging the parameterization update with further parametrization updates from other clients to obtain a merged parameterization update which defines a further setting of the parameterization for a subsequent cycle, wherein the method has, in a predetermined cycle, in downloading the information on the setting of the parameterization of the neural network, downloading the merged parametrization update of a preceding cycle by lossy coding of an accumulated merged parametrization update corresponding to a first accumulation of the merged parametrization update of the preceding cycle on the one hand and coding losses of downloads of information on merged parametrization updates of cycles preceding the preceding cycle on the other hand.
Still another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing a method of for federated learning of a neural network by clients in cycles, the method having in each cycle downloading, to a predetermined client, information on a setting of a parameterization of the neural network, the predetermined client, updating the setting of the parameterization of the neural network using training data at least partially individually gathered by the respective client to obtain a parameterization update, and uploading information on the parameterization update, merging the parameterization update with further parametrization updates of other clients to obtain a merged parameterization update defining a further setting for the parameterization for a subsequent cycle, wherein the uploading of the information on the parameterization update has lossy coding of an accumulated parametrization update corresponding to a first accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand, when said computer program is run by a computer.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing a method for coordinating a distributed learning of a neural network by clients in cycles, the method having, per cycle, downloading, to a predetermined client, information on a setting of a parameterization of the neural network for sake of the clients updating the setting of the parameterization of the neural network using training data to obtain a parameterization update, receiving information on the parameterization update from the predetermined client, merging the parameterization update with further parametrization updates from other clients to obtain a merged parameterization update which defines a further setting of the parameterization for a subsequent cycle, wherein the method has, in a predetermined cycle, in downloading the information on the setting of the parameterization of the neural network, downloading the merged parametrization update of a preceding cycle by lossy coding of an accumulated merged parametrization update corresponding to a first accumulation of the merged parametrization update of the preceding cycle on the one hand and coding losses of downloads of information on merged parametrization updates of cycles preceding the preceding cycle on the other hand, when said computer program is run by a computer.
The present application is concerned with several aspects of improving the efficiency of distributed learning. In accordance with a first aspect, for example, a special type of distributed learning scenario, namely federated learning, is improved by performing the upload of parameterization updates obtained by the individual nodes or clients using at least partially individually gathered training data by use of lossy coding. In particular, an accumulated parameterization update corresponding to an accumulation of the parameterization update of a current cycle on the one hand and coding losses of uploads of information on parameterization updates of previous cycles on the other hand is performed. The inventors of the present application found out that the accumulation of the coding losses of the parameterization update uploads in order to be accumulated onto current parameterization updates increases the coding efficiency even in cases of federated learning where the training data is—at least partially—gathered individually by the respective clients or nodes, i.e., circumstances where the amount of training data and the sort of training data is non-evenly distributed over the various clients/nodes and where the individual clients typically perform their training in parallel without merging their training results more intensively. The accumulation offers, for instance, an increase of the coding loss at equal learning convergence rate or vice versa offers increased learning convergence rate at equal communication overhead for the parameterization updates.
In accordance with a further aspect of the present application, distributed learning scenarios, irrespective of being of the federated or data-parallel learning type, are made more efficient by performing the download of information on parameterization settings to the individual clients/nodes by downloading merged parameterization updates resulting from merging the parameterization updates of the clients in each cycle and, additionally, performing this download of merged parameterization updates using lossy coding of an accumulated merge parameterization update. That is, in order to inform clients on the parameterization setting in a current cycle, a merged parameterization update of a preceding cycle is downloaded. To this end, an accumulated merged parameterization update corresponding to an accumulation of the merged parameterization update of a preceding cycle on the one hand and coding losses of downloads of merged parameterization updates of cycles preceding the preceding cycle on the other hand is lossy coded. The inventors of the present invention have found that even the downlink path for providing the individual clients/nodes with the starting point for the individual trainings forms a possible occasion for improving the learning efficiency in a distributed learning environment. By rendering the merged parameterization update download aware of coding losses of previous downloads, the download data amount may, for instance, be reduced at the same or almost the same learning convergence rate or vice versa, the learning convergence rate may be increased at using the same download overhead.
Another aspect which the present application relates to is concerned with parameterization update coding in general, i.e., irrespective of being used relating to downloads of merged parameterization updates or uploads of individual parameterization updates, and irrespective of being used in distributed learning scenarios of the federated or data-parallel learning type. In accordance with this aspect, consecutive parameterization updates are lossy coded and entropy coding is used. The probability disruption estimates used for entropy coding for a current parameterization update are derived from an evaluation of the lossy coding of previous parameterization updates or, in other words, depending on an evaluation of portions of the neural network's parameterization for which no update values are coded in previous parameterization updates. The inventors of the present invention have found that evaluating, for instance, for each parameter of the neural network's parameterization whether, and for which cycle, an update value has been coded in the previous parameterization updates, i.e., the parameterization updates in the previous cycles, enables to gain knowledge about the probability distribution for lossy coding the current parameterization update. Owing to the improved probability distribution estimates, the coding efficiency in entropy coding the lossy coded consecutive parameterization updates is rendered more efficient. The concept works with or without coding loss aggregation. For example, based on the evaluation of the lossy coding of previous parameterization updates, it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded by the lossy coding by the current parameterization update, or not, and the flag may be coded using entropy coding using the determined probability of the respective parameter. Alternatively, the parameters for which update values are comprised by the lossy coding of the current parameterization update may be indicated by pointers or addresses coded using a variable length code, the code word length of which increases for the parameters in an order which depends on, or increases with, the probability for the respective parameter to have an update value included by the lossy coding of the current parameterization update.
An even further aspect of the present application relates to the coding of parameterization updates irrespective of being used in download or upload direction, and irrespective of being used in a federated or data-parallel learning scenario, wherein the coding of consecutive parameterization updates is done using lossy coding, namely by coding identification information which identifies the coded set of parameters for which the update values belong to the coded set of update values along with an average value for representing the coded set of update values, i.e. they are quantized to that average value. The scheme is very efficient in terms of weighing up between data amount spent per parametrization update on the one hand and convergence speed on the other had. In accordance with an embodiment, the efficiency, i.e. the weighing up between data amount on the one hand and convergence speed on the other hand, is increased even further by determining the set of coded parameters for which an update value is comprised by the lossy coding of the parameterization update in the following manner: two sets of updated values in the current parameterization update are determined, namely a first set of highest update values and a second set of lowest update values. Among same, the largest set is selected as the coded set of update values, namely selected in terms of absolute average, i.e., the set the average of which is largest in magnitude. The average value of this largest set is then coded along with identification information as information on the current parameterization update, the identification information identifying the coded set of parameters of the parameterization, namely the ones the corresponding update value of which is included in the largest set. In other words, each round or cycle, either the largest (or positive) update values are coded, or the lowest (negative) update values are coded. Thereby, a signaling of any sign information for the coded update values in addition to the average value coded for the coded update values is unnecessary, thereby saving signaling overhead even further. The inventors of the present application have found that toggling or alternating between signaling highest and lowest update value sets in lossy coding consecutive parameterization updates in a distributed learning scenario—not in a regular sense, but in a statistical sense as the selection depends on the training data—does not significantly impact the learning convergence rate, while the coding overhead is significantly reduced. This holds true both when applying coding loss accumulation with lossy coding the accumulated prediction updates, or coding the parameterization updates without coding loss accumulation.
As should have become readily clear from the above brief outline of the aspects of the present application, these aspects, although being advantageous when implemented individually, may also be combined pairwise, in triplet or all of them.
Embodiments of the present application are described below with respect to the figures among which:
Before proceeding with the description of embodiments of the present application with respect to the various aspects of the present application, the following description briefly presents and discusses general arrangements and steps involved in a distributed learning scenario.
Just as a side, it is noted that the input data which the neural network 16 is designed for, may be picture data, video data, audio data, speech data and/or textural data and the neural network 16 may be, in a manner outlined in more detail below, ought to be trained in such a manner that the one or more output nodes are indicative of certain characteristics associated with this input data such as, for instance, the recognition of a certain content in the respective input data, the prediction of some user action of a user confronted with the respective input data or the like. A concrete example could be, for instance, a neural network 16 which, when being fed with a certain sequence of alphanumeric symbols typed by a user, suggesting possible alphanumeric strings most likely wished to be typed in, thereby attaining an auto correction and/or auto-finishing function for a user-written textual input, for instance.
As illustrated in
The clients 14 receive the information on the parameterization setting. The clients 14 are not only able to parameterize an internal instantiation of the neural network 16 accordingly, i.e., according to this setting, but the clients 14 are also able to train this neural network 16 thus parametrized using training data available to the respective client. Accordingly, in step 34, each client trains the neural network, parameterized according to the downloaded parameterization setting, using training data available to the respective client at step 34. In other words, the respective client updates the parameterization setting using the training data. Depending on whether the distributed learning is a federated learning or data-parallel learning, the source of the training data may be different: in case of federated learning, for example, each client 14 gathers its training data individually or separately from the other clients or at least a portion of its training data is gathered by the respective client in this individual manner while a reminder is gained otherwise such as be distribution by the server as done in data-parallel learning. The training data may, for example, be gained from user inputs at the respective client. In case of data-parallel learning, each client 14 may have received the training data from the server 12 or some other entity. That is, the training data then does not comprise any individually gathered portion. The splitting-up of a reservoir of training data into portions may be done evenly in terms of, for instance, amount of data and statistics of the data. Details in this regard are set out in more detail below. Most of the embodiments described herein below, may be used in both types of distributed learning so that, unless otherwise stated, the embodiments described herein below shall be understood as being not specific for either one of the distributed learning types. As outlined in more detail below, the training 34 may, for instance, be performed using a stochastic gradient decent method. However, other possibilities exist as well.
Next, each client 14 uploads its parameterization update, i.e., the modification of the parameterization setting downloaded at 32. Each client, thus, informs the server 12 on the update. The modification results from the training in step 34 performed by the respective client 14. The upload 36 involves a sending or transmission from the clients 14 to server 12 and a reception of all these transmissions at server 12 and accordingly, step 36 is shown in
In step 38, the server 12 then merges all the parameterization updates received from the clients 14, the merging representing a kind of averaging such as by use of a weighted average with the weights considering, for instance, the amount of training data using which the parameterization update of a respective client has been obtained in step 34. The parameterization update thus obtained at step 38 at this end of cycle i indicates the parameterization setting for the download 32 at the beginning of the subsequent cycle i+1.
As already indicated above, the download 32 may be rendered more efficient and details in this regard are described in more detail below. One such task is, for instance, the performance of the download 32 in a manner so that the information on the parameterization setting is downloaded to the clients 14 in form of a prediction update or, to be more precise, merged parameterization update rather than downloading the parameterization setting again completely. While some embodiments described herein below relate to the download 32, others relate to the upload 36 or may be used in connection with both transmissions of parameterization updates. Insofar,
After having described the general framework of distributed learning, examples with respect to the neural networks which may form the subject of the distributed learning, the steps performed during such distributed learning and so forth, the following description of embodiments of the present application starts with a presentation of an embodiment dealing with federated learning which makes use of several of the aspects of the present application in order to provide the reader with a sort of overview of the individual aspects and an outline of their advantages, thereby rendering easier the subsequent description of embodiments which form kind of generalizations of this outline. Thus, the description brought forward first concerns a particular training method, namely the federated learning as described, for instance, in [2]. Here, it is proposed to train neural networks 16 in the distributed setting in the manner outlined with respect to
Extensive experiments have shown that one can accurately train neural networks in a distributed setting via the federated learning procedure. In Federated learning, the training data and computation resources are, thus, distributed over multiple nodes 14. The goal is to learn a model from the joint training data of all nodes 14. One communication round 30 of synchronized distributed SGD consists of the steps of (
However, usually, in order to accurately train a neural network via the federated learning method, many communication rounds 30 (that is, many download and upload steps) are used. This implies that the method can be very inefficient in practice if the goal is to train large and deep neural networks (which is usually the desired case). For example, standard deep neural networks which solve state of the art computer vision tasks are around 500 MB in size. Extended experiments have confirmed that the federated learning uses at least 100 communication rounds to solve these computer vision tasks. Hence, in total, we would have to send/receive at least 100 GB (=2×100×500 MB) during the entire training procedure. Hence, reducing the communication cost is critical for being able to make use of this method in practice.
A possible solution for solving this communication inefficiency is to lossy compress the gradients and upload/download a compressed version of the change of the neural network [6]. However, the compression induces quantization noise into the gradients, which decreases the training efficiency of the federated learning method (either by decreasing the accuracy of the network or using a higher number of communication rounds). Hence, in the standard federated learning we face this efficiency-performance bottleneck, which hinders its practicality for real case scenarios.
Considering the above-mentioned drawbacks, the embodiments and aspects described further below individually or together solve the efficiency-performance bottleneck in the following manner.
Using the above concepts individually or together we are able to reduce the communication costs by a high factor. When using them all together, for instance, the communication cost reduction may be of a factor of at least 1000 without affecting the training performance in some of the standard computer vision tasks.
Before starting with a description of embodiments which relate to federated learning while then subsequently broadening this description with respect to certain embodiments of the various aspects of the present application, the following section provides some description with respect to neural networks and their learning thereof in general with using mathematical notations which will subsequently be used.
On the highest level of abstraction, a Deep Neural Network (DNN), which network 16 may represent, is a function
ƒW:S
that maps real-valued input tensors x (i.e., the input applied onto the nodes of the input layer of the neural network 16) with shape Sin to real-valued output tensors of shape Sout (i.e., the output values or activations resulting after prediction by the neural network 16 at the nodes of the output layer, i.e., layer J in
distW:S
The goal in supervised learning is to find parameters W, a setting for the parameterization 18, for which the DNN most closely matches the desired output on the training data D={(xi, yi)|=1, . . . , n}, i.e. to solve the optimization problem
being called the loss-function. The hope is that model W*, resulting from solving optimization problem (3), will also generalize well to unseen data {circumflex over (D)} that is disjoint from the data D used for training, but that follows the same distribution. The generalization capability of any machine learning model generally depends heavily on the amount of available training-data.
Solving the problem (3) is highly non trivial, because the i is usually non-linear, non-convex and extremely high-dimensional. The by far most common way to solve (3) is to use an iterative optimization technique called stochastic gradient descent (SGD). The algorithm for vanilla SGD is given in
While many adaptations to the algorithm of
W′=SGD(W,D,θ) (5)
with θ being the set of all optimization-specific hyperparameters (such as the learning-rate or the number of iterations). The quality of the improvement usually depends both on the amount of data available and on the amount of computational resources that is invested.
The weights and weight-updates are typically calculated and stored in 32-bit floating-point arithmetic.
In many real world scenarios the training data D and computational resources are distributed over a multitude of entities (we are called “clients” 14 in the following). This distribution of data and computation can either be a intrinsic property of the problem setting (for example because the data is collected and stored on mobile or embedded devices) or it can be willingly induced by a machine learning practitioner (i.e. to speed up computations via a higher level of parallelism). The goal in distributed training is to train a global model, using all of the clients training data, without sending around this data. This is achieved by performing the following steps: Clients that want to contribute to the global training first synchronize with the current global model, by downloading 32 it from a server. They then compute 34 a local weight-update using their own local data and upload 36 it to the server. At the server all weight-updates are aggregated 38 to form a new global model.
Below, we will give a short description of two typical settings in which distributed Deep Learning occurs:
Federated Learning: In the Federated Learning setting the clients 14 are embodied as data-collecting mobile or embedded devices. Already today, these devices collect huge amounts of data, that could be used to train Deep Neural Networks. However this data is often privacy sensitive and therefore can not be shared with a centralized server (private pictures or text-messages on a user's phone, . . . ). Distributed Deep Learning enables training a model with the shared data of all clients 14, without any of the clients having to reveal the their training data to a centralized server 12. While information about the training data could theoretically be inferred from the parameter updates, [3] show that it is possible to come up with a protocol that even conceals these updates, such that is possible to jointly train a DNN without compromising the privacy of the contributors of the data at all. Since the training data on a given client will typically be based on the usage of the mobile device by it's user, the distribution of the data among the clients 14 will usually be non-iid and any particular User’s local dataset will not be representative of the whole distribution. The amount of data will also typically be unbalanced, since different users make use of a service or app to different extent, leading to varying amounts of local training data. Furthermore, many scenarios are imaginable in which the total number of clients participating in the optimization can be much larger than the average number of examples per client. In the Federated Learning setting communication cost is typically a crucial factor, since mobile connections are often slow, expensive and unreliable.
Data-Parallel Learning: Training modern neural network architectures with millions of parameters on huge data-sets such as ImageNet [4] can take a very long time, even on the most high-end hardware. A very common technique to speed up training, is to make use of increased data-parallelism by letting multiple machines compute weight-updates simultaneously on different subsets of the training data. To do so, the training data D is split over all clients 14 in an even and balanced manner, as this reduces the variance between the individual weight-updates in each communication round. The splitting may be done by the server 12 or some other entity Every client in parallel computes a new weight-update on it's local data and the server 12 then averages over all weight-updates. Data-parallel training is the most common way to introduce parallelism into neural network training, because it's very easy to implement and has great scalability properties. Model-parallelism in contrast scales much worse with bigger datasets and is tedious to implement for more complicated neural network architectures. Still, the amount of clients in data-parallel training is relatively small compared to federated learning, because the speed-up achievable by parallelization is limited by the non-parallelizable parts of the computation, most prominently the communication used after each round of parallel computation. For this reason, reducing the communication time is the most crucial factor in data-parallel learning. On a side-note, if the local batch-size and the number of local iterations is equal to one for all clients, one communication round of data-parallel SGD is mathematically equivalent to one iteration of regular SGD with a batch-size equal to the number of participating clients.
We systematically compare the two settings in the subsequent table.
The above table compares the two main settings in which training from distributed data occurs. These two settings form the two ends of the spectrum of situations, in which learning from distributed data occurs. Many scenarios that lay in between these two extremes are imaginable.
Distributed training as described above may be performed in a synchronous manner. Synchronized training has a benefit in that it ensures that no weight update is outdated at the time it arrives at the server. Outdated weight-updates may otherwise destabilize the training. Therefore, synchronous distributed training might be performed, but the subsequently described embodiments may also be different in this regard. We describe the general form of Synchronous Distributed SGD in
During every communication round or cycle of synchronous distributed SGD every client 14 should once download 32 the global model (parametrization setting) from the server 12 and later upload 36 it's newly computed local weight-update back to the server 12. If this is done naively, the amount of bits that have to be transferred at up- and download can be severe. Imagine a modern neural network 16 with 10 million parameters is trained using synchronous distributed SGD. If the global weights W and local weight-updates ΔWi are stored and transferred as 32 bit floating point numbers, this leads to 40 MB of traffic at every up- and download. This is much more than the typical data-plan of a mobile device can support in the federated learning setting and can cause a severe bottleneck in Data-Parallel learning that significantly limits the amount of parallelization possible.
An impressive amount of scientific work has been published in the last couple of years that investigates ways to reduce the amount of communication in distributed training. This underlines the relevance of the problem.
[8] identifies the problem setting of Federated Learning and proposes a technique called Federated Averaging to reduce the amount of communication rounds used to achieve a certain target accuracy. In Federated Averaging, the amount of iterations for every client is increased from one single iteration to multiple iterations. The authors claim that their method can reduce the number of communication rounds used by a factor of 10×-100× on different convolutional and recurrent neural network architectures.
The authors of [10] propose a training scheme for federated learning with iid data in which the clients only upload a fraction of their local gradients with the biggest magnitude and download only the model parameters that are most frequently updated. Their method results in a drop of convergence speed and final accuracy of the trained model, especially at higher sparsity levels.
In [6], the authors investigate structured and sketched updates as a means to reduce the amount of communication in Federated Averaging. For structured updates, the clients are restricted to learn low-rank or sparse updates to their weights. For sketches updates, the authors investigate random masking and probabilistic quantization. Their methods can reduce the amount of communication used by up to two orders of magnitude, but also incur a drop in accuracy and convergence speed.
In [7], the authors demonstrate that it is possible to achieve up to 99.9% percent of gradient sparsity in the upload for the Data-Parallel Learning setting on modern architectures. They achieve this by only sending 0.1% of gradients with the biggest magnitude and accumulating the rest of the gradients locally. They additionally apply four tricks to ensure that their method does not slow down the convergence or reduce the final amount of accuracy achieved by the model. These tricks include using a curriculum to slowly increase the amount of sparsity in the first couple communication rounds and applying momentum factor masking to overcome the problem of gradient staleness. The report results for modern convolutional and recurrent neural network architectures on big data-sets.
In [1], a “Deep Gradient Compression” concept is presented, but use of the additional four tricks is made. Consequently their method entails a loss in convergence speed and final accuracy.
Paper [12] proposes to stochastically quantize the gradients to 3 ternary values. By doing so a moderate compression rate of approximately ×16 is achieved, while accuracy drops marginally on big modern architectures. The convergence of the method is mathematically proven under the assumption of gradient-boundedness.
In [9], the authors show empirically that it is possible to quantize the weight-updates in distributed SGD to 1 bit without harming convergence speed, if the quantization errors are accumulated. The authors report results on a language-modeling task, using a recurrent neural network.
In [2], Qsgd (Communication-efficient sgd) is presented. QSGD explores the trade-off between accuracy and gradient precision. The effectiveness of gradient quantization is justified and the convergence of QSGD is proven.
In an approach presented in [11], only gradients with a magnitude greater than a certain predefined threshold are sent to the server. All other gradients are aggregated in a residual.
Other authors such as in [5] and [14] investigated the effects of reducing the precision of both weights and gradients. The results they get are considerably worse than the ones achievable if only the weight-updates are compressed.
The framework presented below relies on the following observations:
therefore the stochastic gradient is a noisy approximation of the true gradient
Δ{tilde over (W)}i=compress(ΔWi) (9)
Δ{tilde over (W)}=compress(ΔW) (10)
A
i
←αA
i
+ΔW
i (11)
Δ{tilde over (W)}i←compressC(Ai) (12)
A
i
←A
i
−Δ{tilde over (W)}
i (13)
A framework which makes use of all of the above-discussed insights and concepts is shown in
Each client uses lossy coding 36′ for the upload of the just-obtained parameterization update ΔWi. To this end, each client i locally manages an accumulation of coding losses or coding errors of the parameterization update during preceding cycles. The accumulated sum of client i is indicated in
It should be noted that there are two sources for the coding loss: firstly, not all of the accumulated parameterization update values 80 are actually coded. For example, in
Even the accumulated parameterization update values 80 comprised by the lossy coding, however, the positions of parameters 26 of which are indicated non-hatched in the coded parameterization update 66 in
The upload of the parameterization update as transmitted by the client i at 36a is completed by the reception at the server at 36b. As just-described: parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server.
The server then merges the gathered parameterization updates at 38 by using, as illustrated in
As a result of performing the distributed learning in the manner as depicted in
In the following, some notes are made with respect to possibilities with respect to the determination as to which parameterization update values 80 should actually be coded and how they should be coded or quantized. Examples are provided and they may be used in the example of
In quantization, compression is achieved, by reducing the number of bits used to store the weight-update. Every quantization method Q is fully defined by the way it computes the different quantiles q and by the rounding scheme it applies.
{tilde over (W)}=quantize(W,q(W,m)),q(W,m)={q1<q2< . . . <qm} (14)
The rounding scheme can be deterministic
or stochastic
Possible Quantization schemes include
q(W)={−max(|W|),0,max(|W|)}
In sparsification, compression is achieved, by limiting the number of non-zero elements used to represent the weight-update. Sparsification can be view as a special case of quantization, in which one quantile is zero, and many values fall into that quantile. Possible sparsification schemes include
To communicate a set of sparse binary weight-updates produced by SBC, we only need to transfer the positions of the non-zero elements, along with either the respective positive or negative mean. Instead of communicating the absolute non-zero positions, it is favorable to only communicate the distances between them. Under the assumption that the sparsity pattern is random for every weight-update, it is easy to show that these distances are geometrically distributed with success probability p equal to the sparsity rate. Geometrically distributed sequences can be optimally encoded using the Golomb code (this last lossless compression step can be also applied in the Deep Gradient Compression and Smart Gradient Compression scheme.
The different coding lossy schemes are summarized in
As can be seen, the sparse binary compression causes a slightly larger coding loss or coding error than compared to smart gradient compression, but on the other hand, the transmission overhead is reduced, too, owing to the fact that all transmitted coded values 82 are of the same sign or, differently speaking, correspond to the also transmitted mean value in both magnitude and sign. Again, instead of using the mean, another average measure could be used. Let's go back to
The choice of the encoding plays a crucial role in determining the final bit-size of a compressed weight-update. Ideally, we would like to design lossless codec schemes which come as close as possible to the theoretical minimum.
To recall, we will shortly derive the minimal bit-length that is needed in order to lossless encode an entire array of gradient values. For this, we assume that each element of the gradient matrix is an output from a random vector ΔW∈N, where N is the total number of elements in the gradient matrix (that is, N=mn where m is the number of rows and n the number of columns). We further assume that each element is sampled from an independent random variable (thus, no correlations between the elements are assumed). The corresponding joint probability distribution is then given by
P(ΔW1=g1, . . . ,ΔWN=gN)=ΠiNP(ΔWi=gi) (23)
where gi∈ are concrete sample values from the ΔWi random variables, which belong to the random vector ΔW.
It is well known [13] that if suitable lossless codecs are used, the minimal average bit-length needed to send such a vector is bounded by
NH(ΔWi)<
where
H(X)=−ΣjP(xj)log2(P(xj)) (25)
denotes the entropy a random variable X.
Uniform Quantization
If we use uniform quantization with K=2b grid points and assume a uniform distribution over these points, we have P(ΔWi=gi)=1/K and consequently
H(ΔWi)=−ΣjK log2K=b (26)
That is, b is the minimum number of bits that are sent per element of the gradient vector G.
Deep Gradient Compression
In the DGC training procedure only a certain percentage p∈(0,1) of gradient elements are set to 0 and the rest are exchanged in the communication phase. Hence, the probability that a particular number is send/received is given by
were we uniformly quantize the non zero values with K=2b bins. The respective entropy is then
H(ΔWi)=−p log2(p)−(1−p)log2(1−p)+b(1−p) (27)
In other words, the minimum average bit-length is determined by the minimum bit-length used to identify if an element is either a zero or non-zero element (the first two sumands), plus the bits used to send the actual value whenever the element was identified as a non zero value (the last summand).
Smart Gradient Compression
In our framework we further reduce the entropy by reducing the number of non zero weight values to one. That is, K=20. Hence, we only have to send the position of the non zero element. Therefore our theoretical bound is lower than (27) and given by
H(ΔWi)=−p log2(p)−(1−p)log2(1−p) (28)
In practice, we the receiver doesn't know the value so we would have to sind it too, which induces an additional, often negligible cost of b-bits.
We just described how we can model the gradient values of the neural network as being a particular outcome of an N-long independent random process. In addition, we also described the models of the probability distributions when different quantization methods are used in the communication phase of the training. Now it remains to design low redundant lossless codecs (low redundant in the sense, that their average bit-length per element is close to the theoretical lower bound (24)). Efficient codecs for these cases have been well studied in the literature [13]. In particular, binary arithmetic coding techniques have shown to be particularly efficient and are widely used in the fields of image and video coding. Hence, once we selected a probability model, we may code the gradient values using these techniques.
We can further reduce the cost of sending/receiving the gradient matrix ΔW by making use of predictive coding methods. To recall, in the sparse communication setting we specify a percentage of gradients with highest absolute values and send only those (at both, the server and client side). Then, the gradients that have been send are set back to 0 and the others are accumulated locally. This means that we can make some estimates regarding the probability that a particular element is going to be send at the next iteration (or next iterations τ), and consequently reduce the communication cost.
Let ρi(g|μi(t),σi(t),t) be the probability density function of the absolute value of the gradients of the i-th element at time t, where μi(t) and σi(t) are the mean and variance of the distribution. Then, the probability that the i-th element will be updated is given by the cumulative probability distribution
P(i=1|t)=∫ε∞ρi(g|μi(t),σi(t),t)dg (29)
where ε is selected such, that P(i=1|t)>0.5 for a particular percentage of elements. A sketch of this model is depicted in
Now we can easily imagine that different elements have different gradient probability distributions (even if we assume that all have the same type, they might have different means and variances), leading to them having different update rates. This is actually supported by experimental evidence, as can be seen in
Hence, a more suitable probability model of the update frequency of the gradients would be to assign a particular probability rate pi to each element (or to a group of elements). We could estimate the element specific update rates pi by keeping track of the update frequency over a period of time and calculate it according to these observations.
However, the above simple model makes the naive assumption that the probability density functions don't change over time. We know that this is not true for two reasons. One, the mean of the gradients tends to 0 as training time grows (and experiments have shown that with the SGD optimizer the variances grow over time). And two, as mentioned before, we accumulate the gradient values of those elements that have not been updated. Thus, we get an increasing sum over random variables over time. Hence, the probability density function at time t*+τ (where τ is the time after the last update t) corresponds to the convolution over all probability density functions between the time t*→t*+τ. If we further assume that the random variables are independent along the time achsis, we then know that the mean and variance of the resulting probability density function corresponds to the sum of their mean and variances
[ρi(g|t*+τ)]=Σt*τμ(t)
var[ρi(g|t*+τ)]Σt*τσ(t)
Consequently, as long as one of those sums don't converge as τ→∞, it is guaranteed that the probability of an element being updated in the next iteration round tends to 1 (that is, P(i=1|t*+τ)→1 as τ→∞).
However, modeling the real time-dependent update rate can be too complex. Therefore we may model it via simpler distributions. For example, we might assume that the probability of encountering τ consecutive zeros follows the geometric distribution (1−pi)τ, where pi indicates the update rate of element i in the stationary mode. But other models where the probability increases overtime might as well be assumed (e.g. P(i=1|τ, ai,bi)=1−aieb
Furthermore, we can use adaptive coding techniques in order to estimate the probability parameters in an online fashion. That is, we use the information about the updates at each communication round in order to fine tune the parameters of the assumed probability. For example, if we model the update rate of the gradients as a stationary (not time dependent) Bernoulli distribution P(i=1)=pi, then the values pi can be learned in an online fashion by taking the sample mean (that is, if xt∈{0,1} is a particular outcome at time (or cycle) t, then pi,t+1=(xt+pi,t)/t).
The advantage of this methods is that the parameter estimation occurs at the sender and receiver side simultaneously, resulting in no communication overhead. However, this comes at the cost of increasing the complexity of the encoder and decoder (for more complex models the online parameter update rule can be fairly complex). Therefore, an optimal trade-off between model complexity and communication cost has to be considered depending on the situation.
E.g., in the distributed setting where communication rounds are high and communication latency ought to be minimal, simple models like the static rate frequency model pi or geometric distribution (1−pi)τ for predictive coding might be a good choice (perhaps any of the distribution belonging to the exponential family distributions, since online update rules for their parameters are simple and well known for those models). On the other hand, we may be able to increase the complexity of the model (and with it the compression gains) in the federated learning scenario, since it is assumed that the computational resource are high in comparison to the communication costs.
The above idea can be generalized to non smart gradient matrices G∈m×n. Again, we think of each element Gi˜gi, i∈{1, . . . , N(=m×n)), of the matrix G as a random variable that outputs real valued gradients gi. In our case, we are only interested in matrices whose elements can only output values from a finite set gi∈:=ω0=0, ω1, . . . , ωS-1}. Each element ωk of the set has a probability mass value pk∈:={p0, . . . , pS-1} assigned to it. We encounter this cases when we use other forms of quantizations for the gradients, such as uniform quantization schemes.
We further assume that the sender and receiver share the same sets . They either agreed before training started on the set of values or a new tables might be send during training (the later should only applied if the cost of updating the set is negligible comparing to the cost of sending the gradients). Each element of the matrix might have an independent set i or a group (or all) of elements might share the same set values.
As for the probabilities (that is, the probability mass function of the set , which depends on element i), we can analogously model them and apply adaptive coding techniques in order to update the model parameters in accordance to the gradient data send/received during training. For example, we might model a stationary (not time dependent) probability mass distribution ={p0i, . . . , pS-1i} for each ith-element in the network, where we update the values pki according to their frequency of appearance during training. Naturally, the resulting codec will then depend on the value .
Furthermore, we might as well model a time dependence of the probabilities pki(t). Let ƒki(t)∈(0,1) be a monotonic decreasing function. Also, let tk,i* be the time step indicating that the i-th gradient has changed it's value to ωk and τ the time after that point. Then, we can write
p
k
i(tk,i*+τ)=ƒki(τ)
That is, the probability that the same value will be chosen at τ consecutive time steps decreases, consequently progressively increasing the probability of the other values over time.
Now we have to find suitable models for each function ƒki(t), where we have to trade-off between codec complexity and compression gain. For example, we might as well model the retention time of each value k with a geometric distribution. That is, pki(tk,i*+τ)=(pki)τ, and take advantage of adaptive coding techniques in order to estimate the parameters pki during training.
Experimental results are depicted in
Now, after having described certain embodiments with respect to the preceding figures, some broadening embodiments shall be described. For example, in accordance with an embodiment, federated learning of a neural network 16 is done using the coding loss aware upload of the clients' parameterization updates. The general procedure might be as depicted in
Another embodiment results from the above description in the following manner. Although the above description primarily concerned federated learning, irrespective of the exact type of distributed learning, advantages may be achieved by applying the coding loss aware parameterization update transmission 56 and the downlink step 32. Here, the coding loss accumulation and awareness is performed on the side of the server rather than the client. It should be noted that the achievable reduction in amount of downloaded parameterization update information is considerable by applying the coding loss awareness as offered by procedure 56 into the download direction of a distributed learning scenario, whereas the convergence speed is substantially maintained. Thus, while in
Another embodiment which may be derived from the above-description by taking advantage of the advantageous nature of the respective concept independent from the other details set out in the above embodiments pertains to the way the lossy coding of consecutive parameterization updates may be performed with respect to a quantization and sparsification of the lossy coding. In
Module 130, thus, forms and an apparatus for lossy coding consecutive parameterization updates. The sequence of parameterization updates is illustrated in
Apparatus 130 starts its operation by determining a first set of update values and a second set of update values namely set 104 and 106. The first set 104 may be a set of highest update values 138 and the current parameterization update 136 while set 106 may be a set of lowest update values. In other words, when the update values 138 are ordered along their value, set 104 may form the continuous run of highest values 138 and the resulting order sequence, while set 106 may form a continuous run at the opposite end of the sequence of values, namely the lowest update values 138. The determination may be done in a manner so that both sets coincide in cardinality, i.e., they have the same number of update values 138 therein. The predetermined number of cardinality may be fixed or set by default, or may be determined by module 130 in a manner and on basis of information also available to the decoder 132. For instance, the number may explicitly be transmitted. A selection 140 is performed among sets 104 and 106 by averaging, separately, the update values 138 in both sets 104 and 106 and comparing the magnitude of both averages with finally selecting the set the absolute average of which is larger. As indicated above, the mean such as the arithmetic mean or some other mean value may be used as average measure, or some other measure such as mode or median. In particular, then, module 130 codes 142, as information on the current parameterization update 136, the average value 144 of the selected larger set, along with an identification information 146 which identifies, or locates, the coded set of parameters 26 of the parameterization 18, the corresponding update value 138 in the current parameterization update 136 of which is included in the selected largest set.
As already described above, it is merely a minor impact on convergence speed, that per parameterization update of the sequence 134, merely one of sets 104 and 106 is actually coded, while the other is left uncoded, because along the sequence of cycles, the selection toggles, depending on the training outcomes in the consecutive cycles—between the set 104 of highest update values and the set 106 of lowest update values. On the other hand, signaling overhead for the transmission is reduced owing to the fact that it is not necessary to code information on the signed relationship between each coded update value and the average value 144.
The decoder 132 decodes the identification information 146 and the average value 144 and sets the largest set of update values indicated by the identification information 146, i.e., the largest set, to be equal in sign and magnitude to the average value 144, while the other update values are set to be a predetermined value such as zero.
As illustrated in
A modification of the embodiment of
In the description of
Thus, apparatus 150 represents an apparatus for coding consecutive parametrization updates 134 of a neural network's 16 parameterization 18 for distributed learning and is configured, to this end, to lossy code the consecutive parameterization updates using entropy coding using probability distribution estimates. To be more precise, the apparatus 150 firstly subjects the current parameterization update 136 to a lossy coding 154 which may be, but is not necessarily implemented as described with respect to
At the decoding side, the apparatus for decoding the consecutive parameterization updates does the reverse, i.e., it entropy decodes 164 the information 146 and 164 using probability estimates which a probability estimator 162′ determines from preceding coded versions 148 of preceding parameterization updates in exactly the same manner as the probability distribution estimator 162 at the encoder side did.
Thus, as noted above, the four aspects specifically described herein may be combined in pairs, triplets or all of them, thereby improving the efficiency in distributed learning in the manner outlined above.
Summarizing, above embodiments enable to achieve improvements in Distributed Deep Learning (DDL) which has gotten a lot of attention in the last couple of years as it is the core concept underlying both privacy-preserving deep learning and the latest successes in speeding up neural network training via increased data-parallelism. The relevance of DDL is very likely going to increase even further in the future as more and more distributed devices are expected to be able to train Deep Neural Networks, due to advances both in hardware and software. In almost all applications of DDL the communication-cost between the individual computation nodes is a limiting factor for the performance of the whole system. As a result of this, a lot of research has gone into trying to reduce the amount of communication used between the nodes via lossy compression schemes. The embodiments described herein may be used in such framework for DDL and may extend past approaches in a manner so as to improve the communication-efficiency in distributed training. Compression at both up- and download was involved and efficient encoding and decoding of the compressed data has been featured.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
The inventive codings of parametrization updates can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and/or by software.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
18173020.1 | May 2018 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2019/062683, filed May 16, 2019, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP18173020.1, filed May 17, 2018, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2019/062683 | May 2019 | US |
Child | 17096887 | US |