Embodiments of the present invention generally relate to machine learning and distilled datasets. More particularly, at least some embodiments of the invention relate to systems, hardware, software, computer-readable media, and methods for generating distilled datasets for training machine learning models.
Performing machine learning in a distributed and/or federated manner can be complicated. For example, performing machine learning tasks at the edge requires handling massive amounts of data and dealing with thousands, if not millions, of nodes that are each training models and generating inferences. For example, self-driving vehicles is an example where each node deals with its own data stream, which is large in terms of duration and dimensionality.
There are situations where it may be desirable to deploy a common machine learning model to all (or at least a subgroup) nodes such that the models can be fine-tuned at each of the nodes. Because the same model is deployed to a large number of nodes, it is advantageous to ensure that the models are aligned and up to date with respect to data at each of the nodes. This helps ensure that the model, once the learning has been federated, can generalize to future data and to new nodes to which the model may be deployed.
To enable the ability to train and deploy machine learning models to edge nodes in a dynamic manner while preserving privacy, it may be possible to rely on a distilled dataset. A distilled dataset can reduce transfer requirements between the edge and the cloud, preserve privacy (data is not shared between nodes) and reduce storage requirements at the nodes. However, generating a distilled dataset that is sufficient for federated learning requirements remains a difficult task.
In order to describe the manner in which at least some of the advantages and features of the invention may be obtained, a more particular description of embodiments of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, embodiments of the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:
Embodiments of the present invention generally relate to federated machine learning, federated distilled datasets, and training machine learning models. More particularly, at least some embodiments of the invention relate to systems, hardware, software, computer-readable media, and methods for robustly aggregated federated dataset distillation and federated machine learning training.
In general, example embodiments of the invention relate to providing a distilled dataset for training machine learning models in edge environments. Embodiments of the invention further relate to obtaining or generating a distilled dataset {tilde over (x)} that is generalized while ensuring coherence (e.g., with respect to drift, malicious attacks, or other deviations) from multiple edge nodes.
Embodiments of the invention further relate to robust aggregation in the context of generating distilled datasets and machine learning model training. Robust aggregation, for example, may determine whether a particular set up updates (e.g., gradients) should be kept or discarded in order to ensure and/or preserve the coherence of the federated training.
The efficiency of deploying machine learning models, particularly to dynamically added nodes, is affected by the amount of data transferred to the nodes, the amount of data required to be stored for training, and the time required for training. Embodiments of the invention save costs and communication overhead by reducing pre-deployment training and fine-tuning training at the edge nodes. Embodiments of the invention also enable fine-tuning in low compute nodes. To achieve these goals and efficiencies, embodiments of the invention generate and/or manage a distilled dataset.
A distilled dataset is a dataset that can be used to train machine learning models. A distilled dataset is often synthetic in nature in the sense that the initial data may be synthetic or random. Once generated, however, a distilled training dataset is substantially smaller than a conventional training dataset and is distinct from a small sample of a training dataset. A distilled dataset is more likely to ensure that the ability of machine learning models to generate sufficiently accurate inferences is generalized. The distilled dataset should be representative of the data from as many nodes as possible.
However, there is no guarantee of homogeneity in the distributions of data collected at the edge nodes and adding nodes to a federated environment increases the chance of acquiring data that has a different distribution. Embodiments of the invention generate a generalized distilled dataset that can adapt to different data distributions. Embodiments of the invention further ensure that malicious updates to the distilled dataset and/or the model are avoided or removed by robustly aggregating the updates to the distilled dataset and/or updates to the machine learning model.
The node 106 includes an edge model 108 that is a copy of the central model 104. The node 110 may include an edge model 112 that is a copy of the central model 104. The models 108 and 112, once trained, may operate using, respectively, local datasets 116 and 118. More specifically, the edge model 108 may generate inferences or predictions using the dataset 116 or data associated with the node 106 specifically. The edge model 112 may be fine-tuned during training using the dataset 118, which includes data associated with the node 110 specifically. The central node 102 may use a dataset 120, which may represent data from all of the nodes in the system 100.
Conventionally, the central model 104 may be trained using a training dataset such as the dataset 120. The central model 104, once trained, is distributed to the nodes 106 and 110 as the edge models 108 and 112. Additional training or fine-tuning may be performed at the nodes 106 and 110. The central node 102 may receive updates from the edge nodes 106 and 110. The model 104 is updated using the updates from the nodes 106 and 110. The global update is then distributed back to the nodes 106 and 110. This process may continue until convergence is achieved.
Embodiments of the invention also perform aspects of federated learning using a distilled dataset 114. Further embodiments of the invention may perform aspects of federated learning at the central node 102 and/or at the edge nodes 106 and 110.
Generally, dataset distillation is a process of generating a synthetic dataset that can be used to train a machine learning model. The distilled dataset 114 is smaller in size compared to a conventional training dataset. Embodiments of the invention relate to generating the distilled dataset 114 such that the distilled dataset 114 is generalized to the nodes in the system 100 and such that the distilled dataset 114 can be deployed to new nodes for training purposes. A model trained on the distilled dataset may have an accuracy that is comparable to being trained with a conventional dataset and such that the newly trained node is suitable for production purposes.
More specifically, a model is initially optimized with a known real dataset. After the model is optimized using the real dataset, a loss is calculated or determined for the synthetic dataset (e.g., by running the synthetic dataset through the model optimized on the known real dataset). Further optimization is performed with respect to the synthesized dataset based on the calculated loss in order to generate the distilled dataset. Many models may be sampled in order to obtain a distilled dataset that is robust to in-distribution changes to a family of models.
The pseudocode 200 describes a method for generating a distilled dataset. The elements 202, 204, 206, and 208 of the method are also referenced in
An example method includes creating or initializing 202 a distilled dataset and an initial learning rate. After the distilled dataset and learning rate are initialized 202 or created, the model is optimized 204 for T distillation rounds. Next, the loss is evaluated or determined 206. Stated differently, a gradient computation is performed. Finally, the gradients are optimized 208. This includes updating the distilled dataset and the learning rate. After the T distillation rounds, the optimized distilled dataset and learning rate are returned.
As shown at 310, generating the federated distilled dataset includes aspects that are performed at different locations. In this specifical example, model optimization 304 is performed centrally (e.g., at the central node), the loss computation 306 is performed at the edge (e.g., at the edge nodes) and the distillation optimization 308 is performed centrally at the central node. Embodiments of the invention may perform these aspects at different locations. Thus, the model optimization, the loss computation, and the distillation optimization can each be performed centrally at the central node or at the edge nodes. Multiple combinations are possible.
Rather than provide a bootstrap model to the node 416 and adapt the bootstrap model using locally available data, embodiments of the invention may use a distilled dataset 422, to which multiple nodes may have contributed. The distilled dataset allows a model 420 deployed to the node 416 to be trained with the distilled dataset 422. Because multiple nodes, each associated with data having a distribution (Di), contributed to the generation of the distilled dataset 422, the model 420 has a similar level of accuracy when deployed to the node 416 and the distilled dataset 422 is orders of magnitude smaller than the original training dataset. This reduces transmission requirements and saves time when training the model 420.
Embodiments of the invention further extend the generation of a federated distilled dataset to include robust aggregation.
The pseudocode 500, however, includes additional orchestration to determine which nodes receive which models for evaluation. Embodiments of the invention do not require a one-to-one correspondence of nodes to models. Thus, the same iteration of a model may be provided to multiple nodes. The pseudocode 500 also illustrates communication parameters such that less data traverses the network between the edge nodes and the central node. The pseudocode 500 also provides a robust aggregation element at 508 to determine which loss evaluations are considered and/or used for the gradient computation 508. Embodiments of the invention discard or exclude loss evaluations that are outliers or that are malicious or potentially malicious from the loss evaluation and gradient computation 508.
Next, model optimization 618 is performed by the central node 602. In this example, model optimization 618 may include obtaining sample weights θ1 from p(θ) and determining an initial model. The model may be selected from a set of models and may be indexed by i.
Model optimization 618 may then perform a training iteration of E epochs, which adjusts the weights θ1i to obtain θEi. This model optimization 618 is performed with respect to the randomly initialized distilled dataset 612 {tilde over (x)} and does not require real data. The resulting weights (θEi) may be communicated to the edge nodes for additional training. This may be performed for multiple models. The resulting model weights after optimizing multiple models, after E epochs, may be represented by θEa, θEb, . . . θEZ. Next, the trained model weight samples (or more generally the trained models) are communicated to the edge nodes for loss evaluation. Using multiple nodes, a trained model weight sample may be sent to a set of nodes. In embodiments of the invention as previously stated, multiple nodes may each receive one or more model weight samples and some of the nodes may receive some of the same model weight samples.
In this example, the nodes 704 and 712 (nodes 0 and 2 of nodes 0, 1, 2, . . . j . . . ) receive the model sample weights 720 and 724, which are both θEa for a model a. The node 708 (node 1) receives the sample weight 724 θEb and the node 716 (node j) receives the sample weight 726 θEi.
Because the sample weights 720 and 724 are the same in this example,
The process for determining which nodes receive which models may be domain-dependent and may require the central node 702 to track characteristics of the nodes. Example characteristics may include, by way of example, the expected response times by the nodes, their resource availability, and/or their data availability.
In one example, this operation may be performed multiple times and each node may receive multiple versions of the model. As a result, the contents of the table 730 are amended to track this information.
When the trained model weight samples are received, each of the nodes performs a loss evaluation. More specifically, each model performs an assessment of the optimized model against real data: xt⊂Xi. The loss is defined or determined using a loss function as follows:
The loss 732 is communicated back to the central node 702 and the table 730 is updated. The table 730 thus stores loss values from each of the nodes. The table 730 illustrates, for example, the model a at the node 0 (node 704) had a loss of L0a.
At this stage, in one embodiment, the loss values are not immediately used to perform the gradient computation (e.g., to obtain the gradients (∇{tilde over (x)}, ∇{tilde over (η)}) for updating, respectively, the distilled data and the learning rate) directly. Rather, the loss values are stored in the table 730 and may be processed at a later stage during robust aggregation.
The loss evaluation may be performed iteratively.
With a large number (e.g., w) of optimized models and because each node may generate a loss value for multiple models, the number of loss evaluations may be greater than
where E is the number of nodes. In one example, these iterations occur in the same training round t (e.g., in the while loop of the pseudocode 500).
Once the loss values are generated, the table 730 may be used in robust distillation and aggregation. Robust aggregation (e.g., robust aggregation (R) function in element 508 of
Next, for each node, the loss gradients and the learning gradients are averaged as follows. For each node i and its assessed models M, the average of the loss gradients Is
and the average of the learning rate gradients is
This results in two lists ∇{tilde over (x)}
After the method of
In one example, the distilled dataset, by itself, would be sufficient to obtain a generic version of a model. Leveraging local data at the new node may refine for fine-tune the model for that node. Consequently, embodiments of the invention allow a model to be deployed and trained at a new node when no or very little data is available at the node. The model may iteratively be adapted as new data becomes available at the node.
The distilled dataset is distinct from the direct deployment of a generic model trained conventionally. The distilled dataset obtained with robust aggregation is more general and coherent with the nodes in the environment and can provide a suitable starting point for model adaptation in the domain.
In one example, the concept of T rounds is eliminated and robust distillation and aggregation is performed periodically and/or continuously. This ensures that a most recent distilled dataset is available for deployment to a new node. In contrast to deploying a pre-trained model, a new node is able to receive an up-to-date starting model that can be adapted. The starting model, which is subject to periodic and/or continuous robust distillation and aggregation, reflects drift or other recent changes in data at other nodes.
Embodiments of the invention relate to a robust dataset distillation and aggregation system in which data privacy is protected and communication requirements are reduced. Embodiments of the invention can also be implemented in edge nodes that do not have the computation resources for model optimization but can perform loss evaluation. Embodiments of the invention also reduce or eliminate the impact of malicious interference and/or drift at certain nodes. Embodiments of the invention also allow nodes with more resources to provide more loss evaluations for more model parameters.
Initially, the method may perform 802 initializations. This may include selecting or defining an initial learning rate and may include generating a random distilled dataset. Next, for I iterations, the model is optimized 804 beginning with the random distilled dataset. Each round may optimize the model using the distilled dataset generated/updated/improved during the previous round. The optimization is performed, in one example, for multiple models in a set or distribution of models.
After the model is optimized, the model (or model weights) are transmitted to the nodes. As previously stated, embodiments of the invention can send the same model to multiple nodes and each node can participate using more than one model. After transmission of the models or the weights, loss evaluations are performed 806 at the nodes. The loss values determined at the nodes are transmitted back to the central node and stored, for example in a table. The table may store loss values for all node-model loss evaluations that were performed in the method 800.
Next, the losses are aggregated 808 at the central node. This may include generating average loss gradients and average learning rate gradients. The loss evaluations or values are, in addition, robustly aggregated 810 such that outliers, malicious values, or the like are prevented from impacting the generation of the distilled dataset. Gradients that are outliers or outside a standard deviation are excluded from use in the update. At the end of each round, the distilled dataset and/or learning rate is updated based on the robustly aggregated loss evaluations. After the final round, the distilled dataset is generated 812. The distilled dataset can be deployed with a model to new nodes. The distilled dataset accounts for distributions of data at all of the nodes and the models can be adapted or individualized at each node based on local data. The distilled dataset allows a model to be trained such that its inferences are likely to be accurate as if trained on a larger undistilled dataset. However, the distilled dataset is orders of magnitude smaller and this overcomes transmission/communication problems in a network that includes multiple edge nodes.
The following is a discussion of aspects of example operating environments for various embodiments of the invention. This discussion is not intended to limit the scope of the invention, or the applicability of the embodiments, in any way.
In general, embodiments of the invention may be implemented in connection with systems, software, and components, that individually and/or collectively implement, and/or cause the implementation of, machine learning operations, machine learning training operations, federated learning operations, distilled dataset generation operations, robust aggregation operations, or the like or combinations thereof. More generally, the scope of the invention embraces any operating environment in which the disclosed concepts may be useful.
New and/or modified data collected and/or generated in connection with some embodiments, may be stored in a data protection environment that may take the form of a public or private cloud storage environment, an on-premises storage environment, and hybrid storage environments that include public and private elements. Any of these example storage environments, may be partly, or completely, virtualized. The storage environment may comprise, or consist of, a datacenter which is operable one or more clients or other elements of the operating environment.
Example cloud computing environments, which may or may not be public, include storage environments that may provide data protection functionality for one or more clients. Another example of a cloud computing environment is one in which processing, data protection, and other, services may be performed on behalf of one or more clients. Some example cloud computing environments in connection with which embodiments of the invention may be employed include, but are not limited to, Microsoft Azure, Amazon AWS, Dell EMC Cloud Storage Services, and Google Cloud. More generally however, the scope of the invention is not limited to employment of any particular type or implementation of cloud computing environment.
In addition to the cloud environment, the operating environment may also include one or more clients that are capable of collecting, modifying, and creating, data. As such, a particular client may employ, or otherwise be associated with, one or more instances of each of one or more applications that perform such operations with respect to data. Such clients may comprise physical machines, containers, or virtual machines (VMs).
Particularly, devices in the operating environment may take the form of software, physical machines, containers, or VMs, or any combination of these, though no particular device implementation or configuration is required for any embodiment. Similarly, data protection system components such as databases, storage servers, storage volumes (LUNs), storage disks, replication services, backup servers, restore servers, backup clients, and restore clients, for example, may likewise take the form of software, physical machines or virtual machines (VMs), though no particular component implementation is required for any embodiment.
It is noted that any operation(s) of any of the methods disclosed herein and in the Figures may be performed in response to, as a result of, and/or, based upon, the performance of any preceding operation(s). Correspondingly, performance of one or more operations, for example, may be a predicate or trigger to subsequent performance of one or more additional operations. Thus, for example, the various operations that may make up a method may be linked together or otherwise associated with each other by way of relations such as the examples just noted. Finally, and while it is not required, the individual operations that make up the various example methods disclosed herein are, in some embodiments, performed in the specific sequence recited in those examples. In other embodiments, the individual operations that make up a disclosed method may be performed in a sequence other than the specific sequence recited.
Following are some further example embodiments of the invention. These are presented only by way of example and are not intended to limit the scope of the invention in any way.
Embodiment 1. A method comprising: initializing a distilled dataset at a central node, performing one or more rounds of: optimizing a model using the distilled dataset, communicating model weights of the optimized model to edge nodes, causing each of the nodes to generate a loss evaluation value at receiving the loss evaluation values to the central node, robustly aggregating the loss evaluation values into a robust aggregated loss evaluation gradient value, wherein loss evaluation values that are outliers are omitted from the robust aggregated loss evaluation gradient value, and updating the distilled dataset using the robust aggregated loss evaluation gradient value.
Embodiment 2. The method of embodiment 1, further comprising initializing the distilled dataset with random data.
Embodiment 3. The method of embodiment 1 and/or 2, further comprising optimizing multiple models using the distilled dataset.
Embodiment 4. The method of embodiment 1, 2, and/or 3, further comprising communicating model weights of one or more models to each of the edge nodes, wherein some of the nodes receive some of the same model weights for performing the loss evaluation.
Embodiment 5. The method of embodiment 1, 2, 3, and/or 4, further comprising generating the loss evaluation using real data at the edge nodes, wherein each of the edge nodes has different data.
Embodiment 6. The method of embodiment 1, 2, 3, 4, and/or 5, wherein robustly aggregating the loss evaluation values comprises generating an average loss evaluation for each model based on the loss evaluations for each model.
Embodiment 7. The method of embodiment 1, 2, 3, 4, 5, and/or 6, further comprising excluding loss evaluations from the robust aggregated loss evaluation gradient value that are outliers from a distribution of the loss evaluations.
Embodiment 8. The method of embodiment 1, 2, 3, 4, 5, 6, and/or 7, further comprising excluding loss evaluations from the robust aggregated loss evaluation gradient value that are more than one standard deviation from a mean loss value.
Embodiment 9. The method of embodiment 1, 2, 3, 4, 5, 6, 7, and/or 8, further comprising defining an initial learning rate and updating the learning rate in each round.
Embodiment 10. The method of embodiment 1, 2, 3, 4, 5, 6, 7, 8, and/or 9, further comprising storing the loss evaluations in a table such that the robust aggregation is performed for each round.
Embodiment 11. A system, comprising hardware and/or software, operable to perform any of the operations, methods, or processes, or any portion of any of these, or any combination thereof disclosed herein.
Embodiment 12. A non-transitory storage medium having stored therein instructions that are executable by one or more hardware processors to perform operations comprising the operations of any one or more of embodiments 1-11.
The embodiments disclosed herein may include the use of a special purpose or general-purpose computer including various computer hardware or software modules, as discussed in greater detail below. A computer may include a processor and computer storage media carrying instructions that, when executed by the processor and/or caused to be executed by the processor, perform any one or more of the methods disclosed herein, or any part(s) of any method disclosed.
As indicated above, embodiments within the scope of the present invention also include computer storage media, which are physical media for carrying or having computer-executable instructions or data structures stored thereon. Such computer storage media may be any available physical media that may be accessed by a general purpose or special purpose computer.
By way of example, and not limitation, such computer storage media may comprise hardware storage such as solid state disk/device (SSD), RAM, ROM, EEPROM, CD-ROM, flash memory, phase-change memory (“PCM”), or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which may be used to store program code in the form of computer-executable instructions or data structures, which may be accessed and executed by a general-purpose or special-purpose computer system to implement the disclosed functionality of the invention. Combinations of the above should also be included within the scope of computer storage media. Such media are also examples of non-transitory storage media, and non-transitory storage media also embraces cloud-based storage systems and structures, although the scope of the invention is not limited to these examples of non-transitory storage media.
Computer-executable instructions comprise, for example, instructions and data which, when executed, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. As such, some embodiments of the invention may be downloadable to one or more systems or devices, for example, from a website, mesh topology, or other source. As well, the scope of the invention embraces any hardware system or device that comprises an instance of an application that comprises the disclosed executable instructions.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts disclosed herein are disclosed as example forms of implementing the claims.
As used herein, the term module, component, engine, agent, client or the like may refer to software objects or routines that execute on the computing system. The different components, modules, engines, and services described herein may be implemented as objects or processes that execute on the computing system, for example, as separate threads. While the system and methods described herein may be implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated. In the present disclosure, a ‘computing entity’ may be any computing system as previously defined herein, or any module or combination of modules running on a computing system.
In at least some instances, a hardware processor is provided that is operable to carry out executable instructions for performing a method or process, such as the methods and processes disclosed herein. The hardware processor may or may not comprise an element of other hardware, such as the computing devices and systems disclosed herein.
In terms of computing environments, embodiments of the invention may be performed in client-server environments, whether network or local environments, or in any other suitable environment. Suitable operating environments for at least some embodiments of the invention include cloud computing environments where one or more of a client, server, or other machine may reside and operate in a cloud environment.
With reference briefly now to
In the example of
Such executable instructions may take various forms including, for example, instructions executable to perform any method or portion thereof disclosed herein, and/or executable by/at any of a storage site, whether on-premises at an enterprise, or a cloud computing site, client, datacenter, data protection site including a cloud storage site, or backup server, to perform any of the functions disclosed herein. As well, such instructions may be executable to perform any of the other operations and methods, and any portions thereof, disclosed herein.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.