The following are incorporated by reference for all purposes as if fully set forth herein:
The disclosure relates generally to a federated cloud learning system and method that has a privacy-preserving machine learning protocol, whereby inferences on data can be transacted or exchanged without ever revealing the data.
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Federated Cloud Learning is a distributed machine learning approach which enables model training on a large corpus of secure data that resides in one or more clouds to which the party training the model does not have access to. By applying the right balance of privacy and security techniques it is possible to keep the data secure on the cloud, with minimal leakage of the data itself in the trained model.
The world is becoming increasingly data-driven. Machine learning is driving more automation into businesses, allowing the delivery of new levels of efficiency and products that are tailored to business outcomes and individual customer preferences. This results in dramatically accelerated volumes of data generation.
The global datasphere, defined by International Data Corporation (“DC”) as the summation of all the world's data, whether it is created, captured, or replicated, is predicted to grow from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025.
Reliance on cloud services for both enterprises and consumers continues to increase. Companies continue to pursue the cloud for data processing needs, and cloud data centers are quickly becoming the new enterprise data repositories. IDC expects that by 2021, there will be more data stored in the cloud than in traditional data centers.
For example, accounts and transactional data is one of the most valuable assets for a large bank. The lending and other product data generated over millions of users, both individual and corporate, over decades, and well-curated, is a rich knowledge graph of information that is valuable for many players in the finance industry. Having access to this data by a private equity fund or a hedge fund will help build or enhance investment models.
Yet today, significant amounts of such data remain predominantly inaccessible to derive valuable insights via machine learning due to privacy and security concerns, as well as regulatory limitations, for example in accordance with General Data Protection Regulation (EU GDPR) and similar regulations in other jurisdictions. There are also concerns about the difficulty to move big data around, de-identifying the data, structuring the process as continuous data-sale vs one-time sale, as well as reputational risks. Such concerns exist widely across any industry and are only becoming more pronounced with the advancement of Big Data.
In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which.
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
A federated training system is described. The system comprises a plurality of models (e.g., models 1-n in model repository 536 of
In one implementation, the training datasets are domain-specific (e.g., health, financial, computing, hospitality). In one implementation, the training datasets include raw data, processed data, derived data, and market data. In one implementation, the training datasets are modifiable in real-time, and the training provisioning parameters are responsive to the real-time modifications.
In one implementation, the dataset metadata identifies dataset schema, dataset usage examples, data set purposes, and dataset ratings. In one implementation, the models are provided by model servers, and the training datasets are provided by dataset servers. The runtime intermediary creates a secure tunnel to receive the models, and the secure tunnel prevents the model servers from accessing the training datasets.
In one implementation, the runtime intermediary returns the trained models to the model servers. In one implementation, the runtime intermediary trains the models on the matched training datasets using a plurality of edge devices, edge devices in the plurality of edge devices including user endpoints and servers, and configured to receive the matched training datasets, the model coefficients, the model hyperparameters, and the privacy parameters to train the models on the matched training datasets in accordance with the model hyperparameters and the privacy parameters to generate a plurality of the gradients with respect to the model coefficients.
In one implementation, the runtime intermediary, upon matching of the models with the training datasets, is further configured to generate a data instrument that specifies transaction updates, including overtime changes to the training acquisition parameters and the training provisioning parameters, memorialization of the training acquisition parameters and the training provisioning parameters that brought about the matching, ownership details of the model servers and the dataset servers, transactional details of the matching, data schema of the training datasets, including input features and precision and recall measures, terms and conditions of the training, including lifetime of the matching, training duration, and privacy specifications, and ratings, including feedback based on prior instances of the matching and third-party opinion on the matching.
In one implementation, the model servers are configured to receive and aggregate the plurality of the gradients, and to update the model coefficients based on the aggregated plurality of the gradients to generate the trained models. In one implementation, trusted third-party servers are configured to receive and aggregate the plurality of the gradients, to update the model coefficients based on the aggregated plurality of the gradients to generate the trained models, and to send the trained models to the model servers. In one implementation, the trusted third-party servers apply a plurality of privacy enhancers on the gradients prior to making the gradients available to the model servers. In one implementation, the model servers are configured to test the trained models on validation sets, and to request the runtime intermediary to further train the trained models based on results of the test.
In one implementation, the request for further training specifies a training duration, and is accompanied with updated model hyperparameters. In one implementation, the runtime intermediary provides a dashboard for configuration of the privacy parameters. In one implementation, the runtime intermediary applies a plurality of privacy enhancers on the gradients prior to making the gradients available to the model servers. In some implementations, privacy enhancers in the plurality of privacy enhancers include differential privacy addition, multi-party computation, and homomorphic encryption.
A computer-implemented method of federated training is described. The method includes receiving requests for training models in a plurality of models on training datasets in a plurality of training datasets, the requests accompanied with request metadata, including training acquisition parameters, the models having model coefficients responsive to training, the models accompanied with model metadata, including model hyperparameters, and the training datasets annotated with ground truth labels to train the models, the training datasets accompanied with dataset metadata, including training provisioning parameters and privacy parameters; responding to the requests by matching the models with the training datasets based on evaluating the training acquisition parameters against the training provisioning parameters; training the models on the matched training datasets in accordance with the model hyperparameters and the privacy parameters to generate gradients with respect to the model coefficients, the gradients generated based on computing error between predictions by the models on the training datasets and the ground truth labels; and making the gradients available for updating the model coefficients and generating the trained models.
One or more implementations of the technology disclosed, or elements thereof can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
Federated Learning has redefined how machine learning is used in any industry where data cannot easily be transferred away from its source. Federated Learning allows for the training of machine learning models by bringing the models and computation directly to the data, rather than moving the data to a central location for training.
In a typical machine learning architecture, all data is transferred to a central location for training. With Federated Learning, only model parameters are transferred to and from the data location in the cloud. With that, Federated Learning allows each party to keep their data private.
Initial Federated Learning implementations were focused on structures with a large number of edge devices contributing to the training of a combined model (for example, Google using Federated Learning for mobile keyboard prediction).
The disclosure is particularly applicable to a federated cloud learning system and method and it is in this context that the disclosure will be described. It will be appreciated, however, that the system and method in accordance with the invention has greater utility. Furthermore, the Federated Cloud Learning system and method are applicable for a broad range of industries (as described below in more detail) including, for example: 1) Pharmaceutical companies: train models on healthcare insurers and hospitals for clinical trials, drug adherence, rare diseases diagnostics; 2) Health insurers: monetize claims data; 3) Investment management: train and back test models on various currently unavailable for purchase datasets; 4) Banking: monetize lending and other banking data; 5) Governments: allow pharmaceutical and other companies to train models on its citizens data, with predictions from those models benefitting various aspects of citizens' life; 6) Large enterprises: Build privacy safe analytics, analyses and models, taking advantage of more data across enterprise, while preserving privacy of respective departments; and 7) Companies in possession of large data assets: monetize their data without losing control of it.
Federated Cloud Learning focuses on a smaller number of parties (as little as two parties, one as data owner, and the other as model owner), typically with one of the parties owning large amounts of data located in its respective cloud or data center, that the other party is interested to train its models on.
With Federated Cloud Learning, unlike traditional machine learning methodologies, data owner (data owner) no longer has to sell the data to model owner and instead is only leasing the data temporarily on its own cloud for the purposes of training model owner (model owner)'s models. By using aggregated updates to train algorithms instead of raw data, Federated Cloud Learning empowers companies from sectors where data cannot be transferred to third parties for confidentiality reasons with data network effects.
Both parties benefit significantly from such setup because: data owner (data owner) generate revenue by training model owner's models on their data without revealing any of the data; and model owner (model owner): create a robust, comprehensive and specific model using own data, as well as data owner's data, but without ever having received the data itself.
To summarize, Federated Cloud Learning is a system in which:
As outlined above, Federated Cloud Learning is a system in which multiple parties (as little as two) build a machine learning model under a data federation system, gaining benefit from the data of all parties in the setup. The model can either be shared between the parties, or not, depending on the agreement between the parties.
To take this one step further, we have developed the concept of a marketplace for data instruments, where inferences on the data can be transacted among multiple parties, with those transactions priced automatically based on bid and offer quotes between the parties, just as securities are traded on a regulated exchange—TensorXchange (tensor exchange) (“TXE”), also called a runtime intermediary 224.
The typical understanding of the data is just the raw data. We extend the meaning of data into data instruments by introducing the following entities:
The above mentioned Data Instruments have certain characteristics that need to be considered when exchanged in a marketplace:
The data instrument could be continuously updated as the data owner's data could be a constantly growing data set. Alternatively, the data could also be a one-time piece of information. TXE supports both configurations and there are different financial configurations to support them.
The critical aspect of TXE is that it is intended as a marketplace for model owners and data owners. The same piece of data instrument a data owner offers could interest multiple model owners. TXE facilitates the marketplace behavior (buy/sell/bid etc.) to build a close to efficient market. The model owners and data owners do not have to deal with too many parties.
There are four key pieces when describing the data flow and the technical architecture of the Federated Cloud Learning and TXE.
The protocol dictates how the two parties—model owner and data owner interact in a secure way.
The protocol ensures that there is minimal or no real data leakage as part of federated workflow by the model owner by enforcing the required privacy and controls.
The protocol ensures that all the privacy and security controls are with the data owner and not the model owner. This is important because it guarantees that the data owner is in charge of the levels of privacy and security protection is acceptable during training.
The protocol allows the model owner to specify the hyperparameters of the training (except security and privacy settings) in order to train a successful model.
Then, the data owner 112 validates and verifies the model (automatically via the federated cloud learning runtime) which guarantees that it is verified model. The admin of the data owner 112 can also manually review the model for additional security measures.
The data owner 112 configures the privacy and security settings. The applied settings will let the data owner 112, not the model owner's control what levels of data guards have to be applied on the model. This is a key step for the data owner, as it guarantees that the trained model will not potentially expose any identity of the users' data even with other publicly available datasets. See the section on Guarantees on data security and privacy section for more information.
The model is then primed to start training on secure data that is only hosted and available on the data owner cloud/data center 602. Note that the model owner has no access to this environment or the data. It is a walled garden behind the firewall of the data owner. The Federated Cloud Environment will facilitate the training of the model on data provided by the data owner.
The training step supports all standard machine learning and deep learning training operations. Once the training is complete the newly computed tensors (weights/gradients) of the model are packaged and sent to the model owner.
The Federated Cloud Environment will also automatically capture all the key information for the data owner—like audit logs, metrics, security and privacy settings. The packaged tensors (weights/gradients) are sent to the model owner.
The model owner aggregates these tensors using the Federated Cloud Learning aggregator module to produce a newly improved model.
This model is put to test to see its performance against any validation set of data that the model owner might have. This already improved model can be a new baseline model. If the model owner determines that model can be further improved, they will request for another round (epoch) of training on the data owner data (with potentially changed hyperparameters). This cycle continues until the model owner is happy with a model.
In
A fully scalable federated cloud learning aggregation server 546 and model repository 536 for the model owner 503 is a service that allows the model owner to aggregate weights when the training operations are completed on the data owners' side. The model repository 536 and the model aggregation server 546 are hosted on model server(s) 556. The model developer(s) 518 add the models 538 to the model repository 536. In one implementation, the fully scalable federated cloud learning aggregation server is installed on a trusted third-party server 936 (shown in
In
1. From the model owner side 603:
The above describes core Federated Cloud Learning infrastructure. However, we believe that many TXE participants will require auxiliary services to meet their needs, for example pre-training services for data summarizing and post-training services of model leakage analysis. Accordingly, we have developed TXE as a product to include further modules to cover those needs. This structure is presented in
For any machine learning technique, including Federated Learning, it is important to prevent situations permitting the training data to be estimated with varying degrees of accuracy (“model inversion”), or recovering information about whether or not a particular individual was in the training set (“membership inference”).
In a consumer federated learning setup, the data is distributed on edge devices with many users. However, in Federated Cloud Learning the data is mostly centralized at the data owner's side. This means that the data owners can provide greater guarantees to keep the identity of individuals private.
Key techniques enabling security and privacy in the context of federated learning are:
It is important to frame the problem of privacy and security in two ways when performing Federated Cloud Learning. A newly trained/improved model on data from the data owner does not leak the identities from the data. For example, if a model is being trained on medical records using federated cloud learning, it is critical that the holder of the model cannot reverse identities of the users whose data was used for training for (even though no personally identifiable information was included during training). The challenge here is providing this guarantee even when the holder of the model tries to use any publicly available dataset or knowledge about the users (for example, Facebook or Twitter data). The model does not decompose to the data completely, and that only overall statistics that can be computed from the model.
Essentially what this means is that after the analysis of a federated cloud learned model, the analyzer does not know anything about the people in the dataset. They remain “unobserved”.
A more formal definition is as follows. A randomized mechanism M: D→R satisfies (ε, δ)—differential privacy if for any two adjacent datasets X, X′ E D and for any measurable subset of outputs Y⊆R it holds that Pr [M(X)∈Y]<eε Pr [M(X′)∈Y]+δ. Refer to https://en.wikipedia.org/wiki/Differentialprivacy#% CE % B5-differentially_private_mechanisms, which is incorporated here by reference for all purposes.
The interpretation of adjacent datasets above determines the unit of information that is protected by the algorithm: a differentially private mechanism guarantees that two datasets differing only by the addition or removal of a single unit produce outputs that are nearly indistinguishable.
Differentially private systems are assessed by a single value, represented by the Greek letter epsilon (ε). ε is a measure of how private, and how noisy, a data release is. Higher values of ε indicate more accurate, less private answers; low ε systems give highly random answers that do not let would-be attackers learn much at all. One of differential privacy's great successes is that it reduces the essential trade-off in privacy-preserving data analysis—accuracy vs. privacy—to a single number.
Differential privacy promises to protect individuals from any additional harm that they might face due to their data being in the private database x that they would not have faced had their data not been part of x.
Differential privacy provides privacy by process; in particular, it introduces randomness. Here the privacy comes from plausible deniability of any outcome. By introducing random events (like a coin toss) when training on individual user's data, any subsequent attack on a trained model cannot be used to triangulate the identity of the individual with a high level of certainty. Taking the coin toss example in
This is extremely important because we can now make guarantees that learning a company's data does not reveal who the real person is, irrespective of any other public data source available on any individual. This can truly protect privacy of individuals in the data set.
The Federated Cloud Learning admin panel makes these choices easy for the data owner.
A computer-implemented method of federated training is described in
Currently, data owners typically sell or hand over the data to parties that are interested to train their models on that data. The individuals in that data set would potentially be easily identifiable, thus compromising their privacy and breaching regulations, such as GDPR and HIPAA. The Federated Cloud Learning allows the data owner to keep the data private in its location and instead allow model owner to train a model in data owner's own or privately leased datacenter—the data never leaves their controlled environment. In an extreme case, training may occur on a single powerful machine disconnected from the internet for maximum private data security.
To ensure maximum privacy and security of the data, while enabling machine learning, the disclosed solution:
Accordingly, Federated Cloud Learning permits learning to be done on multiple data sets, keeping those data sets in their respective locations without any need to perform any dataset exchange, and ensures the privacy and security of the datasets and their derivatives.
In one implementation, the runtime intermediary 224 is communicably linked to the storage subsystem 1313 and the user interface input devices 1338.
User interface input devices 1338 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1300.
User interface output devices 1376 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1300 to the user or to another machine or computer system.
Storage subsystem 1313 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 1378.
Deep learning processors 1378 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning processors 1378 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of deep learning processors 1378 include Google's Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX13 Rackmount Series™, NVIDIA DGX-1™, Microsoft′ Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon Processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.
Memory subsystem 1322 used in the storage subsystem 1313 can include a number of memories including a main random access memory (RAM) 1332 for storage of instructions and data during program execution and a read only memory (ROM) 1334 in which fixed instructions are stored. A file storage subsystem 1336 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 1336 in the storage subsystem 1313, or in other machines accessible by the processor.
Bus subsystem 1355 provides a mechanism for letting the various components and subsystems of computer system 1300 communicate with each other as intended. Although bus subsystem 1355 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 1300 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever changing nature of computers and networks, the description of computer system 1300 depicted in
While the present invention is disclosed by reference to the preferred embodiments and examples detailed above, it is to be understood that these examples are intended in an illustrative rather than in a limiting sense. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.
This application claims the benefit of U.S. Patent Application No. 62/883,639, entitled “FEDERATED CLOUD LEARNING SYSTEM AND METHOD”, filed Aug. 6, 2019 (Attorney Docket No. DCAI 1014-1). The provisional application is incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
62883639 | Aug 2019 | US |