The following materials are incorporated by reference as if fully set forth herein:
U.S. Provisional Patent Application No. 62/883,639, titled “FEDERATED CLOUD LEARNING SYSTEM AND METHOD,” filed on Aug. 6, 2019 (Atty. Docket No. DCAI 1014-1);
U.S. Provisional Patent Application No. 62/816,880, titled “SYSTEM AND METHOD WITH FEDERATED LEARNING MODEL FOR MEDICAL RESEARCH APPLICATIONS,” filed on Mar. 11, 2019 (Atty. Docket No. DCAI 1008-1);
U.S. Provisional Patent Application No. 62/481,691, titled “A METHOD OF BODY MASS INDEX PREDICTION BASED ON SELFIE IMAGES,” filed on Apr. 5, 2017 (Atty. Docket No. DCAI 1006-1);
U.S. Provisional Patent Application No. 62/671,823, titled “SYSTEM AND METHOD FOR MEDICAL INFORMATION EXCHANGE ENABLED BY CRYPTO ASSET,” filed on May 15, 2018;
Chinese Patent Application No. 201910235758.60, titled “SYSTEM AND METHOD WITH FEDERATED LEARNING MODEL FOR MEDICAL RESEARCH APPLICATIONS,” filed on Mar. 27, 2019;
Japanese Patent Application No. 2019-097904, titled “SYSTEM AND METHOD WITH FEDERATED LEARNING MODEL FOR MEDICAL RESEARCH APPLICATIONS,” filed on May 24, 2019;
U.S. Nonprovisional patent application Ser. No. 15/946,629, titled “IMAGE-BASED SYSTEM AND METHOD FOR PREDICTING PHYSIOLOGICAL PARAMETERS,” filed on Apr. 5, 2018 (Atty. Docket No. DCAI 1006-2);
U.S. Nonprovisional patent application Ser. No. 16/816,153, titled “SYSTEM AND METHOD WITH FEDERATED LEARNING MODEL FOR MEDICAL RESEARCH APPLICATIONS,” filed on Mar. 11, 2020 (Atty. Docket No. DCAI 1008-2);
U.S. Nonprovisional patent application Ser. No. 16/987,279, titled “TENSOR EXCHANGE FOR FEDERATED CLOUD LEARNING,” filed on Aug. 6, 2020 (Atty. Docket No. DCAI 1014-2); and
U.S. Nonprovisional patent application Ser. No. 16/167,338, titled “SYSTEM AND METHOD FOR DISTRIBUTED RETRIEVAL OF PROFILE DATA AND RULE-BASED DISTRIBUTION ON A NETWORK TO MODELING NODES,” filed on Oct. 22, 2018.
The technology disclosed relates to use of machine learning techniques on distributed data using federated learning, more specifically the technology disclosed in which different data sources owned by different parties are used to train one machine learning model.
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Insufficient data and labels can result in weak performance by machine learning models. In many applications such as healthcare, data related to same users or entities such as patients are maintained by separate departments in one organization or separate organizations resulting in data silos. A data silo is a situation in which only one group or department in an organization can access a data source. Raw data regarding the same users from multiple data sources cannot be combined due to privacy regulations and laws. Examples of different data sources can include health insurance data, medical claims data, mobility data, genomic data, environmental or exposomic data, laboratory tests and prescriptions data, trackers and bed side monitors data, etc. Therefore, raw data from different sources and owned by respective departments and organizations cannot be combined to train powerful machine learning models that can provide insights and predictions for providing better services and products to users.
An opportunity arises to train high performance machine learning models by utilizing different and heterogenous data sources without breaking the privacy regulations and laws.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. The color drawings also may be available in PAIR via the Supplemental Content tab.
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Traditionally, to take advantage of a dataset using machine learning, all the data for training had to be gathered to one place. However, as more of the world becomes digitized, this will fail to scale with the vast ecosystem of potential data sources that could augment machine learning (ML) models in ways limited only to the imagination. To solve this, we resort to federated learning (“FL”).
Federated learning approach aggregates model weights across multiple devices without such devices explicitly sharing their data. However, the horizontal federated learning assumes a shared feature space, with independently distributed samples stored on each device. Because of the true heterogeneity of information across devices, there can exist relevant information in different feature spaces. In many scenarios such as these, the input feature space is not aligned across devices, making it extremely difficult to relish from the benefits of horizontal FL. If the feature space is not aligned, this results in two specific types of Federated Learning; vertical and transfer. The technology disclosed incorporates vertical learning to enable machine learning models to learn across distributed data silos with different features representing the same set of users. FL is a set of techniques to perform machine learning on distributed data—data which may lie in highly different engineering, economic, and legal (e.g. privacy) landscapes. In the literature, it is mostly conceived as making use of entire samples found across a sea of devices (i.e. horizontally federated learning), that never leave their home device. The ML paradigm remains otherwise the same.
Federated Cloud Learning (“FCL”) is a vertical federated learning—a bigger perspective of FL in which different data sources, which are keyed to each other but owned by different parties, are used to train one model simultaneously, while maintaining the privacy of each component dataset from the others. That is, the samples are composed of parts that live in (and never leave) different places. Model instances only ever see a part of the entire sample, but perform comparably to having the entire feature space, due to the way the model stores its knowledge. This results in tight system coupling, but makes practical and practicable a pandora's box of system possibilities not seen before.
Vertical federated learning (VFL) is best applicable in settings where two or more data silos store a different set of features describing the same population, which will be hereafter referred to as the overlapping population (OP). Assuming the OP is sufficiently large for the specific learning task of interest, vertical federated learning is a viable option for securely aggregating different feature sets across multiple data silos.
Healthcare is one among many industries that can benefit from VFL. Users data is fragmented between different institutions/organizations and departments. Most of these organizations or departments will never be allowed to share their raw data due to privacy regulations and laws. Even if we have access to such data, the data is not homogenous and it cannot be combined directly into an one ML model and vertical federated learning is a better fit to deal with heterogeneous data since it trains a joint model on encoded embeddings. VFL can leverage the private datasets or data silos to learn a joint model. The joint model can learn a holistic view of the users and create a powerful feature space for each user which trains a more powerful model.
Many alternative embodiments of the present aspects may be appropriate and are contemplated, including as described in these detailed embodiments, though also including alternatives that may not be expressly shown or described herein but as obvious variants or obviously contemplated according to one of ordinary skill based on reviewing the totality of this disclosure in combination with other available information. For example, it is contemplated that features shown and described with respect to one or more embodiments may also be included in combination with another embodiment even though not expressly shown and described in that specific combination.
For purpose of efficiency, reference numbers may be repeated between figures where they are intended to represent similar features between otherwise varied embodiments, though those features may also incorporate certain differences between embodiments if and to the extent specified as such or otherwise apparent to one of ordinary skill, such as differences clearly shown between them in the respective figures.
We describe a system 100 for Federated Cloud Learning (FCL). The system is described with reference to
The hardware modules 151 can be computing devices or edge devices such as mobile computing devices or embedded computing systems, etc. The technology disclosed deploys a processing engine on a hardware module. For example, as shown in
A federated cloud learning (FCL) trainer 127 includes the components to train processing engines. The FCL trainer 127 includes a deployer 130, a forward propagator 132, a combiner 134, a backward propagator 136, a gradient accumulator 138, and a weight updater 140. We present details of the components of the FCL trainer in the following sections.
Completing the description of
We present details of the components of the FCL trainer 127 in
Encoder/First Processing Module
Encoder is a processor that receives information characterizing input data and generates an alternative representation and/or characterization of the input data, such as an encoding. In particular, encoder is a neural network such as a convolutional neural network (CNN), a multilayer perceptron, a feed-forward neural network, a recursive neural network, a recurrent neural network (RNN), a deep neural network, a shallow neural network, a fully-connected neural network, a sparsely-connected neural network, a convolutional neural network that comprises a fully-connected neural network (FCNN), a fully convolutional network without a fully-connected neural network, a deep stacking neural network, a deep belief network, a residual network, echo state network, liquid state machine, highway network, maxout network, long short-term memory (LSTM) network, recursive neural network grammar (RNNG), gated recurrent unit (GRU), pre-trained and frozen neural networks, and so on.
In implementations, encoder includes individual components of a convolutional neural network (CNN), such as a one-dimensional (1D) convolution layer, a two-dimensional (2D) convolution layer, a three-dimensional (3D) convolution layer, a feature extraction layer, a dimensionality reduction layer, a pooling encoder layer, a subsampling layer, a batch normalization layer, a concatenation layer, a classification layer, a regularization layer, and so on.
In implementations, encoder comprises learnable components, parameters, and hyperparameters that can be trained by backpropagating errors using an optimization algorithm. The optimization algorithm can be based on stochastic gradient descent (or other variations of gradient descent like batch gradient descent and mini-batch gradient descent). Some examples of optimization algorithms that can be used to train the encoder are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, and Adam.
In implementations, encoder includes an activation component that applies a non-linearity function. Some examples of non-linearity functions that can be used by the encoder include a sigmoid function, rectified linear units (ReLUs), hyperbolic tangent function, absolute of hyperbolic tangent function, leaky ReLUs (LReLUs), and parametrized ReLUs (PReLUs).
In some implementations, the encoder/first processing module and decoder/second processing module can include a classification component, though it is not necessary. In preferred implementations, the encoder/first processing module and decoder/second processing module is a convolutional neural network (CNN) without a classification layer such as softmax or sigmoid. Some examples of classifiers that can be used by the encoder/first processing module and decoder/second processing module include a multi-class support vector machine (SVM), a sigmoid classifier, a softmax classifier, and a multinomial logistic regressor. Other examples of classifiers that can be used by the encoder/first processing module include a rule-based classifier.
Some examples of the encoder/first processing module and decoder/second processing module are:
In a processing engine, the encoder/first processing module produces an output, referred to herein as “encoding”, which is fed as input to each of the decoders. When the encoder/first processing module and decoder/second processing module is a convolutional neural network (CNN), the encoding/decoding is convolution data. When the encoder/first processing module and decoder/second processing module is a recurrent neural network (RNN), the encoding/decoding is hidden state data.
Decoder/Second Processing Module
Each decoder/second processing module is a processor that receives, from the encoder/first processing module information characterizing input data (such as the encoding) and generates an alternative representation and/or characterization of the input data, such as classification scores. In particular, each decoder is a neural network such as a convolutional neural network (CNN), a multilayer perceptron, a feed-forward neural network, a recursive neural network, a recurrent neural network (RNN), a deep neural network, a shallow neural network, a fully-connected neural network, a sparsely-connected neural network, a convolutional neural network that comprises a fully-connected neural network (FCNN), a fully convolutional network without a fully-connected neural network, a deep stacking neural network, a deep belief network, a residual network, echo state network, liquid state machine, highway network, maxout network, long short-term memory (LSTM) network, recursive neural network grammar (RNNG), gated recurrent unit (GRU), pre-trained and frozen neural networks, and so on.
In implementations, each decoder/second processing module includes individual components of a convolutional neural network (CNN), such as a one-dimensional (1D) convolution layer, a two-dimensional (2D) convolution layer, a three-dimensional (3D) convolution layer, a feature extraction layer, a dimensionality reduction layer, a pooling encoder layer, a subsampling layer, a batch normalization layer, a concatenation layer, a classification layer, a regularization layer, and so on.
In implementations, each decoder/second processing module comprises learnable components, parameters, and hyperparameters that can be trained by backpropagating errors using an optimization algorithm. The optimization algorithm can be based on stochastic gradient descent (or other variations of gradient descent like batch gradient descent and mini-batch gradient descent). Some examples of optimization algorithms that can be used to train each decoder are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, and Adam.
In implementations, each decoder/second processing module includes an activation component that applies a non-linearity function. Some examples of non-linearity functions that can be used by each decoder include a sigmoid function, rectified linear units (ReLUs), hyperbolic tangent function, absolute of hyperbolic tangent function, leaky ReLUs (LReLUs), and parametrized ReLUs (PReLUs).
In implementations, each decoder includes a classification component. Some examples of classifiers that can be used by each decoder include a multi-class support vector machine (SVM), a sigmoid classifier, a softmax classifier, and a multinomial logistic regressor. Other examples of classifiers that can be used by each decoder include a rule-based classifier.
The numerous decoders/second processing modules can all be the same type of neural networks with matching architectures, such as fully-connected neural networks (FCNN) with an ultimate sigmoid or softmax classification layer. In other implementations, they can differ based on the type of the neural networks. In yet other implementations, they can all be the same type of neural networks with different architectures.
We now present an example use case in which the technology disclosed can be deployed to solve a problem in the field of health care.
Problem
To demonstrate the capabilities of FCL in the intra-company scenario for a Health Insurer, we present the use case of fraud detection. We imagine a world where health plan members have visits with healthcare providers. This results in some fraud, which we would like to classify. This information lives in two silos: (1) claims submitted by providers, and (2) claims submitted by members, which always correspond 1 to 1. Both or either providers or members may be fraudulent, and accordingly the data to answer the fraud question lies in both or either of the two datasets.
We have broken down our synthetic fraud into six types: three for members (unnecessarily going to providers for visits), and three for providers (unnecessarily performing procedures on members). These types have very specific criteria, which we can use to enrich a synthetic dataset appropriately.
In this example, the technology disclosed can identify potential fraud broken down into six types, grouped in simple analytics, complex analytics, and prediction analytics. The goal is to identify users (or members) and providers in the following two categories.
1. Users who are unnecessarily going to providers for visits
2. Providers that are unnecessarily performing a certain procedure on many users
Simple Analytics:
Complex Analytics:
Prediction Analytics:
The six types of fraud are summarized in table 1 below:
Accordingly, we are assuming that the data required to analyze fraud types 5 and 6 exists on separate clusters:
Dataset
The data is generated by a two-step process, which is decoupled for faster experimentation:
1. Create the raw provider, member, and visit metadata, including fraud.
2. Collect into two partitions (provider claims vs member claims) and featurize.
Many fields are realized categorically, with randomized distributions of correlations between provider/member attributes and the odds of different types of fraud. Some are more structured, such as our fake ICD10 codes and ZIP codes, which are used to connect members to local providers. Fraud is decided on a per-visit basis (6 potential reasons). Tables are related by provider, member, and visit ID. Getting to specifics, we generate the following columns:
Execution steps with timings in seconds:
Features
The second dataset generation stage, collection and featurization, makes this a good vertically federated learning problem. There is only partial overlap between the features present in the provider claims data and the member claims data. In practice, this makes detecting all types of fraud with high accuracy require both partitions of the feature space.
In practice, much of the gap between the “perfect information” learning curve and 100% accuracy is to be found in inadequate featurization. Providers and members are realized as the group of visits that belong to them. Visit groups are then featurized in the same way. Cost, visit count, date, ICD10, num rx, etc. are all considered relevant. Numbers are often taken as log 2 and one-hot. This results in a feature dimensionality of around 100-200.
Models
For this problem, provider claim and member claim encoder networks are both stock multilayer perceptions (MLPs) with sigmoid outputs (for quantizing in the 0-1 range). The output network is also an MLP, as is almost always true, as this is a classification problem. Trained with categorical cross-entropy loss.
Training
We default to 20% validation, 50 epochs, batch size 1024, encode dim 8, no quantization. We experience approx. half-minute epochs for A, B, and AB—and minute epochs for F—on an unladen NVIDIA RTX 2080. The models were implemented in PyTorch 1.3.1 with CUDA 10.1.
Results
Explanation:
The A and B learning curves are their respective datasets taken alone. As these data sources are insufficient when used independently, these form the low-end baselines as shown in
AB is the traditional (non-federated) machine learning task, taking both A and B as input. This is the high-end baseline as shown on the top of end of the graphical plot in
F is the federated cloud learning or FCL curve. Notice how, with uninitialized model memory, it performs as well as either A or B taken alone, then improves as this information forms and stabilizes.
On this challenging dataset, the FCL curve approaches but does not match the AB curve.
Architecture Overview
The overview of the FL architecture is below, ensuring no information is leaked via training.
Network Architecture
The networks, for each system, are split into two parts: an encoder that is built specifically for the feature subset that it addresses, and a “shared” deeper network that takes the encodings as inputs to produce an output. The encoder networks are fully isolated from one another and do not need to share their architecture. For example, the encoder on the left (labeled 904) could be a convolutional network that works with image data while the encoder on the right (labeled 954) could be a recurrent network that addresses natural language inputs. The encoding from encoder 904 is labeled as 905 and encoding from encoder 954 is labeled as 955.
The “shared” portion of each network, on the other hand, has the same architecture, and the weights will be averaged across the networks during training so that they converge to the same values. Data is fed into each network row-wise, that is to say, by sample, but with each network only having access to its subset of the feature space. The rows of data from separate data sets but belonging to same sample are shown in a table in
Architecture Properties
One of the important features of this federated architecture is that the separate systems do not need to know anything about each other's dataset.
Each network runs separately from other networks therefore each network has access to the target output. The labels and the values (from the target output) that the federated system will be trained to predict are shared across networks. In less ideal cases where there is overlap in the feature subsets it may be necessary to coordinate on decisions about how the overlap will be addressed. For example, one of the subsets could simply be designated as the canonical representation of the shared feature, so that it is ignored in the other subset, or the values could be averaged or otherwise combined prior to processing by the encoders.
Federated cloud learning (FCL) is about a basic architecture and training mechanism. The actual neural networks used are custom to the problem at hand. The unifying elements, in order of execution, are:
We have applied federated cloud learning (FCL) and vertical federated learning (VFL) to the following problems that have very different characteristics and have found common themes and generalized our learnings:
1. Parity
Using the technology disclosed, we predict the parity of a collection of bits that have partitioned into multiple shards using the FCL architecture. We detected a yawning gap between one-shard knowledge (50% accuracy) and total knowledge (100% accuracy). FCL is a little slower to converge, especially at higher quantizations, more sample bits, and tighter encoding dimensionalities, but it does converge. It displays some oscillatory behavior due to the long memory update/short batch update tick/tock, combined with the efficiency with which the encodings need to preserve sample bits causing model sensitivity.
2. CLEVR
CLEVR is an image dataset for synthetic visual question and answer challenge and yields itself to (a) a questions dataset and (b) an associated images dataset, which together we can use with the FCL architecture. Also notable for the different encoder architectures, we can use (CONV2D+CONV1D/RNN/Transformer), which the optimizer favors in different ways.
3. Higgs Boson
Higgs boson detection dataset can be cleaved into what it describes as low-level features and a set of derived high-level features, which can be fed to respective multilayer perceptrons (MLPs). It showcases the overlap and correlations so often present in real-world data, also known as the power of deep learning.
4. Other Data Sources and Use Cases
The technology disclosed can be applied to other data sources listed below.
We present below in table 3 some example use cases of the technology disclosed using the data listed in table 2 above.
In
A bus system 1005 is connected to the plurality of prediction engines. The bus system is configurable to partition the respective prediction engines into respective processing pipelines. The bus system 1005 can block input feature exchange via the bus system between an encoder within a particular processing pipeline and encoders outside the particular processing pipeline. For example, the bus system 1005 can block exchange of input features 902 and 952 with encoders outside their respective processing pipelines. Therefore, the encoder 904 does not have access to input features 952 and the encoder 954 does not have access to input features 902.
The system presented in
The system includes a joint prediction generator connected to the plurality of prediction engines. The joint prediction generator is configurable to process input features from the respective feature spaces of the respective data silos through encoders of corresponding allocated processing pipelines to generate corresponding encodings. The joint prediction generator can combine the corresponding encodings across the processing pipelines to generate combined encodings. The joint prediction generator can process the combined encodings through the decoders to generate a unified prediction for members of the overlapping population. Examples of such predictions are presented in Table 3 above. For example, the system can predict a person's likelihood of following a medical protocol, or predict whether a person can experience burnout or productivity issues.
The technology disclosed provides a platform to jointly train a plurality of prediction engines as described above and illustrated in
The trained system can be used to execute joint prediction tasks. The system comprises a joint prediction generator connected to a plurality of prediction engines. The joint prediction generator is configurable to process input features from respective feature spaces of respective data silos through encoders of corresponding allocated prediction engines in the plurality of prediction engines to generate corresponding encodings. The prediction generator can combine the corresponding encodings across the prediction engines to generate combined encodings. The prediction generator can process the combined encodings through respective decoders of the prediction engines to generate a unified prediction for members of an overlapping population that spans the respective feature space.
We describe implementations of a system for training processing engines.
The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
A computer-implemented method implementation of the technology disclosed includes accessing a plurality of processing engines. Each processing engine in the plurality of processing engines can have at least a first processing module and a second processing module. The first processing module in each processing engine is different from a corresponding first processing module in every other processing engine. The second processing module in each processing engine is same as a corresponding second processing module in every other processing engine.
The computer-implemented method includes deploying each processing engine to a respective hardware module in a plurality of hardware modules for training.
During forward pass stage of the training, the computer-implemented method includes processing inputs through the first processing modules of the processing engines and producing an intermediate output for each first processing module.
During the forward pass stage of the training, the computer-implemented method includes combining intermediate outputs across the first processing modules and producing a combined intermediate output for each first processing module.
During the forward pass stage of the training, the computer-implemented method includes processing combined intermediate outputs through the second processing modules of the processing engines and producing a final output for each second processing module.
During the backward pass stage of the training, the computer-implemented method includes determining gradients for each second processing module based on corresponding final outputs and corresponding ground truths.
During the backward pass stage of the training, the computer-implemented method includes accumulating the gradients across the second processing modules and producing accumulated gradients.
During the backward pass stage of the training, the computer-implemented method includes updating weights of the second processing modules based on the accumulated gradients and producing updated second processing modules.
This method implementation and other methods disclosed optionally include one or more of the following features. This method can also include features described in connection with systems disclosed. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to methods, systems, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
One implementation of the computer-implemented method includes determining gradients for each first processing module during the backward pass stage of the training based on the combined intermediate outputs, the corresponding final outputs, and the corresponding ground truths. The method includes, during the backward pass stage of the training, updating weights of the first processing modules based on the determined gradients and producing updated first processing modules.
In one implementation, the computer-implemented method includes storing the updated first processing modules and the updated second processing modules as updated processing engines. The method includes making the updated processing engines available for inference.
The hardware module can be a computing device and/or edge device. The hardware module can be a chip or a part of a chip.
In one implementation, the computer-implemented method includes accumulating the gradients across the second processing modules and producing the accumulated gradients by determining weighted averages of the gradients.
In one implementation, the computer-implemented method includes accumulating the gradients across the second processing modules and producing the accumulated gradients by determining averages of the gradients.
In one implementation, the computer-implemented method includes combining the intermediate outputs across the first processing modules and producing the combined intermediate output for each first processing module by concatenating the intermediate outputs across the first processing modules.
In another implementation, the computer-implemented method includes combining the intermediate outputs across the first processing modules and producing the combined intermediate output for each first processing module by summing the intermediate outputs across the first processing modules.
In one implementation, the inputs processed through the first processing modules of the processing engines can be a subset of features selected from a plurality of training examples in a training set. In such implementation, the inputs processed through the first processing modules of the processing engines can be a subset of the plurality of the training examples in the training set.
In one implementation, the computer-implemented method includes selecting and encoding inputs for a particular first processing module based at least on an architecture of the particular first processing module and/or a task performed by the particular first processing module.
In one implementation, the computer-implemented method includes using parallel processing for performing the training of the plurality of processing engines.
In one implementation, the computer-implemented method includes the first processing modules that have different architectures and/or different weights.
In one implementation, the computer-implemented method includes the second processing modules that are copies of each other such that they have a same architecture and/or same weights.
The first processing modules can be neural networks, deep neural networks, decision trees, or support vector machines.
The second processing modules can be neural networks, deep neural networks, classification layers, or regression layers.
In one implementation, the first processing modules are encoders, and the intermediate outputs are encodings.
In one implementation, the second processing modules are decoders and the final outputs are decodings.
In one implementation, the computer-implemented method includes iterating the training until a convergence condition is reached. In such implementation, the convergence condition can be a threshold number of training iterations.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform a method as described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform a method as described above.
Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the method implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
A system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to train processing engines. The system comprises a memory that can store a plurality of processing engines. Each processing engine in the plurality of processing engines can have at least a first processing module and a second processing module. The first processing module in each processing engine is different from a corresponding first processing module in every other processing engine. The second processing module in each processing engine is same as a corresponding second processing module in every other processing engine.
The system comprises a deployer that deploys each processing engine to a respective hardware module in a plurality of hardware modules for training.
The system comprises a forward propagator which can process inputs during forward pass stage of the training. The forward propagator can process inputs through the first processing modules of the processing engines and produce an intermediate output for each first processing module.
The system comprises a combiner which can combine intermediate outputs during the forward pass stage of the training. The combiner can combine intermediate outputs across the first processing modules and produce a combined intermediate output for each first processing module.
The forward propagator, during the forward pass stage of the training, can process combined intermediate outputs through the second processing modules of the processing engines and produces a final output for each second processing module.
The system comprises a backward propagator which, during backward pass stage of the training, can determine gradients for each second processing module based on corresponding final outputs and corresponding ground truths.
The system comprises a gradient accumulator which, during the backward pass stage of the training, can accumulate the gradients across the second processing modules and can produce accumulated gradients.
The system comprises a weight updater which, during the backward pass stage of the training, can update weights of the second processing modules based on the accumulated gradients and can produce updated second processing modules.
This system implementation optionally includes one or more of the features described in connection with method disclosed above. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to methods, systems, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above. Yet another implementation may include a method performing the functions of the system described above.
A computer readable storage medium (CRM) implementation of the technology disclosed includes a non-transitory computer readable storage medium impressed with computer program instructions to train processing engines. The instructions when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the method implementation apply equally to the CRM implementation. As indicated above, all the method features are not repeated here and should be considered repeated by reference.
Other implementations may include a method of aggregating feature spaces from disparate data silos to execute joint training and prediction tasks using the systems described above. Yet another implementation may include non-transitory computer readable storage medium storing instructions executable by a processor to perform the method described above.
Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the system implementation apply equally to the method and CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
Particular Implementations—Aggregating Feature Spaces from Data Silos
We describe implementations of a system for aggregating feature spaces from disparate data silos to execute joint training and prediction tasks.
The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
A first system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to aggregate feature spaces from disparate data silos to execute joint prediction tasks. The system comprises a plurality of prediction engines, respective prediction engines in the plurality of prediction engines having respective encoders and respective decoders. The system comprises a plurality of data silos, respective data silos in the plurality of data silos having respective feature spaces that have input features for an overlapping population that spans the respective feature spaces. The system comprises a bus system connected to the plurality of prediction engines. The bus system is configurable to partition the respective prediction engines into respective processing pipelines. The bus system is configurable to block input feature exchange via the bus system between an encoder within a particular processing pipeline and encoders outside the particular processing pipeline.
The system comprises a memory access controller connected to the bus system. The memory access controller is configurable to confine access of the encoder within the particular processing pipeline to input features of a feature space of a data silo allocated to the particular processing pipeline. The memory access controller is configurable to allow access of a decoder within the particular processing pipeline to encoding generated by the encoder within the particular processing pipeline. The memory access controller is configurable to allow access of a decoder to encodings generated by the encoders outside the particular processing pipeline.
The system comprises a joint prediction generator connected to the plurality of prediction engines. The joint prediction generator is configurable to process input features from the respective feature spaces of the respective data silos through encoders of corresponding allocated processing pipelines to generate corresponding encodings. The joint prediction generator is configurable to combine the corresponding encodings across the processing pipelines to generate combined encodings. The joint prediction generator is configurable to process the combined encodings through the decoders to generate a unified prediction for members of the overlapping population.
This system implementation and other systems disclosed optionally include one or more of the following features. This system can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to methods, systems, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
The prediction engines can comprise convolutional neural networks (CNNs), long short-term memory (LSTM) neural networks, attention-based models like Transformer deep learning models and Bidirectional Encoder Representations from Transformers (BERT) machine learning models, etc.
One of more data silos in the plurality of data silos can store medical images, claims data from a health insurer, mental health data from a mental health application, data from wearable devices, trackers or bedside monitors, genomics data, banking data, mobility data, clinical trials data, etc.
One or more feature spaces in the respective feature spaces of the plurality of data silos include prescription drugs information, insurance plans information, activity information from wearable devices, etc.
The unified prediction can include survival score predicting a person's survival in the next time period. The unified prediction can include burnout prediction indicating a person's likelihood of experiencing productivity issues. The unified prediction can include predicting whether a person will experience a mental health episode or manic depression. The unified prediction can include likelihood that a person will default on a loan. The unified prediction can include predicting efficacy of a new drug or a new medical protocol.
A second system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to aggregate feature spaces from disparate data silos to execute joint prediction tasks. The system comprises a joint prediction generator connected to a plurality of prediction engines. The plurality of prediction engines have respective encoders and respective decoders that are configurable to process input features from respective feature spaces of respective data silos through the respective encoders to generate respective encodings, to combine the respective encodings to generate combined encodings, and to process the combined encodings through the respective decoders to generate a unified prediction for members of an overlapping population that spans the respective feature spaces.
This system implementation and other systems disclosed optionally include one or more of the features listed above for the first system implementation. In the interest of conciseness, the individual features of the first system implementation are not enumerated for the second system implementation.
A third system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to aggregate feature spaces from disparate data silos to execute joint training tasks. The system comprises a plurality of prediction engines, respective prediction engines in the plurality of prediction engines can have respective encoders and respective decoders configurable to generate gradients during training. The system comprises a plurality of data silos, respective data silos in the plurality of data silos can have respective feature spaces that have input features for an overlapping population that spans the respective feature spaces. The input features are configurable as training samples for use in the training. The system comprises a bus system connected to the plurality of prediction engines and configurable to partition the respective prediction engines into respective processing pipelines. The bus system is configurable to block training sample exchange and gradient exchange via the bus system during the training between an encoder within a particular processing pipeline and encoders outside the particular processing pipeline.
The system comprises a memory access controller connected to the bus system and configurable to confine access of the encoder within the particular processing pipeline to input features of a feature space of a data silo allocated as training samples to the particular processing pipeline and to gradients generated from the training of the encoder within the particular processing pipeline. The memory access controller is configurable to allow access of a decoder within the particular processing pipeline to gradients generated from the training of the decoder within the particular processing pipeline and to gradients generated from the training of decoders outside the particular processing pipeline.
The system comprises a joint trainer connected to the plurality of prediction engines and configurable to process, during the training, input features from the respective feature spaces of the respective data silos through the respective encoders of corresponding allocated processing pipelines to generate corresponding encodings. The joint trainer is configurable to combine the corresponding encodings across the processing pipelines to generate combined encodings. The joint trainer is configurable to process the combined encodings through the respective decoders to generate respective predictions for members of the overlapping population. The joint trainer is configurable to generate a combined gradient set from respective gradients of the respective decoders generated based on the respective predictions. The joint trainer is configurable to generate respective gradients of the respective encoders based on the combined encodings. The joint trainer is configurable to update the respective decoders based on the combined gradient set, and to update the respective encoders based on the respective gradients.
This system implementation and other systems disclosed optionally include one or more of the features listed above for the first system implementation. In the interest of conciseness, the individual features of the first system implementation are not enumerated for the third system implementation.
A fourth system implementation of the technology disclosed includes a system comprising a joint trainer connected to a plurality of prediction engines have respective encoders and respective decoders that are configurable to process, during training, input features from respective feature spaces of respective data silos through the respective encoders to generate respective encodings. The joint trainer is configurable to combine the respective encodings across encoders to generate combined encodings. The joint trainer is configurable to process the combined encodings through the respective decoders to generate respective predictions for members of an overlapping population. The joint trainer is configurable to generate a combined gradient set from respective gradients of the respective decoders generated based on the respective predictions. The joint trainer is configurable to generate respective gradients of the respective encoders based on the combined encodings. The joint trainer is configurable to update the respective decoders based on the combined gradient set, and to update the respective encoders based on the respective gradients.
This system implementation and other systems disclosed optionally include one or more of the features listed above for the first system implementation. In the interest of conciseness, the individual features of the first system implementation are not enumerated for the fourth system implementation.
Other implementations may include a method of aggregating feature spaces from disparate data silos to execute joint training and prediction tasks using the systems described above. Yet another implementation may include non-transitory computer readable storage
Method implementations of the technology disclosed include aggregating feature spaces from disparate data silos to execute joint training and prediction tasks by using the system implementations described above.
Each of the features discussed in this particular implementation section for the system implementation apply equally to the method implementation. As indicated above, all the method features are not repeated here and should be considered repeated by reference.
Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the system implementation apply equally to the method and CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
Computer System
In one implementation, the processing engines are communicably linked to the storage subsystem 1110 and the user interface input devices 1138.
User interface input devices 1138 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1100.
User interface output devices 1176 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1100 to the user or to another machine or computer system.
Storage subsystem 1110 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. Subsystem 1178 can be graphics processing units (GPUs) or field-programmable gate arrays (FPGAs).
Memory subsystem 1122 used in the storage subsystem 1110 can include a number of memories including a main random access memory (RAM) 1132 for storage of instructions and data during program execution and a read only memory (ROM) 1134 in which fixed instructions are stored. A file storage subsystem 1136 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 1136 in the storage subsystem 1110, or in other machines accessible by the processor.
Bus subsystem 1155 provides a mechanism for letting the various components and subsystems of computer system 1100 communicate with each other as intended. Although bus subsystem 1155 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 1100 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1100 depicted in
The computer system 1100 includes GPUs or FPGAs 1178. It can also include machine learning processors hosted by machine learning cloud platforms such as Google Cloud Platform, Xilinx, and Cirrascale. Examples of deep learning processors include Google's Tensor Processing Unit (TPU), rackmount solutions like GX4 Rackmount Series, GX8 Rackmount Series, NVIDIA DGX-1, Microsoft' Stratix V FPGA, Graphcore's Intelligent Processor Unit (IPU), Qualcomm's Zeroth platform with Snapdragon processors, NVIDIA's Volta, NVIDIA's DRIVE PX, NVIDIA's JETSON TX1/TX2 MODULE, Intel's Nirvana, Movidius VPU, Fujitsu DPI, ARM's DynamiclQ, IBM TrueNorth, and others.
This application claims the benefit of U.S. Patent Application No. 62/942,644, entitled “SYSTEMS AND METHODS OF TRAINING PROCESSING ENGINES,” filed Dec. 2, 2019 (Attorney Docket No. DCAI 1002-1). The provisional application is incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
62942644 | Dec 2019 | US |