Systems and Methods for Supplementing Data With Generative Models

FIELD OF THE INVENTION

The present invention generally relates to supplementing data for analysis and, more specifically, to training or adapting generative conditional generative models to supplement data for analysis.

BACKGROUND

In a world of uncertainty, it is difficult to properly model probability distributions across multiple dimensions based on diverse and heterogeneous data sets. For example, in the health industry, individual health outcomes are never certain. The condition of one patient with a disease may deteriorate rapidly, while another patient quickly recovers. The inherent stochasticity of individual health outcomes implies that health informatics must aim to predict health risks rather than deterministic outcomes. The ability to quantify and predict health risks has important implications for business models that depend on the health of a population. As such, generative models can be trained to generate potential outcome data based on characteristics of entities from individuals to entire populations.

Generative models are a class of machine learning models that learns to sample from, potentially multivariate and/or time-dependent, probability distributions that are consistent with the observed data. Generative models have various applications in a variety of additional fields, such as economic forecasting, climate modeling, and medical research. There are a variety of instances in which it is important to obtain information surrounding outcomes that are conditional on sets of pre-determined features, by modeling the entire (conditional) probability distributions. These models, generally applied to classification or regression, are usually called discriminative or conditional generative models.

SUMMARY OF THE INVENTION

Systems and techniques for adjusting experiment parameters are illustrated. One embodiment includes a method for training a conditional generative model. The method defines a joint distribution, wherein the joint distribution corresponds to a combination of a probabilistic model and a point prediction model, and wherein the point prediction model is configured to obtain a measurement of regression accuracy. The method derives an energy function for the joint distribution. The method obtains, from the energy function for the joint distribution, an approximation for a conditional distribution, wherein an output of the point prediction model is a parameter of the approximation. The method determines, from a loss function, at least one training parameter. The method trains the combination based on the at least one parameter to operate as a conditional generative model, wherein the conditional generative model follows the conditional distribution. The method applies the trained probabilistic model to a dataset corresponding to a randomized trial.

In a further embodiment, the probabilistic model is a conditional restricted Boltzmann machine (CRBM).

In a further embodiment, applying the trained probabilistic model to a dataset corresponding to a randomized trial includes using the CRBM to generate a set of samples of a target population.

In a still further embodiment, the joint distribution is represented as: p(y, h|x)=Z⁻¹(x)e^−U(y,h|x), wherein y represents visible units of the CRBM, h represents hidden units of the CRBM, x represents feature units of the CRBM, Z(x) represents a normalization constant, and U(y, h|x) is the energy function; and wherein the normalization constant is represented as: Z(x)=∫dy Σ_He^−U(y,h|x).

In a further embodiment, the combination is trained by using gradient descent.

In another further embodiment, deriving, from the joint distribution, the energy function for the probabilistic model includes summing over states of hidden units of the CRBM.

In another embodiment, the measurement of regression accuracy is a minimum mean squared error prediction.

In still another embodiment, the approximation is a Laplace approximation.

In another embodiment, the mode of the conditional distribution is identified by the point prediction model; and the point prediction model includes at least one selected from the group consisting of a linear model, a neural network, a decision tree, and a differential model.

In still another embodiment, the loss function is a negative log-likelihood function.

One embodiment includes a non-transitory computer-readable medium for training a conditional generative model, wherein the program instructions are executable by one or more processors to perform a process. The process defines a joint distribution, wherein the joint distribution corresponds to a combination of a probabilistic model and a point prediction model, and wherein the point prediction model is configured to obtain a measurement of regression accuracy. The process derives an energy function for the joint distribution. The process obtains, from the energy function for the joint distribution, an approximation for a conditional distribution, wherein an output of the point prediction model is a parameter of the approximation. The process determines, from a loss function, at least one training parameter. The process trains the combination based on the at least one parameter to operate as a conditional generative model, wherein the conditional generative model follows the conditional distribution. The process applies the trained probabilistic model to a dataset corresponding to a randomized trial.

In a further embodiment, the probabilistic model is a conditional restricted Boltzmann machine (CRBM).

In a further embodiment, applying the trained probabilistic model to a dataset corresponding to a randomized trial includes using the CRBM to generate a set of samples of a target population.

In a further embodiment, the combination is trained by using gradient descent.

In another further embodiment, deriving, from the joint distribution, the energy function for the probabilistic model includes summing over states of hidden units of the CRBM.

In another embodiment, the measurement of regression accuracy is a minimum mean squared error prediction.

In still another embodiment, the approximation is a Laplace approximation.

In still another embodiment, the loss function is a negative log-likelihood function.

Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.

FIGS. 1A-1C illustrate machine learning models that may be used to implement unsupervised learning processes in accordance with many embodiments of the invention.

FIG. 2 conceptually illustrates a process for deriving and applying conditional generative models in accordance with many embodiments of the invention.

FIG. 3 illustrates the application of Neural Conditional Restricted Boltzmann Machines, configured in accordance with multiple embodiments of the invention, to time-series.

FIG. 4 illustrates a system that provides for the gathering and distribution of data for modeling probability distributions in accordance with numerous embodiments of the invention.

FIGS. 5A-5B illustrates data processing configurations constructed in accordance with various embodiments of the invention.

DETAILED DESCRIPTION

Systems and methods configured in accordance with some embodiments of the invention may produce conditional generative models by combining two or more machine learning models. In accordance with many embodiments of the invention, preliminary point prediction models may be applied to produce expected values for the outcome y given the features x. In doing so, secondary, probabilistic, models including but not limited to Conditional Restricted Boltzmann Machines (CRBMs) may be applied to further determine the corresponding distribution through describing the variability around the point prediction models. In accordance with many embodiments, untrained point prediction models can be combined with probabilistic models while the two models are trained simultaneously. Additionally or alternatively, the combined machine learning models may be used to refine time-series.

Machine learning is one potential approach to modeling complex probability distributions. In the following description, many examples are described with reference to medical applications, but one skilled in the art will recognize that techniques described herein can be readily applied in a variety of different fields including (but not limited to) health informatics, image/audio processing, marketing, sociology, and lab research. One of the most pressing problems is that one often has little, or no, labeled data that directly addresses a particular question of interest. Consider the task of predicting how a patient will respond to an investigational therapeutic in a clinical trial. In a supervised learning setting, one would give the therapeutic to many patients and observe how each patient responds. Then, one would use this data to build a model that predicts how a new patient will respond to the therapeutic. For example, a nearest neighbor classifier would look through the pool of previously treated patients to find a patient that is most similar to the new patient, then it would predict the new patient's response based on the previously treated patient's response. However, supervised learning requires significant amounts of labeled data and, particularly where sample sizes are small or labeled data is not readily available, unsupervised learning is critical to the successful application of machine learning.

Many machine learning applications, such as computer vision, require the use of homogeneous information (e.g., images of the same shape and resolution), which must be pre-processed or otherwise manipulated to normalize the input and training data. However, in many applications, it is desirable to combine data of various types (e.g., images, numbers, categories, ranges, text samples, etc.) from many sources. For example, medical data can include a variety of different types of information from a variety of different sources, including (but not limited to) demographic information (e.g., a patient's age, ethnicity, etc.), diagnoses (e.g., binary codes that describe whether or not a patient has a particular disease), laboratory values (e.g., results from laboratory tests, such as blood tests), doctor's notes (e.g., handwritten notes taken by a physician or entered into a medical records system), images (e.g., x-rays, CT scans, MRIs, etc.), and 'omics data (e.g., data from DNA sequencing studies that describe a patient's genetic background, the expression of his/her genes, etc.). Some of these data are binary, some are continuous, and some are categorical. Integrating all of these different types and sources of data is critical, but treating a variety of data types with traditional approaches to machine learning is quite challenging. Typically, the data have to be heavily pre-processed so that all of the features used for machine learning are of the same type. Data pre-processing steps can take up a large portion of an analyst's time in training and implementing a machine learning model.

Many embodiments of the invention provide novel and innovative systems and methods for the use of heterogeneous, irregular, and unlabeled data to train and implement stochastic, unsupervised machine-learning models of complex probability distributions.

Boltzmann Machine Architectures

With many traditional machine learning techniques, supervised learning is used to train a model on a large set of labeled data to make predictions and classifications. However, in many cases, it is not feasible or possible to gather such large samples of labeled data. In many cases, the data cannot be readily labeled or there are simply not enough samples of an event to meaningfully train a supervised learning model. For example, clinical trials often face difficulties in gathering such labeled data. A clinical trial typically proceeds through three main phases. In phase I, the therapeutic is given to healthy volunteers to assess its safety. In phase II, the therapeutic is given to approximately 100 patients to obtain initial estimates for safety and efficacy. Finally, in phase III, the therapeutic is given to a few hundred to a few thousand patients to rigorously investigate the efficacy of the drug. Before phase II, there is no in-human data on the effect of the investigational drug for the desired indication, making supervised learning impossible. After phase II, there is some in-human data on the effect of the investigational drug, but the sample size is quite limited, rendering supervised learning techniques ineffective. For comparison, a phase II clinical trial may have 100-200 patients, whereas a typical application of machine learning in computer vision may use millions of labeled images. As with many situations with limited data, the lack of large labeled datasets for many important problems implies that health informatics must heavily rely on methods for unsupervised learning.

1. Restricted Boltzmann Machines (RBMs)

FIGS. 1A-1C illustrate machine learning models that may be used to implement unsupervised learning processes in accordance with many embodiments of the invention. Such models may include but are not limited to Restricted Boltzmann Machines (RBMs), as illustrated in FIG. 1A. RBMs may refer to bidirectional neural networks, where the neurons (also called units) are divided into two layers, a visible layer 104 and a hidden layer 102. The visible layer 104 (v) can describe the observed data. The hidden layer 102 (h) may include one or more set(s) of unobserved latent variables that capture the interactions between the visible units. RBMs may describe the joint probability distribution of v and h using an exponential form,

p(v, h)=Z⁻¹e^−E(v,h). (1)

Here, E(v, h) may be called the energy function, and be used to train the RBM. Additionally or alternatively, Z=∫ dvdhe^−E(v,h)may be called the partition function, and used for normalization of the energy function. In many embodiments, processes can use the integral operator, ∫ dx, to denote both standard integration or a sum over all of the elements in a discrete set.

In a traditional RBM, both the visible 104 and hidden 102 units may be binary. Each can only take on the values 0 or 1. The energy function in accordance with numerous embodiments of the invention can then be written as,

$\begin{matrix} E (v, h) = - \sum_{i} a_{i} v_{i} - \sum_{μ} b_{μ} h_{μ} - \sum_{i μ} W_{i μ} v_{i} h_{μ} & (2) \end{matrix}$

and/or, in vector notation, as E(v,h)=−a^Tv−b^Th−v^TWh, wherein a_i∈a and b_i∈b are unconstrained, real-valued learnable parameters. In accordance with numerous embodiments of the invention, visible units 104 may interact with the hidden units 102 through the weights, W. However, in accordance with some embodiments, there may not be visible-visible and/or hidden-hidden interactions. Instead the layers 102, 104 can be restricted to interactions between layers.

A key feature of an RBM configured in accordance with certain embodiments may be the ease of computing conditional probabilities for the layers,

$\begin{matrix} p (v ❘ h) = \prod_{i} \frac{e^{(a_{i} + \sum_{μ} W_{i μ} h_{μ}) v_{i}}}{1 + e^{a_{i} + \sum_{μ} W_{i μ} h_{μ}}} & (3) \end{matrix}$

$\begin{matrix} and, p (h ❘ v) = \prod_{μ} \frac{e^{(b_{μ} + \sum_{i} W_{i μ} v_{i}) h_{μ}}}{1 + e^{b_{μ} + \sum_{i} W_{i μ} v_{i}}} . & (4) \end{matrix}$

Similarly, it can easy to compute the conditional moments,

$\begin{matrix} v_{p (v ❘ h)} = \frac{1}{1 + e^{- (a + W h)}} & (5) \end{matrix}$

$\begin{matrix} and, h_{p (h ❘ v)} = \frac{1}{1 + e^{- (b + W^{T} v)}} . & (6) \end{matrix}$

RBMs can be trained by maximizing a log-likelihood function custom-character :=log p(v)_data=log ∫ dhp(v, h)_dataHere, ·_datamay denote an average over all of the observed samples. The derivative of the log-likelihood with respect to some parameter of the model θ is:

$\begin{matrix} \frac{\partial ℒ}{\partial θ} = \frac{\partial}{\partial θ} \log \int {dhp (v, h)}_{data} = \frac{\partial}{\partial θ} \log \int {dhe}_{data}^{- E (v, h)} - \frac{\partial}{\partial θ} \log Z = {\frac{\int {dhe}^{- E (v, h)} (- \frac{\partial E}{\partial θ})}{\int {dhe}^{- E (v, h)}}}_{data} - \frac{\int {dvdhe}^{- E (v, h)} (- \frac{\partial E}{\partial θ})}{\int {dvdhe}^{- E (v, h)}} = {\frac{\partial E}{\partial θ}}_{p (v, h)} - {\frac{\partial E}{\partial θ}}_{{p (h ❘ v)}_{data}} & (7) \end{matrix}$

In the standard formulation of an RBM, there are three parameters a, b, and W. The derivatives are:

$\begin{matrix} \frac{\partial ℒ}{\partial a} = v_{p (v, h)} - v_{data} \frac{\partial ℒ}{\partial b} = h_{p (v, h)} - h_{{p (h ❘ v)}_{data}} & (8) \end{matrix}$

$\frac{\partial ℒ}{\partial W} = {vh}_{p (v, h)}^{T} - {vh}_{{p (h ❘ v)}_{data}}^{T}$

Computing expectations from the joint distribution is generally computationally intractable. Therefore, statistics from the joint distribution including but not limited to the derivatives may be estimated using random sampling processes such as Markov Chain Monte Carlo (MCMC) processes.

2. Conditional Restricted Boltzmann Machines (CRBMs)

In accordance with many embodiments of the invention, a Conditional RBM (CRBM) may refer to an RBM where some of the parameters are not free are instead parametrized functions of a conditioning random variable (i.e., may be predicted by an RV with significant levels of precision). As such, newly obtained (temporal) information may be added to CBRMs as delayed units on the visible layer.

Turning now to the drawings, a CRBM configured in accordance with a number of embodiments of the invention is illustrated in FIG. 1B. The example of FIG. 1B shows a conditional RBM with a visible layer 108, 110, 112, and a hidden layer 106. As is the case for a traditional RBM, the nodes (also referred to herein as “units”) of the visible layer 108, 110, 112, may be connected to nodes of the hidden layer 106. In accordance with many embodiments of the invention, nodes of the hidden layer 106 (hidden units) may be used to create latent spaces to model the data distributions of interest. The visible layer, may refer to a composite layer comprised of several nodes (visible units) of various types (e.g., continuous, categorical, and/or binary).

In accordance with certain embodiments, a Conditional RBM (CRBM) can be defined using the energy function

$\begin{matrix} ℋ (v_{t + 1}, v_{t}, h) = ℋ (v_{t + 1}, h) + ℋ (v_{t}, h) + \sum_{μ} b_{μ} (h_{μ}) & (9) \end{matrix}$

where each component energy is of the same form as the RBM energy function above. Additionally or alternatively, within the energy function, v_tmay represent the visible units at time step t represented in vector form.

As a result, RBMs may be extended to include a notion of temporal history, in the form of CRBMs. In accordance with many embodiments of the invention, a single input vector may contain x features, which may be mapped to the visible random variables v_tcorresponding to the visible units 112 in the current time iteration (t). There are undirected connections between these visible units and the hidden units. Alone, these connections form an unaltered RBM for the input vector at time step t. However, the CRBM can also incorporate additional directed or undirected connections from the input vectors at the previous time steps (e.g., t−1). As a result, systems may define CRBMs as models encompassing RBMs whose probability distributions depend conditionally on the visible random variables (e.g., v_t−1110, v₀108) corresponding to the visible units of a number of previous time points. For CRBMs, the joint distribution of the (current) visible and hidden units 106 conditioned on the previous visible units 108, 110 can be reordered as:

$\begin{matrix} ℋ (v_{t + 1}, h ❘ v_{t}) = ℋ (v_{t + 1}, h) - \sum_{i μ} W_{i μ}^{(t)} a_{i}^{(t)} (v_{t, i}) b_{μ} (h_{μ}) . & (10) \end{matrix}$

3. Neural Conditional Restricted Boltzmann Machines (nCRBMs)

Systems and methods in accordance with many embodiments of the invention may be applied, through the conditional distributions including but not limited to Equation (1) to turning pre-existing point prediction models into conditional generative models. FIG. 1C illustrates the implementation of CRBMs (and RBMs) incorporating point prediction models in accordance with various embodiments of the invention. In particular, such cases may use the point prediction models 118 (ƒ_θ(x)) as CRBM components. In accordance with many embodiments of the invention, point prediction models may refer to various models capable of providing point estimates of outcomes based on input features, including but not limited to linear models, neural networks, decision trees, and/or other predictive model classifications. In using such point prediction models, systems may be applied to determining mathematical models of probability distribution (p(y|x)) for multivariate outcome vectors 122, 124, 126 (y_i∈y), conditioned on input features 114 (x), which may also be referred to as conditional generative models. In accordance with many embodiments of the invention, values in outcome vectors may be continuous, binary, ordinal, one-hot encoded categorical, and/or various other types of variables.

Additional CRBM components may include but are not limited to hidden layers 122 (e.g., h), weight matrices 116 (e.g., W), and precision matrices 120 (e.g., P). In accordance with a number of embodiments of the invention, weight matrices 116 and precision matrices 120 may, additionally or alternatively, be represented as functions. For example, P 120 may instead be a function of the input features 114 (P(x)). Additionally or alternatively, W 116 may instead be a function of the input features 114 (W(x)).

Systems in accordance with a number of embodiments of the invention may be configured to produce conditional generative models obtained from a combination of a probabilistic model (including but not limited to RBMs and/or CRBMs) and a point prediction model 118. The combination of probabilistic models and point prediction models 118 may, in this application, be referred to as “combinations”, “conditional generative models” and/or “neural Conditional Restricted Boltzmann Machines” (nCRBMs) in this application. Nevertheless, in accordance with certain embodiments of the invention, prospective point prediction models are not limited to neural networks.

A process for deriving and applying conditional generative models, obtained from nCRBMs configured in accordance with many embodiments of the invention, is illustrated in FIG. 2. Process 200 defines (205) a joint distribution corresponding to a combination of a probabilistic model and a point prediction model. In accordance with many embodiments of the invention, the combination of the probabilistic model and the point prediction model may correspond to the nCRBM. As indicated above, probabilistic models may include but are not limited to CRBMs. In accordance with several embodiments of the invention, point prediction models may be configured to derive conditional means (E[y|x]) in terms of a function ƒ_θ(x) parameterized by θ. In accordance with numerous embodiments of this invention, point predictions may be derived based on estimates including but not limited to the minimization of ordinary least squares (OLS) for y|x:

OLS(θ)=E_p_data_(y,x)[(y−ƒ_θ(x))²] (11)

In accordance with some embodiments, when the probabilistic model is a CRBM, the joint distribution of the resulting nCRBM, with visible units y, hidden units h, and feature units x, may be represented as: y_t+δt

p(y, h|x)=Z⁻¹(x)e^−U(y,h|x) (12)

where Z(x)=∫ dy Σ_He^−U(y,h|x)is the normalization constant.

Process 200 derives (210), from the joint distribution, an energy function for the conditional generative model. In accordance with many embodiments of the invention, process 200 may obtain the energy function for the conditional generative model ( custom-character (y|x)) from the energy function for the joint distribution. As such, the term U(y,h|x) may represent the energy function for the joint distribution and take the form:

$\begin{matrix} U (y, h ❘ x) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P (y - f_{θ} (x)) - {(y - f_{θ} (x))}^{'} W h & (13 A) \end{matrix}$

where P, the precision matrix, is a diagonal positive definite matrix and W is a weight matrix. In accordance with a few embodiments, the hidden units (h) may take forms including but not limited to Ising spins where h_i=±1. When y is a continuous, real-valued vector, the energy function of y|x (i.e., the conditional generative model) can be derived by marginalizing and/or summing over the states of the hidden units (e.g., p(y|x)=Σ_Hp(y, h|x)). In doing so, the resulting marginal energy function ( custom-character (y|x)) may take the form:

$\begin{matrix} 𝒰 (y | x) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P (y - f_{θ} (x)) - \log {Tr}_{h} e^{{(y - f_{θ} (x))}^{'} W h} = \frac{1}{2} {(y - f_{θ} (x))}^{'} P (y - f_{θ} (x)) - \sum_{j} \log \cosh (\sum_{i} {W_{ij} (y - f_{θ} (x))}_{i}) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P (y - f_{θ} (x)) - \sum_{j} \log \cosh (w_{j^{'}} (y - f_{θ} (x))) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P (y - f_{θ} (x)) - 1^{'} \log \cosh (W^{'} (y - f_{θ} (x))) . & (13 B) \end{matrix}$

Additionally or alternatively, P may instead be a function of the input features (x) parameterized by parameter ϕ (P_ϕ(x)). Additionally or alternatively, W may instead be a function of the input features (x) parameterized by parameter ψ (W_ψ(x)). In such cases:

$\begin{matrix} 𝒰 (y, h ❘ x) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P_{ϕ} (x) (y - f_{θ} (x)) - {(y - f_{θ} (x))}^{'} W_{ψ} (x) h . & (13 C) \end{matrix}$

$\begin{matrix} 𝒰 (y ❘ x) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P_{ϕ} (x) (y - f_{θ} (x)) - 1^{'} \log \cosh (W_{ψ}^{'} (x) (y - f_{θ} (x))) . & (13 D) \end{matrix}$

Process 200 obtains (215), from the energy function (for the conditional generative model) and the point prediction model, an approximation for a conditional distribution of the conditional generative model parameterized around the point prediction model output. In accordance with many embodiments of the invention, approximations may include but are not limited to Laplace approximations. For an example with a continuous y, taking the derivatives of the energy function with respect to y:

$\begin{matrix} \frac{\partial 𝒰}{\partial y} = P (y - f_{θ} (x)) - W \tanh (W^{'} (y - f_{θ} (x))), & (14 A) \end{matrix}$

$\begin{matrix} \frac{\partial^{2} 𝒰}{\partial y^{2}} = P - W diag (1 - \tanh^{2} (W^{'} (y - f_{θ} (x)))) W^{'} . & (15 A) \end{matrix}$

and evaluating the derivatives at y=ƒ_θ(x) may yield:

$\begin{matrix} {\frac{\partial 𝒰}{\partial y} ❘}_{y = f_{θ} (x)} = 0, & (14 B) \end{matrix}$

$\begin{matrix} {\frac{\partial^{2} 𝒰}{\partial y^{2}} ❘}_{y = f_{θ} (x)} = P - {WW}^{'} & (15 B) \end{matrix}$

This method may be used to conclude that y=ƒ_θ(x) is a local minimum of the energy function when y is continuous and real-valued and P−WW′ is positive definite. As a result, the Laplace approximation for y conditioned on x may take the form:

y|x˜
custom-character (ƒ_θ(x), (P−WW′)⁻¹). (16)

where P is a precision matrix and W is a weight matrix. As described above, both precision matrices (e.g., P) and weight matrices (e.g., W) may be represented as functions (e.g., P_ϕ(x), W_ψ(x)). In accordance with certain embodiments of the invention, y=ƒ_θ(x) may be a local minimum of the energy function when P_ϕ(x)−W_ψ(x) W_ψ(x)′ is positive definite.

In accordance with a number of embodiments of the invention, nCRBMs may normalize values in the weight matrix. For example, when W may be normalized by values including but not limited to

${(n_{y})}^{- \frac{1}{2}},$

wherein n_ymay represent the number of participants in a particular arm of a clinical trial for example. This normalization may be performed in order to better condition the matrix P−WW′ (where P may be P_ϕ(x), W may be W_ψ(x), and/or (P−WW′)⁻¹may be a covariance matrix representing the covariance of the residual noise process). Systems configured in accordance with multiple embodiments of the invention may add additional loss during training including but not limited to logarithms of the determinant of the P−WW′, logarithms of conditions numbers, and/or other constraints on positive definiteness.

Process 200 determines (220), from the approximation, terms used in training the combination (e.g., the nCRBM) to operate as a conditional generative model. In accordance with some embodiments, nCRBMs may be trained by minimizing loss functions that may include but are not limited to negative log-likelihood functions. In accordance with many embodiments of the invention, the negative log-likelihood function may take the form:

$\begin{matrix} ℒ = - E_{p_{data} (x)} [E_{p_{data}} (y | x) [\log {Z (x)}^{- 1} \sum_{H} e^{- U (y, h ❘ x)}]], & (17) \end{matrix}$

$= E_{p_{data} (x)} [\log Z (x) - E_{p_{data}} (y | x) [\log \sum_{H} e^{- U (y, h | x)}]] .$

The training of nCRBMs, configured in accordance with certain embodiments of the invention, is expounded upon below.

Process 200 trains (225) the conditional generative model. In many embodiments, energy-based models can be trained using gradient descent. Gradients used in training the conditional generative model may be obtained from various derivatives of the loss function. In minimizing the negative log-likelihood function, the derivative of Equation (17) with respect to a particular parameter ϕ may take the form:

$\begin{matrix} \frac{\partial L}{\partial ϕ} = E_{p_{data} (x)} [\log Z (x) - E_{p_{data} (y ❘ x)} [\log {Tr}_{h} e^{- U (y, h | x)}]] = E_{p_{data} (x)} [E_{p (y, h ❘ x)} [- \partial_{ϕ} U] - E_{p_{data} (y ❘ x)} [E_{p (h ❘ x, y)} [- \partial_{ϕ} U]]] = E_{p_{data} (x)} [E_{p_{data} (y ❘ x)} [E_{p (h ❘ x, y)} [\partial_{ϕ} U]] - E_{p (y | x)} [E_{p (h | x, y)} [\partial_{ϕ} U]] .] & (18) \end{matrix}$

which may be used to minimize the loss function and thereby optimize the conditional generative model. In training the conditional generative model, process 200 may need to determine the terms that optimize Equation (18). This may be done using information including but not limited to data from historical datasets. In accordance with certain embodiments of the invention, expected values for p(y|x) may be obtained and/or refined using obtained Monte Carlo samples. Additionally or alternatively, estimates for p(h|x, y) can be obtained from integrating h*p(h|x, y) over h, using Equations (12) and (13A). The result may be the following:

E
_p(h|x,y)
[h]=tanh(W′(y−ƒ_θ(x))). (19)

In doing so, process 200 may derive values for P and W in order to further improve the approximation of the conditional distribution p(y|x). As the precision matrix, P may be diagonal and positive definite. As such, systems in accordance with some embodiments may define P in terms of a vector b, a learned parameter, using P=diag(e^b). In such a case, the gradients for vectors b and W may take the form:

and be used to train the conditional generative model accordingly.

Systems and methods configured in accordance with various embodiments, may facilitate the training of point prediction components and the RBM component of an nCRBM simultaneously (which may be referred to as “end-to-end training”). When point prediction models ƒ_θ(x) are differentiable with respect to the parameters θ, the above gradient formulas may take the forms:

$\begin{matrix} \frac{\partial ℒ}{\partial b} = \frac{1}{2} e^{b} ⊙ (E_{p_{data} (x)} [E_{p_{data} (y ❘ x)} [{(y - f_{θ} (x))}^{2}]] - E_{p_{data} (x)} [E_{p (y | x)} [{(y - f_{θ} (x))}^{2}]]), & (20 B) \end{matrix}$

$\begin{matrix} \frac{\partial ℒ}{\partial W} = E_{p_{data} (x)} [E_{p_{data} (y ❘ x)} [- (y - f_{θ} (x)) \tanh ({(y - f_{θ} (x))}^{'} W)]] - E_{p_{data} (x)} [E_{p (y | x)} [- (y - f_{θ} (x)) \tanh ({(y - f_{θ} (x))}^{'} W)]], & (21 B) \end{matrix}$

$\begin{matrix} \frac{\partial ℒ}{\partial θ} = E_{p_{data} (x)} [E_{p_{data} (y | x)} [- \frac{\partial f_{θ}^{'} (x)}{\partial θ} (e^{b} ⊙ (y - f_{θ} (x)) - W \tanh (W^{'} (y - f_{θ} (x))))]] - E_{p_{data} (x)} [E_{p (y | x)} [- \frac{\partial f_{θ}^{'} (x)}{\partial θ} (e^{b} ⊙ (y - f_{θ} (x)) - W \tanh (W^{'} (y - f_{θ} (x))))]] & (22) \end{matrix}$

allowing all the parameters of the conditional generative model to be learned via stochastic gradient descent.

In accordance with numerous embodiments, as mentioned above, P and W may operate as parameters that depend on x. As such, P(x) and W(x) may take the form of parameterized functions of the features x. Additionally or alternatively, process 200 may apply Equation 7 and/or Equation 5 to compute the gradients with respect to the parameters for training. In doing so, process 200 may define the general energy function,

$\begin{matrix} 𝒰 (y | x) = \frac{1}{2} {(y - f_{θ} (x))}^{'} P_{ϕ} (x) (y - f_{θ} (x)) - 1^{'} \log \cosh (W_{ψ}^{'} (x) (y - f_{θ} (x))), & (23) \end{matrix}$

and use automatic differentiation to compute the gradients with respect to the parameters θ, ϕ, and ψ.

Process 200 may, in certain cases, apply (230) the trained conditional generative model to a dataset corresponding to a randomized trial. In accordance with numerous embodiments, y may correspond to certain randomized trial treatments, while x corresponds to pre-treatment covariates. In accordance with various embodiments, averages from the model conditional distribution can be estimated using Monte Carlo samples from the conditional distribution. As such, any Monte Carlo algorithm can be used for this. Additionally or alternatively, sampling methods including but not limited to Gibbs sampling, Persistent Contrastive Divergence sampling, and Gibbs-Langevin sampling may be applied to obtain these averages.

Systems and methods in accordance with many embodiments of the invention may be used to train nCRBMs. In particular, systems may use negative log-likelihood functions to train nCRBMs.

custom-character (θ, ϕ, ψ)=−E_data[log p(y|x)], (24)

based on the assumptions that:

p(y|x)=Z⁻¹(x)e^−U(y|x) (25A)

and

Z(x)=∫dy Σ_He^−U(y|x). (25B)

In accordance with several embodiments of the invention, based on the above loss function, gradients may be derived according to particular model parameters (e.g., θ, ϕ, ψ) in the following form:

$\begin{matrix} \frac{\partial ℒ}{\partial θ} = - \frac{\partial}{\partial θ} E_{data} [\log p (y | x)] = E_{data} [\frac{\partial}{\partial θ} U (y | x)] + E_{data} [\frac{\partial}{\partial θ} \log Z (x)] = E_{data} [\frac{\partial}{\partial θ} U (y | x)] + E_{data} [\frac{\int d y \frac{\partial}{\partial θ} e^{- U (y | x)}}{\int d y e^{- U (y | x)}}] = E_{data} [\frac{\partial}{\partial θ} U (y | x)] - E_{data} [E_{p (y ❘ x)} [\frac{\partial}{\partial θ} U (y | x)]] & (26) \end{matrix}$

where the first term

$E_{data} [\frac{\partial}{\partial θ} U (y | x)]$

Herein “the positive phase”) may be obtained by taking the gradient of the energy function and averaging over observed (x, y) values. The positive phase integral may be comparatively easy to estimate using seeded Markov Chain Monte Carlo samples from the data distribution. Additionally or alternatively, the second term

$E_{data} [E_{p (y ❘ x)} [\frac{\partial}{\partial θ} U (y | x)]]$

may be obtained through deriving gradients of the energy function and averaging that value over observed x values and/or generated y|x values.

In accordance with certain embodiments of the invention, backpropagation may be used to derive gradients in situations where P_ϕ(x), W_ψ(x), and ƒ_θ(x) are differentiable functions of ϕ,ψ and θ, respectively. In accordance with various embodiments of the invention, values for, P_ϕ(x), may be configured to remain non-negative through reparameterizations to learn log(P_ϕ(x)) in place of P_ϕ(x).

Systems configured in accordance with some embodiments of the invention may apply penalties including but not limited to L2 penalties to functions (e.g., ƒ_θ(x), P_ϕ(x), and W_ψ(x)). In accordance with certain embodiments, systems may set the penalty on W_ψ(x) to be larger than ƒ_θ(x) and P_ϕ(x). For example, L2 penalties of 1.0 on W_ψ(x) and 0.5 on ƒ_θ(x) and P_θ(x) may be utilized in practice across a wide variety of problems. Implementation of such configurations are referenced in disclosure “Neural Boltzmann Machines” by Alex Lang et al., incorporated by reference in its entirety.

Systems and methods configured in accordance with a number of embodiments of the invention, may be trained in order to update model parameters. Training may involve sampling minibatches of data (e.g., (x_i, y_i)). The obtained samples may be used to perform initial backward passes to obtain values for U(y|x) from using at least one of Equations (13B) and (13D). Additionally, k-steps of block Gibbs sampling may be used to generate {tilde over (y)}_iconditioned on x_i. When these values are obtained, additional backward passes may be used to obtain values for U(y|x), again from using at least one of Equations (13B) and (13D). The first term in the gradient

$(E_{data} [\frac{\partial}{\partial θ} U (y | x)])$

may be estimated by performing the initial backward passes. Additionally or alternatively, the second term in the gradient

$(E_{data} [E_{p (y ❘ x)} [\frac{\partial}{\partial θ} U (y | x)]])$

may be estimated by sampling the aforementioned values for {tilde over (y)}_iconditioned on x_iand using those samples to estimate the integrals.

FIG. 3 illustrates the application of nCRBMs configured in accordance with many embodiments of the invention, to learning conditional generative models of time-series. Many problems (e.g., modeling patient trajectories) addressed in accordance with some embodiments of the invention may utilize the capacity to generate time-series. In doing so, they may generate a sequence of states L_t₀_:t_ƒ={l(t)}_t=t₀^t=t^ƒ330. In one example, L_t₀_:t_ƒ={l(t)}_t=t₀^t=t^ƒmay denote a realization of a stochastic process 330 from time t₀to time t_ƒ. Additionally or alternatively, c may denote a vector of time-independent context variables 320. In the example, the conditional generative model for the time-series can be trained by learning the transition density p(l(t+δt)|L_t₀_:t, c, δt). In accordance with some embodiments of the invention, transition density may be represented with nCRBMs.

In accordance with many embodiments of the invention, ƒ_θ(L_t₀_:t, c, t, δt) may represent a point predictor 350 for l(t+δt) 380. As such, point predictors may take the form of deterministic functions that predict the states 380 of stochastic processes at time t+δt from the entire history up until time t_ƒ. As suggested above, P 360 and/or W 340 may be represented as parameterized functions. In some instances, systems configured in accordance with various embodiments may define P_ϕ(L_t₀_:t, c, t, δt) and W_ψ(L_t₀_:t, c, t, δt) as parameterized functions of the entire temporal history. As such, ƒ_θ, P_ϕ and/or W_ψ may input values including but not limited to L_t₀_:t_ƒ={l(t)}_t=t₀^t=t^ƒ, c, t={t₁, . . . , t_ƒ}, and/or δt. Additionally or alternatively, simple approximations may assume that the covariance matrix approximately scales like a diffusion, wherein (P−WW′)⁻¹˜δt. In such an example, systems may determine that P˜1/δt and ˜1/√{square root over (δt)}. Other approximations used may include but are not limited to general approximations. Regardless, when diffusion approximations are used the resulting joint energy function may take the form:

$\begin{matrix} U (l (t + δ t), h ❘ L_{t_{0} : t}, c) = \frac{1}{2} {(l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t))}^{' \frac{P}{δ t}} (l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t)) - {(l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t))}^{' \frac{W}{\sqrt{dt}}} h, & (27 A) \end{matrix}$

which corresponds to the marginal energy function:

$\begin{matrix} U (l (t + δt) ❘ L_{t_{0} : t}, c) = \frac{1}{2} {(l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t))}^{'} \frac{P}{δ t} (l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t)) - 1^{'} \log \cosh (\frac{W^{'}}{\sqrt{dt}} (l (t + δ t) - f_{θ} (L_{t_{0} : t}, c, t, δ t))) . & (27 B) \end{matrix}$

With this energy function derived, systems may apply Equation 7 to compute the gradients with respect to the parameters for training.

While specific modules for modeling complex probability distributions are described above, any of a variety of processes can be utilized to generate models as appropriate to the requirements of specific applications. In certain embodiments, steps may be executed or performed in any order or sequence not limited to the order and sequence shown and described. In a number of embodiments, some of the above steps may be executed or performed substantially simultaneously where appropriate or in parallel to reduce latency and processing times. In some embodiments, one or more of the above steps may be omitted.

System for Modeling Probability Distributions

A system that provides for the gathering and distribution of data for modeling probability distributions in accordance with some embodiments of the invention is shown in FIG. 4. Network 400 includes a communications network 460. The communications network 460 is a network such as the Internet that allows devices connected to the network 460 to communicate with other connected devices. Server systems 410, 440, and 470 are connected to the network 460. Each of the server systems 410, 440, and 470 is a group of one or more servers communicatively connected to one another via internal networks that execute processes that provide cloud services to users over the network 460. For purposes of this discussion, cloud services are one or more applications that are executed by one or more server systems to provide data and/or executable applications to devices over a network. The server systems 410, 440, and 470 are shown each having three servers in the internal network. However, the server systems 410, 440 and 470 may include any number of servers and any additional number of server systems may be connected to the network 460 to provide cloud services. In accordance with various embodiments of this invention, a network that uses systems and methods that model complex probability distributions in accordance with an embodiment of the invention may be provided by a process (or a set of processes) being executed on a single server system and/or a group of server systems communicating over network 460.

Users may use personal devices 480 and 420 that connect to the network 460 to perform processes for providing and/or interacting with a network that uses systems and methods that model complex probability distributions in accordance with various embodiments of the invention. In the shown embodiment, the personal devices 480 are shown as desktop computers that are connected via a conventional “wired” connection to the network 460. However, the personal device 480 may be a desktop computer, a laptop computer, a smart television, an entertainment gaming console, or any other device that connects to the network 460 via a “wired” connection. The mobile device 420 connects to network 460 using a wireless connection. A wireless connection is a connection that uses Radio Frequency (RF) signals, Infrared signals, or any other form of wireless signaling to connect to the network 460. In FIG. 4, the mobile device 420 is a mobile telephone. However, mobile device 420 may be a mobile phone, Personal Digital Assistant (PDA), a tablet, a smartphone, or any other type of device that connects to network 460 via wireless connection without departing from this invention.

A data processing element for training and utilizing a stochastic model in accordance with a number of embodiments is illustrated in FIG. 5A. In various embodiments, data processing element 500 is one or more of a server system and/or personal devices within a networked system similar to the system described with reference to FIG. 4. Data processing element 500 includes a processor (or set of processors) 510, network interface 520, and memory 530. The network interface 520 is capable of sending and receiving data across a network over a network connection. In a number of embodiments, the network interface 520 is in communication with the memory 530. In several embodiments, memory 530 is any form of storage configured to store a variety of data, including, but not limited to, a data processing application 540, data files 550, and model parameters 560. Data processing application 540 in accordance with some embodiments of the invention directs the processor 510 to perform a variety of processes, such as (but not limited to) using data from data files 550 to update model parameters 560 in order to model complex probability distributions.

A data processing application in accordance with a number of embodiments of the invention is illustrated in FIG. 5B. In this example, data processing application 540 includes a data gathering engine 541, database 542, a model trainer 543, a generative model 544, a discriminator model 545, and a simulator engine 546. Model trainer 543 includes a schema processor 547 and a sampling engine 548. Data processing applications in accordance with many embodiments of the invention process data to train stochastic models that can be used to model complex probability distributions.

Databases in accordance with various embodiments of the invention store data for use by data processing applications, including (but not limited to) input data, pre-processed data, model parameters, schemas, output data, and simulated data. In some embodiments, databases are located on separate machines (e.g., in cloud storage, server farms, networked databases, etc.) from a data processing application.

Model trainers in accordance with a number of embodiments of the invention are used to train generative and/or discriminator models. In many embodiments, model trainers utilize schema processors to build the generator and/or discriminator models based on schemas that are defined for the various data available to the system. Schema processors in accordance with some embodiments of the invention build composite layers for a generative model (e.g., restricted Boltzmann machine) that are made up of several different layers for handling different types of data in different ways. In some embodiments, model trainers train the generative and discriminator models by optimizing a compound objective function based on log-likelihood and adversarial objectives. Training generative models in accordance with certain embodiments of the invention may utilize sampling engines to draw samples from the models to measure the probability distributions of the data and/or the models. Various methods for sampling from such models to train and/or draw generated samples from a model are described in greater detail below.

In many embodiments, generative models are trained to model complex probability distributions, which can be used to generate predictions/simulations of various probability distributions. Discriminator models discriminate between data-based samples and model-generated samples based on the visible and/or hidden states.

Simulator engines in accordance with several embodiments of the invention are used to generate simulations of complex probability distributions. In some embodiments, simulator engines are used to simulate patient populations, disease progressions, and/or predicted responses to various treatments. Simulator engines in accordance with several embodiments of the invention use a sampling engine for drawing samples from the generative models that simulate the probability distribution of the data.

As described above, as a part of the data-gathering process, the data in accordance with several embodiments of the invention is pre-processed in order to simplify the data. Unlike other pre-processing which is often highly manual and specific to the data, this can be performed automatically based on the type of data, without additional input from another person.

Applications and methods in accordance with various embodiments of the invention are not limited to modeling complex probability distributions or implementing generative models. Accordingly, it should be appreciated that the data collection capabilities of any system, application, and/or element described herein can also be implemented outside the context of generative modelling. Various systems and methods for configuring probability distributions in accordance with numerous embodiments of the invention are discussed further below.

Although specific methods of producing conditional generative models are discussed above, many different methods of model production can be implemented in accordance with many different embodiments of the invention. It is, therefore, to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

Systems and techniques for producing conditional generative models, are not limited to use for randomized controlled trials. Accordingly, it should be appreciated that applications described herein can be implemented outside the context of generative model architecture and in contexts unrelated to RCTs. Moreover, any of the systems and methods described herein with reference to FIGS. 1A-5B can be utilized within any of the generative models described above.

Systems and Methods for Supplementing Data With Generative Models

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

Provisional Applications (1)