The present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods to learn deep latent variable models for improved performance.
Deep generative models have achieved great successes in many domains, such as image generation, image recovery, image representation, image disentanglement, anomaly detection, etc. Such models typically include simple and expressive generator networks, which are latent variable models assuming that each observed example is generated by a low-dimensional vector of latent variables, and the latent vector follows a non-informative prior distribution, such as Gaussian distribution. Since high dimensional visual data (e.g., images) usually lie on low-dimensional manifolds embedded in the high-dimensional space, learning latent variable models of visual data is of fundamental importance in the field of computer vision for the sake of unsupervised representation learning. The challenge mainly comes from the inference of the latent variables for each observation, which typically relies on Markov chain Monte Carlo (MCMC) methods to draw fair samples from the analytically intractable posterior distribution (i.e., the conditional distribution of the latent variables given the observed example). Since the posterior distribution of the latent variables is parameterized by a highly non-linear deep neural network, the MCMC-based inference may suffer from non-convergence and inefficiency problems, thus affecting the accuracy of the model parameter estimation.
Accordingly, what is needed are systems and methods to learn deep latent variable models with improved efficiency.
References will be made to embodiments of the disclosure, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these particular embodiments. Items in the figures may not be to scale.
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the disclosure. It will be apparent, however, to one skilled in the art that the disclosure can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present disclosure, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the disclosure and are meant to avoid obscuring the disclosure. It shall be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including, for example, being in a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” “communicatively coupled,” “interfacing,” “interface,” or any of their derivatives shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections. It shall also be noted that any communication, such as a signal, response, reply, acknowledgement, message, query, etc., may comprise one or more exchanges of information.
Reference in the specification to “one or more embodiments,” “preferred embodiment,” “an embodiment,” “embodiments,” or the like means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated. The terms “include,” “including,” “comprise,” “comprising,” or any of their variants shall be understood to be open terms and any lists the follow are examples and not meant to be limited to the listed items. A “layer” may comprise one or more operations. The words “optimal,” “optimize,” “optimization,” and the like refer to an improvement of an outcome or a process and do not require that the specified outcome or process has achieved an “optimal” or peak state. The use of memory, database, information base, data store, tables, hardware, cache, and the like may be used herein to refer to system component or components into which information may be entered or otherwise recorded.
In one or more embodiments, a stop condition may include: (1) a set number of iterations have been performed; (2) an amount of processing time has been reached; (3) convergence (e.g., the difference between consecutive iterations is less than a first threshold value); (4) divergence (e.g., the performance deteriorates); (5) an acceptable outcome has been reached; and (6) all of the data has been processed.
One skilled in the art shall recognize that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently.
Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated by reference herein in its entirety.
It shall be noted that any experiments and results provided herein are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
Deep generative models have achieved great successes in many domains, such as image generation, image recovery, image representation, image disentanglement, anomaly detection, etc. Such models typically include simple and expressive generator networks, which are latent variable models assuming that each observed example is generated by a low-dimensional vector of latent variables, and the latent vector follows a non-informative prior distribution, such as Gaussian distribution.
Since high dimensional visual data (e.g., images) usually lie on low-dimensional manifolds embedded in the high-dimensional space, learning latent variable models of visual data is of fundamental importance in the field of computer vision for the sake of unsupervised representation learning. However, learning such a model is challenging due to non-linear parameterization of g.
Variational auto-encoder (VAE) and generative adversarial network (GAN) are currently popular way to train a deep latent variable model. These two models train the generator by recruiting an extra model for assisting in the training, and will disregard it in testing. To avoid inefficient MCMC sampling from the posterior, variational inference becomes an attractive alternative by approximating the intractable posterior via a tractable network. Despite the growing prevalence and popularity of the VAE, its drawbacks are increasingly obvious. First, it parameterizes the intrinsic iterative inference process by an extrinsic feedforward inference model 220. These extra parameters due to the reparameterization have to be estimated together with those of the generator network. Second, such a joint training is to be accomplished by maximizing the variational lower bound. Thus, the accuracy of VAE heavily depends on the accuracy of the inference model as an approximation of the true posterior distribution. Only when the Kullback-Leibler (KL) divergence between the inference and the posterior distribution is equal to zero, the variational inference is equivalent to the desired maximum likelihood estimation. This goal is usually infeasible in practice. Third, an extra effort is required to made in designing the inference model of VAE, especially for the generators that have complicated dependency structures with the latent variables, e.g., some proposed a top-down generator with multiple layers of latent variables, some proposed dynamic generators with time sequences of latent variables. It is not a simple task to design inference models that infer latent variables for models mentioned above. An arbitrary design of the inference model cannot guarantee the performance. The GAN approach of training involves a discriminator 230 besides the generator, thus has two sets of parameters during training. Model collapse may happen during the training process. Furthermore, an effective inference model is hard to design for the GAN approach.
In the present disclosure, the idea of reparameterizing the inference process is totally abandoned. Instead, embodiments of an MCMC-based inference for training deep latent variable models are disclosed. Specifically, embodiments of a short-run MCMC, such as a short-run Langevin dynamics, are used to perform the inference of the latent vectors during training. However, considering that the convergence of finite-step Langevin dynamics in each iteration might be a concern, embodiments of optimal transport (OT) are used to correct bias that may exist in such a short-run MCMC. The OT may be adopted to transform an arbitrary probability distribution to a desired distribution with a minimum transport cost. Thus, the OT cost may be used to measure the difference between two probability distributions. In one or more embodiments of the present disclosure, the short-run MCMC is treated as a learned flow model whose parameters are from the latent variable model. Bias of the short-run MCMC may be corrected by performing an optimal transport from the result distribution produced by the short-run MCMC to the prior distribution. Such an operation is to minimize the OT cost between the inference distribution and the prior distribution, in which parameters in the flow model are updated instead of optimized. With the corrected inference output, the parameters of the latent variable model may be updated more accurately.
There are several advantages using the disclosed short-run MCMC inference with OT correction: (1) efficiency: The learning and inference of the model are efficient with a short-run MCMC; (2) convenience: The approximate inference model represented by the short-run MCMC is automatic in the sense that there is nothing to worry about the design and training of a separate inference model. Both bottom-up inference and top-down generation are governed by the same set of parameters; and (3) accuracy: the optimal transport corrects the errors of the non-convergent short-run MCMC inference, thus improves the accuracy of the model parameter estimation.
Contributions of the present patent disclosure include at least the following: (1) Embodiments are disclosed to train a deep latent variable model by a non-convergent short-run MCMC inference with OT correction; (2) Embodiment of a semi-discrete OT methodology are extended to approximate the one-to-one map between the inferred latent vectors and the samples drawn from the prior distribution; and (3) Strong empirical results are provided in various experiments to verify the effectiveness of the disclosed strategy to train deep latent variable models.
1. Variational Inference
VAE is a popular method to learn generator network by simultaneously training a tractable inference network to approximate the intractable posterior distribution of the latent variables. In VAE, one needs to design an inference model for the latent variables, which is a non-trivial task in a generator network with complex architecture. While in the present patent document, the disclosed method does not rely on an extra inference model to assist the training. It performs inference by Langevin sampling from the posterior distribution, followed by an optimal transport correction.
Alternating back-propagation algorithm. The maximum likelihood learning of the generator network, including its dynamic version, may be achieved by the alternating backpropagation (ABP) algorithm without resorting to an inference model. The ABP algorithm trains the generator model by alternating the following two steps: (1) inference step: inferring the latent variables by Langevin sampling from the posterior distribution, and (2) learning step: updating the model parameters based on the training data and the inferred latent variables by gradient descent. Both steps compute the gradients with the help of back-propagation. The ABP algorithm has been successfully applied to saliency detection, zero-shot learning, and disentangled representation learning, etc.
2. Optimal Transport
Optimal transport (OT) is used to compute the distance between two measures and is able to push forward the source distribution to the target distribution. Recently, OT has been widely used in the generative models to help generate high quality samples. For example, by replacing the original KL-divergence in the GAN models with the W1 distance, some proposed the Wasserstein GAN (WGAN) model to achieve better convergence and generate higher quality samples. Some proposed the Wasserstein VAE that minimizes the Wasserstein distance between the inference model and the posterior distribution. Besides the Wasserstein distance, the optimal transport is also used to transport a simple uniform distribution to the complex latent feature distribution extracted by the auto-encoder for image generation.
Let I be a D-dimensional observed data example, such as an image. Let z be the d-dimensional vector of continuous latent variables. Generalizing from traditional factor analysis model, the generator network assumes the observed example I is generated from a latent vector z by a non-linear transformation I=gθ(z)+ϵ, where gθ is a top-down convolutional neural network (sometime called deconvolutional neural network) with parameters θ that comprises all trainable weights and bias terms in the network, ϵ˜(0, σ2ID) is the observation error, and z˜(0, Id). Id and ID are d-dimensional and D-dimensional identity matrices, respectively, and it is assumed that d<<D. The generator network may essentially be a non-linear latent variable model that defines the joint distribution of (I, z),
p
θ(I,z)=pθ(I|z)p(z) (1)
where it is assumed that the prior distribution p(z)=(0, Id) and p(I|z)=(gθ(z), σ2ID). The standard deviation σ takes an assumed value. Following the Bayes rule, the marginal distribution pθ(I)=∫pθ(I, z)dz and the posterior distribution pθ(z|I)=pθ(I, z)/pθ(I) may be obtained.
Given a set of training examples {Ii, i=1, . . . , n}˜pdata(I), where pdata(I) is the unknown data distribution. pθ may be trained by maximizing the log-likelihood of the training samples:
which is equivalent to the minimization of KL(pdata∥pθ) when the number of training examples n is large enough.
In one or more embodiments, the maximization of the log-likelihood function presented in Equation (2) may be accomplished by gradient ascent algorithm that iterates
where γt is the learning rate depending on time t and the gradient of the log probability is given by:
To compute ∇θ log pθ(I) in Equation (4), it is necessary to estimate ∇θ log pθ(I, z). According to Equation (1), the logarithm of the join distribution is given by:
where the constant term is independent of z or θ, thus
where ∇θgθ(z) can be efficiently computed by back-propagation.
1. Long-Run Langevin Dynamics Embodiments
To learn the model parameter θ by using Equation (3), the key is to compute the intractable expectation term in Equation (4), which can be achieved by first drawing samples from pθ(I, z) and then using the Monte Carlo sample average to approximate it. Given a step size s>0, and an initial value z0, Langevin dynamics, a gradient-based MCMC method, may produce samples from the posterior density pθ(z|I) by recursively computing
In Equation (6), k indexes the time step of Langevin dynamics, ξk˜(0, Id) is a random noise diffusion. Also,
where ∇zgθ(z) may be efficiently computed by back-propagation.
In one or more embodiments, K is used to denote the number of Langevin steps. When s→0 and K→∞, no matter what an initial distribution of z0 is, zK will converge to the posterior distribution pθ(z|I) and become a fair sample from pθ(z|I).
2. Short-Run Langevin Dynamics Embodiments
It may not be sensible or realistic to use a long-run MCMC to train a deep latent variable model. Within each iteration, running a finite number of Langevin dynamics steps for inference toward pθ(z|I) appears to be practical. Thus, a short-run K-step Langevin dynamics is given by:
In one or more embodiments, an initial distribution p0 is assumed to be the Gaussian distribution. Such dynamics may be treated as a conditional generator that transforms a random noise z0 to the target distribution under the condition I. The transformation itself may also be treated as a K-layer residual network, where each layer shares the same parameters θ and has a noise injection. κθ is used to denote the K-step MCMC transition kernel. The conditional distribution of zk given I is:
q
θ(zK|I)∫p0(z0)κθ(zK|z0,I)dz0 (8)
The corresponding marginal distribution of zK is
q
θ(zK)=∫qθ(zK|I)pdata(I)dI (9)
If the MCMC converges, qθ(zK) should be close to the prior distribution p(z), otherwise, there is a gap between them.
Equation (7) is also called the noise-initialized short-run MCMC, where for each step of parameter update, the short-run MCMC starts from the noise distribution z0˜p0(z). If the short-run MCMC is initialized by the inferred results obtained in previous iteration, it is called the persistent short-run MCMC.
Despite the efficiency of the short-run MCMC inference in Equation (8), it might not converge to the true posterior distribution pθ(z|I). Some treat the short-run MCMC as an approximate inference model and optimizes the step size s by variational inference, in which the step size s is optimized via either a grid search or gradient descent, such that the short-run MCMC qs(z|I) (here s is the learning parameter) may best approximate the posterior distribution pθ(z|I).
In one or more embodiments, optimal transport is used to correct the bias of the short-run inference results. In one or more embodiments, instead of minimizing the difference between the short-run inference model and the true posterior, i.e., KL(qθ(zK|I)|pθ(z|I)), OT is used to minimize the transport cost between the marginal distribution qθ(zK) of the latent variables inferred by the short-run Langevin dynamics and the prior distribution p0(z).
1. OT Correction for Biased Short-Run MCMC Embodiments
In one or more embodiments, for learning a top-down latent variable model I=gθ(z) that generates an observed image I from a latent vector z, the following three steps are iterated.
(1) Inference step: the latent vector is first inferred for each observed image Ii by a K-step short-run MCMC, i.e., {circumflex over (z)}˜pθ(zK|Ii), and then a population {{circumflex over (z)}i} of the inferred latent vectors is obtained for all observed data {Ii}, where {{circumflex over (z)}i}˜qθ(zK)
(2) Correction step: OT is used to move {{circumflex over (z)}i} to the desired prior distribution for closing the gap between them due to non-convergent inference.
(3) Learning step: Given the observed images and their corresponding inferred latent vectors, θ is updated by Equation (3) and Equation (4). As the θ becomes increasingly well-trained, the inference engine qθ(zK) becomes more accurate and the correction made by OT also becomes smaller. An illustration of the disclosed strategy using OT correction is presented in the aforementioned
In practice, either the noise-initialized short-run MCMC or the persistent short-run MCMC may be used in the inference step. In one or more experiments, the latter one is chosen for the purpose of quick convergence. As to the correction stage, the one-to-one OT map is learned from {{circumflex over (z)}i} to {zi}, which is a population sampled from the prior Gaussian distribution and of the same size as {{circumflex over (z)}i}. Computing the optimal transport at each iteration is time-consuming and unnecessary in practice. In one or more embodiments, to make the whole pipeline more efficient, the correction step may be performed after every L iterations. After the bijective OT map T{{circumflex over (z)}i}=zj is obtained, instead of directly updating the model through the paired data {(T({{circumflex over (z)}i}), Ii)}, {circumflex over (z)}i may be correct by using a mixture of the OT result and the old one to avoid unstable learning due to a sudden change of {circumflex over (z)}i, i.e.,
{circumflex over (z)}
i
←αT({circumflex over (z)}i)+(1−α){circumflex over (z)}i (10)
In Equation (10), α∈[0,1] is a hyperparameter that controls the percentage of the OT result used for correction. Accordingly, the corrected paired data {({circumflex over (z)}i, Ii)} may be obtained to update the model parameter θ. It shall be noted that when α=0, the disclosed model embodiment may be considered to degenerate to the traditional ABP model. If α is set to be 1, the short-run outputs are corrected totally with the OT results. A moderate 0<α<1 is typically helpful to gradually pull the marginal distribution qθ(zK) to the prior distribution p(z) for ensuring a smooth correction. Methodology 1 summarizes the whole pipeline of a learning strategy embodiment with the detailed process for short-run MCMC inference and OT correction shown in
Methodology 1: Short-run MCMC inference with OT correction embodiment
Methodology 2: Short-Run MCMC Inference with OT Correction Embodiment
It shall be noted that although Methodology 2 shows an updating process according to Adam method with β1=0.9 and β2=0.5, parameters β1, an exponential decay rate for the first moment estimates, and β2, an exponential decay rate for the second-moment estimates, may be other values and other methods may be used. Such variations shall be still within the scope of the present patent document.
2. Optimal Transport
Given the latent codes sampled from qθ(zK), namely {{circumflex over (z)}i}i=1n, and the randomly generated samples {zj}j=1n, from the prior (0, Id), the one-to-one map from {{circumflex over (z)}i} to {zj} is computed through the optimal transport. Specifically, in one or more embodiments, the cost function is set to be the squared Euclidean distance cij=∥{circumflex over (z)}i−zj∥22 because it has a beautiful geometric meaning, and the following assignment problem is then solved:
According to the linear programming theory, there will be only one nonzero element in each row/column of w. Actually, all of the nonzero elements should be equal to 1/n. Thus, the map from {{circumflex over (z)}i} to {zj} may be defined as: T:{circumflex over (z)}i→zj if πij≠0. When n is large, directly solving the above problem with Linear Programming will be problematic, since the computational complexity is prohibitively high O(n2.5). Similarly, the classical Hungarian algorithm for the assignment problem cannot be used to solve this problem due to the high computational complexity O(n3). It is also impossible to solve the above problem with the approximate OT solvers, e.g., the Sinkhorn algorithm, since these solvers tend to give a dense transport plan, from which it is impossible to recover the OT map. Moreover, the approximate algorithms are not suitable for large scale problems with n>20,000. Thus, the dual problem of Equation (11) is used. In one or more embodiments, the original dual formula for the semi-discrete OT may be extended to the following minimization problem in a discrete setting:
The above problem is convex as it is the maximum of the summation of n hyperplanes. Thus, it may be solved by the gradient descent optimization. The gradient is computed by
and #Jj is the number of elements in Jj. Assume h* is an optimal solution of E(h), then h=h*+(c, c, . . . , c)T is also an optimal solution. To omit the ambulation, ∇E(h) is defined ∇E(h)=∇E(h)−mean(∇E(h)). With the gradient information, the energy E(h) may be minimized by the Adam gradient descent algorithm.
Since Equation (12) is the dual of the assignment problem, with the optimal solution h*, it is easy to reconstruct the one-to-one OT map from {{circumflex over (z)}i} to {zj} by
During the optimization process, the process stops when the norm of the gradient ∇E(h) is less than a predetermined threshold ε. Ideally, if ε=0, the map T becomes injective and surjective, and each Jj only includes one element, namely the corresponding i. In that case, the OT map T is well defined. In reality, E is usually set ε>0, therefore T becomes neither injective nor surjective. In such a situation, for some zjs, there may be one or more corresponding {circumflex over (z)}is; and for some other zjs, the corresponding {circumflex over (z)}is may not exist. To omit the ambiguity and reconstruct the one-to-one map, it is necessary to handle the set Jj that will be empty or include one or more elements. The approximate OT map {circumflex over (T)} is thus given by: (1) if there is only one element in Jj, namely i, then {circumflex over (T)}({circumflex over (z)}i)=zj; (2) when Jj includes more than one element, i∈Jj is randomly selected and the others are abandoned, then define {circumflex over (T)}({circumflex over (z)}i)=zj; (3) the abandoned {circumflex over (z)}is and the zjs corresponding to the empty Jjs are removed from the domain and range of {circumflex over (T)}, respectively. In such a way, a new injective and surjective map {circumflex over (T)} that approximates the OT map T well may be built.
It shall be noted that in embodiments of the disclosed OT methodology, the prior distribution is not limited to the Gaussian distribution. Any prior distribution may actually be chosen as long as it is easy to sample from. Additionally, the computational complexity to solve the non-smooth dual problem in Equation (12) is O(n2/√{square root over (ε)}). Under the background of training the complex neural networks with a large number of parameters, the time used to optimize the OT problem is negligible. Finally, the number of the removed samples from {circumflex over (T)} should not be larger than nε. In one or more experiments, ε usually set as ε=0.05. With such a small ε, a good approximation of the OT map may be obtained.
It shall be noted that these experiments and results are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
In the experiments, embodiments of the disclosed model were tested in terms of whether it may (1) successfully correct the marginal distribution qθ(zK) of the latent vectors inferred by the short-run Langevin dynamics, (2) learn an expressive generator that synthesizes visually realistic images from the prior distribution, and (3) successfully perform anomaly detection. To show the performance of the disclosed method, experiments were done on various datasets. Details about the design of the generator architecture, the choices of the model hyperparameters and the optimization method for each dataset can be found in the supplementary material. Moreover, to investigate the influence of different hyperparameters, Dataset A was mainly used due to its simplicity and representativeness. To quantify the performance of the model, the mean squared error (MSE) and the Frechet Inception Distance (FID) score were adopted to measure the quality of the reconstructed and generated images. FID score is a metric for evaluating the quality of generated images.
Datasets: Various datasets were used for training and/or testing in one or more experiments. Some sample data, such as image data, were randomly selected for the purpose of quick convergence. All of the training images were resized and scaled to the range of [−1,1].
Model architectures: The architectures of the models are presented in Table 1, where the numbers of latent dimensions are set to be 30, 64, 64 for the Dataset A, Dataset B, and Dataset C, respectively.
Optimization: The parameters for the generators are initialized with Xavier normal and then optimized with the Adam optimizer with β1=0.5 and β2=0.99. For all of the experiments, the batch size was set to be 2,000. In Methodology 1, both L and K are set to be 50. The hyperparameter α is set to be 0.5 for the Dataset A, and 0.3 for Dataset B and Dataset C. The step sizes s for Datasets A, B, and C are set to be 0.3, 3.0, 3.0, respectively. Σ was also set σ=0.3 for all of the models.
Computational cost: Due to the involvement of the short-run MCMC and the optimal transport, it is necessary to consider the running time of the whole pipeline. Here the Dataset B including multiple images with the size 32×32×3 was taken as an example. Embodiments of the disclosed model were trained on two NVIDIA TitanX GPUs. For each iteration, the inference step with K=30 takes about 124 minutes, the correction step by optimal transport takes about 10 minutes and the learning step with L2=2 takes 5 minutes. Generally, it is necessary to run 10-15 iterations for the model, which will consume about one day.
1. Latent Space Analysis
To verify that the proposed method does correct the short-run marginal distribution qθ(zK) of the latent variables, the classes “0” and “1” of the Dataset A were picked up. From the classes, embodiments of the disclosed model were learned with the latent space dimension set to be 2 for better visualization.
2. Image Modeling
In one or more experiments, quality of both the reconstructed and generated images was evaluated. With a well-learned model, the marginal distribution of qθ(zK) should match the prior distribution well. In such a case, the generator will be a probability transformation from the prior Gaussian distribution to the image distribution, and a high quality image may be synthesized by I=gθ(z) with a latent vector z sampled from the prior distribution. Additionally, the model may be useful for reconstruction. In the following, embodiments of the disclosed model were compared to the VAE, its variants two-stage VAE (2sVAE) and regularized autoencoder (RAE). Comparisons were also made with the ABP model and its variant short-run inference (SRI), whose generator has multiple layers of latent variables. The last model for comparison is the latent space energy-based model (LEBM) model, which uses an energy-based short-run MCMC to infer the latent variables of each observed image.
Given the reconstructed and the generated images with the latent vectors sampled from the given prior distribution, it is obvious that the generated images are realistic and comparable to the real ones in the training datasets. In Table 2, the MSE was used to test the quality of the reconstructed images and the FID score to quantify the quality the generated images. From the table it was found that embodiments of the disclosed method (shown as “Present” column) outperformed the other methods in the tasks of reconstruction and generation.
3. Anomaly Detection
Anomaly detection is another task that may help evaluate embodiments of the disclosed model. With a well-learned model from the normal data, the anomalous data may be detected by firstly sampling the latent code z of the given testing image I from the conditional distribution qθ(zK|I) by the short-run Langevin dynamics, and then computing the logarithm of the joint probability log pθ(I, z) in Equation (5). Based on this theory, the joint probability should be high for the normal images and low for the anomalous ones.
In the following experiments, each class in the Dataset A was treated as an anomalous class and the others were left as normal. The model was trained only with the normal data, and then tested with both the normal and anomalous data. To evaluate the performance, log pθ(I, z) was used as a decision function to compute the area under the precision-recall curve (AUPRC). In the test stage, each experiment was run 10 times to get the mean and variance. In Table 3, embodiments of the disclosed method (shown as “Present” column) were compared with the related models in this task, including the VAE, MEG, BiGAN-σ, LEBM and ABP model, which can be treated as a special case without the OT calibration. From the table, it was found that the tested method embodiment may get much better results than those of other methods.
4. Influence of the Number of Latent Dimensions
This subsection shows the influence of the number of dimensions of the latent space under the same architecture. Dataset B was used with different numbers of dimensions of the latent space, e.g., 20, 40 and 64, respectively. As shown in Table 4, with more latent dimensions, much better results may be obtained in terms of both reconstruction and generation.
5. Ablation Study
This subsection explores the performances of the proposed model under different values of the parameter α introduced in Equation (10), different step sizes of the Langevin dynamics (the s of Equation (7)), different numbers of Langevin steps (K in Equation (7)) and different numbers of iterations for the learning step that seeks to maximize the joint probability in Equation (5) using the paired data {({circumflex over (z)}i, Ii)}.
The influence of α. Firstly, the influence of α in Equation (10) was investigated with results shown in
The influence of the Langevin step size. Table 5 shows the performances of embodiments of the disclosed model with different Langevin step sizes (s in Equation (7)). In the Table, where “Before” means that the model was used before the OT correction, and “After” means the trained model was used after the OT correction. With a small s, the MSE loss is indeed very small, but the FID is relatively large, meaning that the quality of the generated images is not very good. When s is large, e.g., s=6 e−2 in the last column, both the MSE loss and the FID are large, which means that high quality reconstructed images cannot even be obtained. In this situation, the model actually doesn't converge very well. Only with the appropriate Langevin step size (in this experiment, s=3 e−2), a good balance between the MSE and the FID may be obtained for satisfying reconstruction and generation results.
The influence of the number of Langevin steps. The number of Langevin steps K in Equation (7) is another key factor that influences the performance of the proposed method. Theoretically, larger K will cause a more convergent MCMC inference, so as to help getting more accurate latent variables. To prove this point, K was set as K=30,50,100 respectively, and the other parameters were kept fixed. The results are shown in Table 6. Indeed, a larger K leads to a better result. However, a large K will also increase the running time for the whole pipeline linearly. Thus, to get a good balance between the running time and the performance, it is needed to choose the suitable K for different datasets.
The influence of the number of iterations inside the learning step. In Methodology 1, several iterations, denoted by L2, of gradient ascent were actually run inside the learning step to maximize the joint probability in Equation (5) by the paired data {({circumflex over (z)}i, Ii)}. The results are shown in Table 7. From the table, it was found that by increasing L2, much better performances may be obtained for image reconstruction and generation.
The present document discloses embodiments of using the OT to correct the bias of the short-run MCMC-based inference in training the deep latent variable models. Specifically, in one or more embodiments, the marginal distribution of the latent variables of the short-run Langevin dynamics is corrected through the OT map between this distribution and the prior distribution step by step. In such a way, the distribution of the inferred latent vectors may finally converge to the prior distribution, thus improving the accuracy of the subsequent parameter learning. Experimental results show that the disclosed training method embodiments performed better than the ABP and VAE models on the tasks like image reconstruction, image generation and anomaly detection.
In one or more embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems (or computing systems). An information handling system/computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or may include a personal computer (e.g., laptop), tablet computer, mobile device (e.g., personal digital assistant (PDA), smart phone, phablet, tablet, etc.), smart watch, server (e.g., blade server or rack server), a network storage device, camera, or any other suitable device and may vary in size, shape, performance, functionality, and price. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, read only memory (ROM), and/or other types of memory. Additional components of the computing system may include one or more drives (e.g., hard disk drive, solid state drive, or both), one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, mouse, touchscreen, stylus, microphone, camera, trackpad, display, etc. The computing system may also include one or more buses operable to transmit communications between the various hardware components.
As illustrated in
A number of controllers and peripheral devices may also be provided, as shown in
In the illustrated system, all major system components may connect to a bus 1116, which may represent more than one physical bus. However, various system components may or may not be in physical proximity to one another. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network. Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, for example: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as compact discs (CDs) and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices (such as 3D XPoint-based devices), and ROM and RAM devices.
Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and/or non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, for example: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as ASICs, PLDs, flash memory devices, other non-volatile memory devices (such as 3D XPoint-based devices), and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into modules and/or sub-modules or combined together.
It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.