This disclosure relates generally to computer modeling of high-dimensional data spaces, and more particularly to probabilistic modeling of the high-dimensional data lying on a low-dimensional manifold embedded in high-dimensional ambient space.
As machine learning techniques and infrastructures become more sophisticated and increase performance on data sets, machine models are increasingly tasked with processing high-dimensional data sets and to generate new instances (also termed “data points”). Existing solutions struggle with effectively representing the complete range of a high-dimensional data set or in doing so in a low-dimensional space (e.g., representing a manifold of the relatively higher-dimensional data in a lower-dimensional space) while simultaneously permitting effective probabilistic modeling of the data. For example, while generative adversarial network (GAN) models have been used to learn to generate data in conjunction with feedback from a discriminative model, the generative model can neglect to learn how to generate certain types of content from the training data and do not model underlying probabilities. In other examples, some models like variational autoencoders (VAE) may be used to model high-dimensional data points with latent variables in a low-dimensional space, but because the observational space is in the high-dimensional space, the model may still implicitly model non-zero densities across the entire high-dimensional space and thus does not correctly learn that the probability should be zero for positions off-manifold.
Alternative solutions that do provide probabilistic information, such as normalizing flows, do not effectively learn densities for complex high-dimensional spaces in which the high-dimensional data lies on a manifold describable with a low-dimensional representation. In some examples, learning high-dimensional densities when the underlying data lies on a low-dimensional manifold can result in the trained model not properly detecting out-of-distribution data, that is, assigning higher densities to out-of-distribution data than to training data.
As such, while likelihood-based or explicit deep generative models use neural networks to construct flexible high-dimensional densities, this formulation is ineffective when the true data distribution lies on a manifold. Maximum-likelihood training (e.g., for density estimation) directly in the high-dimensional space yields degenerate optima, such that a model may learn the manifold, but not the distribution on it. This type of error termed herein: “manifold overfitting.” There is thus a need for an approach to effectively model data points of a high-dimensional space with effective probability density information while accounting for the data lying on a manifold in the high-dimensional space.
To address manifold overfitting and more effectively model high-dimensional data, such as data for images, video, and other complex data items, the model operates in two stages—first, an autoencoder that may encode and decode data between the high-dimensional space and the low-dimensional space, and second, a density model that learns a probability density of the data on the low-dimensional space. By reducing the data's dimensionality and then modeling the density, the model may effectively recover the position of data points within the high-dimensional space and apply a probability density to it based on the probability density learned in the low-dimensional space, thus avoiding manifold overfitting. With this approach, density estimation can be applied to model structures that reduce dimensionality implicitly, such as various types of generative networks, including generative adversarial networks (GANs).
To train these models, a training data set in a high-dimensional space is used to first train parameters of an autoencoder model that learns an encoder from the high-dimensional space to the low-dimensional space and a decoder from the low-dimensional space to the high-dimensional space. The autoencoder model may thus learn a mapping of the embedded manifold from the manifold in high-dimensional space to the low-dimensional space. The autoencoder may thus learn to transform points on the manifold (in the high-dimensional space) to the low-dimensional space, and to transform points in the low-dimensional space to the manifold in high-dimensional space, and in one embodiment may be bijective only for the manifold of the high-dimensional space and the low-dimensional space, and in some circumstances may apply to a region of the low-dimensional space (and may thus not be homeomorphic for other additional regions of the high-dimensional or low-dimensional space). The density model may then be trained with the training data as translated to the low-dimensional space, such that the density model is learned in the low-dimensional space and may be trained sequentially to the autoencoder model and using maximum-likelihood training approaches. This permits effective instance generation in the high dimensional space (e.g., to sample in the low-dimensional space and output in the high-dimensional space), along with density estimation and/or out-of-distribution evaluation of data in the high-dimensional space.
The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The computer model 160 is trained by the training module 120 to learn parameters for the autoencoder model and the probability density model based on the training data of training data store 140. Individual training data items are referred to as data points or data instances and may be represented in a “high-dimensional” space that may be represented as D dimensions (e.g., that may have real values varying across the dimensions D, represented as D). The computer model 160 learns a manifold ⊂D in the high-dimensional space of the positions of the training data items by learning parameters for the autoencoder, including an encoder portion that encodes a high-dimensional data item to a position in low-dimensional space and decodes a position in the low-dimensional space to the high-dimensional space. The low-dimensional space has a dimensionality d (e.g., d). The dimensionality d of the low-dimensional space is lower than the high-dimensional space (d<D) and in some instances may be significantly less than the high-dimensional space. As such, an encoder may represent a function ƒ for converting the high-dimensional space to the low-dimensional space: ƒ:D→d. Similarly, the decoder may represent a function F for converting the low-dimensional space to the high-dimensional space: F: d→D.
The computer model 160 also includes a density model that learns a probability density of the low-dimensional space based on the training data in the low-dimensional space (as converted to the low-dimensional space by the encoder). This enables the model to simultaneously address the appearance of the training data within a sub-region of the high-dimensional space (as the learned manifold) while also enabling effective probabilistic applications for the model using the probability density. As example applications, a point may be obtained (e.g., sampled) from the probability function to generate a data instance with the decoder, or the probability for the high-dimensional space may be determined by transforming a probability in the low-dimensional space (as given by the density model) to the high-dimensional space with the decoder. For example, a point in high-dimensional space may be encoded to determine the respective position in low-dimensional space, the probability determined in low-dimensional space with respect to the encoded position, and the probability transformed back to the high-dimensional space (e.g., to account for the change-in-variables) to represent the probability in the high-dimensional space.
By separating these functions and learning them separately, the computer model 160 may effectively learn a probability density despite that the training data may be located on a previously undefined or unknown manifold of the high-dimensional space. As further discussed below, the models may also be trained with respective training objectives. The autoencoder may be trained with respect to a reconstruction error (e.g., based on minimizing a difference between the original position of a training data point in the high-dimensional space and the position after the point is encoded and subsequently decoded). The density model may be trained with a maximum-likelihood training objective over the distribution of training points in low-dimensional space, which due to manifold overfitting, often cannot be performed effectively in the high-dimensional space.
After training, the sampling module 130 may sample outputs from the probabilistic computer model 160 by sampling a value from the density model in a low-dimensional space and transforming the sampled value to an output in the high-dimensional space, enabling the model to generatively create outputs similar in structure to the data points of a training data store 140 while accurately representing the probability density of the training data set (i.e., overcoming the manifold overfitting). Similarly, an inference module 150 may evaluate probabilities using the density model, e.g., by receiving a new data point(s) in the high-dimensional space and convert it to a point(s) in the low-dimensional for evaluation with respect to learned probability density. This may be used to determine, for example, whether the new data point or set of points may be considered “in-distribution” or “out-of-distribution” with respect to the trained probability density. Further details of each of these aspects is discussed further below.
In many cases, however, high-dimensional data lies on a manifold of the high-dimensional space, such that directly learning a probability density on the high-dimensional data may prove both ineffective and require many parameters to describe in particularly high-dimensional data sets. In general, the high-dimensional space has a number of dimensions referred to as D, and the low-dimensional space has a number of dimensions referred to as d. While the concepts discussed herein may apply to situations in which the high-dimensional space is relatively higher than the low-dimensional space (e.g., d<D) and may thus apply to dimensions of D=3 and d=2, in many cases, the high-dimensional space may have tens or hundreds of thousands, or millions of dimensions, and the low-dimensional space may have fewer dimensions by an order of magnitude or more.
Model structures that learn a probability density for data that lies on a manifold in high-dimensional space (e.g., as shown in
*=0.3δ−1+0.7δ1 Equation 1
In which δ−1 and δ1 are the point masses for −1 and 1, respectively.
p(x)=λ·(x;m1,σ2)+(1−λ)·(x;m2,σ2) Equation 2
The density model of Equation 2 is capable of correctly and exactly modeling the ground truth probability density 400 by learning a value of −1 for first mean m1, a value of 1 for the second gaussian at the value 1, the variance approaching 0, and a mixture weight λ, of 0.3 (to sample 30% from the first gaussian and 70% from the second gaussian). In training parameters of the model to learn the ground truth probability density 400, the intended behavior 430 may thus be to learn the respective means, variance, and mixture by iteratively revising the parameters with a likelihood maximization training cost (which may also be termed “maximum-likelihood”), in which the parameters are revised by at each training iteration steps with the intent of maximizing the likelihood of correctly capturing the ground truth probability density p*(x) 400 (as observed from the sampled points) by modifying the learnable model parameters. It may be possible to learn the correct distribution when is model is initialized with the correct mixture weight:
p
t(x)=0.3·(x;−1,1/t)+(0.7)·(x;1,1/t) Equation 3
When training a model based on initial starting parameters of Equation 3, the model parameters could possibly learn the correct ground truth distribution (i.e., by setting the mixture weights a priori); however, when actually training the model with maximum-likelihood, the objective does not actually encourage this desired behavior above other distributions that can also be learned (i.e., training of Equation 2 does not necessarily encourage learning a value of 0.3 for λ).
That is, this maximum-likelihood approach, which can be effective for training complex models when there is not a dimensionality mismatch (e.g., the data does not lie on a manifold effectively represented in a lower dimensionality), may in this instance recover many probabilities that are not the correct ground-truth probability density 400. Instead, maximum-likelihood training may yield parameters exhibiting a manifold overfitting distribution 440 as shown in
p
t(x)=(0.8)·(x;−1,1/t)+(0.2)·(x;1,1/t) Equation 4
As such, when trained with maximum-likelihood, however, parameters for a distribution 0 that is on the manifold but with an incorrect distribution: 0=0.8δ−1+0.2δ1, and may further be trained to arbitrarily high likelihoods, i.e., p′t(x)→∞ as t→∞ for x∈.
This may occur, for example, because the distribution exhibiting manifold overfitting achieves high likelihoods with respect to p*(x) due to the dimensionality mismatch. Because sampled points never include off-manifold points and the manifold is significantly smaller (i.e., representable in fewer dimensions) than the total space of the high-dimensional space, training iterations may obtain local optima that fail to learn the correct distribution, as the probability of any point on the manifold can diverge towards infinity relative to the probability of off-manifold points. As such, as further discussed below, even with infinite data samples from the sampled distribution p*(x), subsequent training iterations with maximum-likelihood may be dominated by terms for learning the manifold rather than learning the distribution on the manifold.
A Gaussian variational autoencoder was trained on this data sample to provide another example of manifold overfitting, which is shown as a learned VAE distribution 450. Although it learned the manifold (spiking probabilities at −1 and 1), the VAE distribution 450 has probabilities that begin to diverge towards infinity (i.e., individual regions spike above 1) and incorrectly learns the relative frequencies of −1 and 1. Because the sampled data from p*(x) lies on the manifold as a small portion of the total space of the high-dimensional space, the maximum-likelihood training (e.g., for the learned distributions of the example manifold overfitting distribution 440 or the VAE distribution 450) approach with a dimensional mismatch (the data in fact lies on a manifold) may thus recover the manifold only and can iteratively “learn” parameters for incorrect relative distributions within the manifold. Stated another way, as more and more points are sampled from p*(x), the number of sampled points (in dimension D) that are on the manifold approach infinity, while points off-manifold remain zero. As such, iterations of maximum-likelihood training may fail to converge on the correct probability because the maximum-likelihood evaluation is dominated by correctly identifying the manifold itself. This may yield parameter training based on likelihood maximization that may only learn the manifold correctly.
As another way of understanding manifold overfitting, when the probability model attempts to learn a continuous probability density in a space having dimensionality D (e.g., points are represented as having respective values varying along each dimension in D), it attempts to learn a probability density as a continuous function that can be instantaneously evaluated at points in D as non-zero values. In addition, as a probability, integrating the density across the entirety of D is intended to yield an accumulated probability of 1. That is, accumulating the probability density of a region in D as an integral is the accumulation of each respective “volume” with respect to D multiplied by the probability density for each point in the volume.
However, when the data lies on a manifold of dimensionality d, the learned volume with respect to D approaches zero. Described intuitively, this may be like measuring a three-dimensional volume of a circle or measuring the two-dimensional area of a line segment—by lacking a value in the additional dimension, the volumetric measurement of the lower-dimensional data with respect to higher dimension is zero. For example, as shown in the example of
As suggested by the above, this effect is not resolved by additional data samples from p*(x) and is not the result of traditional notions of the model parameters “overfitting” individual data points. Rather, it arises from the dimensional mismatch that is not cured by additional data because increasing the number of samples causes the number of samples for every on-manifold point to approach infinity, while the off-manifold points remain zero and does not address the measured “volume” in D to cause the probability density to approach infinity. This problem with manifold overfitting may thus occur even for model structures that represent data with low-dimensional latent variables, such as a variational autoencoder (VAE) or Adversarial Variational Bayes (AVB) models, because these models may still evaluate maximum-likelihood directly in the high-dimensional space and imply that each point in the high-dimensional space has a positive density.
As such, although termed an “autoencoder,” autoencoders as used herein may include more than typical/traditional “autoencoder” models. Other model types that provide for (or may be modified to provide) effective encoding of the manifold and recovery thereof may be used. As such, the encoder and decoder may be bijective along the manifold. As shown in
Accordingly, additional types of models beyond a traditional autoencoder (AE) for use as an encoder and/or decoder may include other types of models that learn lower-dimensional representations that may be returned to high-dimensional representations. These may include continuously differentiable injective functions that are bijective over on its image (i.e., the corresponding region in the low-dimensional space). In addition, the models may explicitly learn such functions, or may implicitly learn them as a result of learning low-dimensional representations, for example in generative models that may learn a “decoder” in the form of a function that generates a high-dimensional position based on a low-dimensional representation.
Types of models that may be used as the autoencoder and to learn an encoder model parameters and/or decoder model parameters include:
Additional types of generative models (e.g., generative adversarial models) may be used to learn the autoencoder by learning the decoder F based on parameters of the GAN (e.g., that can use a point in low-dimensional space for which the model can generate a high-dimensional output), and learning an encoder ƒ based on a reconstruction error of the training data points. In some embodiments, rather than learning an explicit encoding function ƒ, the respective low-dimensional point for a high-dimensional data point (e.g., training data points) may be determined as the positions in low-dimensional space for which the decoder recovers the high-dimensional data point (e.g., for xn in high-dimensional space, determining zn in low-dimensional space such that F(zn)=xn). As such, the autoencoder may generally be described as learning the decoder function F and an encoder function ƒ either explicitly, alternatively, implicitly as a point zi in the low-dimensional space as respective points {zn}n=1N for input points {xn}n=1N (e.g., the training data or, in certain applications, for the test data).
Thus, the encoder and decoder may be trained (as an explicit training objective or as a result of the training process) in a way that minimizes the expected reconstruction error, one of example of which is shown in Equation 3:
X˜
*
[∥F(ƒ(X))−X∥] Equation 5
As shown in Equation 5, the reconstruction error may be measured in one embodiment as the distance between each training data point Xi in the training data set X (sampled from unknown density P*) and the reconstructed position of the training data point Xi after applying the encoder ƒ( ) and subsequently the decoder F( ). I.e., the reconstruction error may aim to minimize the difference between Xi and F(ƒ(Xi)) across the data set X.
After determining the encoder and decoder as just discussed, the density model may be learned as shown in
The respective model architectures may be independently selected (e.g., the model architecture for the autoencoder model 510 and for the density model 520) and thus enable wide variation of types of model architectures that may overcome the manifold overfitting issue. As such, this framework may provide for many types of density models that may not require injective transformations over the entire low-dimensional space.
As such, the respective model architectures for the autoencoder model 510 and density model 520 (e.g., as components of the computer model 160) may be trained by the training module 120 based on the training data in the training data store 140. That is, the autoencoder model may be trained to learn the encoder and decoder based on a reconstruction error of the training data points (which lie on the manifold in the high-dimensional space). Then, the training data may be converted to respective positions in the low-dimensional space by applying the learned encoder and used to learn the probability density as the parameters of a learned density model using, e.g., a maximum-likelihood training loss. This permits the model as a whole to correctly learn both a low-dimensional representation and to learn a probability thereon, enabling, e.g., a generative model, that successfully models probability densities for data on a manifold in the high-dimensional space.
After training, to generate data points in high-dimensional space, the sampling module 130 may sample a point in the low-dimensional space from the density model and then apply the decoder to convert the low-dimensional point to a data instance in the high-dimensional space as an output of the computer model 160.
In addition, the inference module 150 may use the computer model 160 to perform various probabilistic/density measures on high-dimensional data points. To evaluate probabilities in the high-dimensional space, probabilities for a point in high-dimensional space may be determined by a change-of-variable formula applied to the respective density in the low-dimensional space:
The change-of-variable formula in Equation 6 provides that the probability density at point x in high-dimensional space (px(x)) (for a point x on the low-dimensional manifold), may be evaluated by determining the encoded position of x in low-dimensional space (i.e., ƒ(x)), determining the probability density pz in the low-dimensional space (as given by the density model) evaluated at the encoded position (together forming (pz(ƒ(x)) and returned to the high-dimensional space based on the change-of-variable Jacobian determinant (and its transverse) of the decoder F evaluated at the encoded position of x in low-dimensional space. I.e., the decoder F applied to the determinant of the Jacobian JF and its transpose JFT for the decoder function F at the low-dimensional point ƒ(x) (together the
term). In some embodiments in which the decoder architecture may not directly provide the Jacobian at ƒ(x), the Jacobian and its transverse may be determined by automatic differentiation.
The probability density evaluation in the high-dimensional space may permit, for example, evaluation of various density/probabilistic functions by the inference module 150 to successfully evaluate data points in high-dimensional space based on the low-dimensional density. Where Equation 6 may be used to obtain a probability for a point in high-dimensional space, the inference module in some embodiments may perform probability measurements based on encoding the high-dimensional points to the low-dimensional space and evaluating the probability density using the probability model.
For example, analysis may be performed to evaluate a test data set's correspondence to the original training dataset and whether the test data set was likely to have been obtained from the same underlying (typically unknown) probability density P*. This may also be termed out-of-distribution analysis—determining the extent to which the test data set may be in- or out-of distribution with respect to the training data set. The out-of-distribution analysis may be performed in a variety of ways, some examples of which are provided below.
As one example of out-of-distribution analysis, the test data set may be analyzed with respect to the autoencoder to determine whether the test data set lies on a different manifold than the original training data set. To do so, the test data set may be encoded by ƒ and decoded by F to determine whether the encoder and decoder yield different reconstruction errors for the test data set than for the training data set (e.g., as an average or as an accumulated total, a maximum reconstruction error, or another metric). Because the encoder and decoder are generally trained to encode data on the manifold and to recover points on the manifold (e.g., as discussed with respect to
As another example, the data points in the test data set may be encoded to the low-dimensional space for evaluation of the low-dimensional points (of the test data set) with respect to the learned probability density of the training data set in the low-dimensional space. For example, the probability of test points in the test data set may be determined based on the change-of-variable formula of Equation 6 to determine the probability of the respective points in the high-dimensional space compared with the points in the training data set. In another example, the test data points in the low-dimensional space may be compared with the density distribution to determine, for example, whether the assigned likelihoods are low relative to the training data, and thus whether the test data points are likely from a different data distribution. In another example, another density distribution may be learned for the test data set based on the encoded test data points (i.e., in low-dimensional space), and the test density distribution may be evaluated against the density distribution of the training data to determine the divergence of the test density distribution.
As another example, the probability of data points in a first data set (e.g., the training data set) and the probability of data points in a second data set (e.g., a validation data set, which may be known to differ in composition from the test data set) may have the probability of each data point evaluated according to the trained density model, such as via Equation 6. A classifier (e.g., a decision stump) may be trained on the resulting probabilities to learn a threshold probability value for predicting membership in the first data set, trained based on the probability values of the first data set as in-member examples and the probability values of the second data set as out-of-class examples. The individual data samples for each data set may then be evaluated based on the threshold to determine the frequency that the first or second data sets are correctly predicted as being members of the first data set. This approach may be used, for example, to evaluate the frequency that instances of the second data set may be predicted to belong to the density learned based on the first data set. When the first and second data sets are known to have significantly different composition, the frequency may be used to evaluate how well the model learned the actual density of the first data set.
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
This application claims the benefit of provisional U.S. application No. 63/305,481, filed Feb. 1, 2022, the contents of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63305481 | Feb 2022 | US |