This disclosure relates generally to density modeling of data on a manifold of high-dimensional space, and particularly to density modeling with implicit manifold modeling and energy-based densities.
Natural data is often observed, captured, or otherwise represented in a “high-dimensional” space of n dimensions (). While the data may be represented in this high-dimensional space, data of interest typically exists on a manifold
having lower dimensionality
than the high-dimensional space (n>m). For example, the manifold hypothesis states that real-world high-dimensional data tends to have low-dimensional submanifold structure. Elsewhere, data from engineering or the natural sciences can be manifold-supported due to smooth physical constraints. In addition, data samples in these contexts are often drawn from an unknown probability distribution, such that effective modeling of data must both account for the manifold structure of the data and estimate probability only on the manifold—a challenging task to directly perform because the manifold may be “infinitely thin” in the high-dimensional space.
Typical approaches struggle to effectively model both the density and the shape of the manifold in the high-dimensional space. In general, approaches do not attempt to model the manifold and the probability density with respect to the high-dimensional space directly. Instead, many approaches model a probability density in an m-dimensional latent space and map points in the latent space to the higher-dimensional output space with a learned mapping fθ:→
.
This approach is referred to herein as “pushforward” models because they “push” sampled points into the high-dimensional output space from the m-dimensional space. There are many challenges with these approaches. Manifolds cannot, in general, be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. In further examples, the manifold itself is not effectively modeled with m-dimensions. As such, while using such pushforward modeling approaches has provided significant results, e.g., for use as generative models, there remain significant practical and theoretical challenges to further improvement with this paradigm.
To model a much broader class of topologies more effectively, a manifold-defining function is trained to learn the manifold as a zero set (points at which the output of the manifold-defining function are zero) and an energy function is trained for the training data with respect to the learned manifold. The energy function and the manifold-defining function may each comprise computer models, such as neural networks, with trainable parameters. Both the manifold-defining function and the energy function may be trained natively with inputs in the same dimensionality as the training data set. Because the manifold-defining function defines the manifold as the positions at which it outputs zero, the manifold-defining function may define various geometries of a manifold effectively in the high-dimensional space, while off-manifold points to have non-zero output values. Similarly, although the energy function may be defined for and can generate an energy across the high-dimensional space, the values of interest for probabilistic functions are constrained to the manifold (as defined by the zero set of the manifold-defining function). The combination of the energy function in conjunction with the manifold-defining function may thus be considered a probability model for the training data, such that the energy function evaluated on the manifold may serve as a probability density. Embodiments of the combined energy function and manifold-defining function are referred to as an energy-based implicit manifold (EBIM).
To train the models, initially the manifold-defining function is trained to learn the manifold, which is then used in training of the energy function. Training the manifold-defining function may include training with an energy-based training function based on the training data points. A loss function for the manifold-defining function may include terms for encouraging the function to evaluate zero for training data points, evaluate non-zero for points that are not a part of the training data, and smooth the output function around the training data points. Because the manifold of a manifold-defining function is defined as the zero set, in some embodiments the manifold for the data set as a whole is defined as a combination (e.g., union or intersection) of the zero sets for multiple manifold-defining functions.
Using the manifold-defining function, the energy function may be trained to learn an energy density that, on the manifold, may represent a probability density. The energy function may be trained with a contrastive divergence loss function. The contrastive divergence loss function may use data points sampled from the energy density on the manifold. To effectively sample these points, a constrained Hamiltonian Monte Carlo sampling algorithm may be applied that accounts for the energy density and constrains sampled points to the manifold.
This provides an energy-based model with an implicitly-defined manifold that may be suitable for effective density modeling and may also be used as a probabilistic generative model. As the manifold-defining function defines the manifold implicitly and energy function is allowed to create non-zero values off-manifold (that do not effect on-manifold evaluation), this approach is able to more effectively models densities on a manifold.
The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
of an n-dimensional space. The n-dimensional space may also be referred to as a “high-dimensional” space to reflect that the manifold
may be representable in an m-dimensional space. Although some examples are provided below in simple examples of 2 or 3 dimensions, in practice, the high-dimensional space may represent images, chemical or biological modeling, or other data having thousands or millions of independent dimensions. As such, the manifold of the data in the high-dimensional space is typically “infinitely thin” with respect to the high-dimensional space. Formally, a training data store 150 contains a set of points xi represented in n dimensions {xi}⊂
. The points xi may also be referred to as training data samples and may be considered to be drawn from an unknown probability density p(x)* to be modeled by the computer model 160. The model is trained to learn a probability density p(x) as represented by trained/learned parameters of the computer model based on the data points {xi}. Formally, the data set may be considered drawn from a probability measure supported on the manifold
(e.g., having a “volume” with respect to the dimensionality of the manifold), but which, in the high-dimensional space, may lack a standard Lebesgue measure (i.e., there is no effective measure as a “volume” with respect to the n-dimensional space dimensionality of n).
As such, the training data is modeled as existing on an m-dimensional manifold of the high-dimensional space, in which the manifold is smooth and in which m is typically significantly smaller than n(m<<n). An example of such data is shown in
After training, a sampling module 130 may sample outputs from the probability density represented by the combination of the manifold-defining function 170 and the energy function 180. The samples may represent probabilistic sampling on the learned manifold and thus represent “generative” modeling in the output space that differ from the individual data points in the training data store 150. This enables the model to generatively create outputs similar in structure and distribution to the data points of the training data in the training data store. Similarly, an inference module 140 may receive data points or a set of data points to perform probabilistic evaluations with respect to the learned probability density represented by the computer model 160. For example, data points off-manifold may be represented as having a probability measure of zero, and data points on-manifold may have a probability measure as described by the energy function at that point. Similarly, a group of data points may be evaluated with respect to whether it may be considered “in-distribution” or “out-of-distribution” with respect to the trained probability density. Further details of each of these aspects is discussed further below.
To model the manifold, a manifold-defining function 330 is trained to learn the manifold as particular output values of the manifold-defining function. In various embodiments, the Manifold is defined by the zero set of the manifold-defining function 330. The zero set are the set of points in the n-dimensional space that, when input to the manifold-defining function 330, yield an output of zero. Formally, the inverse of the zero set of the manifold-defining function Fθ with parameters θ defines a respective Manifold
θ:
θ:=Fθ−1({0}). Although throughout this disclosure the zero set is used with the value of zero as the manifold-defining output value, other values may equivalently be used that permits the manifold-defining function 330 to learn parameters that define the manifold. The manifold may also be referred to as “implicitly-defined” because, rather than specifying the manifold itself, the manifold is defined with respect to the output of the manifold-defining function.
In one embodiment, the manifold-defining function 330 outputs values in multiple dimensions to account for the independent dimensions in which the n-dimensional space may vary that are not accounted for by the dimensionality of the manifold. In one embodiment, the manifold-defining function is thus defined as Fθ:→
, such that the output of the manifold-defining function 330 has a dimensionality based on a difference of the high-dimensional space n and the dimensionality m of the manifold. Where the zero set defines the manifold, this allows the manifold-defining function to learn to distinguish non-manifold points from the zero set in the any of the different output dimensions, enabling the manifold-defining function 330 to learn more complex geometries. That is, because points on the manifold are evaluated as a zero value across the output dimensions n-m of the manifold-defining function, any nearby points that are not on the manifold may be evaluated as non-zero for any of the n-m output dimensions. In further embodiments, the manifold-defining function 330 may provide a different output dimensionality that provides for effective representation of the manifold
and sufficient dimensional freedom to represent the manifold shape. Because the manifold-defining function 330 can receive points in the native high-dimensional space to evaluate the zero set, native high-dimensional space is not distorted in defining or making use of the manifold, and the implicit definition allows for complex contours of the manifold to be learned with the flexibility of the different output dimensions allowed by the manifold-defining function 330.
The energy function 320 outputs an energy density for points in the n-dimensional space: Eψ:→
. As shown in
Accordingly, the energy-based implicit model 340 is a function of both the energy function 320 and manifold-defining function 330 (and their parameters), and together the trained models (designated by *) may be represented as an energy-based implicit model as a pair of trained models (Fθ*, Eψ*), that define a probability density Pθ*, ψ* of the trained density model Eψ* with respect to the manifold θ* defined by the trained manifold-defining function Fθ*.
As shown below with respect to the examples in
As shown by
To train the manifold-defining function 330 that effectively defines the manifold-defining function as a zero set, the manifold-defining function 330 such that its parameters evaluate to zero for the training data and are smoothly defined on the manifold. In one embodiment, training of the manifold-defining function 330 aims to satisfy three conditions:
To satisfy condition 1, the loss function encourages the manifold-defining function to learn parameters that output a zero for each training data point xi. Since is the support of P*, condition 1 can be encouraged in one embodiment by the minimizing
x˜P*∥Fθ(x)∥ with respect to the data points xi. That is, P* represents the unknown probability distribution from which training data samples are drawn, such that ∥Fθ(x)∥ is evaluated with respect to the data points xi.
Condition 2 represents ensuring that only on-manifold points belong to the zero set for the manifold-defining function. In one embodiment, this may be performed by identifying off-manifold points (i.e., not in the training data) having a low magnitude (e.g., points for which ∥Fθ(x)∥ is close to zero). Where the training data points xi may represent “positive” points for which the manifold-defining function should output zero, these points may represent the most-relevant “negative” points for which the MDF should be encouraged to output non-zero values. To implement this, the model is encouraged to increase the output value of the manifold-defining function at these “negative” for these points, for example by maximizing the norm for these points. To identify these low-magnitude points, the manifold-defining function may be sampled as though the manifold-defining function described an energy density by applying Langevin dynamics sampling with respect to minimized values of ∥Fθ(x)∥ for points that are not in the training data. The application of this approach to obtain off-manifold points (i.e., not in the training data) based on the manifold-defining function Fθ may be considered a sampling distribution Pθ.
To satisfy condition 3 and provide that the manifold-defining function is smooth when evaluated on the manifold may be equivalent to ensuring the Jacobian of the MDF is non-zero for manifold points. In one embodiment, to do so, the Jacobian evaluated on the manifold is bounded away by encouraging non-zero magnitudes of the Jacobian as: ∥vtJF.
Combining these terms yields a loss function (θ) for the parameters θ manifold-defining function Fθ for minimizing the expectation:
(θ)=
(x, x′, v)˜(P*, P
In which:
In one embodiment, the ReLU function is replaced with the Identity function, particularly for relatively high-dimensional applications. This loss function of Equation 1 is one example loss function that may be minimized in training the parameters of the manifold-defining function.
Including the additional two conditions as regularization terms may help avoid degeneracy in the manifold-defining function, losing smooth manifold definition or allowing off-manifold points to join the zero set.
Examples of these combinations is shown in
Returning to θ. Similarly, when using the energy function 320 for probability inference or sampling (e.g., to obtain new points in the high-dimensional space), the density of the energy function 320 is only considered for the region defined by the manifold
θ. As the energy function can evaluate inputs across the high-dimensional space but is trained with points on the manifold, the energy function 320 may freely allow the energy density of off-manifold points to be affected by training gradients to optimize parameters for evaluation as a probability density when evaluated on the manifold. Considered another way, because the energy function is “filtered” through the manifold-defining function 330, the energy function is not constrained to minimizing or otherwise accounting for the energy density of off-manifold points because these “off-manifold” densities are discarded when “filtered” through the manifold-defining function 330.
As such, the energy function can be evaluated as a probability density only on the defined manifold, the energy function may be considered a constrained energy-based model, in which the energy is constrained to the region of the manifold. As a function of the manifold-defining function and the energy function, the density may be defined as:
where dy can be equivalently thought of as Riemannian volume form or Riemannian measure of θ*. Similarly, the resulting probability measure is represented as Pθ*, ψ*.
As the energy function is defined on the manifold of high-dimensional space, optimization of its gradients directly with respect to the data point distribution is typically intractable. Instead, the energy function may be trained in some embodiments with a contrastive divergence that learns gradients based on the training data points and points sampled from the current energy function. In one embodiment the contrastive divergence is defined as:
In which points are sampled from the probability distribution Pθ*, ψ for the expectation of the right-most term.
To sample points from the probability distribution Pθ*, ψ, an individual points may be sampled with a manifold-aware Markov Chain Monte Carlo (MCMC) methods such as a constrained Hamiltonian Monte Carlo (CHMC) sampling. These approaches permit sampling from a probability distribution by exploring the space point-to-point based on the local energy density. Although CHMC is typically applied to analytically known manifolds, it is adapted here to manifolds implicitly defined by neural networks.
Points are sampled by iteratively changing calculating a momentum and determining a subsequent step. The sampling process begins with an initial position x that is updated at each iteration that each apply a step to update the position of x for a number of k iterations.
First, a momentum r may be determined at the current point x by initializing the momentum with a Gaussian sample: r′˜N(0, In) and then projecting it to the null space of JF(x(t)) (written as JF for clarity). The projection to the null space for the momentum r may be defined as:
r←r′−J
F
T(JF
Next, a new position may be determined by determining a constrained Lagrange multiplier λ*∈ in that satisfies the requirement that the next step is on the manifold, such that the manifold-defining function evaluates the point to as the null set: F(x(t+1))=0. In one embodiment, this may be determined by solving the following minimization, for example via stochastic gradient descent or L-BFGS:
Finally, the next position for x can be determined with a Leapfrog step using the constrained Lagrange multiplier λ*:
in which ε is a step size.
In some embodiments, as explicitly constructing the Jacobian JF or JFTv=(vtJF)T for v∈
is tractable. Furthermore, the inverse term of JF
Av, not the matrix A itself.
In this case, b=JFr′ is a Jacobian-vector product and the operation is vJfJfTv, which is again computable as a vector-Jacobian product followed by a Jacobian-vector product. Since JF is a wide matrix, this operation may be most efficiently performed using backward-mode followed by forward-mode auto-differentiation.
The two steps described above constitute a single iteration of constrained Langevin dynamics. In practice, many iterations are required to obtain a sample resembling the probability distribution Pθ*, ψ*. To obtain completely new samples (e.g., by the sampling module 130), a similar process may be followed by sampling random noise in ambient space and projecting it to the manifold by computing
In each of these examples, all manifolds learned in these experiments were determined only based on samples, without additional knowledge. Quantitative comparisons of density estimates are challenging when manifolds are unknown: likelihood values are incomparable for different learned manifolds. Fortunately, these manifolds may be examined visually to illustrate the benefits of the EBIM approach.
The class of pushforward density estimation models is large; any can serve as a basis of comparison. In these experiments, a simple pushforward energy-based model was used consisting of an autoencoder with an energy-based model for the density in the latent space.
The first example, shown in
In
Geospatial data
Finally,
The EBIM in this example uses a manifold dimension of 16, which is close to intrinsic dimension estimates of MNIST and Fashion MNIST. MDF has a model architecture parameterized with a small U-Net architecture modified from the implementation in the labml.ai Python package. The U-Net architecture includes skip connections give it full rank with a large output dimensionality (28×28−16=768). The constrained EBM has a simple convolutional architecture. Two baseline comparisons are provided: an ordinary EBM 900, 905 and a pushforward EBM 910, 915. Samples from all models are provided in FIG .9 with Fréchet Inception Distance (FID) scores (Heusel et al., 2017) in Table 1 for reference.
The pushforward EBM 910, 915 consists of an autoencoder trained as a Gaussian VAE, and then an EBM 900, 905 on the latent space serving as a prior. Although its latent dimension should equal the manifold dimension to provide correct density estimates, reconstructions were poor with a dimension of 16 in the latent space. To improve performance of the pushforward EBM, the latent space instead used 30 latent dimensions to obtain reasonable samples. This mismatch points to an inability of pushforward models to accurately reflect the true geometric structure of complex data sets.
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
This application claims the benefit of U.S. Provisional Application No. 63/346,814, filed May 27, 2022, and U.S. Provisional Application No. 63/350,337, filed Jun. 8, 2022, the contents of each of which are hereby incorporated by reference in the entirety.
Number | Date | Country | |
---|---|---|---|
63350337 | Jun 2022 | US | |
63346814 | May 2022 | US |