The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 22 21 2910.8 filed on Dec. 12, 2022, which is expressly incorporated herein by reference in its entirety.
The present invention relates to a system and computer-implemented method for training an autoencoder on training data to obtain a generative model for synthesizing new data. The present invention further relates to a system and computer-implemented for synthesizing the new data using the decoder of the trained autoencoder as generative model. The present invention further relates to a computer-readable medium comprising data representing instructions for a system to perform any computer-implemented method.
A common task in machine learning is generative modeling, which involves developing a probability distribution model of given real-world samples. Such a generative model is intended to allow new data to be synthesized which follows (approximately) the same probability distribution as the real-world data. Data synthesis may be valuable in various application areas, in particular those where insufficient real-world data is available. For example, for the testing or training of machine learnable models in the field of autonomous driving, it is desirable to be able to use sensor data of so-called corner cases, which may represent unexpected and possible dangerous situations (e.g., near-accidents), to ensure the correct handling of such corner cases by machine learnable models. However, since such corner cases inherently occur only seldomly, little sensor data may be available for the testing or training. To address this problem, a generative model may be trained on the available sensor data and used to synthesize new sensor data of corner cases.
Among many different approaches for generative modeling, such as graphical models and generative adversarial networks, the family of variational auto-encoders (VAEs) has become a popular and effective tool for modelling complex distributions. A VAE comprises an encoder to map input data instances to representations in a latent space and a decoder to obtain reconstructed versions of the input data instances from the representations in the latent space. However, there are also problems with VAEs, such as the latent representations being insufficiently informative. This may hinder the synthesis of new data as it may be difficult or even impossible to generate sensible latent representations when synthesizing new data. References [1] and [2] describe a deterministic autoencoder which address this problem to some degree. Namely, reference [1] proposes various regularization heuristics to induce a sensible distribution of the latent representations and to prevent collapse into a single latent code. However, the relative performance of individual regularization heuristics varies across data sets. As such, there is significant risk that a regularization heuristic just does not work on a new dataset. Reference [2] uses a Gaussian Mixture Model (GMM) prior which can also be used to induce a sensible distribution of the latent representations and to prevent collapse into a single latent code. However, to define the GMM, additional domain knowledge may be required, which is not available in many applications, such as when for example modeling driver behavior [3].
It would be advantageous to be able to train an autoencoder in a manner which addresses one or more of the above-mentioned drawbacks.
In accordance with a first aspect of the present invention, a computer-implemented method and system are provided, for training an autoencoder on training data. In accordance with a further aspect of the present invention, a computer-readable medium is provided, including instructions for causing a processor system to perform the computer-implemented method.
In accordance with the above measures, an autoencoder is trained on training data. The autoencoder may be a deterministic autoencoder which comprises an encoder and a decoder, with the encoder being configured to receive input data and encode the input data to a latent representation in a latent space and the decoder being configured to receive the latent representation and decode the latent representation to obtain a reconstructed version of the input data. As is conventional, the encoder and decoder may be trained on the training data, e.g., using mini-batches of the training data, by optimizing the parameters of the encoder and the decoder so as to reduce a reconstruction error between the input data and its reconstructed version. Autoencoders and their training in as far as described in this paragraph may be conventional.
Unlike conventional autoencoders, according to an example embodiment of the present invention, an affine transformation, which is also known as an affine whitening transformation, is applied to an output of the encoder during the training. This means that the latent representations which are output by the encoder are affine transformed to obtain further latent representations, and the further latent representations are then fed into the decoder, instead of the latent representations directly output by the encoder. The affine transformation may be a parameterized transformation of which the parameters may initially have default or random values. However, the parameters of the affine transformation may be optimized together with the training of the autoencoder, in that during the training of the autoencoder, a mean and covariance of the latent representations of the training data instances processed thus far may be determined, and the parameters of the affine transformation may be updated so that this mean and covariance is shifted towards a target. This may for example involve maintaining a running mean and running covariance and after the completion of each mini-batch of training data adjusting the parameters of the affine transformation.
In this way, a sensible probability distribution of the latent representations is obtained, namely a probability distribution which has a desired mean and desired covariance, without running the risk, as in [1], that a given heuristic does not work on a novel dataset. In addition, unlike [2], no additional domain knowledge is required. Furthermore, the above measures are well suited for mini-batch optimization, where it can be guaranteed that the target mean and target covariance can be achieved, as will explained in more detail in the detailed description including a comparison to [2].
According to an example embodiment of the present invention, optionally, the mean and the covariance are determined as a running mean and a running covariance across the subsets of the training data. By maintaining a running mean and running covariance during the training, the parameters of the affine transformation may be adjusted several times during the training so that when reaching the end of the training data, the mean and covariance of the latent representations of all the training data at least approximate the target mean and the target covariance.
According to an example embodiment of the present invention, optionally, the target mean is zero. Optionally, the target covariance is an identity covariance. A zero mean, identify covariance is a suitable target.
According to an example embodiment of the present invention, optionally, the updating of the parameters of the encoder and the decoder is further based on a regularization term which penalizes a deviation from a probability distribution defined by the target mean and the target covariance. The periodic updating of the affine transformation may not guarantee that the target mean and the target covariance are reached. To this end, an additional regularization term may be used in the training of the encoder and decoder which penalizes deviations from a probability distribution which is defined by the target mean and the target covariance.
According to an example embodiment of the present invention, optionally, the regularization term is a loss term which is based on Kullback-Leibler divergence.
According to an example embodiment of the present invention, optionally, the encoder and/or the decoder is a neural network. In a specific example embodiment of the present invention, both the encoder and the decoder may be deep neural networks.
According to an example embodiment of the present invention, optionally, the training data comprises audio data and/or image data. For example, the image data may be video data, radar data, LiDAR data, ultrasonic data, motion data, thermal image data, etc.
According to an example embodiment of the present invention, optionally, after the training, the parameters of the affine transformation are output. The parameters of the affine transformation may be considered as auxiliary parameters of the trained autoencoder. By exporting the affine transformation's parameters, the affine transformation may be applied during use of the autoencoder, for example when using the autoencoder to detect outliers based on reconstruction loss.
In a further aspect of the present invention, a method is provided of synthesizing new data using a decoder of an autoencoder as trained in a manner as described elsewhere in this specification. According to an example embodiment of the present invention, the method may include:
The trained decoder may be used to synthesize new data which has a same or at least similar probability distribution as the input data on which the autoencoder was trained. According to an example embodiment of the present invention, for that purpose, the latent space of the autoencoder may be sampled by sampling from a probability distribution defined by the target mean and the target covariance. The samples may then be fed into the decoder to obtain synthesized data instances, such as synthetic image data instances or audio data instances.
According to an example embodiment of the present invention, optionally, the one or more synthesized data instances are used as input data to a test or simulation of a system, device, or machine. The newly generated data may be used for data augmentation purposes, for example for testing or simulating a physical entity which may for example be a system, device, or machine.
According to an example embodiment of the present invention, optionally, the one or more synthesized data instances are used as training data to train a machine learnable model.
In a further aspect of the present invention, a method is provided of performing anomaly detection using an autoencoder as trained in a manner as described elsewhere in this specification. According to an example embodiment of the present invention, the method may include:
In a further aspect of the present invention, a method is provided of performing anomaly detection using an encoder of an autoencoder as trained in a manner as described elsewhere in this specification. According to an example embodiment of the present invention, the method may include:
There may be two ways of performing anomaly detection using the autoencoder. One may involve using (only) the encoder of the trained autoencoder to determine if the probability distribution of the latent representations of input data differs more than a predetermined amount from the probability distribution defined by the target mean and the target covariance used during the training of the autoencoder. Another way may be to calculate the reconstruction loss of the full autoencoder when applied to the input data. Since the autoencoder has been trained to minimize the reconstruction loss on the training data, an excessive reconstruction loss may be indicative of the input data not sharing the same characteristics as the training data and thus representing an outlier with respect to the common characteristics of the training data instances.
It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or optional aspects of the present invention may be combined in any way deemed useful.
Modifications and variations of any system, any computer-implemented method or any computer-readable medium, which correspond to the described modifications and variations of another one of said entities, can be carried out by a person skilled in the art on the basis of the present description.
These and other aspects of the present invention will be apparent from and elucidated further with reference to the example embodiments described by way of example in the following description and with reference to the figures.
It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
The following list of reference numbers is provided for facilitating the interpretation of the figures and shall not be construed as limiting the present invention.
The following describes with reference to
The system 100 may further comprise a processor subsystem 120 which may be configured to, during operation of the system 100, train the autoencoder 154 on the training data 152 in the manner as described with reference to
In some embodiments, the system 100 of
The method 200 is shown to comprise, in a step titled “PROVIDING AUTOENCODER”, providing 210 an autoencoder comprising an encoder to map input data instances to representations in a latent space and a decoder to obtain reconstructed versions of the input data instances from the representations in the latent space, in a step titled “ACCESSING TRAINING DATA”, accessing 220 training data comprising a plurality of training data instances, and in a step titled “TRAINING AUTOENCODER ON TRAINING DATA”, training 230 the autoencoder on the training data. The training may comprise, for a respective subset of the training data, in a step titled “FEEDING TRAINING DATA INTO ENCODER”, feeding 240 training data instances from the subset into the encoder and obtaining representations of the training data instances in the latent space from output of the encoder, in a step titled “APPLYING AFFINE TRANSFORMATION TO LATENT REPRESENTATIONS”, applying 250 an affine transformation to the output of the encoder to obtain the representations of the training data instances in the latent space, in a step titled “DETERMINING MEAN AND COVARIANCE”, determining 255 a mean and covariance of the representations of the training data instances in the latent space, in a step titled “UPDATING PARAMETERS OF AFFINE TRANSFORMATION”, updating 260 parameters of the affine transformation to shift the mean and covariance of the representations of the training data instances towards a target mean and a target covariance, and in a step tilted “FEEDING LATENT REPRESENTATIONS INTO DECODER”, feeding 270 the representations of the training data instances into the decoder to obtain reconstructed versions of the training data instances. The method 200 is further shown to comprise, in a step titled “UPDATING PARAMETERS OF ENCODER AND DECODER”, determining a reconstruction loss based on differences between the training data instances and the reconstructed versions of the training data instances, and based on the reconstruction loss, updating 280 parameters of the encoder and the decoder so as to reduce the reconstruction loss, and in a step titled “OUTPUTTING DATA REPRESENTATION OF DECODER”, after the training, outputting 290 a data representation of the decoder for use as generative model to synthesize new data based on a sampling of the latent space. The training 230 may comprise a number of iterations to loop over different subsets of the training data, as shown by arrow 280 in
With continued reference to the training of the autoencoder, the following is noted. A common task in machine learning is generative modeling, that is, developing a probability distribution model of given real world samples. Preferably, a generative model should allow to easily generate new samples following (approximately) the same probability distribution as the real-world data. Among many different approaches such as graphical models and generative adversarial networks, the family of variational auto-encoders (VAE) s has become a very popular and effective tool for modelling complex distributions. The variational auto encoding approach is based on the assumption that the data distribution x˜p(⋅|z) is mainly determined by a latent variable z which follows a prior distribution p(z). While the latent codes (elsewhere also simply referred to as ‘latent representations’) z are not contained in the data, the distribution p(z) is typically assumed to be a known Gaussian. p(x|z) may be learned by considering the posterior p(z|x). That is, p(x|z) may be modeled by a decoder network gθ(x,z) and p(z|x) may be modelled by an encoder network hψ(z,x), which both may be trained minimizing the Evidence Lower Bound (ELBO) ELBO(x,θ,ψ)=−λKL(hψ(z,x)∥p(z)+Ez˜h
While being superior to alternative generative models, VAEs still have limitations. For example, VAEs may suffer from over-regularization which requires careful tuning of the weight λ of the Kullback-Leibler (KL) loss term. Furthermore, the computation of the ELBO may involve drawing random variables z˜hψ(⋅,x) which may introduce additional randomness into the mini-batch training and consequently makes it more brittle than standard mini-batch neural network training. As an alternative, reference [1] proposes deterministic autoencoders where the decoder network x=gθ(z) and the encoder network z=hψ(x) are deterministic functions which are trained by a loss l(x,θ,ψ)=Δ(x−gθ(hψ(x)+ρ(gθ,hψ) which is comprised of a reconstruction loss Δ(x−gθ(hψ(x))) (e.g., squared loss) and a regularization term ρ(gθ,hψ) (e.g., squared norm of the networks' gradients or squared norm of parameters θ,ψ). This loss may be combined with ex-post density estimation, meaning that once the encoder-decoder pair has been trained, the predicted codes {circumflex over (z)}=hψ(x) for all the training data may be collected and a simple generative model such as a Gaussian-Mixture-Model (GMM) may be fitted to the set {{circumflex over (z)}} of codes. New samples may be drawn by first sampling {circumflex over (z)} from the fitted GMM model and then feeding the sample through the decoder {circumflex over (z)}=gθ({circumflex over (z)}), thereby obtaining synthesized data instances for further use. The authors of reference [2] extend on the idea of the deterministic autoencoder. Here, a loss ρ(gθ,hψ)=ρ(hψ(x)=ρ1(hψ(x)+ρ2(hψ(x) is proposed where the first part penalizes deviation of the cumulative density function of the marginals p({circumflex over (z)}i) of the predicted latent codes {circumflex over (z)}=hψ(x) from the cumulative density function of the marginals p(zi) of an assumed prior Gaussian Mixture Model. The second part penalizes the deviation of the covariance of the predicted codes from the covariance of the prior Gaussian Mixture Model. The approach described in [2] demonstrated improve sample quality.
The autoencoder as described in this specification may also be considered a deterministic autoencoder, but which allows using a simple and uninformative zero mean, identity covariance Gaussian prior. A zero mean, identity covariance Gaussian prior has been demonstrated to be quite effective in complex real-world applications such as modeling human driver behavior [3]. However, even for this simple distribution and typical discrepancy measures such as the KL or the Wasserstein distance or Hellinger distance, a general closed form formula for arbitrary continuous distributions is not known. Hence, the goal of exactly matching the prior distribution p(z) by the distribution of the predicted latent codes p(2) may be relaxed, and instead, only the mean and covariance of the predicted latent codes may be matched with the prior, i.e., moment matched. If the distribution of the latent codes p({circumflex over (z)}) is a Gaussian, this moment matching results in perfect distribution matching.
To obtain the zero mean, identity covariance Gaussian prior, an affine transformation may be used of which the parameters may be adjusted during training of the autoencoder so that the predicted latent codes of the training data have approximately zero mean and identity covariance. This may for example involve training the autoencoder on mini-batches of training data, keeping a running mean and running co-variance across the mini-batches, and adjusting the parameters of the affine transformation in-between the mini-batches so that once the training has passed over the entire training data, the affine transformation's parameters are such that they yield an approximately zero mean and identity covariance for the predicted latent codes of the entire training data. This affine transformation may be used with a regularization loss on the mean and the covariance of latent codes, and a simple scheduler for the regularization loss that ensures that the iterative optimization over reconstruction and weighted regularization loss finally results in accurate moment-matching. The affine transformation, the regularization loss term and the scheduler typically easily fit into a standard deep learning training workflow and thereby enable an encoder and decoder pair to be trained that induces a latent code distribution with zero mean and identity covariance. When synthesizing samples using the trained autoencoder, latent codes may either be sampled from a zero mean identity covariance Gaussian or by a post-hoc estimated Gaussian Mixture Model similarly as proposed in reference [1].
In general, after training, the encoder may be used to predict latent codes of given input data and may be used to detect outliers by comparing the predicted latent codes to the prior distribution. The decoder may be used to synthesize new in-distribution data by first sampling from the prior and then decoding the samples by feeding the samples into the decoder. This new data may be used in for example the context of data augmentation or for simulation-based validation. Another way to detect outliers may be by passing input data through both the encoder and the decoder and calculating the reconstruction loss.
If this reconstruction loss is too large, then the input data may be considered an outlier.
With continued reference to the training of the autoencoder, it is noted that it is an important task in deterministic autoencoders to force the distribution of the predicted latent codes p({circumflex over (z)}) with ż=hψ(x) towards a sensible prior distribution p(z). In the following, measures are described that force the distribution towards an uninformative zero mean and identity covariance Gaussian prior. In particular, instead of forcing the entire predictive distribution p({circumflex over (z)}) towards the prior, the following measures may only be applied to the mean and the covariance of the predictive distribution. While this might not lead to full prior matching, an important advantage is that the loss computation is possible in closed form whereas full distribution losses in general require some approximation procedure.
It is noted that for any distribution p({circumflex over (z)}) with finite mean μ and covariance Σ, there exist multiple affine so-called whitening transformations T({circumflex over (z)}−t) such that p(T({circumflex over (z)}−t)) has zero mean and identity covariance, such as for example t=μ and T being the inverse of the Cholesky factor Σ−1/2. Given an initially trained encoder {circumflex over (z)}=hψ(x), the mean and the covariance of the latent codes of the training data may be obtained by a single pass over the entire training data. Having determined the mean and the covariance, an affine transformation may be determined which maps the mean and the covariance to the zero mean and identity covariance. With this affine transformation in place, the autoencoder may then be trained again on the training data. Alternatively, as also described elsewhere in this specification, the parameters of the affine transformation may also be approximated during the training of the encoder using a rolling approach, such as an exponential rolling average. In general, using the affine whitening transform, the autoencoder may take the form of gθ(T(hψ(x)−t)) with T(hψ(x)−t) being the predicted latent codes.
The place of the affine transformation during training is illustrated in
It is noted that when updating the affine transformation during training, there is no guarantee that the final distribution of predicted latent codes p(T(hψ(x)−t)) is very close to the desired zero mean and identity covariance. Furthermore, it might be desirable to stabilize training by also providing an explicit training signal to the encoder that targets the desired mean and covariance of the output distribution. Hence, additional regularization of the predicted latent codes may be used. Given the empirical mean and covariance
of the predicted codes in a batch, the regularization loss is given as
where |{{circumflex over (z)}}| is the batch size and where k is the dimensionality of z. This loss is sensible because if |{{circumflex over (z)}}| goes to infinity it holds
which is uniquely minimized by {circumflex over (μ)}({{circumflex over (z)}})=0 and {circumflex over (Σ)}({{circumflex over (z)}})=I. That is, the loss can only be optimized by reaching the desired zero mean and identity covariance. In case Lreg({{circumflex over (z)}}) is optimized on small mini-batches as commonly done in autoencoder training [1], [2], the factor
makes sure that mini-batch optimization targets the same objective as full training set optimization. This is because even for {circumflex over (z)} being distributed according to mean {circumflex over (μ)} and {circumflex over (Σ)}, the expected loss Ep({circumflex over (z)})[Lreg({{circumflex over (z)}})] is given as
with constant C. If the
correction were not applied, the mini-batch optimization may be biased towards a small covariance of the latent codes.
The encoder-decoder pair may thus be trained on the sum of losses l(x,θ,ψ)=Δ(x−gθ(hψ(x)))+λLreg(T(hψ(x)−t)) where Δ is any established reconstruction loss, e.g., a L2 loss, and where λ is the weight of the latent space regularization. For the autoencoder as described in this specification, λ is preferably set to a small value initially, e.g., 0.001, and gradually increased, e.g., by a factor of 10, over the course of the training. As a variant of the penalty method for constrained optimization, this procedure may ensure that mean and covariance are closely matched after the training.
In general, the autoencoder described in this specification may receive an input data instance, such as an image, and try to reconstruct the image after passing it through a latent space bottleneck. To this end, the autoencoder may map the input space onto a latent space of reduced dimension, and then back to the original space. In particular, an encoder, which may for example be a neural network, may map the input data to an intermediate latent space data. Using an estimated affine whitening transformation, the latent code in this intermediate latent space may be mapped to the final latent space. The latent state data may then be input into a decoder, which may for example be a neural network, to obtain a reconstructed version of the input data instance, e.g., a reconstructed image. The encoder or the decoder or both may for example be given by deep neural networks, but any other differentiable model can also be used. During training, the parameters of the affine transformation may be updated, and the encoder decoder pair may be trained to minimize a linear combination of reconstruction error and regularization.
It is noted that while the affine transformation is described to be used to obtain an approximately zero mean, identity covariance of the probability distribution of the latent space, any other target mean and target covariance may be used as well.
The method 400 is shown to comprise, in a step titled “PROVIDING TRAINED DECODER”, providing 410 a decoder of a trained autoencoder as described elsewhere in this specification, in a step titled “OBTAINING LATENT REPRESENTATION THROUGH SAMPLING OF PROBABILITY DISTRIBUTION”, sampling 420 a latent space to obtain one or more samples, wherein the sampling assumes a latent sample distribution defined by a target mean and a target covariance, and in a step titled “FEEDING LATENT REPRESENTATION INTO DECODER TO OBTAIN SYNTHETIC DATA”, feeding 430 the one or more samples into the decoder to obtain one or more synthesized data instances. The method 400 may comprise a number of iterations to generate several new data instances, as shown by arrow 440 in
In general, each system described in this specification, including but not limited to the system 100 of
It will be appreciated that, in general, the operations or steps of the computer-implemented methods 200 and 400 of respectively
Each method, algorithm or pseudo-code described in this specification may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the present invention.
Mathematical symbols and notations are provided for facilitating the interpretation of the present invention and shall not be construed as limiting the present invention.
It should be noted that the above-mentioned embodiments illustrate rather than limit the present invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the present invention. Any reference signs placed between parentheses shall not be construed as limiting the present invention. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those stated. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The present invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. When the device is describes as including several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different embodiments does not indicate that a combination of these measures cannot be used to advantage.
| Number | Date | Country | Kind |
|---|---|---|---|
| 22 21 2910.8 | Dec 2022 | EP | regional |