The present disclosure relates generally to the field of image analysis and image processing. More particularly, the present disclosure relates to systems and methods for disentangling factors of variation in computer vision systems using cycle-consistent variational auto-encoders.
Natural images can be thought of as samples from an unknown distribution with different factors of variation. The appearance of objects in images is influenced by factors of variation that may correspond to shape, geometric attributes, illumination, texture and pose. Depending on a task that is being performed (e.g., image classification), many of these factors serve as a distraction for computer vision systems (including prediction models), and are often referred to as nuisance variables. These nuisance variables are sometimes referred to as uninformative factors of variation.
One solution for mitigating the confusion caused by uninformative factors of variation is to design representations that ignore all nuisance variables. This approach, however, is limited by the quantity and quality of training data available for the computer vision system.
Another solution for mitigating the confusion caused by uninformative factors of variation is to train a classifier of a computer vision system to learn representations, including uninformative factors of variation, by providing sufficient diversity via data augmentation. Generative models that are driven by a “disentangled” (separated) latent space can be an efficient way of controlled data augmentation. Although Generative Adversarial Networks (hereinafter “GANs”) have proven to be excellent at generating new data samples, standard GAN architecture does not support inference over latent variables. This prevents control over different factors of variation during data generation. DNA-GANs introduce a fully supervised architecture to disentangle factors of variation, however, acquiring labels for each factor, even when possible, is cumbersome and time consuming.
Some solutions combine auto-encoders with adversarial training to “disentangle” or separate informative and uninformative factors of variation and map them to separate sets of latent variables. The informative factors, typically specified by the task of interest, are associated with the available source of supervision (e.g. class identity or pose), and are referred to as the specified factors of variation. The remaining uninformative factors are grouped together as unspecified factors of variation. Computer vision using such a model has two benefits. First, the encoder learns to factor out nuisance variables (e.g., unspecified factors of variation) for the task that is being performed. Second, the decoder can be used as a generative model that can generate new samples of images with controlled specified factors of variation and randomized unspecified factors of variation.
Other solutions utilize the EM framework to discover independent factors of variation which describe the observed data. Other solutions learn bilinear maps from style and content parameters to images. Moreover, some solutions use Restricted Boltzmann Machines to separately map factors of variation in images. Further, some solutions model vision as an inverse graphics problem by using a network that disentangles transformation and lighting variations. Still further, some other solutions utilize identity and pose labels to disentangle facial identity from pose by using a modified GAN architecture. SD-GANs introduce a siamese network architecture over DC-GANs and BE-GANs, which simultaneously generates pairs of images with a common identity but different unspecified factors of variation. However, like standard GANs they lack any method for inference over the latent variables. Yet another solution can develop an architecture for visual analogy making, which transforms a query image according to the relationship between the images of an example pair. DNA-GANs present a fully supervised approach to learn disentangled representations. Adversarial auto-encoders use a semi-supervised approach to disentangle style and class representations, however, this approach cannot generalize to unseen object identities. Moreover, another approach can combine auto-encoders with adversarial training to disentangle factors of variation in a fully unsupervised manner.
Some solutions have also explored a non-adversarial approach to disentangle factors of variation. These solutions demonstrate that severely restricting the dimensionality of the unspecified latent space discourages the encoder from encoding information related to the specified factors of variation in it. However, this solution is extremely sensitive to the dimensionality of the unspecified space. As shown in
Therefore, in view of existing technology in this field, what would be desirable are systems and methods for disentangling factors of variation in computer vision systems using cycle-consistent variational auto-encoders, which address the foregoing needs.
The present disclosure relates to systems and methods for disentangling factors of variation in computer vision systems using cycle-consistent variational auto-encoders. By sampling from the disentangled latent sub-space of interest, the systems and methods can efficiently generate new data necessary for a particular task. The systems and methods disentangle the latent space into two complementary subspaces by using only weak supervision in the form of pairwise similarity labels. The systems and methods use cycle-consistency in a variational auto-encoder framework to accomplish the objectives discussed herein. A non-adversarial approach used in the systems and methods of the present disclosure provides significant advantage over other prior art solutions.
The foregoing features of the invention will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to systems and methods for disentangling factors of variation with cycle-consistent variational auto-encoders, as discussed in detail below in connection with
As will be apparent below, the systems and methods of the present disclosure use variational auto-encoders. A variational inference approach for an auto-encoder based latent factor model can be used. The system can define a dataset as X={xi}i=1N which can contain N i.i.d samples. Each sample can be associated with a continuous latent variable zi drawn from some prior p(z) usually having a simple parametric form. The approximate posterior qϕ(z|x) can be parameterized using the encoder, while the likelihood term pθ(x|z) can be parameterized by the decoder. The architecture, popularly known as Variational Auto-Encoders (VAEs), optimizes the following variational lower-bound equation:
(θ,ϕ;x)=
q
The first term in the equation is the expected value of the data likelihood, while the second term, the KL divergence, acts as a regularizer for the encoder to align the approximate posterior with the prior distribution of the latent variables. By employing a linear transformation based reparameterization, an end-to-end training of the VAE using back-propagation is enabled. At test time, VAEs can be used as a generative model by sampling from the prior p(z) followed by a forward pass through the decoder. The present systems and methods use the VAE framework to model the unspecified latent subspace.
The systems and methods of the present disclosure also use generative adversarial networks (“GANs”). GANs can model complex, high dimensional data distributions and generate novel samples from it. GANs include two artificial neural networks, a generator and a discriminator, both of which can be trained together in a min-max game setting, by optimizing the loss in below equation:
The discriminator outputs the probability that a given sample represents a true data distribution as opposed to being a sample from the generator. The generator can correlate random samples from a simple parametric prior distribution in the latent space with samples from the true distribution. The generator can be successfully trained when the output of the discriminator is ½ for all generated samples. DCGANs use CNNs to replicate complex image distributions and can be used for successful adversarial training. Non-adversarial training can be used in the systems and methods of the present disclosure.
Cycle-consistency methods are also used herein. Cycle-consistency has been used to enable a Neural Machine Translation system to learn from unlabeled data by following a closed loop of machine translation. Cycle-consistency can be used to establish cross-instance correspondences between pairs of images depicting objects of the same category. Cycle-consistent architectures can also be used with in depth estimation, unpaired image-to-image translation and unsupervised domain adaptation. The present systems and methods also leverage cycle-consistency in the unspecified latent space and explicitly train the encoder to reduce leakage of information associated with specified factors of variation.
The systems and methods of the present disclosure can combine auto-encoders with non-adversarial training to disentangle specified and unspecified factors of variation based on a single source of supervision, like class labels. In particular, the present disclosure introduces a non-adversarial approach to disentangle factors of variation under a weak source of supervision which uses only pairwise similarity labels.
p norm based cyclic loss defined in below equation:
L
cyclic
=L
forward
+L
reverse
cyclic=x˜p(x)[∥G(F(x))−x∥p]+
y˜p(y)[∥F(G(y))−y∥p] (3)
Cycle-consistency naturally fits into the variational auto-encoder training framework, where the KL divergence regularized reconstruction includes the forward. The systems and methods also use the reverse cycle-consistency loss to train the encoder to disentangle better. As is typical for such loss functions, the model is trained by alternating between the forward and reverse losses.
The systems and methods of the present disclosure use a conditional variational auto-encoder based model, where the latent space is partitioned into two complementary subspaces. The first subspace is “s,” which controls specified factors of variation associated with the available supervision in the dataset. The second subspace is “z,” which models the remaining unspecified factors of variation. The systems and methods keep s as a real value vector space and z is assumed to have a standard normal prior distribution p(z)=(0,1). Such an architecture enables explicit control in the specified subspace, while permitting random sampling from the unspecified subspace. It can be assumed that there is a marginal independence between z, and s, which implies complete disentanglement between the factors of variation associated with the two latent subspaces.
The encoder can be written in the following equation: Enc(x)=(ƒz(x), ƒs(x), ƒs(x)), where ƒz(x)=(μ,σ)=z and ƒs(x)=s. Function ƒs(x) is a standard encoder with a real value vector latent space. Moreover, ƒz(x) is an encoder whose vector outputs parameterize the approximate posterior qϕ(z|x). Since the same set of features extracted from x can be used to create mappings to z and s, the systems and methods can define a single encoder with shared weights for all but the last layer, which branches out to generate outputs of the two functions ƒz(x) and ƒs (x).
The decoder, x′=Dec(z,s), in this VAE is represented by the conditional likelihood pθ(x|z,s). Maximizing the expectation of this likelihood w.r.t the approximate posterior and s is equivalent to minimizing the squared reconstruction error.
It is worth noting that forward cycle does not demand actual class labels at any given time. This can result in a weaker form of supervision. Accordingly, it may be desirable to use images which are annotated with pairwise similarity labels. The forward cycle mentioned above is similar to an auto-encoder reconstruction loss system or method.
(0,1) over the unspecified latent space. Specified latent variables s1=ƒs(x1) and s2=ƒs(x2) are also sampled. The specified latent variables and the sampled unspecified variables are passed through a decoder 44 to obtain reconstructions x1″=Dec(z1,s1) and x2″=dec(z1,s2); respectively. Unlike the forward cycle, x1 and x2 need not have the same label and can be sampled independently. A third image x1″ 46 and a fourth image x2″ 48 can be pass through the encoder 43. Since both images x1″ 46 and x2″ 48 are generated using the same z1, their corresponding unspecified latent embeddings z1″=ƒz(x1″) and z2″(x2″) should be mapped close to each other, regardless of their specified factors. Such a constraint promotes marginal independence of z from s as images generated using different specified factors could potentially be mapped to the same point in the unspecified latent subspace. This step directly drives the encoder to produce disentangled representations by only retaining information related to the unspecified factors in the z latent space. As
The variational loss in the below equation enables sampling of the unspecified latent variables and aids the generation of novel images.
In some cases, the encoder may not necessarily learn a unique mapping from the image space to the unspecified latent space. In other words, samples with similar unspecified factors may get mapped to different unspecified latent variables. Accordingly, to address this observation the above pairwise reverse cycle loss equation can penalize the encoder if the unspecified latent embeddings z1″ and z2″ and have a large pairwise distance, but not if they are mapped farther away from the originally sampled point z1. Minimizing the pairwise reverse cycle loss in the above equation can be more beneficial than its absolute counterpart (∥z1−z1″∥+∥z1−z2″∥), both in terms of the loss value and the extent of disentanglement.
Testing of the above systems and methods will now be explained in greater detail. The performance of the above systems and methods are evaluated on three datasets: MNIST, 2D Sprites and LineMod. The experiments are divided into two parts. The first part evaluates the performance of the systems and methods in terms of the quality of disentangled representations. The second part evaluates the image generation capabilities of the systems and methods.
The MNIST dataset includes of hand-written digits distributed among 10 classes. The specified factors in case of MNIST is the digit identity, while the unspecified factors control digit slant, stroke width etc.
2D Sprites dataset includes game characters (sprites) animated in different poses for use in small scale indie game development. The dataset includes 480 unique characters according to variation in gender, hair type, body type, armor type, arm type and greaves type. Each unique character is associated with 298 different poses, 120 of which have weapons and the remaining do not. In total, there are 143,040 images in the dataset. The training, validation and the test set contain 320, 80 and 80 unique characters respectively. This implies that character identity in each of the training, validation and test split is mutually exclusive and the dataset presents an opportunity to test the model on completely unseen object identities. The specified factors latent space for 2D Sprites is associated with the character identity, while the pose is associated with the unspecified factors.
LineMod is an object recognition and 3D pose estimation dataset with the following 15 unique objects photographed in a highly cluttered environment: ‘ape’, ‘benchviseblue’, ‘bowl’, ‘cam’, ‘can’, ‘cat’, ‘cup’, ‘driller’, ‘duck’, ‘eggbox’, ‘glue’, ‘holepuncher’, ‘iron’, ‘lamp’ and ‘phone.’ The synthetic version of the dataset is used, which has the same objects rendered under different viewpoints. There are 1,541 images per category and a split of a 1,000 images for training is used along with 241 images for validation and 300 images for testing. The specified factors in latent space can resemble the object identity in this dataset. The unspecified factors in latent space can resemble the remaining factors of variation in the dataset.
During the forward cycle, image pairs are randomly selected which are defined by the same specified factors of variation. During the reverse cycle, the selection of images is completely random. All of the models were implemented using the PyTorch programming language.
The quality of disentangled representations will now explained in greater detail. A two layer neural network classifier is trained separately on the specified and unspecified latent embeddings generated by each competing model. Since the specified factors of variation are associated with the available labels in each dataset, the classifier accuracy gives a fair measure of the information related to specified factors of variation present in the two latent subspaces. If the factors were completely disentangled, it would be expected that the classification accuracy in the specified latent space would be perfect, while that in the unspecified latent space would be close to chance. In this experiment, the effect of change in the dimensionality of the latent spaces is also investigated.
(0, 1).
As discussed in greater detail above, the systems and methods of the present disclosure provide a simple yet effective way to disentangle specified and unspecified factors of variation by leveraging the idea of cycle-consistency. The systems and methods include architecture that needs only weak supervision in the form of pairs of data having similar specified factors. The architecture does not produce degenerate results and is not impacted by the choices of dimensionality of the latent space. Through the experimental evaluations, it has been shown that the present systems and methods achieve compelling quantitative results on three different datasets and show good image generation capabilities as a generative model. It should also be noted that the cycle-consistent VAE could be trained as the first step, followed by training the decoder with a combination of adversarial and reverse cycle-consistency loss. This training strategy can improve the sharpness of the generated images while maintaining the disentangling capability of the encoder.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/962,455 filed on Jan. 17, 2020 and U.S. Provisional Patent Application Ser. No. 62/991,862 filed on Mar. 19, 2020, each of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62962455 | Jan 2020 | US | |
62991862 | Mar 2020 | US |