The present disclosure relates generally to the field of image analysis and processing. More specifically, the present disclosure relates to computer vision systems and methods for diverse image-to-image translation via disentangled representations.
In the computer vision fields, image-to-image (“I2I”) translation aims to enable computers to learn the mapping between different visual domains. Many vision and graphics problems can be formulated as I2I problems, such as colorization (e.g., grayscale to color), super-resolution (e.g., low-resolution to high resolution), and photo-realistic image rendering (e.g., label to image). Furthermore, I2I translation has recently shown promising results in facilitating domain adaptation.
In existing computer visions systems, learning the mapping between two visual domains is challenging for two main reasons. First, corresponding training image pairs are either difficult to collect (e.g., day scene and night scene) or do not exist (e.g., artwork and real photos). Second, many of such mappings are inherently multimodal (e.g., a single input may correspond to multiple possible outputs). To handle multimodal translation, low-dimensional latent vectors are commonly used along with input images to model the distribution of the target domain. However, mode collapse can still occur easily since the generator often ignores additional latent vectors.
Several efforts have been made to address these issues. In a first example, the “Pix2pix” system applies a conditional generative adversarial network to I2I translation problems. However, the training process requires paired data. In a second example, the “CycleGAN” and “UNIT” systems relax the dependence on paired training data. These methods, however, produce a single output conditioned on the given input image. Further, simply incorporating noise vectors as additional inputs to the model is still not effective to capture the output distribution due to the mode collapsing issue. The generators in these methods are inclined to overlook the added noise vectors. Recently, the “BicycleGAN” system tackled the problem of generating diverse outputs in I2I problems. Nevertheless, the training process requires paired images.
The computer vision systems and methods disclosed herein solve these and other needs by using a disentangled representation framework for machine learning to generate diverse outputs without paired training datasets. Specifically, the computer vision systems and methods disclosed herein map images onto two disentangled spaces: a shared content space and a domain-specific attribute space.
This present disclosure relates to computer vision systems and methods for diverse image-to-image translation via disentangled representations. Specifically, the system first performs a content disentanglement and attribution processing phase, where the system projects input images onto a shared content space and domain-specific attribute spaces. The system then performs a cross-cycle consistency loss processing phase. During the cross-cycle consistency loss processing phase, the system performs a forward translation stage and a backward translation stage. Finally, the system performs a loss functions processing phase. During the loss function processing phase, the system determines an adversarial loss function, a self-reconstruction loss function, a Kullback-Leibler divergence loss (“KL loss”) function and a latent regression loss function. These processing phases allow the system to perform diverse translation between any two collections of digital images without aligned training image pairs, and to perform translation with a given attribute from an example image.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems and methods for diverse image-to-image translation via disentangled representations, as described in detail below in connection with
Specifically, the computer vision systems and methods disclosed herein, map images onto two disentangled spaces: a shared content space and a domain-specific attribute space. A machine learning generator learns to produce outputs using a combination of a content feature and a latent attribute vector. To allow for diverse output generation, the latent vector and the corresponding outputs are invertible and thereby avoid many-to-one mappings. The attribute space encodes domain-specific information while the content space captures information across domains. Representation disentanglement is achieved by applying a content adversarial loss (for encouraging the content features not to carry domain-specific cues) and a reconstruction loss (for modeling the diverse variations within each domain). To handle unpaired datasets, the system and methods disclosed herein use a cross-cycle consistency loss function using the disentangled representations. Given a non-aligned pair, the system performs a cross-domain mapping to obtain intermediate results by swapping the attribute vectors from both images. The system then applies the cross-domain mapping again to recover the original input images. The system can generate diverse outputs using random samples from the attribute space, and provide desired attributes from existing images. More specifically, the system translates one type of image (e.g., an input image) into one or more different output images using a machine learning architecture.
It should also be noted that the computer vision systems and methods disclosed herein provide a significant technological improvement over existing mapping and translation models. In prior art systems such as a generative adversarial network (“GAN”) system used for image generation, the core feature of the GAN system lies in the adversarial loss that enforces the distribution of generated images to match that of the target domain. However, many existing GAN system frameworks require paired training data. The system of the present disclosure produces diverse outputs without requiring any paired data, thus having wider applicability to problems where paired datasets are scarce or not available, thereby improving computer image processing and vision systems. Further, to train with unpaired data, frameworks such as CycleGAN, DiscoGAN, and UNIT systems leverage cycle consistency to regularize the training. These methods all perform deterministic generation conditioned on an input image alone, thus producing only a single output. The system of the present disclosure, on the other hand, enables image-to-image translation with multiple outputs given a certain content in the absence of paired data.
Even further, the task of disentangled representation focuses on modeling different factors of data variation with separated latent vectors. Previous work leverages labeled data to factorize representations into class-related and class-independent representations. The system of the present disclosure models image-to-image translations as adapting domain-specific attributes while preserving domain-invariant information. Further, the system of the present disclosure disentangles latent representations into domain-invariant content representations and domain-specific attribute representations. This is achieved by applying content adversarial loss on encoders to disentangle domain-invariant and domain specific features.
It should be understood that
{zxc,zxa}={Exc(x),Exa(x)} zxc∈,zxa∈x
{zyc,zya}={Eyc(x),Eya(x)} zyc∈,zya∈y
To achieve representation disentanglement, the system applies two strategies. First, in step 26, the system shares a weight between the last neural network layer of Ecx and Ecy and the first neural network layer of Gx and Gy. In an example, the sharing is based on the assumption that two domains share a common latent space. It should be understood that, through weight sharing, the system forces the content representation to be mapped onto the same space. However, sharing the same high level mapping functions cannot guarantee the same content representations encode the same information for both domains. Next, in step 28, the system uses a content discriminator Dc to distinguish between zcx and zcy. It should be understood that content encoders learn to produce encoded content representations whose domain membership cannot be distinguished by the content discriminator. This is expressed as content adversarial loss via the formula:
L
adv
c(Exc,De)=x[½ log Dc(Exc(x))+½ log(1−Dc(Exc(Exe(x)))]
y[½ log Dc(Eyc(x))+½ log(1−Dc(Eyc(Eye(x)))]
It is noted that since the content space is shared, the encoded content representation is interchangeable between two domains. In contrast to cycle consistency constraint (i.e., X to Y to X), which assumes one-to-one mapping between the two domains, a cross-cycle consistency can be used to exploit the disentangled content and attribute representations for cyclic reconstruction. Using a cross-cycle reconstruction allows the model to train with unpaired data.
u=G
x(zyc,zxa) v=Gyzxc,zya)
In step 34, the system performs a backward translation stage. Specifically, the system performs a second translation by exchanging the content representation (zcu and zcv) via the following formula:
{circumflex over (x)}=G
x(zvc,zxa) v=Gyzuc,zua)
It should be noted that, intuitively, after two stages of image-to-image translation, the cross-cycle should result in the original images. As such, the cross-cycle consistency loss is formulated as:
L
1
cc(Gx,Gy,Exc,Eyc,ExaEya)=x,y[∥Gx(Eyc(v),Exa(u))−x∥1+∥Gy(Exc(u),Eya(v))−−y∥1]
where u=Gx(Eyc(y)),Exa(x)) and v=Gy(Exc(x)),Eya(y)).
In addition to training the network via the content adversarial loss and the cross-cycle consistency loss, the system can further train the network via other loss functions. In this regard,
{circumflex over (x)}=G
x(Exc(x),Exa(x)) and ŷ=Gy(Eyc(y),Eya(y)).
In step 46, the system determines a Kullback-Leibler (“KL”) divergence loss (“LKL”). It should be understood that the KL divergence loss can bring the attribute representation close to a prior Gaussian distribution, which would aid when performing stochastic sampling at a testing stage. The KL divergence loss can be determined using the following formula:
In step 48, the system determines a latent regression loss L1latent to fully explore the latent attribute space. Specifically, the system draws a latent vector z from the prior Gaussian distribution as the attribute representation and reconstructs the latent vector z using the following formula:
{circumflex over (z)}=E
x
a(Gx(Exc(x),z)) and {circumflex over (z)}=Eya(Gy(Eyc(y),z)).
In step 50, the system 10 determines a full objection function using the loss functions from steps 42-48. To determine the full objection function, the system uses the following formula where hyper-parameters λs control the importance of each term:
Testing of the above system and methods will now be explained in greater detail. It should be understood that the systems and parameters are discussed below for example purposes only, and that any systems or parameters can be used with the system and methods discussed above. The system can be implemented using a machine learning programing language, such as, for example, PyTorch. An input image size of 216×216 is used, except for domain adaption. For content encoder Ec, the system uses a neural network architecture consisting of three convolution layers followed by four residual blocks. For attribute encoder Ea, the system uses a convolutional neural network (“CNN”) architecture with four convolution layers followed by fully-connected layers. The size of the attribute vector is |za|=8. Generator G uses an architecture containing four residual blocks followed by three fractionally strided convolution layers.
For training, the system uses an Adam optimizer with a batch size of 1, learning rate of 0.0001, and a momentum of 0.5 and 0.99. The system 10 sets the hyper-parameters as follows: λcadv=1, λcc=10, λadv=1, λ1rec=10, λ1latent=10, λKL=0.01. The system 10 further uses L1 regularization on the content representation with a weight 0.01. The system 10 uses the procedure in DCGAN system for training the model with adversarial loss.
The functionality provided by the system of the present disclosure could be provided by an image-to-image (“I2I”) translation program/engine 106, which could be embodied as computer-readable program code stored on the storage device 104 and executed by the CPU 112 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 108 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 102 to communicate via the network. The CPU 112 could include any suitable single- or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the I2I translation program/engine 106 (e.g., an Intel microprocessor). The random access memory 114 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc. The input device 116 could include a microphone for capturing audio/speech signals, for subsequent processing and recognition performed by the engine 106 in accordance with the present disclosure.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/962,376 filed on Jan. 17, 2020 and U.S. Provisional patent Application Ser. No. 62/991,271 filed on Mar. 18, 2020, each of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62962376 | Jan 2020 | US | |
62991271 | Mar 2020 | US |