The present embodiments generally relate to a method and an apparatus for unfolding a first latent space onto a second latent space, and more particularly to unfolding latent space based on neural network. The present embodiments also generally relate to methods and apparatuses for encoding or decoding an image or a video based on neural network.
Generative models such as GANs (Generative Adversarial Networks) (Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53-65, Creswell, A.W) are machine learning techniques that learn the distribution of given objects (e.g. images) and that generate plausible new ones. Recently, GANs are of interest not only for their generative capability, but also because their latent (aka hidden) space exhibits good properties emerging from the disentangled nature of the latent space. The generation factors (attributes) seem to be more “linearly” separable or disentangled than in the original space of the objects.
Hence, many techniques are developed to project an object to the GAN latent representation and manipulate it. For instance, in the case of facial image, only one of the facial attributes like “lipstick” could be changed.
In image editing, StyleGAN is a GAN architecture which has an intermediate latent space providing interpretable and disentanglement properties. This means that to change an attribute, only the related components of the intermediate latent space have to be changed. This is thus useful in image editing tasks. Recent state of the art methods in image editing (e.g. InterFaceGAN) relies on the StyleGAN latent space due to the above property, and generally consists of two steps:
InterFaceGAN (InterfaceGAN: Interpreting the disentangled face representation learned by gans. IEEE Transactions on Pattern Analysis and Machine Intelligence, Shen Y.Y., 2020) assumes that the attributes are linearly separated and performs edit on the direction orthogonal to the hyperplane. The quality of the edited image depends on how well the image of interest is represented in the latent space of the GAN, and such representation could lose the geometrical and semantic relationships in the perceptual image space. In other words, two geometric limitations of the latent space have been identified: (a) euclidean distances differ from image perceptual distance, and (b) disentanglement is not optimal and facial attribute separation using linear model is a limiting hypothesis. For instance, an edit on an attribute of an image may have an impact on other attributes in the original space.
Therefore, there is a need for improving the state of the art.
According to an embodiment, a method for unfolding a first latent space onto a second latent space is provided, which comprises:
According to an embodiment, an apparatus for unfolding a first latent space onto a second latent space, which comprises one or more processors configured for:
According to an embodiment, the first latent space is obtained from a Generative Adversarial Network. According to another embodiment, the at least one constraint is at least one of a global constraint or a local constraint. According to another embodiment, the unfolding is a semantic unfolding or a geometrical unfolding or both.
According to another embodiment, the unfolding uses a neural network. In a variant, the unfolding is based on an invertible transformation. In a further variant, the transformation is a normalizing flow.
According to another embodiment, the at least one object is an image.
According to another embodiment, a method for encoding at least one image is provided, wherein encoding at least one image includes obtaining a first latent representation of the image, in a first latent space, obtaining a second latent representation of the image in a second latent space, encoding the second latent representation as image or video data.
According to another embodiment, a method for decoding at least one image is provided, wherein decoding at least one image from image or video data includes decoding from the image or video data a latent representation of the image, obtaining another latent representation of the image from the decoded latent representation, generating the decoded image from the other latent representation.
According to another embodiment, a method for video encoding and a method for video decoding are provided.
One or more embodiments also provide an apparatus comprising one or more processors configured for performing any one of the embodiments of the methods cited above.
One or more embodiments also provide a computer program comprising instructions which when executed by one or more processors cause the one or more processors to perform any one of the methods according to any of the embodiments described above. One or more of the present embodiments also provide a computer readable storage medium having stored thereon instructions for editing a video shot, encoding at least one image or a video or decoding at least one image or a video according to the any of the embodiments described above.
One or more embodiments also provide a bitstream comprising image or video data encoded according to any one of the embodiments of the encoding method cited above. One or more of the present embodiments also provide a computer readable storage medium having stored thereon a bitstream described above.
One or more embodiments also provide a method for transmitting a bitstream comprising image or video data encoded according to any one of the embodiments of the encoding method described herein. One or more embodiments also provide an apparatus for transmitting a bitstream comprising image or video data encoded according to any one of the embodiments of the encoding method described herein.
A method for unfolding a latent space is proposed and more particularly a latent space of GANs using semantic and/or geometrical constraints. Such a method provides a new desired proxy space, wherein operations on object's attributes, such as image manipulation for instance, is made easier and more efficient.
According to an embodiment, the method unfolds (geometrically speaking) the latent space of any given GAN by imposing additional constraints on the semantics of the objects, and/or on their geometrical relationship. To do so, a continuous and invertible (bijective) transformation (i.e. Normalizing Flows) is learned from the original latent space (W+) to a new proxy latent space (W*). Known methods for Normalizing flow are described in “Normalizing flows: An introduction and Review of Current Methods”, I. Kobyzev, S.J.D. Prince, M.A. Brubaker, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
To learn the desired transformation for unfolding, at least one of the following constraints are added:
The properties of this new space make it more suitable for operations on objects projected onto this new space. For instance, such operations comprise manipulation on images. According to this example, image editing is made easier and more efficient.
Image/Video Editing: When editing a natural object, one can project it to its hidden representation and manipulate it (for faces, beautification/de-aging/social media editing). Editing in this new space is more efficient because of the properties that have been enforced. Such methods could be either embarked on a user smartphone, or deployed on the cloud of social networks. Since the editing is more disentangled, the user has more editing capabilities with better results.
At 320, the first latent space is unfolded onto a second latent space, based on at least one constraint.
As discussed above, the constraint may be a global constraint or local constraint. The constraint may be a semantic constraint which satisfies, in the second latent space, a linear separation of the attributes of the object that has been projected onto the first latent space at 310.
In another variant, the constraint may be a geometrical constraint which satisfies a matching between an Euclidean distance determined in the second latent space between the latents and a corresponding distance in the original space.
As will be discussed further below, according to an embodiment, the unfolding at 320 is based on a neural network that learns an invertible transformation, such as a normalizing flow.
According to an aspect of the present disclosure, the method for unfolding provided herein allows to avoid retraining the GAN which would be difficult and computationally expensive in order to overcome the aforementioned limitations. The method for unfolding provided herein allows to learn a transformation to map objects in the second latent space wherein the attributes of the objects are linearly separable, disentangled, the attributes can be disentangled and separated by hyperplanes, which was not perfectly the case in previous approaches, and wherein the latent Euclidean distance mimics the perceptual distance in the original space, e.g. the image space when objects are images.
Normalizing Flows (NFs) NFs are another type of generative model that consists of diffeomorphic transformations between a simple known distribution and any arbitrarily complex distribution. Due to the constraints that should be satisfied (e.g. bijectivity, tractable inverse and jacobian determinant) the expressivity of such models is limited compared to others (e.g. GANs).
E, G and C are fixed during the training while only T is learned. Dashed arrows means that the corresponding modules are used only during training.
It is discussed below how the latent space (W*) that satisfies the two aforementioned properties is learnt, and more particularly the transformation T that allows to map a latent code into the new latent space W*:
In addition, according to the present principles, other properties that are useful for editing could also be satisfied (i.e. Wa*-ID). Note that the proposed approach only requires the bijectivity of the NF, thus the prior distribution in the latent space is not imposed as the density estimation has no interest here.
It is assumed that a pretrained StyleGAN2 generator G is available, such a generator G takes a latent code w ∈W+ and generates a high resolution image I (i.e. 1024×1024). A bijective transformation T is thus learnt, T: W+→W* that maps a latent code w ∈W+ to w* ∈W*. To return to W+, the inverse T−1: W+→W+ is used. The focus will be on real images, thus it is assumed that a pretrained encoder E is available that embeds the image in W+ such that G(E(I))≠1.
An objective here is to learn the mapping T that map the latent codes to Wd such that the latent distance in this space is similar to the perceptual one in the image space. This property is obtained by minimizing the distance between the latent distance and perceptual distance as below:
S1 and S2 are two disjoint sets of image samples of size N. The first term is the latent Euclidean distance squared (Dintent) and Dperceptual(Ii, Ij) is the perceptual distance between Ii and I. Dperceptualcould be any perceptual distance. As an example, the VGG16 could be used. As is used to rescale Dperceptual to be in the same range as Dintent. However, this scaling factor could be omitted, if the NF learns the normalization factor.
In some cases, the normalization factor may be needed, for instance in image editing, the scaling factor needs to be known. Thus, in a variant, one scaling factor may be chosen and forced the NF model to have negligible effect on scaling. An example of a scaling factor value could be λs=10, but other value are also possible.
An objective here is to obtain two main properties. T is trained to map the latent codes to Wa* where it is possible to fit a hyperplane between the positive and negative regions of each attribute (i.e. a positive example is when the attribute is present in the image and the negative when it is not). In addition, it is desired that the attributes are separated (i.e. disentangled). These properties are enforced by minimizing the classification loss of a linear attribute classifier C:W*→{0,1}K, where K is the number of attributes labeled in the image dataset. Choosing a linear model is mainly to enforce the first property while reducing the loss in general leads to better attributes separation/disentanglement.
Instead of using one classification model for all the attributes, one binary classification model is used for each attribute and these models are trained jointly. For each sample w, the objective is to minimize:
Where Ci: W*→{0,1} is the classifier for the ith attribute, yi ∈{0,1} is the label of the sample w corresponding to the ith attribute. In Eq (2), the classifier is fixed and only T is optimized, because it is desired to obtain the linear separation between attributes, thus it could be any fixed linear classifier.
In a variant that is optional, the linear classifiers are pretrained first in W+. The motivation is that it is needed to keep the same hyperplanes between the two spaces while “re-organizing” the new space in such a way that the objective is satisfied.
Having a space that shares some properties of W* is important for image editing as W+ already enjoys good properties. Furthermore, it helps to converge faster.
Combining Eq (1) and Eq (2), the total loss for W* can be written as:
where λd allows to have a trade-off between the two losses.
For image editing applications, additional regularization can be introduced in Eq (3) to better condition the properties of W*.
According to a variant, the person identity should be preserved after editing the latent codes. Identity preservation is thus enforced by minimizing the loss between the features extracted from a pretrained face recognition model F before and after editing, thus for a given image sample I, the loss can be written as:
where ∈˜N(0,I) which is a normal distribution with zero mean and identity matrix I as covariance matrix, and which simulates the editing effect.
As the mapping function in StyleGAN2 is trained to obtain a latent space (i.e. W+) where the images generated from this space are of high quality and with almost no artifacts, the proposed approach benefits from this by ensuring that the new space is not very different from the original one. To this end, the magnitude of the vectors in W* should be the same as in W+.
The magnitude regularization for a given image sample can be as follows:
An example of Implementation Details is discussed in the following. A pretrained StyleGAN2 (G) is used on a FFHQ dataset (Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401-4410, 2019). The images are encoded in W+ using the pretrained StyleGAN2 encoder (E). Parameters of the generator and the encoder remain fixed in all the experiments. The latent vector dimension in W+ and W* is (18,512). Celeba-HQ (Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017) is the image dataset that is used and has labels for K=40 attributes.
A single layer MLP (for Multiple Layer Perceptron, also known as fully connected layers) model for each attribute (Ci) is used as linear classifier which is pretrained in W+. For the NF model, Real NVP (Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016) is used without batch normalization (which would led to normalized space and affect significantly the image editing). The NF model comprises several blocks or coupling layers, each coupling layer comprises two submodules or mapping functions: the scale function (s func) and the translation function (t func). Each mapping function is similar to a small neural network comprises, in a variant, 3 fully connected (FC) layers, with LeakyReLU as hidden activation and Tanh (for tangent hyperbolic function) as output one. VGG16 (Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694-711.Springer, 2016) is used as a perceptual loss and As=1. The VGG16 comprises several blocks, each one comprising several layers. Output of intermediate blocks (or features maps) of Blocks 2, 3 and 4 are taken. For the face recognition model F, a pretrained VGG16 is used on a face recognition dataset. Adam optimizer is used with 31=0.9 and 32=0.999, learning rate=1e-4 and the batch size=8.
At 420, a second representation w* of the image I is determined by projecting the first representation w+ onto the second latent space W* using the trained transform T. At 430, an edit is made on at least one attribute of the image in the second latent space W*, providing a modified second representation w*+ε. At 440, the modified second representation is remapped onto the first latent space using the inverse of the transformation T1 and at 450, a new image is generated by the GAN generator module.
Classification Accuracy: An SVM (for Support Vector Machine, which is a machine learning technique used for classification) or any other classification technique is trained from scratch for each attribute on 15000 latent codes in the corresponding space (which contains the validation set and a portion from the training one that was used for the NF training). In W+, these are obtained after encoding the images in Celeba-HQ using the pretrained encoder. In W*, after the encoding, the codes are mapped using the trained NF model T. The split ratio is 0.8 for the training set. 3 numbers are reported: the minimum (Min Acc) and maximum (Max Acc) accuracy among the 40 attributes as well as the Average (Avg Acc).
In DCI (for Disentanglement, Completeness and Informativeness), which is a metric introduced to quantify the disentanglement of the attributes, (C. Eastwood and C. K. Williams. A framework for the quantitative evaluation of disentangled representations. In ICLR, 2018): 40 Lasso regressors have been used, from scikit-learn library with a=0.02 that is multiplied by the L1 regularizer. The dataset size is 2000 and composed of the validation set of Celeba-HQ encoded using the pretrained encoder. The train and validation sets are split as 80% and 20% respectively. The RMSE loss is used.
In these experiments, both objectives are optimized: latent distance unfolding and attributes separation. Real NVP consists of 13 coupling layers (size=20.4 M parameter). Ad=1 at the beginning and is set to Ad=10 after 40 epochs. It is to be noted that, in the Real NVP there is no batch normalization (BN), this is important as the BN normalizes the data and the hyperplanes of the pretrained classifiers are obtained on the unormalized space W+. From Table 1, it can be noticed a significant improvement in all the quantitative metrics in W*.
+
*
To assess qualitatively the new space, InterFaceGAN (Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020) was retrained to manipulate the attributes of a given real image in both W+ and Wa*.
InterFaceGAN assumes that the positive and negative examples of each attribute is linearly separable and the editing direction is simply the normal to the hyperplane that separates the positive and negative regions. Specifically, to obtain these hyperplanes, an SVM is trained for each attribute in both spaces and the latent code of the image encoded by the pretrained encoder is edited. In Wa*, the image is first encoded and then the latent codes are mapped using (T). To generate the image using the pretrained StyleGAN2 generator after editing in Wa*, the latent codes are remapped using the inverse of Real NVP (T−1). The total loss for attributes separation and identity regularization (i.e. Wa*) is as follows:
Implementation details: Real NVP consists of 3 coupling layers (size=4.7 M parameter) and is usually trained with additional objectives. According to a variant, only the loss (objective) defined in of Eq(6) is minimized. The same setup as in above is adopted. The editing directions are obtained after training an SVM on 15000 images of Celeba-HQ encoded using the pretrained encoder in W+ and Wa*. The editing step=6 for W+ and 10 in Wa*.
Results:
It can be noticed from
While in Wa*, these attributes are better disentangled. The identity is better preserved in Wa*. Finally, it is clear that high quality images are still obtained even if the generator has not bee retrained.
Quantitative evaluation: In this section, the effect of some design choices is investigated. The setup is the same as above except when stated otherwise. Ad=10 from the beginning of the training and kept constant. 4 experiments are analyzed, which differ from the main setup with the following: H (High model capacity, 13 coupling layers),
+
* (H)
* (L)
* (R)
* (1R)
From Table 2:
Qualitative evaluation, Image Editing: In general, the magnitude and the identity preservation losses help to preserve the identity and allow high quality image editing. Although, the effect of identity loss is better and when combined with the magnitude loss the results get slightly worse. The latent distance unfolding loss does not give any benefit for editing.
Image Editing: It is noticed in some experiments that the new space should not be very different from the original one to obtain good editing results. For instance, if the latent and perceptual distances are not in the same scale, an editing step in W* could be equivalent to times 10 higher or lower in W+. In this regard, some constraint are added on the model such as the magnitude regularization (to ensure that the new space is not contracted/expanded) and the same boundaries (editing directions) are kept in W*. Although, using the identity regularization is enough to replace these two constraints. When using the latter, it is important to choose carefully its weight. For example, if the weight is high and the model is trained for too long the editing effect will be smaller.
Beyond StyleGAN: Some effort was devoted to do image editing and to improve the attributes disentanglement for other generative models such as GANS and VAEs. The proposed attribute separation approach could be extended in a straightforward way to such type of models. For distance unfolding, the scope of models is larger as any model with a latent space could be adopted. Other properties could be enforced as well. For instance, for image editing, a head pose preservation loss could be adopted.
Retraining the whole model: Enforcing these properties is also possible by optimizing directly the latent space while training the generator/discriminator as it is done in many recent work on attributes disentanglement. According to another embodiment, the method for unfolding a latent space described in reference with
Several embodiments are provided below which provide a new compression scheme using inverted GAN. In the image/video compression scheme provided below, a GAN encoder, for instance a StyleGAN encoder, is used for mapping each video frame to a latent point in the GAN latent space, for instance with dimension 18×512.
According to an embodiment, an intra coding scheme or image compression method that provides an entropy model learned in the proxy latent space is provided. According to another embodiment, an inter-coding scheme for video compression is provided wherein intermediate frames latent codes are linearly interpolated in the proxy latent space from intra coded latent codes.
According to another embodiment, an inter-coding scheme for video compression is provided wherein entropy model for successive differences between latent codes is learned.
At low bitrates, traditional image codecs favor blocking artifacts, while other deep compression systems are unable to reconstruct a sharp, unblurred and high quality images. To remedy this, it is proposed to leverage the generative power of Generative adversarial networks (GANs) for image compression. To alleviate the burden of adversarial training, a proxy latent space dedicated for compression is learnt while the pretrained and off the shelf GAN encoder and decoder are freezed.
In other words, it is learnt how to efficiently compress the latent code associated to a given image, for example a face image, but the method is not limited to this kind of images.
In addition, a new perceptual distortion loss is proposed that is more efficient to compute than other counter parts (such as LPIPS defined in Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586-595, 2018, or VGG16). The method proposed herein (SGANC) is simple, faster to train and shows better qualitative results compared to state of the art codecs such VVC, AV1 and recent deep learning-based ones for low bitrates.
Image compression can be formulated as an optimization problem with the objective of finding a codec with minimal bitrate for a given distortion level between the reconstructed image at the decoder side and the original one. The distortion is mainly due to the image quantization, as compression codecs work with discrete data. On the other hand, as bitrate is lower bounded by the entropy, the mismatch between the predicted data distribution and the real one leads to higher bitrate. Thus, good codecs are the ones with good probability models of the underlying data. Due to the fact that images live in high dimension space, the optimization in this space is intractable, thus, usually they are transformed first to a latent code with lower dimension before quantization/compression. This scheme is classically called transform coding.
Traditional image codecs (e.g., JPEG, JPEG2000) are based on handcrafted and linear transformations, unlike the recent deep learning-based codecs or deep compression systems (Johannes Balle, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704,2016a. [Ballé et al., 2016a]) which learn nonlinear transformations that are more adapted to the processed data. These recent models optimize jointly a rate-distortion loss:
where x, z are the original and the reconstructed images, z is the corresponding latent code, Pz(_) is the data distribution and d(x, z) is any distortion loss.
Usually, the distortion loss is chosen to be one of the traditional metrics that are used to assess compression systems such as PSNR or MS-SSIM. Although, these metrics capture the pixel wise distortion and focus on the texture rather than the perceptual distortion or the global appearance. Moreover, it has been shown that there is a tradeoff between pixel wise distortion and perceptual quality. This observation is seen clearly for very low bitrate or bit per pixel (bpp), where traditional codecs favor blocking artifacts and deep compression systems show blurred and other types of artifacts.
According to the embodiment described herein, the encoding/decoding method leverages the generative power of StyleGAN and the GANs inversion techniques such as in Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. arXiv preprint arXiv:2008.00951, 2020, or in Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Lu Yuan, Gang Hua, and Nenghai Yu. A simple baseline for stylegan inversion. arXiv preprint arXiv:2104.07661, 2021, for high quality, lower perceptual distortion and efficient training image compression.
According to an embodiment illustrated on
On the encoding side, the images are projected in the latent space (W+). The latent obtained from the projection is then mapped to the proxy latent space W*c where the quantization/compression is done, providing coded image data. In an embodiment, the coded image data can then be transmitted in a bitstream to a decoder. On the decoding side, the coded image data are obtained from the bitstream and decompressed/decoded. The decoded image data is then mapped from the proxy latent space W*c back to the latent space W+, before generating the reconstructed image Irec using the GAN generator.
According to the embodiments for encoding/decoding images described herein, the burden of retraining the StyleGAN encoder/decoder is avoided as a proxy latent space dedicated for compression is learned while using off the shelf pretrained StyleGAN encoder/decoder models.
The proposed scheme shows high quality and lower perceptually distorted reconstructed images for low bitrates, better quantitative metrics for medium and high bitrates in terms of MS-SSIM and LPIPS and better PSNR metrics for high bitrates.
According to a variant of the embodiment illustrated on
In particular, the method relies on computing a normalizing flow T bijective transformation so that an optimal coding scheme can be learned in this new latent space (W*c). In the following, are described the Gan approximation, as well as the training of the optimal intra compression method.
The Generator StyleGAN is a state of the art unconditional GAN in high quality image generation. It consists of a mapping function that takes a noise vector and maps it to an intermediate latent space (i.e., W) before feeding it to multiple stages of the generator to generate the image. It is shown that the latent space of StyleGAN is semantically rich and the generative factors are better disentangled thus making it better for interpolation. According to an embodiment, a StyleGAN2 encoder/generator (Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110-8119, 2020.) is used, which is an improved version of StyleGAN discussed above. However, the method for encoding an image proposed herein is not limited to the StyleGAN2 networks, and any GAN model can be used.
The StyleGAN encoder's role is to project an image in the latent space of StyleGAN (e.g., W, W+) in such a way that the image reconstructed by the generator is minimally distorted. According to this embodiment, the image is projected in W+ with dimension (18×512).
Normalizing Flows (NFs) are another type of generative models that consists of diffeomorphic transformations between a simple known distribution and any arbitrarily complex one. In this embodiment, a same parametrization as the one used in the embodiments described in reference with
Regarding the Intra Compression, an objective is to minimize the rate-distortion loss. In addition, to avoid the burden of retraining the StyleGAN encoder/generator, a proxy space W*c is introduced.
As in the embodiments described in reference with
It is assumed a pretrained StyleGAN2 generator G that considers a latent code w ∈W+ and generates a high resolution image I (for instance 1024×1024).
A bijective transformation T: W+→W*c is trained to map a latent code w ∈W+ to w*, E W*c. T is a Normalizing Flows (NFs) model and can be inverted explicitly. The focus will be on real images, thus it is assumed that a pretrained encoder E is available that embeds the image in W+ such that G(E(I))≠I.
Although, the transformation T is modelled as a NF, it is noted that the proposed method only requires the bijectivity, as such, no maximum likelihood is included in the training objective.
The entropy model is based on a fully factorized probability distribution as in [Ballé et al., 2016a].
The entropy model takes as input the latent code provided by the transformation T and outputs a probability value pl.
To obtain the coded image data, the latent code is quantized by applying a rounding operation and compressed using Range Asymmetric Numeral System (rANS) bindings as proposed in Duda, Jarek. “Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding.” arXiv preprint arXiv:1311.2540 (2013), which is a coder based on entropy coding. The entropy model takes a latent vector in W*c of dimension (18×512) and it is trained jointly with the transformation T.
To train the entropy model, a method similar to the one used in [Ballé et al., 2016a] is used and the hard quantization is replaced by adding uniform noise to the latent vectors.
As the compression is done in W*c, the rate loss is minimized after mapping the latent codes from W+ using T. The rate loss is as follows:
where pi is the ith dimension of the probability density function in W*c, Dm is the latent vector dimension, x is the input image and E is sampled from a uniform distribution U[−0.5,0.5]. The distortion loss is applied in the original latent space W+ and can be written as follows:
Where d is any distortion measure between the latent code from W+ and the reconstructed latent code in W+ after mapping T, encoding and inverse mapping T1.
The total loss is a trade-off between rate and distortion:
Where λ is the trade-off parameter.
In the above variant, the distortion loss is determined in the latent space W+ which allows for faster training. Moreover, computing the distortion in the latent space is equivalent to computation of the distortion in the image space in terms of mean squared error.
In some variant, the distortion loss can be determined in the image space (between original picture and reconstructed picture) using any distortion metrics, either pixel-based or a perceptual metric or a combination of both.
As described above, the encoding of the second latent representation is performed using an entropy network model that has been trained jointly with the transformation T for mapping the first latent representation in the proxy space.
In a variant, decoding of the latent representation comprises entropy decoding. In another variant, decoding of the latent representation also comprises dequantization.
As described above, the decoding of the latent representation is performed using an entropy network model that has been trained jointly with the transformation T/T−1 used for mapping the first latent representation to encode in the proxy space wherein it is encoded. The latent representation is thus decoded in the proxy space.
At 1120, another latent representation of the image is obtained from the decoded latent representation. In a variant, the decoded latent representation is mapped using the transformation T−1 from the proxy space to the target latent space. The target latent space corresponds here to the original latent space onto which the image has been projected on the encoder side. The target latent space is the GAN latent space. At 1130, a decoded image is generated from the latent representation that has been mapped on the target latent space, using the GAN generator.
In the following, some qualitative and quantitative results of the proposed method compared to other ones are presented.
Implementation details: a StyleGAN2 generator (G) is used, it has been pretrained on FFHQ dataset. The images are encoded in W+ using a pretrained StyleGAN2 encoder (E) (the parameters of the generator and the encoder remain fixed in all the experiments). The latent vector dimension in W+ and W*c is 18×512. Celeba-HQ is the image dataset that is used for training and consists of 30000 high quality images (i.e. 1024×1024) of faces.
For the NF model, Real NVP is used without batch normalization. Each coupling layer consists of 3 fully connected (FC) layers for the translation function and 3 FC for the scale one with LeakyReLU as hidden activation and Tanh as output one. A fully factorized entropy model is trained as in [Ballé et al., 2016a]. For all the experiments, Adam optimizer is used with _β1=0,9 and β2=0,999, learning rate=1e−4 and the batch size=8.
Datasets: the method is evaluated on different datasets: FILMPAC: This dataset consists of video clips with high resolution and length between 60 and 260 frames.
MEAD intra: MEAD dataset, defined in Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020, is a high resolution talking face video corpus for many actors with different emotions and poses. MEAD intra consists of 200 frames selected from these videos with frontal pose. It contains frames from around 40 actors, with different expressions (i.e., neutral, happy, sad).
Dataset preprocessing: All the frames are cropped around the face and aligned. As the reconstructed image is compared with the projected one for SGANC, instead of feeding the original image, other methods are fed with the projected image. All frames are with resolutions (1024×1024).
In these experiments, each frame is quantize/compress independently of the videos (Intra coding) and the average of the metrics is reported over all the frames of a given video.
The method is compared with respect to Versatile Video Coding Test Model (VTM), AV1, factorized model with scale and mean hyperpriors (MeanHP) described in David Minnen, Johannes Balle, and George Toderici. Joint autoregressive and hierarchical priors for learned image compression. arXiv preprint arXiv:1809.02736, 2018.
The following metrics are used: Peak Signal to Noise Ratio (PSNR), Multi Scale Structural Similarity (MS-SSIM) defined in H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47-57, 2017. doi:10.1109/TCI.2016.2644865, and Learned Perceptual Image Patch Similarity (LPIPS). The size of the compressed images in bits per pixel (BPP) is reported.
Note that, all the distortion metrics of the method (SGANC) are reported w.r.t the projected image while all the others are w.r.t to the original one.
Qualitative results: As can be seen in
Even though, the proposed encoding scheme uses off the shelf Encoder/Generator that are trained on different datasets (FFHQ for StyleGAN, Celeba-HQ for the compression training and the evaluation is done on a third dataset), it is still possible to get artifacts-free images that are perceptually close to the original image.
Quantitative results:
As can be seen in
In the methods for encoding/decoding at least one image described above, results are provided for face images, however, the present principles are not limited to this kind of images and the methods provided herein applies to any other kind of images, as long as a network model is available for projecting the image into the first latent space from which a similar image can generated, such as with a GAN network.
In the present application, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeably, the terms “pixel” or “sample” may be used interchangeably, and the terms “image,” “picture” and “frame” may be used interchangeably. Usually, but not necessarily, the term “reconstructed” is used at the encoder side while “decoded” is used at the decoder side.
Before being encoded, the video sequence may go through pre-encoding processing (701), for example, applying a color transform to the input color picture (e.g., conversion from RGB 4:4:4 to YCbCr 4:2:0), or performing a remapping of the input picture components in order to get a signal distribution more resilient to compression (for instance using a histogram equalization of one of the color components). Metadata can be associated with the pre-processing and attached to the bitstream.
In the encoder 700, a picture is encoded by the encoder elements as described below. The picture to be encoded is partitioned (702) and processed in units of, for example, CUs. Each unit is encoded using, for example, either an intra or inter mode. When a unit is encoded in an intra mode, it performs intra prediction (760). In an inter mode, motion estimation (775) and compensation (770) are performed. The encoder decides (705) which one of the intra mode or inter mode to use for encoding the unit, and indicates the intra/inter decision by, for example, a prediction mode flag. The encoder may also blend (763) intra prediction result and inter prediction result, or blend results from different intra/inter prediction methods.
Prediction residuals are calculated, for example, by subtracting (710) the predicted block from the original image block. The motion refinement module (772) uses already available reference picture in order to refine the motion field of a block without reference to the original block. A motion field for a region can be considered as a collection of motion vectors for all pixels with the region. If the motion vectors are sub-block-based, the motion field can also be represented as the collection of all sub-block motion vectors in the region (all pixels within a sub-block has the same motion vector, and the motion vectors may vary from sub-block to sub-block). If a single motion vector is used for the region, the motion field for the region can also be represented by the single motion vector (same motion vectors for all pixels in the region).
The prediction residuals are then transformed (725) and quantized (730). The quantized transform coefficients, as well as motion vectors and other syntax elements, are entropy coded (745) to output a bitstream. The encoder can skip the transform and apply quantization directly to the non-transformed residual signal. The encoder can bypass both transform and quantization, i.e., the residual is coded directly without the application of the transform or quantization processes.
The encoder decodes an encoded block to provide a reference for further predictions. The quantized transform coefficients are de-quantized (740) and inverse transformed (750) to decode prediction residuals. Combining (755) the decoded prediction residuals and the predicted block, an image block is reconstructed. In-loop filters (765) are applied to the reconstructed picture to perform, for example, deblocking/SAO (Sample Adaptive Offset) filtering to reduce encoding artifacts. The filtered image is stored at a reference picture buffer (780).
In particular, the input of the decoder includes a video bitstream, which can be generated by video encoder 700. The bitstream is first entropy decoded (830) to obtain transform coefficients, motion vectors, and other coded information. The picture partition information indicates how the picture is partitioned. The decoder may therefore divide (835) the picture according to the decoded picture partitioning information. The transform coefficients are de-quantized (840) and inverse transformed (850) to decode the prediction residuals. Combining (855) the decoded prediction residuals and the predicted block, an image block is reconstructed.
The predicted block can be obtained (870) from intra prediction (860) or motion-compensated prediction (i.e., inter prediction) (875). The decoder may blend (873) the intra prediction result and inter prediction result, or blend results from multiple intra/inter prediction methods. Before motion compensation, the motion field may be refined (872) by using already available reference pictures. In-loop filters (865) are applied to the reconstructed image. The filtered image is stored at a reference picture buffer (880).
The decoded picture can further go through post-decoding processing (885), for example, an inverse color transform (e.g. conversion from YCbCr 4:2:0 to RGB 4:4:4) or an inverse remapping performing the inverse of the remapping process performed in the pre-encoding processing (801).
The post-decoding processing can use metadata derived in the pre-encoding processing and signaled in the bitstream.
According to an embodiment, when an image is to be intra-coded using the encoder and decoder described above in reference to
According to another embodiment, the method for unfolding a latent space described in reference with
Video compression methods tries to reduce as much as possible the temporal (TR) and spatial (SR) redundancy.
Some works proposed to reduce the TR in the feature or latent space; such as computing the feature space residual as in Abdelaziz Djelouah, Joaquim Campos, Simone Schaub-Meyer, and Christopher Schroers. Neural inter-frame compression for video coding. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6420-6428, 2019. In this method, TR is exploited by interpolating the intermediate frames given two reference ones. The approach is based on motion estimation which makes it complex to train and requires to compress additional information (flow maps).
In Shibani Santurkar, David Budden, and Nir Shavit. Generative compression. In 2018 Picture Coding Symposium (PCS), pages 258-262, 2018, it is proposed to interpolate intermediate frames in the latent space, however, no entropy coding is used and small image resolution (64×64) is used, as well as a constant interpolation gap without any strategy to adapt it according to the type or the dynamics of the video.
The present embodiment allows remedying these drawbacks, by leveraging the properties of the StyleGAN latent space for efficient and high quality video compression. In this embodiment, the video is divided into temporal segments of adapted lengths, the first and last frame of each segment are compressed and sent to a receiver. On the receiver side, the intermediate frames are obtained by an interpolation in the latent space, e.g. a linear interpolation is used. An example of the embodiment is illustrated in
In this embodiment, the properties of the latent space of StyleGAN are leveraged for simple, efficient and high quality video compression. High quality reconstructed images with lower perceptual distortion for low bitrates is achieved. Some better quantitative metrics for high bitrates in terms of MS-SSIM and LPIPS are also obtained.
For each temporal segment, the first and last images are encoded as in the intra coding scheme explained above. E and G are the StyleGAN2 encoder and generator respectively. The image is projected in the latent space of the GAN (W+) and mapped to the proxy latent space W*c using the transformation T, where the quantization/compression is done to produce coded video data, for instance in a bitstream. According to this embodiment, only the latent codes of the first and last frames of the temporal segment are encoded in the coded video data. The first and last frames are encoded using the encoding method illustrated on
On the decoder side, the coded video data are obtained, for instance from a received bitstream or retrieved from memory. The coded video data is decompressed and the decoded latent are mapped (using T−1) from the proxy latent space W*c to the GAN latent space W+, wherein the first and last frame of the temporal segment are reconstructed by the generator G. The first and last frame are decoded using the decoding method illustrated on
To obtain the intermediate frames located between the first and last frames, a linear interpolation in the latent space W+ is performed using the latent code of the first and last frames. Then, an intermediate frame is generated by the generator using the interpolated latent code as input. In this way, a set of reconstructed frames Irec is thus obtained for the temporal segment.
According to this embodiment, there is no need for a specific training for video compression as the same models (T and the entropy model) trained for intra coding are used. E and G are pretrained StyleGAN2 encoder and decoder respectively, and remain fixed in all of the trainings.
The latent space or the manifold of GANs is semantically rich and enables several applications such Image editing. Moreover, image interpolation on this manifold produces high quality and pleasant images. This property is leveraged to reduce temporal redundancy of frames sequence and a method for video compression is provided wherein intra coding is combined with linear interpolation in the latent space to reduce also spatial redundancy.
In this embodiment for video compression, the intra coding part is the same as the one described above, the training of the transformation T is performed in the same way for using the same rate-distortion losses (equation 8, 9 and 10).
For the inter-coding part of the scheme, the video is divided into different non overlapping segments of a size=GAP. The first and last frames (i.e., I1, I2 respectively) are encoded as illustrated with
Where wi=T-1(Q (T(E(I1)))), w2=T-1(Q (T(E(I2)))) are the two received latent codes in W+ and i ∈{1, . . . , GAP−1}, here Q denotes the quantization, compression coding and decoding.
The size of the temporal segment GAP is a parameter of the method to tune. The value of the GAP could depend on the motion or the dynamics of the video as well as what type of objects are changing. In the followings, several variants are provided to adapt the GAP temporally and layer wise.
Layer specific adaptive gap (LA-GAP)
In Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401-4410, 2019 it is shown that each stage of the StyleGAN generator corresponds to specific scale of details. Specifically, the first layers, which correspond to coarse resolution (e.g., 42-82) affect mainly high level aspects of the image such as the pose and face shape, while the last layers affect the low level aspects such as textures, colors and small micro structures. This property is used to adapt the GAP layer wise. Specifically, small GAP value (GAPI) for the first layers (e.g., 1-7) and larger one (GAPh) for the last layers (e.g., 7-18) are used.
This is motivated by noticing that what usually change in the videos correspond to the high level aspects while the textures, colors change slowly and their change may not be noticeable nor important. It is to be noted that as an example, in this variant, the latent code dimension is (18, 512) but other dimensions could be envisaged. The intermediate frames can be obtained by the following equations wherein the latent code of an intermediate frame is obtained by two interpolations between first and last frames of two temporal segments of size GAP, and GAPh,:
Where wl1, wl2, wh1 and wh2 correspond respectively to the encoded frames Il1, Il2, Ih1 and Ih2for GAP, and GAPh respectively and can be written as follows:
Where n={1,0} ∈N18, 1 ∈NS, 0 ∈N1−S is a mask to compress only the first s dimensions of the latent codes. It is to be noted that the choice of s and GAPI, GAPh can be adapted to the processed videos (e.g., if the main changes are the objects color, the opposite may be adopted).
At 1910, the first and last images of a first temporal segment GAP, are decoded, using the method for decoding illustrated on
At 1920, the first and last images of a second temporal segment GAPh are decoded, using the method for decoding illustrated on
At 1930, the intermediate latent code is obtained by interpolation wherein a first set of layers of the latent code is obtained by interpolation using the corresponding layers of the latent codes of the first and last frames of the first temporal segment and a second set of layers of the latent code is obtained by interpolation using the corresponding layers of the latent codes of the first and last frames of the second temporal segment, as explained above with Equation (12).
At 1940, the intermediate frame linter is generated by the GAN generator.
Temporal adaptive gap (TA-GAP)
Another variant to adapt the GAP is provided wherein the GAP is determined according to the motion or dynamics of each frames segment. To this end, the GAPs are determined for each temporal segment as a preprocessing step, then the video is compressed based on the determined GAP.
At 2010, for a set of images of the video, a size of a temporal segment (GAP) between intra coded images is determined. At 2020, the first and last frames of the determined temporal segment are encoded, using the method illustrated in
An algorithm for determining sizes of temporal segments is provided below, wherein a result of the algorithm provides a list of the determined temporal segments of the video for encoding.
At initialization, a default size GAP0 is set, a metric M, a metric threshold TM, a threshold tolerance eps and a number of iterations N are initialized to 0.
with Interpolation (GAP, i, M) performs linear interpolation of the frames in the temporal segment of size GAP starting from frame i, and return an average metric.
Specifically, in the above algorithm, the average metric (e.g., PSNR) is computed to assess the reconstruction of the intermediate frames given a GAP, if the reconstruction is good that means the motion is relatively steady and the GAP can be increased. If there is high motion, this leads to low reconstruction, thus the GAP is reduced in this case.
The threshold TM depends on the processed video, thus TM is set so that it is less than the best reconstruction/metric that can be obtained by a margin m. It is assumed that the best metric is obtained with the minimal GAP=2.
Once the GAPs have been computed, they are used to compress the video using the method illustrated in
According to another variant, the variants for determining the GAPs described above (layer-wise, temporal) can be combined, to reduce the compression size. Specifically, instead of using a fixed and small GAP, for the first layers as in LA-GAP, the GAP, is determined as explained in the temporal adaptation TA-GAP.
In another variant, this can also be done for the last layers, but as the GAPh used for these layers is already high (e.g., 60), it can be kept constant.
In the following, some results of the embodiments provided above are shown.
A StyleGAN2 generator (G) pretrained on FFHQ dataset is used. The images are encoded in W+ using a pretrained StyleGAN2 encoder (E), the parameters of the generator and the encoder remain fixed in all the experiments. The latent vector dimension in W+ and W*c is 18×512. Celeba-HQ is the image dataset that is used for training and consists of 30000 high quality images (i.e. 1024×1024) of faces.
For the NF model, Real NVP is used without batch normalization. Each coupling layer consists of 3 fully connected (FC) layers for the translation function and 3 FC for the scale one with LeakyReLU as hidden activation and Tanh as output one. A fully factorized entropy model is trained.
Range Asymmetric Numeral System coder is used to obtain the bitstream. The entropy model is based on the implementation in the CompressAl library (Jean Begaint, Fabien Racape, Simon Feltman, and Akshay Pushparaja. Compressai: a pytorch library and evaluation platform for end-to-end compression research). For all the experiments, 2 Adam optimizers with the same parameters are used for both the entropy model and T with 31=0.9 and 32=0.999, learning rate=1 e4 and the batch size=8.
Datasets: the method for encoding/decoding a video are evaluated on the MEAD dataset which is a high resolution talking face video corpus for many actors with different emotions and poses. MEAD inter consists of 10 videos of different actors with frontal pose.
The dataset is preprocessed as follows: all the frames are cropped around the face and aligned. As the reconstructed image is compared with the projected one for SGANC, all the frames are projected, encode the original images and reconstruct them using StyleGAN2, except for the method provided herein which takes the original frames as input. All frames are with resolutions (1024×1024).
The average of the metrics over all the frames of a given Video are reported. For MEAD inter dataset, the average of the metrics over all the videos is used.
The following metrics are used; Peak Signal to Noise Ratio (PSNR), Multi Scale Structural Similarity (MS-SSIM) and Learned Perceptual Image Patch Similarity (LPIPS). The size of the compressed images in bits per pixel (BPP) is reported.
It is to be noted that all the distortion metrics of the provided method (SGANC) are reported with respect to the projected image while all the others are with respect to the original one.
The following methods are compared:
Quantitative results: From
For the LPIPS loss, the methods provided herein performs better than VTM. Note that, for SGANC, the distortion is measured from the quantization (Projected vs SGANC).
Implementation details: the following variants are compared:
From
As illustrated in A first latent code of the sequence is intra coded:
using the same entropy model as the one described for image compression or another entropy model trained for image compression. The first latent code can be the latent code of the first image of the video sequence or a first image of a group of frames when the video sequence is fragmented into groups of frames.
The following steps are repeated until the end of the video sequence or group of frames: =Q(w*t−w*t−1).
At 2430, a prediction (estimate) wt of the current latent code wt* is determined from the previously reconstructed code and the reconstructed difference with: wt*=
+
.
At 2440, the residual between the prediction and the current latent code is computed and at 2450, the residual is quantized and entropy coded (for all the frames or each GAP frames): =Q(wt−wt*).
The quantized difference and the residual
are compressed using entropy coding and sent to a receiver. The current latent code is reconstructed
from the prediction and the reconstructed residual and stored for compressing the subsequent latent codes.
and the reconstructed difference
. At 2540, the current latent code is reconstructed from the decoded residual and the prediction of the latent code, or depending on the variant only from the prediction latent code:
=wt*+
. At 2550, the reconstructed latent code in the latent space W*c is remapped to W+ to generate the decoded image using the pretrained generator G, for instance the StyleGAN2.
According to the video encoding and video decoding methods described above, the transformation T from mapping the latent codes from W+ to the proxy latent space W*c and the entropy model (p) are learned (trained) to optimize a rate-distortion loss which can be written as follows:
where d is any distortion measure between the latent code from W+ and the reconstructed latent code in W+ after mapping T, encoding and inverse mapping T−1, λ is a trade-off parameter, and E[ΣiD log2 pi()] is an estimate of the coding cost, where E is the expectation, pi is the dimension i of the entropy model (entropy model P with dimension being the dimension of the latent code). One entropy model is trained for the differences. In operation, the learned entropy model is also used for both the differences and the residuals.
The quantization/compression are replaced by adding noise in a similar manner as in the embodiment using interpolation to obtain intermediate frames. It is to be noted that using one entropy model for both the latent code differences and residuals (during test) leads to better results. Thus, according to a variant, a same entropy model is trained for the differences and residuals. Having only few dimensions that change between two consecutive latent codes is efficient for entropy coding, thus according to a variant, an L1 regularization is added on the latent codes differences and the final loss becomes:
It has been shown that each stage/layer of the StyleGAN generator corresponds to a specific scale of details. Specifically, the first layers, which correspond to coarse resolution (e.g. 42-82) affect mainly high level aspects of the image such as the pose and face shape, while the last layers affect the low level aspects such as textures, colors and small micro structures. According to a variant, such a hierarchical structure is used and different distortion are used for each layer of the generator.
For instance, the latent codes in W+ or W*c consist of 18 latent codes of dimension 512 and each one corresponds to one layer in the generator, hence its dimension is (18, 512).
Specifically, in this variant, smaller A are used for the last layers and larger ones for the first ones.
It is to be noted that when using different distortions, it is better to use also different entropy models and normalizing flows. As a trade-off between complexity and compression efficiency, a stage-specific entropy models/NF is used (i.e. 3 stages are used:1-8, 8-13, 13-18) while using different A for each layer.
Below, algorithms for video compression using inter coding with residual as described above are provided.
Algorithm for the video encoding/decoding method with residual inter-coding (SGANC IC): The result of the method is coded video data comprising a sequence of N compressed frames or a bitstream comprising code data representative of the compressed frames sequence: {,
,
, . . . ,
,
, . . . ,
}, with N being the number of frames. In the following, E stands for the GAN encoder, G the GAN generator, T the learned transformation, EC the entropy coder, ED the entropy decoder and Q the quantizer, and GAP being a number of frames in a group of fames.
According to an embodiment, residual coding is performed by groups of frames. In other words, the residual is determined and coded only for the first frame of the group of frames.
At initialization, the frame sequence {x0, x1, x2, . . . , xt−1, xt, . . . , XN} is input to the method, and the first frame is intra coded with =ED (EC(Q(T(E(xo)))));
= ED(EC(Q(w*t − w*t−1); this quantizes, compresses and decompresses the difference
=
+
; this determines an estimate (prediction) of the latent code of the current frame
= ED(EC(Q(w*t −
))); this quantizes, compresses and decompresses the residual
=
+
; this reconstructs the latenet code of the current frame
=
;
= G(T−1(
)); this reconstructs the current frame
Below is provided the corresponding algorithm 3 of the method used for training, The results of the algorithm are thus the learned transformation T and the entropy model EM.
As input to the training, a video dataset encoded as latent codes in the GAN latent space are provided, with {w1, w2, . . . , wt−1, wt, . . . } being the latent codes of a video sequence, N being a number of frames in each video sequence, S being the size of the dataset, E the GAN encoder and G the GAN generator.
= Q(w*t − w*t−1); quantizes (adding noise) the difference: this step comprises
=
+
; determines an estimate (prediction) of the latent code
=
;
A StyleGAN2 generator (G) pretrained on FFHQ dataset is used. The images are encoded in W+ using a pretrained StyleGAN2 encoder (E). The parameters of the generator and the encoder remain fixed in all the experiments. The latent vector dimension in W+ and W*c is 18×512. Celeba-HQ is the image dataset that is used for training and consists of 30000 high quality images (i.e. 1024×1024) of faces. To accelerate the training, all the images are encoded once and the training is done using the latent codes.
For the NF model, Real NVP is used without batch normalization. Each coupling layer consists of 3 fully connected (FC) layers for the translation function and 3 FC for the scale one with LeakyReLU as hidden activation and Tanh as output one.
For the SGANC IC, the models were trained on 2.5 k videos from the MEAD dataset, where each batch contains video slices of size of 9 frames. All the frames are pre-processed as in the embodiment of the SGANC with interpolation. A fully factorized entropy model is trained.
Range Asymmetric Numeral System coder is used to obtain the bitstream. The entropy model is based on the implementation in the CompressAl library. For all the experiments, 2 Adam optimizers with the same parameters are used for both the entropy model and T with β1=0.9 and β2=0.999, learning rate=1e4 and the batch size=8.
In the following, the ablation study for SGANC IC investigates the effect of the following:
According to an embodiment, the methods described above are implemented as instructions causing one or more processors to perform the methods steps.
According to an embodiment,
The system 100 includes at least one processor 110 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this application. Processor 110 may include embedded memory, input output interface, and various other circuitries as known in the art. The system 100 includes at least one memory 120 (e.g., a volatile memory device, and/or a non-volatile memory device). System 100 includes a storage device 140, which may include non-volatile memory and/or volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 140 may include an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples.
According to an embodiment, system 100 includes an encoder/decoder module 130 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 130 may include its own processor and memory. The encoder/decoder module 130 represents module(s) that may be included in a device to perform encoding and/or decoding functions. As is known, a device may include one or both of the encoding and decoding modules. Additionally, encoder/decoder module 130 may be implemented as a separate element of system 100 or may be incorporated within processor 110 as a combination of hardware and software as known to those skilled in the art.
Program code to be loaded onto processor 110 to perform the various aspects described in this application may be stored in storage device 140 and subsequently loaded onto memory 120 for execution by processor 110. In accordance with various embodiments, one or more of processor 110, memory 120, storage device 140, and encoder/decoder module 130 may store one or more of various items during the performance of the processes described in this application. Such stored items may include, but are not limited to, one of more input video shots, mosaic images, warpings, 3D models, color transform information, visibility maps, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.
In several embodiments, memory inside of the processor 110 and/or the encoder/decoder module 130 is used to store instructions and to provide working memory for processing that is needed during pre-processing steps of the method described herein and/or video editing. In other embodiments, however, a memory external to the processing device (for example, the processing device may be either the processor 110 or the encoder/decoder module 130) is used for one or more of these functions. The external memory may be the memory 120 and/or the storage device 140, for example, a dynamic volatile memory and/or a non-volatile flash memory. In several embodiments, an external non-volatile flash memory is used to store the operating system of a television. In at least one embodiment, a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations, such as for MPEG-2, HEVC, or VVC.
The input to the elements of system 100 may be provided through various input devices as indicated in block 105. Such input devices include, but are not limited to, (i) an RF portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Composite input terminal, (iii) a USB input terminal, and/or (iv) an HDMI input terminal.
In various embodiments, the input devices of block 105 have associated respective input processing elements as known in the art. For example, the RF portion may be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band-limiting again to a narrower band of frequencies to select (for example) a signal frequency band which may be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets. The RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band-limiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers. The RF portion may include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband. In one set-top box embodiment, the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band. Various embodiments rearrange the order of the above-described (and other) elements, remove some of these elements, and/or add other elements performing similar or different functions. Adding elements may include inserting elements in between existing elements, for example, inserting amplifiers and an analog-to-digital converter. In various embodiments, the RF portion includes an antenna.
Additionally, the USB and/or HDMI terminals may include respective interface processors for connecting system 100 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, may be implemented, for example, within a separate input processing IC or within processor 110 as necessary. Similarly, aspects of USB or HDMI interface processing may be implemented within separate interface ICs or within processor 110 as necessary. The demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 110, and encoder/decoder 130 operating in combination with the memory and storage elements to process the datastream as necessary for presentation on an output device.
Various elements of system 100 may be provided within an integrated housing, Within the integrated housing, the various elements may be interconnected and transmit data therebetween using suitable connection arrangement 115, for example, an internal bus as known in the art, including the 12C bus, wiring, and printed circuit boards.
The system 100 includes communication interface 150 that enables communication with other devices via communication channel 190. The communication interface 150 may include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 190. The communication interface 150 may include, but is not limited to, a modem or network card and the communication channel 190 may be implemented, for example, within a wired and/or a wireless medium.
Data is streamed to the system 100, in various embodiments, using a Wi-Fi network such as IEEE 802.11. The Wi-Fi signal of these embodiments is received over the communications channel 190 and the communications interface 150 which are adapted for Wi-Fi communications. The communications channel 190 of these embodiments is typically connected to an access point or router that provides access to outside networks including the Internet for allowing streaming applications and other over-the-top communications. Other embodiments provide streamed data to the system 100 using a set-top box that delivers the data over the HDMI connection of the input block 105. Still other embodiments provide streamed data to the system 100 using the RF connection of the input block 105.
The system 100 may provide an output signal to various output devices, including a display 165, speakers 175, and other peripheral devices 185. The other peripheral devices 185 include, in various examples of embodiments, one or more of a stand-alone DVR, a disk player, a stereo system, a lighting system, and other devices that provide a function based on the output of the system 100. In various embodiments, control signals are communicated between the system 100 and the display 165, speakers 175, or other peripheral devices 185 using signaling such as AV.Link, CEC, or other communications protocols that enable device-to-device control with or without user intervention. The output devices may be communicatively coupled to system 100 via dedicated connections through respective interfaces 160, 170, and 180. Alternatively, the output devices may be connected to system 100 using the communications channel 190 via the communications interface 150. The display 165 and speakers 175 may be integrated in a single unit with the other components of system 100 in an electronic device, for example, a television. In various embodiments, the display interface 160 includes a display driver, for example, a timing controller (T Con) chip.
The display 165 and speaker 175 may alternatively be separate from one or more of the other components, for example, if the RF portion of input 105 is part of a separate set-top box. In various embodiments in which the display 165 and speakers 175 are external components, the output signal may be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.
Processor 210 is also configured to either receive an image or output a generated image and, either implementing a GAN encoder, or a GAN generator, or the learnt transformation T or T−1 to unfold the first latent space onto the second latent space/encode at least one image or decode at least one image, using the aforementioned methods.
According to an example of the present principles, illustrated in
In accordance with an example, the network is a broadcast network, adapted to broadcast/transmit encoded images from device A to decoding devices including the device B.
A signal, intended to be transmitted by the device A, carries at least one bitstream comprising coded data representative of at least one image.
Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.
Unless indicated otherwise, or technically precluded, the aspects described in this application can be used individually or in combination.
Various numeric values are used in the present application. The specific values are for example purposes and the aspects described are not limited to these specific values.
The implementations and aspects described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device.
Processors also include communication devices, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.
Additionally, this application may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
Also, as used herein, the word “signal” refers to, among other things, indicating something to a corresponding decoder. For example, in certain embodiments the encoder signals a quantization matrix for de-quantization. In this way, in an embodiment the same parameter is used at both the encoder side and the decoder side. Thus, for example, an encoder can transmit (explicit signaling) a particular parameter to the decoder so that the decoder can use the same particular parameter. Conversely, if the decoder already has the particular parameter as well as others, then signaling can be used without transmitting (implicit signaling) to simply allow the decoder to know and select the particular parameter. By avoiding transmission of any actual functions, a bit savings is realized in various embodiments. It is to be appreciated that signaling can be accomplished in a variety of ways. For example, one or more syntax elements, flags, and so forth are used to signal information to a corresponding decoder in various embodiments. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.
As will be evident to one of ordinary skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Number | Date | Country | Kind |
---|---|---|---|
21305845.6 | Jun 2021 | EP | regional |
21306026.2 | Jul 2021 | EP | regional |
21306163.3 | Aug 2021 | EP | regional |
21306276.3 | Sep 2021 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/066476 | 6/16/2022 | WO |