This application claims prior to Italian Application No. 102023000018537, filed Sep. 11, 2023, which is incorporated herein by specific reference in its entirety.
The present invention relates to a method for learned image compression. The present invention also relates to an autoencoder implementing the method for learned image compression.
End-to-end image compression is gaining momentum as it enables learning the encoder and decoder functions jointly, instead of relying on handcrafted transformations that standard codecs are based on (see Ma S. et al., “Image and video compression with neural networks: A review”, IEEE TCSVT, 2019 [reference 1]).
Most designs depend on an autoencoder architecture, leveraging recent advances in deep artificial neural networks.
At the transmitter side, a convolutional encoder extracts, from the image, a vector of features, known as latent space. This vector has lower dimensionality than the image, achieving preliminary compression. The vector is further quantized, yielding a compressed representation of the image in the form of a bitstream.
At the receiver side, this representation is projected back to the original dimension by a decoder network, recovering the original image.
Encoder and decoder are jointly trained via gradient backpropagation, minimizing some RD (Rate Distortion) cost function in the form λD+R, where D is the reconstruction error, R is the rate of the quantized latent space, i.e. the estimated entropy, and λ regulates the trade-off between these two competing terms.
Several approaches have been proposed to estimate the rate of the latent space.
In Ballé J., Laparra V., and Simoncelli E. P., “End-to-end optimized image compression”, ICLR, 2017, [reference 2], the authors built a simple autoencoder with one single latent space, estimating the rate using a parametric function, while in Ballé J., Minnen D., Singh S., Hwang S. J., and Johnston N., “Variational image compression with a scale hyperprior”, ICLR, 2018, [reference 3], an ad-hoc neural network has been trained within the overall framework.
Upon this seminal idea, more advanced architectures have been built to improve quantitative results in terms of both rate and distortion.
In reference [3], a scale hyperprior latent space was introduced to capture spatial dependencies within an image, while in Minnen D. et al., “Joint autoregressive and hierarchical priors for learned image compression”, Advances in neural information processing systems, 2018, [reference 4], in Lee J. et al., “Context-adaptive entropy model for end-to-end optimized image compression”, International Conference on Learning Representations (ICLR), 2019, [reference 5], and in Minnen D., Saurabh S., “Channel-wise autoregressive entropy models for learned image compression”, IEEE International Conference on Image Processing, 2020, [reference 6], the authors exploited an autoregressive context model, inspired by their success in probabilistic generative models.
Other proposed solutions were for example to add graph-based modules in the autoencoder to capture non-spatial correlations as in Yang C. et al., “Graph-Convolution Network for Image Compression”, IEEE International Conference on Image Processing (ICIP), 2021, [reference 7], to add attention modules as in Cheng, Z. et al., “Learned image compression with discretized gaussian mixture likelihoods and attention modules”, CVPR, 2020, [reference 8], or to exploit Swin-transformers as in Zou R. et al., “The devil is in the details: Window-based attention for image compression”, CVPR 2022, [reference 9].
With reference to
The autoencoder 1 is described in reference [2].
A learnable encoder fa projects an image x into a latent space y=fa(x, θf)∈RN
The latent space y has a lower dimension than the image x, achieving preliminary compression.
Then, the latent space y is quantized in a quantizer (U|Q) using a function Q, obtaining a quantized latent space ŷ=Q(y).
The quantized latent space ŷ is entropy coded by an entropy encoder, e.g. by arithmetic coding, producing an actual bitstream.
At the receiver side, the bitstream is entropy-decoded by an entropy decoder and then is fed as input to a decoder fs that projects it to the original image dimension and recovers the reconstruction {circumflex over (x)}=fs (ŷ, θg), where θg represents learnable parameters of the decoder.
In the context of learned image compression framework, the entropy model used to encode the latent space y is represented by a probability distribution pŷ, and it has the role to approximate the real marginal distribution, which is unknown a priori.
The autoencoder 1 should be trained end-to-end via standard gradient descent of the backpropagated error gradient. Namely, training the autoencoder 1 resolves to finding the learnable parameters (θf,θg) that minimize the cost function:
where d is a distortion metric, R(ŷ)=E[−pŷ(ŷ)] is an estimation of the rate of the latent space, and λ controls the RD trade-off between these two competing terms.
However, training the autoencoder via backpropagation requires all the cost function terms to be differentiable, which is not, because a quantization step occurs, and the quantization is not differentiable.
In reference [2], rounding quantization is replaced by adding uniform noise to the latent space y, obtaining {tilde over (y)}=y+4, where Δ˜U(−0.5,0.5). This approach has two advantages: first, the density function p{tilde over (y)} is a continuous relaxation of the discrete density mass pŷ (see reference [2]); second, the moments of the quantized random variable are the same as those of the original signal plus an additive signal-independent uniform noise (see Robert M. and Neuhoff D., “Quantization, in IEEE transactions on information theory”, 1998, [reference 11]).
Another crucial step is to define an effective proxy of the rate R. To tackle this problem, works like those described in references [2], [3], [5] and [8], introduce a parametric function to estimate the density function p{tilde over (y)} during training, modeling it as a fully factorized model defined as
where * represents the convolution operation, and Ψ represents the learnable parameters related to the entropy model.
In particular, reference [2] models each marginal of the density function p{tilde over (y)} with a piecewise linear function where the parameters represent the value of the specific sampling points.
Conversely, in references [3], [8] and Lee J. et al., “DPICT: deep progressive image compression using trit-planes”, IEEE/CVF CVPR, 2022 [reference 12], the density function p{tilde over (y)} is modeled via its cumulative by an auxiliary neural network Ψ trained jointly with the entire image compression framework. The neural network Ψ must guarantee the theoretical characteristics of a cumulative function, namely the positivity and the boundedness between 0 and 1. In particular, the neural network Ψ is modeled as a cascade of K parametric vector functions τk, obtaining Ψ=τK∘τK−1∘τ1.
The actual choice of τ in references [3], [4] and [8] is:
where mk=x+ak⊙tan h, where tan h represents the non-linearity, ⊙ represents the elementwise multiplication, and (Tk, bk, ak) are vectors of trainable parameters which form the auxiliary neural network Ψ.
To respect the cumulative conditions mentioned above, a reparametrization step is performed at each step, by applying the softplus and tan h functions to the vectors of trainable parameters Tk and bk, respectively.
The methods according to the references [3], [8] and [12] implemented an auxiliary neural network Ψ with K=4, obtaining an architecture with around 20000 parameters, depending on the dimension of the latent space.
With the developments of more complex architectures also the modeling of the entropy model became more precise.
With reference to
In this case, while the probability distribution p{circumflex over (z)} modelling the latent space is a fully-factorized model like in the above indicated equation (2), p(ŷ|{tilde over (z)}) is parameterized as a zero-mean Gaussian distribution with the scale factor equal to o2.
To train the autoencoder 2 of reference [3] described in
where R({circumflex over (z)}) is equivalent to the rate term in equation (1), while R(ŷ|{circumflex over (z)})=E[−pŷ(ŷ|{circumflex over (z)})] is the rate term related to the Gaussian distributed latent space.
Following that, references [4] and [6] exploit a context model Cm, formed by mask convolution and a parameter estimation module, with a mean-scale hyperprior to extract a more accurate entropy model, while reference [8] modeled p(ŷ|{circumflex over (z)}) as a mixture of Gaussians and added attention module to the autoencoder 2: both of these architectures are represented in
As already mentioned, the above approaches rely on a neural network Ψ for learning a parametric entropy model referring to {circumflex over (z)} at training time.
However, this not only impacts the training complexity but also requires a fine-tuning step if some context from where to extract updated statistics was available.
At inference time, uniform scalar quantization over integers is performed, considering L different quantization levels uniformly distributed in a symmetric range
obtaining a bounded latent representation, which is enforced during training.
To deal with the quantization step, uniform noise during training is added as in reference [2] and the entropy model is fully factorized as in equation (2), meaning that each channel of the latent space has its distribution over the quantization levels, and there is no correlation between different channels.
Indicating with lij the i-th quantization level of the j-th channel and with
its associated quantization interval, the first order entropy H{circumflex over (p)} of the probability distribution pŷ is expressed as
Unfortunately, the fact that the probability distribution pŷ is a discrete distribution makes equation (5) non-differentiable, making it impossible to minimize it within a gradient-based optimization framework.
To sum up, all the prior art approaches rely on a neural network to estimate the latent space rate, however not without drawbacks.
First, the neural network requires training itself, adding to the encoder complexity and learning time.
Second, assuming that a temporal context is available, the neural network must be refined, i.e. retrained, on the context.
The present invention aims at solving these and other problems by providing a method for learned image compression, and a related autoencoder, that reduces the overall training complexity.
A further scope of the present invention is to provide a method for learned image compression, and a related autoencoder, that accelerates rate convergence at training time.
A further scope of the present invention is to provide a method for learned image compression, and a related autoencoder, that are suitable for any learnable image compression scheme.
A further scope of the present invention is to provide a method for learned image compression, and a related autoencoder, that does not require any additional training to update the entropy model when a temporal context is available.
In a nutshell, the method according to the invention proposes a non-parametric model of the latent space entropy distribution as a proxy of the encoding rate.
The method according to the invention models the rate of the quantized latent space as a differentiable function that can be optimized at training time through backpropagation.
The method according to the present invention, and the related autoencoder, estimates the latent space statistical frequencies in a differentiable way, so it can be plugged into the RD cost function as a proxy of the rate at training time.
The proposed technical solution not only fulfills the requirements for learning with standard gradient backpropagation, but is agnostic of the overall autoencoder architecture, and can be adapted without lengthy refinement procedures.
As the entropy model according to the present invention is non-parametric, it reduces the overall complexity and accelerates rate convergence at training time.
Moreover, whenever a temporal context is available, no additional training is required to update the entropy model.
A variety of learned image compression architectures has been experimented, and similar performance for a static entropy model was achieved, with a slight improvement when the model is updated over a temporal context.
According to a first embodiment of the method according to the invention, it is described a method for learned image compression implemented in an autoencoder comprising a learnable encoder and a decoder, said method comprising the steps of:
According to an aspect of the first embodiment of the method according to the invention, the latent space comprises a number Nc of latent space channels having a dimension Nd, and wherein, given a j-th channel of the latent space and a quantization level lij of the j-th channel, the soft frequency counter associates every value of the latent space ynj to a weight inversely proportional to the distance with lij, where n varies within the same channel and ranges from 1 to Nd.
According to a further aspect of the first embodiment of the method according to the invention, the soft frequency counter relies on a scalar function ϕij and a first order entropy H{tilde over (p)} of a probability distribution{tilde over (p)}j for every single channel of the latent space is:
According to a further aspect of the first embodiment of the method according to the invention, the cost function L is
where d(x, {circumflex over (x)}) is a reconstruction error, H{tilde over (p)}
According to a second embodiment of the method according to the invention, it is described a method for learned image compression implemented in an autoencoder comprising a learnable encoder and a decoder, the method comprising the steps of:
According to an aspect of the second embodiment of the method according to the invention, the hyperprior representation comprises a number Nc of latent space channels having a dimension Nd, and wherein, given a j-th channel of the hyperprior representation and a quantization level lij of the j-th channel, the soft frequency counter associates every value of the latent hyperprior representation znj to a weight inversely proportional to the distance with lij, where n varies within the same channel and ranges from 1 to Nd.
According to a further aspect of the second embodiment of the method according to the invention, the soft frequency counter relies on a scalar function ϕij and the first order entropy H{tilde over (p)} of a probability distribution {tilde over (p)}y for every single channel of said latent space is:
According to a further aspect of the second embodiment of the method according to the invention, the cost function L is
where d(x, {circumflex over (x)}) is the reconstruction error, H{tilde over (p)}
According to a first embodiment of an autoencoder for learned image compression according to the invention, the autoencoder comprises:
According to a first aspect of the first embodiment of the autoencoder for learned image compression according to the invention, the latent space comprises a number Nc of latent space channels having a dimension Nd, and wherein, given a j-th channel of the latent space and a quantization level lij of the j-th channel, the soft frequency counter is adapted to associate every value of the latent space ynj to a weight inversely proportional to the distance with lij, where n varies within the same channel and ranges from 1 to Nd.
According to a further aspect of the first embodiment of the autoencoder for learned image compression according to the invention, the soft frequency counter relies on a scalar function ϕij and the first order entropy H{tilde over (p)} of a probability distribution {tilde over (p)}j for every single channel of the latent space is:
According to a further aspect of the first embodiment of the autoencoder for learned image compression according to the invention, the cost function L is
where d(x, {circumflex over (x)}) is a reconstruction error, H{tilde over (p)}
According to a second embodiment of an autoencoder for learned image compression according to the invention, the autoencoder comprises:
According to a first aspect of the second embodiment of an autoencoder for learned image compression according to the invention, the hyperprior representation comprises a number Nc of latent space channels having a dimension ND, and wherein, given a j-th channel of the hyperprior representation and a quantization level lij of the j-th channel, the soft frequency counter associates every value of the hyperprior representation znj to a weight inversely proportional to the distance with lij, where n varies within the same channel and ranges from 1 to Nd.
According to a further aspect of the second embodiment of an autoencoder for learned image compression according to the invention, the soft frequency counter relies on a scalar function ϕij and the first order entropy H{tilde over (p)} of a probability distribution {tilde over (p)}j for every single channel of the latent space is:
According to a further aspect of the second embodiment of an autoencoder for learned image compression according to the invention, the cost function L is
where d(x, {tilde over (x)}) is a reconstruction error, H{tilde over (p)}
The invention will be described in detail hereinafter through non-limiting embodiments with reference to the attached Figures, in which:
With reference to
Given the j-th channel and a quantization level lij, the desired formulation must associate every value of the latent space ynj to a weight inversely proportional to the distance with lij, where n varies within the same channel and ranges from 1 to Nd: in this way, by adding all the weights together, a soft counter is obtained which is higher for most representative levels.
To model such a mechanism, the soft frequency counter SFC(lij) according to the invention relies on a scalar function ϕij, whose behaviour is depicted in
Given the considered level lij, any value of y that lies outside the quantization range has zero weight, thus not contributing to the soft frequency counter SFC(lij); on the contrary, values within the quantization range are linearly weighted according to the distance to the center, obviously with maximum weight equal to 1 when y=lij. Being a relaxed approximation of the frequency counter for y, results can be normalized among the different levels, obtaining thus relaxed statistical frequency for estimating probability distribution {tilde over (p)}j for every single channel. In particular, we have that
Relaxation of the entropy represented by equation (6) makes it advantageously possible to directly minimize it during the training phase, so it can be inserted in RD cost functions.
Since the formulation according to the invention is consistent with the frequency statistics, at inference time this relaxation is advantageously replaced by the actual frequency statistics of a limited batch of images.
The entropy model according to the invention can be plugged into the cost function minimized when training a generic learned image compression algorithm.
The model according to the invention is advantageously agnostic to the underlying autoencoder architecture, so it can be in principle plugged into any learnable image compression scheme.
To show some examples, the method is applied to the autoencoders of references [2], [3], [4], [8], depicted in
The replacement of the auxiliary network Ψ with the soft frequency counter SFC according to the invention allows to obtain an autoencoder 3 according to a first embodiment of the invention (
In particular, for a model based on architecture according to reference [2], the cost function becomes:
For hyperprior-based architectures like those described in references [3], [4] and [8], the cost function turns into
Each architecture can then be advantageously trained via standard gradient descent as usual.
To use learned image compression architectures at inference time, it is necessary to extract the entropy model for the arithmetic codec first: while standard frameworks like those described in references [3], [4] and [8] exploit neural network Ψ trained on the whole dataset, in the model according to the present invention it is enough to compute the entropy model by applying equation (7) using as input of equation (8) the quantized latent space ŷ of a small subset of the training set, whose size is denoted as ω, and use them as the actual probability distribution of the latent space.
Later in this description, it will be proved how this strategy allows for easy adaptation of the entropy model with some temporal context.
It is now experimenting with the entropy model according to the invention over four state-of-the-art learnable image compression schemes described in references [2], [3], [4], [8], each of them with different characteristics and architectures.
At first, all the details are given about how both the autoencoders of references [2], [3], [4] and [8], and the architectures according to the present invention are trained; then the results obtained according to the present invention are evaluated in terms of rate-distortion performance on two distinct datasets, namely the dataset of Eastman Kodak Company, Kodak Lossless True Color Image Suite, 1999 [reference 13] and the CLIC validation dataset, Toderici G., et al., E. Workshop and challenge on learned image compression, CVPR, 2021 [reference 14].
It is also investigated how the method according to the present invention allows a faster convergence concerning the rate terms. In the end, it is proven how it is possible to adapt in a fast way the entropy model based on simple frequency statistics, showing performance in terms of BD-Rate and BD-PSNR on the Jvet dataset described in Joint Video Exploration Team (JVET) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11: JVET-G1010: “JVET common test conditions and software reference configurations, in 7th Meeting, 2017, Torino, Italy [reference 15].
As a final remark, the experiments according to the present invention aim at assessing the effectiveness of the entropy model according to the invention over different architectures, not comparing the relative performance.
As far as the training details are concerned, for an unbiased comparison, each architecture used was trained from scratch as for the reference algorithm. For the concerned models, the entropy model according to the present invention is plugged in equation (6) in lieu of the auxiliary neural network according to the prior art, and each architecture was retrained from scratch.
All models were trained over the Vimeo-90K dataset described in Xue T. et al., “Video Enhancement with Task-Oriented Flow”, International Journal of Computer Vision (IJCV), 2019, [reference 16] over 256×256 patches cropped at random from the training images, and multiple RD tradeoffs were obtained by properly imposing different values of λ, which ranges between 0.0009 and 0.045.
L=120 was heuristically set, meaning that the latent space was automatically bound in a range between −60 and 60. The initial learning rate was set to 1e-4 and it was halved whenever the cost function hit a plateau, with a patience of 20 epochs.
Each architecture was trained for 1-2 million steps and with batch sizes of 32 images. At inference time, the entropy model was extracted exploiting a subset of the training set, and a non-adaptive arithmetic coder was used to encode and decode the latent space, by fixing ω=32.
All the experiments were performed leveraging the CompressAI library codebase as described in Bégaint, J. et al., “CompressAI: a PyTorch library and evaluation platform for end-to-end compression research”, arXiv preprint arXiv: 2011.03029, 2020, [reference 17], the code being publicly available on https://github.com/EIDOSLAB/SFC.
For clarity, MS-SSIM was converted to −10 log10(1-MS-SSIM). For each considered architecture, the solid line represents the reference scheme with the auxiliary neural network Ψ, whereas the dotted line represents the results obtained with the non-parametric entropy model according to the invention. For all the considered baseline models, the method according to the invention performs close if not identical to the original reference, especially at low bitrates.
It is pointed out that the performance gap is negligible for architectures where the entropy model according to the present invention is used to estimate the rate of a hyperprior latent space as in references [3], [4], and [8].
On the other hand, a little decrease in performance is visible concerning the autoencoder of reference [2]: however, it will be shown in the following how adapting the entropy model according to the present invention with a simple statistics computation closes this gap.
Besides the quantitative performance, it is investigated how the entropy model according to the present invention impacts the training process convergence.
Namely, the first 20 iterations of the architectures of references [2] and [8] are analyzed for λ=0.0018 and c=128.
With the formulation proposed in the present invention, the rate term, representing the estimated entropy, converges in just a few epochs, while distortion drops regularly as for the references: this fact means that the formulation proposed in the present invention advantageously leads to a faster convergence to the stable configuration in terms of rate.
Typically, such frameworks automatically allocate bits to different channels, shrinking to zero any useless ones. This also applies to the formulation proposed in the present invention, but redundant channels are discarded more quickly, as it could already be imagined from
While for reference [8] some intermediate steps are necessary for the final configuration, with the model according to the present invention it is advantageously immediate the achievement of the right distribution.
While modern video codecs rely on a context-adaptive arithmetic coder (CABAC), recently learned image codecs like those of references [2], [3], [4] and [8] involve the use of a fixed probability distribution extracted through the auxiliary neural network, that should be retrained in case of entropy model adaptation.
The entropy formulation according to the present invention is parametric-free since it is only based on equation (5) and can be updated to a given context by recomputing simple statistics.
To prove this point, the long JVET video sequences consisting of different contents at different resolutions (up to 4K, i.e. 2160 p) were encoded using the very same architectures trained above (no retraining performed): the architectures of references [2] and [4] were taken as baselines, to consider cases where the method according to the present invention is applied to latent spaces of different types.
For the first frame only, it was relied on the entropy model computed at training time using the Vimeo dataset, since no temporal context was available. For extracting the distribution model, a batch of 32 images from the dataset was utilized.
However, for the following frames, the entropy model was adapted by averaging the current entropy distribution with frequency statistics of the previous frame, calculated using equation (7).
As two adjacent frames of the same sequence are in most cases very similar, so are expected to be the distributions of the relative latent spaces.
To make a more fair comparison, this method was also experimented with a sampling rate of 16, meaning that the entropy model was adapted every 16-th frames only instead of exploiting every single one: this configuration is more similar to the one used by classic codecs.
Thanks to adaptive entropy modeling, the BD-Rate improves beyond 10% concerning reference [2], which was previously the worst result, closing the gap with the reference model.
These gains are attributed to the fact that in reference [2] all the information required to reconstruct the image is encoded in the latent space whose entropy the model according to the present invention accounts for, whereas in reference [4] only the hyperprior latent representation is modeled.
As it is possible to observe, in both cases refining the entropy model by exploiting temporal context yields about a 10% better rate, without affecting the distortion results.
According to the invention, a differentiable and non-parametric model of the latent space entropy as a proxy of the rate into the RD cost function is proposed.
The model according to the invention is built around a soft statistical counter that attributes to each quantization level a value proportional to its effective frequency in a specific channel of the latent space, and which once normalization occurred could be used as a proxy of the entropy model.
Experimental results with four different learned image compression architectures show performance similar to the case where a neural network estimates the latent space rate and proves that the formulation according to the invention achieves a stable solution faster to reference models.
Moreover, it is advantageously possible to update the entropy distribution by exploiting temporal content without any retraining, achieving overall slight improvements in the performance.
The present description has tackled some of the possible variants, but it will be apparent to the man skilled in the art that other embodiments may also be implemented, wherein some elements may be replaced with other technically equivalent elements. The present invention is not therefore limited to the explanatory examples described herein, but may be subject to many modifications, improvements or replacements of equivalents parts and elements without departing from the basic inventive idea, as set out in the following claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 102023000018537 | Sep 2023 | IT | national |