The efficient transmission of videos and images has driven an unprecedented surge of telecommunications in the past decade. All coding technologies, applied in different use cases, refer to the same compression task: Given a budget of R* bits for storage, the goal is to transmit the image (or video) with bitrate R≤R* and minimal distortion d. The optimization has an equivalent formulation as
min(d+λR), (1)
where λ is the Lagrange parameter, which depends on R*. Advanced video codecs like HEVC [1, 2] and VVC [3, 4] attack the compression task by a hybrid, block-based approach. The current frame is partitioned into smaller sub-blocks. Divided into these blocks, intra-frame prediction or motion estimation is applied on each block. The resulting prediction residual is transform-coded, using a context-adaptive arithmetic coding engine. Here, the encoder performs a search among several coding options for selecting the block-partition as well as the prediction signal, the transform and the transform coefficient levels; see for examples in [5]. This search is referred to as rate-distortion optimization (RDO): the encoder extensively tests different coding decisions and compares their impact on the Lagrangian cost (1). Algorithms for RDO are crucial to the performance of modern video coding systems and rely on approximations of d and R, disregarding certain interactions between the coding decisions; [6]. Considering the spatial and temporal dependencies inside video signals, the authors of [7] have investigated several techniques for optimal bit allocation. Furthermore, as the quantization has a strong impact on the Lagrangian cost (1), there are several algorithms for selecting the quantization indices of a transform block [8, 9]. In general, the performance of hybrid video encoders heavily relies on such signal-dependent optimizations.
In contrast to the aforementioned block-based hybrid approach, the data-driven training of non-linear transforms for image compression has become a promising prospect; [10]. The authors of use stochastic gradient descent for jointly training an auto-encoder via convolutional neural networks (CNN) with a conditional probability model for its quantized features. Bane et al. have introduced generalized divisive normalizations (GDN) as non-linear activations. The auto-encoder has been enhanced by using a second auto-encoder (called hyper system) for compressing the parameters of the estimated probability density of the features; [13]. The authors have added an auto-regressive model for the probability density of the features and reported compression efficiency which surpasses HEVC in an RGB-setting for still image compression. In [15], the authors have successfully trained a compression system with octave convolutions and features at different scales, similar to the composition of natural images into high and low frequencies; [16].
The introduced concepts, in particular the ones of [10] to [16] such as the auto-encoder concept, GDNs as activation function, the hyper system, the auto-regressive entropy model and the octave convolutions and feature scales may be implemented in embodiments of the present disclosure.
It is a general urge in video and image coding, to improve the tradeoff between a low size of a compressed image and a low distortion of the reconstructed image, as explained with respect to rate R and distortion d in the previous section.
This object is achieved by the subject matter of the independent claims enclosed herewith. Embodiments provided by the independent claims provide a coding concept with a good rate-distortion trade-off.
An embodiment may have an apparatus for decoding a picture from a binary representation of the picture, wherein the decoder is configured for deriving a feature representation of the picture from the binary representation using entropy decoding, wherein the feature representation comprises a plurality of partial representations comprising first partial representations, second partial representations and third partial representations, wherein a resolution of the first partial representations is higher than a resolution of the second partial representations, and the resolution of the second partial representations is higher than a resolution of the third partial representations, and using a multi-layered convolutional neural network, CNN, for reconstructing the picture from the feature representation.
Another embodiment may have an apparatus for encoding a picture, configured for using a multi-layered convolutional neural network, CNN, for determining a feature representation of the picture, encoding the feature representation using entropy coding, so as to acquire a binary representation of the picture, wherein the CNN is configured for determining, on the basis of the picture, a plurality of partial representations of the feature representation comprising first partial representations, second partial representations and third partial representations, wherein a resolution of the first partial representations is higher than a resolution of the second partial representations, and the resolution of the second partial representations is higher than a resolution of the third partial representations.
Another embodiment may have a method for decoding a picture from a binary representation of the picture, the method comprising: deriving a feature representation of the picture from the binary representation using entropy decoding, wherein the feature representation comprises a plurality of partial representations comprising first partial representations, second partial representations and third partial representations, wherein a resolution of the first partial representations is higher than a resolution of the second partial representations, and the resolution of the second partial representations is higher than a resolution of the third partial representations, and using a multi-layered convolutional neural network, CNN, for reconstructing the picture from the feature representation.
Another embodiment may have a method for encoding a picture, the method comprising: using a multi-layered convolutional neural network, CNN, for determining a feature representation of the picture, encoding the feature representation using entropy coding, so as to acquire a binary representation of the picture, wherein the CNN is configured for determining, on the basis of the picture, a plurality of partial representations of the feature representation comprising first partial representations, second partial representations and third partial representations, wherein a resolution of the first partial representations is higher than a resolution of the second partial representations, and the resolution of the second partial representations is higher than a resolution of the third partial representations.
Another embodiment may have a bitstream into which a picture is encoded using an apparatus for encoding a picture, configured for using a multi-layered convolutional neural network, CNN, for determining a feature representation of the picture, encoding the feature representation using entropy coding, so as to acquire a binary representation of the picture, wherein the CNN is configured for determining, on the basis of the picture, a plurality of partial representations of the feature representation comprising first partial representations, second partial representations and third partial representations, wherein a resolution of the first partial representations is higher than a resolution of the second partial representations, and the resolution of the second partial representations is higher than a resolution of the third partial representations.
According to embodiments of the invention, a picture is encoded by determining a feature representation of the picture using a multi-layered convolutional neural network, CNN, and by encoding the feature representation.
Embodiments according to a first aspect of the invention rely on the idea of determining a feature representation of a picture to be encoded, which feature representation comprises partial representations of three different resolutions. Encoding of such a feature representation using entropy coding facilitates a good rate-distortion of the encoded picture. In particular, using partial representations of three different resolutions may reduce redundancies in the feature representation, and therefore, this approach may improve the compression performance. Using partial representations of different resolutions allows for using a specific number of features of the feature representation for each of the resolutions, e.g. using more features for encoding higher resolution information of the picture compared to the number of features used for encoding lower resolution information of the picture. In particular, the inventors realized that surprisingly, the dedication of a particular number of features for an intermediate resolution in addition to using particular numbers of features for a higher and for a lower resolution, may, despite an increased implementation effort, result in an improved tradeoff between implementation effort, and a good rate-distortion relation.
According to embodiments of a second aspect of the invention, the feature representation is encoded by determining a quantization of the feature representation. Embodiments of the second aspect rely on the idea of determining the quantization by estimating, for each of candidate quantizations, a rate-distortion measure, and by determining the quantization based on the candidate quantizations. In particular, for estimating the rate-distortion measure, a polynomial function between a quantization error and an estimated distortion is determined. The invention is based on the finding that a polynomial function may provide a precise relation between the quantization error and a distortion related to the quantization error. Using the polynomial function enables an efficient determination of the rate-distortion measure, therefore allowing for testing a high number of candidate quantizations.
A further embodiment exploits the inventors finding, that the polynomial function can give a precise approximation of a contribution of a modified quantized feature of a tested candidate quantization to an approximated distortion of the tested candidate quantization. Further, the inventors found, that the distortion of a candidate quantization may be precisely approximated by means of individual contributions of quantized features. An embodiment of the invention exploits this finding by determining the distortion of a candidate quantization by determining a distortion contribution of a modified quantized feature, which is modified with respect to a predetermined quantization, e.g. a previously tested one, to the approximated distortion of the candidate quantization, which is determined based on the distortion contribution and the distortion of the predetermined quantization. This concept allows, for example, for an efficient, step-wise testing of a high number of candidate quantizations, as, for example, starting from the predetermined quantization, for which the distortion is already determined, determining the distortion contribution from modifying an individual quantized feature using the polynomial function provides a computationally efficient way for determining the approximated distortion of a further candidate quantization, namely the one which differs from the predetermined one in the modified quantized feature.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
In the following, embodiments are discussed in detail, however, it should be appreciated that the embodiments provide many applicable concepts that can be embodied in a wide variety of image compression, such as video and still image coding. The specific embodiments discussed are merely illustrative of specific ways to implement and use the present concept, and do not limit the scope of the embodiments. In the following description, a plurality of details is set forth to provide a more thorough explanation of embodiments of the disclosure. However, it will be apparent to one skilled in the art that other embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in form of a block diagram rather than in detail in order to avoid obscuring examples described herein. In addition, features of the different embodiments described herein may be combined with each other, unless specifically noted otherwise.
In the following description of embodiments, the same or similar elements or elements that have the same functionality are provided with the same reference sign or are identified with the same name, and a repeated description of elements provided with the same reference number or being identified with the same name is typically omitted. Hence, descriptions provided for elements having the same or similar reference numbers or being identified with the same names are mutually exchangeable or may be applied to one another in the different embodiments.
The following description of the figures starts with a presentation of a description of an encoder and a decoder for coding pictures such as still images or pictures of a video in order to form an example for a coding framework into which embodiments of the present invention may be built in. The respective encoder and decoder are described with respect to
Internally, the encoder 10 may comprise an encoding stage 20 which generates a feature representation 22 on the basis of the picture 12. The feature representation 22 may include a plurality of features being represented by respective values. A number of features of the feature representation 22 may be different from a number of pixel values of pixels of the picture 12. The encoding stage 20 may comprise a neural network, having for example one or more convolutional layers, for determining the feature representation 22. The encoder 10 further comprises a quantizer 30 which quantizes the features of the feature representation 22 to provide a quantized representation 32, or quantization 32, of the picture 12. The quantized representation 32 may be provided to an entropy coder 40. The entropy coder 40 encodes the quantized representation 32 to obtain a binary representation 42 of the picture 12. The binary representation 42 may be provided to data stream 14.
The entropy coder 40 may use a probability model 52 for encoding the quantized representation 32. To this end, entropy coder 40 may apply an encoding order for quantized features of the quantized representation 32. The probability model 52 may indicate a probability for a quantized feature to be currently encoded, wherein the probability may depend on previously encoded quantized features. The probability model 52 may be adaptive. Thus, encoder 10 may further comprise an entropy module 50 configured to provide the probability model 52. For example, the probability may depend on a probability distribution of the previously encoded quantized features. Thus, the entropy module 50 may determine the probability model 52 on the basis of the quantized representation 32, e.g. on the basis of the previously encoded quantized features of the quantized feature representation. In examples, the probability model 52 may further depend on a spatial correlation within the feature representation. Thus, alternatively or additionally to the previously encoded quantized features 32, the entropy module 50 may use the feature representation 22 for determining the probability model 52, e.g. by determining a spatial correlation of features of the feature representation, e.g. as described with respect to
The decoder 11, as illustrated in
Similar to the entropy coder 40, the entropy decoder 41 may use the probability model 53 for decoding the binary representation 42. The probability model 53 may indicate a probability for a symbol to be currently decoded. The probability model 53 for a currently decoded symbol of the binary representation 42 may correspond to the probability model 51 using which the symbol has been encoded by entropy coder 40. Like the probability model 51, the probability model 53 may be adaptive and may depend on previously decoded symbols of the binary representation 42. The decoder 11 comprises an entropy module 51, which determines the probability model 53. The entropy module 51 may determine the probability model 53 for a quantized feature of the quantized representation 32, which is currently to be decoded, i.e. a currently decoded quantized feature, on the basis of previously decoded quantized features of the quantized feature representation 32. Optionally, the entropy module 51 may receive the side information 72 and use the side information 72 for determining the probability model 53. Thus, the entropy module 51 may rely on information about the feature representation 22 for determining the probability model 53.
The neural networks of encoding stage 20 of the encoder 10 and decoding stage 21 to the decoder 11, and optionally also respective neural networks of the entropy module 50 and the entropy module 51, may be trained using training data so as to determine coefficients of the neural networks. A training objective for training the neural networks may be to improve the trade-off between a distortion of the reconstructed picture 12′ and a rate of data stream 14, comprising the binary representation 42 and optionally the side information 72.
The distortion of the reconstructed picture 12′ may be derived on the basis of a (normed) difference between the picture 12 and the reconstructed picture 12′. An examples of how the neural networks may be trained is given in section 3.
As described with respect to
According to an embodiment, the entropy module 50 comprises a feature encoding stage which may generate a feature parametrization 62 on the basis of the feature representation 22. The feature encoding stage 60 may use an artificial neural network having one or more convolutional layers for determining the feature parametrization 62. The feature parameterization may represent a spatial correlation of the feature representation 22. To this end, for example, the feature encoding stage 60 may subject the feature representation 22 to convolutional neural network, e.g. E′ described in section 2. The entropy module 50 may comprise a quantizer 64 which may quantize the feature parametrization 62 so as to obtain a quantized parametrization 66. Entropy coder 70 of the entropy module 50 may entropy code the quantized parametrization 66 to generate the side information 72. For entropy coding the quantized parametrization 66, the entropy coder 70 may optionally apply a probability model which approximates the true probability distribution of the quantized parametrization 66. For example, the entropy coder 70 may apply a parametrized probability model for coding a quantized parameter of the quantized parametrization 66 into the side information 72. For example, the probability model used by entropy coder 70 may depend on previously decoded symbols of the side information 72.
The entropy module 50 further comprises a probability stage 80. The probability stage 80 determines the probability model 52 on the basis of the feature parametrization 66 and on the basis of the quantized representation 32. In particular, the probability stage 80 may consider, for the determination of the probability model 52 for a currently coded quantized feature of the quantized representation 32, previously coded quantized features of the quantized representation 32, as explained with respect to
For example, the first probability parameter 84 for the currently coded quantized feature of the quantized feature representation 32 may be determined by context module 82 on the basis of one or more previous quantized features of features that precede the currently coded one in the coding order. Similarly, the second probability estimation parameter 22′ may be determined by the feature decoding stage 61 in dependence on previously coded features. For example, feature encoding stage 60 may determine, for each of the feature of the feature representation 22, e.g. according to the coding order, a parameterized feature of the feature parameterization 62, and quantizer 64 may quantize each of the parameterized feature so as to obtain a respective quantized parameterized feature of the quantized parameterization 66. The feature decoding stage 61 may determine the second probability estimation parameter 22′ for the encoding of the current feature on the basis of one or more quantized parameterized features which have been derived from previous features of the coding order. For example, section 2 describes, by means of index I, an example of how the probability model for the current feature, e.g. the one having index I, may be determined.
It is noted, that according to embodiments, the entropy module 50 does not necessarily use both the feature representation 22 and the quantized feature representation 32 as an input for determining the probability model 52, but may rather use merely one of the two. For example, the probability module 86 may determine the probability model 52 on the basis of one of the first and the second probability estimation parameters, wherein the one used, may nevertheless be determined as described before.
Accordingly, in an embodiment, the entropy module 50 determines the probability model 52 on the basis of previous quantized feature of the quantized feature representation 32, e.g. using a neural network. Optionally, this determination may be performed by means of a first and a second neural network, e.g. a masked neural network followed by a convolutional neural network, e.g. as performed by exemplary implementations of the context module 82 and the probability module 86 illustrated in
According to an alternative embodiment, the entropy module 50 determines the probability model 52 on the basis of previous features of the feature representation 22, e.g. using the feature encoding stage 60, the quantizer 65, and the feature decoding stage 61, e.g. as described before. However, according to this embodiment, probability stage 80 may not receive the quantized feature representation 32 as an additional input, but may derive the probability model 52 merely on the basis of the information derived via the feature encoding stage 60, the quantizer 65, and the feature decoding stage 61, e.g. by processing the output of the feature encoding stage 61 by a convolutional neural network, as it may, e.g. be part of the probability module 86. In examples of this embodiment, the feature encoding stage 61 and the probability module 86 may be combined, e.g. the neural networks of the feature encoding stage 61 and the neural network of the probability model 86 may be combined to determine the probability model 52 on the basis of the quantized parameterization 66 using one neural network.
Optionally, the latter two embodiments may be combined, as illustrated in
As described before, the entropy module 51 may determine a probability model 53 for the entropy decoding of a currently decoded feature of the feature representation 32. Accordingly, the features of the feature representation 32 may be decoded according to a coding order or scan order, e.g. according to which they are encoded into data stream 14.
According to an embodiment, the entropy module 51 according to
The entropy module 51 according to
As described with respect to
As described with respect for
Accordingly, in an embodiment, the probability stage 81 determines the probability model 53 based on previously decoded features of the feature representation, e.g. as described with respect to the probability stage 81, or as described with respect to
According to an alternative embodiment, the probability stage 81 determines the probability model 53 based on the quantized parameterization 66, e.g. as described with respect to the probability stage 81, or as described with respect to
Optionally, the latter two embodiments may be combined, as illustrated in
Neural networks of the feature encoding stage 60, as well as of the feature decoding stage 61, the context module 82, and the probability module 86 of the probability stage 50 and the probability stage 51 may be trained together with the neural networks of transformer 20 and decoding stage 21, as described with respect to
The feature encoding stage 60 and the feature decoding stage 61 may also be referred to as hyper encoder 60 and hyper decoder 61, respectively. Determining the feature parametrization 66 on the basis of the feature representation 22, may allow for exploiting spatial redundancies in the feature representation 22 in the determination of the probability model 52, 53. Thus, the rate of the data stream 14 may be reduced even though the side information 72 is transmitted in the data stream 14.
In the following, embodiments of the present disclosure are described in detail. All of the herein described embodiments may optionally be implented on the basis of the encoder 10 and the decoder 11 of
Given the capabilities of massive GPU hardware, there has been a surge of using artificial neural networks (ANN) for still image compression. These compression systems usually consist of convolutional layers and can be considered as non-linear transform coding. Notably, these ANNs are based on an end-to-end approach where the encoder determines a compressed version of the image as features. In contrast to this, existing image and video codecs employ a block-based architecture with signal-dependent encoder optimizations. A basic requirement for designing such optimizations is estimating the impact of the quantization error on the resulting bitrate and distortion. As for non-linear, multi-layered neural networks, this is a difficult problem. Embodiments of the present disclosure provide a well-performant auto-encoder architecture, which may, for example, be used for still image compression. Advantageous embodiments use multi-resolution convolutions so as to represent the compressed features at multiple scales, e.g. according to the scheme described in sections 4 and 5. Further advantageous embodiments implement an algorithm, which tests multiple feature candidates, so as to reduce the Lagrangian cost and to increase or to optimize compression efficiency, as described in sections 6 and 7. The algorithm may avoid multiple network executions by pre-estimating the impact of the quantization on the distortion by a higher-order polynomial. In other words, the algorithm exploits the inventors finding that the impact of small feature changes on the distortion can be estimated by a higher-order polynomial. Section 3 describes a simple RDO algorithm, which employs this estimate for efficiently testing candidates with respect to equation (1) and which significantly improves the compression performance. The multi-resolution convolution and the algorithm for RDO may be combined, which may further improve a rate-distortion trade-off.
Examples of the disclosure may be employed in video coding and may be combined with concepts of High Efficiency Video Coding (HEVC), Versatile Video Coding (VVC), Deep Learning, Auto-Encoder, Rate-Distortion-Optimization.
In this section, an implementation of an encoder and a decoder is described in more detail. The encoder and decoder described in this section may optionally be an implementation of encoder 10 as described with respect to
The presented deep image compression system may be closely related to the auto-encoder architecture in [14]. A neural network E, as it may be implemented in the encoding stage 20 of
z=E(x),{circumflex over (z)}=round(z),{circumflex over (x)}=D({circumflex over (z)}). (2)
Thus, {circumflex over (x)} of the herein used notation may correspond to the reconstructed picture 12′ of
As some of the herein described embodiments focus on an encoder optimization, the description is restrict to luma-only inputs which do not require the weighting of different color channels for computing the bitrate and distortion. Nevertheless, in some embodiments the picture 12 may also comprise chroma channels, which may be processed similarly. Transmitting the quantized features 2 requires a model for the true distribution p{circumflex over (z)}, which is unknown. Therefore, a hyper system with a second encoder E′, as it may be implemented in the feature encoding stage 60 of
y=E′(z),ŷ=round(y),θ=D′(ŷ). (3)
Thus, within the herein used notation, y may correspond to the feature parametrization 62, ŷ may correspond to the quantized parametrization 66, and θ to the second probability estimation parameter 22′. Accordingly, the hyper encoder E′ may be implemented by means of the feature encoding stage 60, and the hyper decoder D′ may be implemented by means of the feature decoding stage 61.
An example for an implementation of the encoder E, decoder D, hyper encoder E′ and hyper decoder D′ is described in section 7.
The hyper parameters are fed into an auto-regressive probability model P{tilde over (z)} (⋅; θ) during the coding stage of the features. The model employs normal distributions (⋅, (μ, σ2)), which has proven to perform well in combination with GDNs as activation; [13]. As described in section 5, GDNs may be employed as activation functions in encoder E and decoder D. We fix a scan order among the features, according to which the quantized features are to be entropy coded into, and map the context of {circumflex over (z)}l and the hyper parameters θ to the Gaussian parameters μl, σl2 a via two neural networks
({circumflex over (z)}l−1, . . . ,{circumflex over (z)}l−L)con({circumflex over (z)}l−1, . . . ,{circumflex over (z)}l−L)=θ*l, (5)
(θ*lθl)est(θ*l,θl)=(μl,σl2). (6)
Here, I is an index of a currently coded quantized feature {circumflex over (z)}l, L is a number of previously coded quantized features which are considered for the context of {circumflex over (z)}l. The auto-regressive part (5) may, for example, use 5×5 masked convolutions. For the case that encoder E and decoder D implement the multi-resolution convolution described in section 4 or in section 5, three versions of the entropy models (5) and (6) may be implemented, as in this case the features consist of coefficients at three different scales. An exemplary implementation of the models con and est of (5) and (6) for a number of C input channels is shown in Table 2. In other words, the encoder and decoder may each implement three of each of the models con and est, one for each scale of coefficients, or feature representations.
The estimated probability then becomes
P
{circumflex over (z)}({circumflex over (z)}l)≈P{tilde over (z)}({circumflex over (z)}l;θ)=∫{circumflex over (z)}
For example, with reference to
Finally, a parametrized probability model P{tilde over (y)}(⋅,ϕ) approximates the true distribution of the side information, for example as described in [13].
It is noted, that the probability model for a currently coded quantized feature {circumflex over (z)}l may alternatively be determined using either the hyper parameter, θl, or the context parameter, In other words, according to an embodiment, the probability model is determined using the hyper parameter θl. According to this embodiment, the network con may be omitted. According to an alternative embodiment, the probability model is determined using the context parameter determined based on the previously coded quantized features {circumflex over (z)}l−1, . . . ,{circumflex over (z)}l−L by the network con. In this alternative, the hyper encoder/hyper decoder path may be omitted. With respect to equation (6), these embodiments are expressed by the cases
est(θ*l,θl)=∀θl
bzw.
est(θ*l,θl)=θl∀θ*l.
The scheme described in this section may be used for implementing both an encoder and a decoder, wherein the implementation of the decoder may follow the correspondences of the encoder 10 and the decoder 11 as described with respect to
A concept for training the neural networks E, E′, D, D′, and the entropy models con, est, ϕ is described in the following section.
Referring to the encoder 10 and the decoder 11 of
Using the notation from the previous section, the compression task of equation (1) translates into the following, differentiable training objective:
Here, ∥⋅∥ may for example denote the Frobenius norm. For example, for each λ∈{128·2i, i=0, . . . , 4}, a separate auto-encoder may be been trained. The optimization is performed
via stochastic gradient over luma-only 256×256-patches from the ImageNet data set with batch size 8 and 2500 batches per training epoch. The step size for the Adam optimizer [19] was set as αj=10−4·1.13−j, where j=0, . . . , 19.
For avoiding zero gradients during gradient computation; [12], the quantization 30 and 64, e.g. the rounding of equation (2) and (3), may be replaced by a summation with noisy training variables for the processing of the training data, wherein may represent the equal distribution:
Δ˜(−0.5,0.5),{tilde over (z)}l=z1+Δ,{tilde over (y)}k=yk+Δ, (4)
In this section, embodiments of an encoder 10 and a decoder 11 are described. The encoder 10 and the decoder 11 may optionally correspond to the encoder 10 and the decoder 11 according to
For example, for the purpose of entropy coding, the entropy coding stage 28 may comprise an entropy coder, for example entropy coder 40 as described with respect to
For example, for the purpose of entropy decoding, the entropy decoding stage 29 may comprise an entropy decoder, for example entropy decoder 41 as described with respect to
The following description of this section focuses on embodiments of the encoding stage 20 and the decoding stage 21. While encoding stage 20 of encoder 10 determines the feature representation 22 based on the picture 12, decoding stage 21 of decoder 11 determines the picture 12′ on the basis of the feature representation 32. The feature representation 32 may correspond to the feature representation 22, despite of quantization loss, which may be introduced by a quantizer, which may optionally be part of the entropy coding stage 28.
In other words, described with respect to
For example, the picture 12 may be represented by a two-dimensional array of samples, each of the samples having assigned to it, one or more sample values. In some embodiments, each pixel may have a single sample value, e.g. a luma sample. For example, the picture 12 may have a height of H samples and a width of W samples, such having a resolution of H×W samples.
The feature representation 32 may comprise a plurality of features, each of which is associated with one of the plurality of partial representations of the feature representation 22. Each of the partial representations may represent a two-dimensional array of features, so that each feature may be associated with a feature position. Each feature may be represented by a feature value. The partial representations may have a lower resolution than the picture 12, 12′. For example, the decoding stage 21 may obtain the picture 12′ by upsampling the partial representations using transposed convolutions. Equivalently, the encoding stage 20 may determine the partial representations by downsampling the picture 12 using convolutions. For example, a ratio between the resolution of the picture 12′ and the resolution of the first partial representations 321 corresponds to a first downsampling factor, a ratio between the resolution of the first partial representations 321 and the resolution of the second partial representations 322 corresponds to a second downsampling factor, and a ratio between the resolution of the second partial representations 322 and the resolution of the third partial representations 322 corresponds to a third downsampling factor. In embodiments, the first downsampling factor equal to the second downsampling factor and the thirds downsampling factor, and is equal to 2 or 4.
As the first partial representations 321 have a higher resolution than the second partial representations 322 and the third partial representations 323, they may carry high frequency information of the picture 12, while the second partial representation 322 may carry medium frequency information and the third partial representations 323 may carry low frequency information.
According to embodiments, a number of the first partial representations 321 is at least one half or at least ⅝ or at least three quarters of the total number of the first to third partial representations. By dedicating a great part of the binary representation 42 to a high frequency portion of the picture 12, a particularly good rate-distortion trade-off may be achieved.
In some embodiments, the number of the first partial representations 321 is in a range from one half to 15/16 or in a range between five eighths to seven eighths or in a range between three quarters and seven eighths of a total number of the first to third partial representations. These may provide a good balance between high and medium/low frequency portions of the picture 12, so that a good rate-distortion trade-off may be achieved.
Additionally or alternatively to this ratio between the first partial representations 311 and the second and third partial representations 312, 313, a number of the second partial representations 322 may be at least one half or at least five eighths or at least three quarters of a total number of the second and third partial representations 322, 323.
The last layer 24N−1 comprises a first module 26N−11 which determines the first output representations, that is the first partial representations 221, on the basis of the first input representations 22N−11. A second module 26N−12 of the last layer 24N−1 determines the second output representations 222 on the basis of the first input representations 22N−11, the second input representations 22N−12, and the third input representations 22N−13. A third module 26N−13 of the last layer 24N−1 determines the third output representations 223 on the basis of the second input representations 22N−12, and the third input representations 22N−13. That is, the first module 26N−11 may use a plurality or all of the first input representations 22N−11 and the second input representations 22N−11 for determining one of the first output representations 221, applying an analog manner to the second module 26N−12 and the third module 26N−13.
For example, the first to third modules 26N−11-3 may apply convolutions, followed by non-linear normalizations to their respective input representations.
According to embodiments, the encoding stage CNN 24 comprises a sequence of a number of N−1 layers 24n, with N>1, index n identifying the individual layers, and further comprises an initial layer which may be referred to as using reference sign 240. Thus, according to these embodiments, the encoding stage CNN 24 comprises a number of N layers. The last layer 24N−1 may be the last layer of the sequence of layers. In other words, referring to
For example, for one or more or each of the layers of the sequence of layers 24n, or also, in embodiments in which block 24* is not implemented as shown in
For example, for one or more or each of the layers of the sequence of layers 24n, or also, in embodiments in which block 24* is not implemented as shown in
In advantageous embodiments, for one or more or each of the layers of the sequence of layers 24n, or also, in embodiments in which block 24* is not implemented as shown in
According to embodiments, each of the layers of the sequence of layers determines its output representations based on its input representations as described with respect to the last layer 24N−1. However, coefficients of applied transformations for determining the output representations may be mutual different between the layers of the sequence of layers.
The initial layer 240 determines the input representations 221 for the first layer 241, the input representations 221 comprising first input representations 2211, second input representations 2212 and third input representations 2213. The initial layer 240 determines the input representations 221 by applying convolutions to the picture 12.
According to embodiment, the sampling rate and the structure of the initial layer may be adapted for a structure of the picture 12. E.g., the picture may comprise one or more channels (i.e. two-dimensional sample arrays), e.g. a luma channel and/or one or more chroma channels, which may have mutually equal resolution, or, in particular for some chroma formats, may have different resolutions. Thus, the initial layer may apply a respective sequence of one or more convolutions to each of the channels to determine the first to third input representations for the first layer.
In advantageous embodiments, e.g. for cases in which the picture comprises one or more channels of equal resolution, the initial layer 240 determines, as indicated in
In general, a superposition of a plurality of input representations may refer to a representation (referred to as superposition), each feature of which is obtained by a combination of all features of the input representations which features are associated with a feature position which corresponds to a feature position of the feature within the superposition. The combination may be a sum or a weighed sum, wherein some coefficients may optionally be zero, so that not necessarily all of said features contribute to the superposition.
The first layer 23N comprises a first module 25N1, a second module 25N2 and a third module 25N3. The first module 25N1 determines the first output representations 32N−11 on the basis of the first input representations 321 and the second input representations 322. The second module 25N2 determines the second output representations 32N−12 on the basis of the first to third input representations 321-3. The third module 25N3 determines the third output representations 32N−13 on the basis of the second and third input representations 322-3. In other words, the first module 25N1 may use a plurality or all of the first and second input representations 321-2 for determining one of the first output representations 32N−11, which applies in an analog manner to the second module 25N2 and the third module 25N3.
The output representations 32N−1 of the first layer 23N may have a lower resolution than the input representations 321-3 of the first layer 23N in a sense that the first output representations have a lower resolution than the first input representations, the second output representations have a lower resolution than the second input representations, and the third output representations have a lower resolution than the third input representations. For example, the resolution of the first to third output representations may be lower than the resolution of the first to third input representations by a downsampling factor of two or four, respectively.
For example, the first to third modules 25N1-3 may use transposed convolutions and/or convolutions, each of which may optionally be followed by a non-linear normalization, for determining their respective output representations on the basis of the respective input representations.
The decoding stage CNN 23 may comprise one or more further layers, which are represented by block 23* in
According to embodiments, the decoding stage CNN comprises a sequence of a number of N−1 layers 23n, with N>1, index n identifying the individual layers, and further comprises a final layer which may be referred to using reference sign 231. Thus, according to these embodiments, the decoding stage CNN 23 comprises a number of N layers. The first layer 23N may be the first layer of the sequence of layers. In other words, referring to
According to embodiments, the relations between the resolutions of the first to third input representations and between the resolutions of the first to third output representations of the layers 23n of the sequence of layers of the encoding stage CNN 24 may optionally be implemented as described with respect to layers 22n of the decoding stage CNN 23. Same applies for the number of input representations and output representations of the layers of the sequence of layers. Note that the order of the index for the layers is revered between the decoding stage CNN 23 and the encoding stage CNN 24.
According to embodiments, each of the layers of the sequence of layers determines its output representations based on its input representations as described with respect to the first layer 23N. However, coefficients of applied transformations for determining the output representations may be mutual different between the layers of the sequence of layers.
The final layer 231 determines the picture 12′ on the basis of the output representations 321 of the last layer 232 of the sequence of layers, being input representations 321 of the final layer 231. The output representations 321 may comprise, as indicated in
According to an advantageous embodiment, the final layer 231 applies transposed convolutions having an upsampling rate greater than one to its third input representations 3213 to obtain third representations. That is, the final layer 231 may determine each of the third representations by applying respective transposed convolutions having an upsampling rate greater than one to each of the third input representations 3211 to obtain the third representation. Further, the final layer 231 may determine second representations by superposition of upsampled third representations and upsampled second representations. The final layer 231 may determine each of the upsampled third representations by applying respective transposed convolutions having an upsampling rate greater than one to each of the third representations. The final layer 231 may determine each of the upsampled second representations by applying respective transposed convolutions having an upsampling rate greater than one to each of the second input representations 3212. Finally, the final layer 231 may determine the picture 12′ by superposition of further upsampled second representations and upsampled first representations. The final layer 231 may determine each of the further upsampled second representations by applying respective transposed convolutions having an upsampling rate greater than one to each of the second representations. The final layer 231 may determine each of the upsampled first representations by applying respective transposed convolutions having an upsampling rate greater than one to each of the first input representations 3211.
According to an advantageous embodiment, each of the layers 23N to 232 may be implemented according to the exemplary embodiment described with respect to
The layer 23n comprises a first transposed convolution module 271, a second transposed convolution module 272 and a third transposed convolution module 273. Transposed convolutions the front by the first to third transposed convolutions 271-3 may have a common upsampling rate. The layer 23n further comprises a first cross-component convolutions module 281 and a second cross component convolutions module 282. The layer 23n further comprises a second cross component transposed convolution module 292 in the third cross component transposed convolution module 293.
The layer 23n is configured for determining each of the first output representations 32n−11 by superposing a plurality of first upsampled representations 971 provided by the first transposed convolution module 271 and a plurality of upsampled second upsampled representations 992 provided by the second cross component transposed convolution module 292. Each of the plurality of first upsampled representations 971 for the determination of the first output representation is determined by the first transposed convolution module 271 by superposing the results of transposed convolutions of each of the first input representations 32n1. The first upsampled representations 971 have a higher resolution than the first input representations 32n1. Further, each of the plurality of upsampled second upsampled representations 992 for determining the first output representation is determined by the second cross component transposed convolution module 292 by applying a transposed convolution to each of a respective plurality of second upsampled representations 972. Each of the respective plurality of second upsampled representations 972 for the determination of the upsampled second upsampled representation is determined by the second transposed convolution module 272 by superposing the results of transposed convolutions of each of the second input representations 32n2. The transposed convolutions applied by the second cross component transposed convolution module 292 have an upsampling rate which may correspond to the ratio between the resolutions of the first upsampled representations 971 and the second upsampled representations 972, which may correspond to the ratio between the resolutions of the first input representations 32n1 and the second input representations 32n2.
The layer 23n is configured for determining each of the second output representations 32n−12 by superposing a plurality of second upsampled representations 972 provided by the second transposed convolution module 272 and a plurality of downsampled first upsampled representations 981 provided by the first cross component convolution module 281, and a plurality of upsampled third upsampled representations 993. Each of the plurality of second upsampled representations 972 for the determination of the second output representation is determined by the second transposed convolution module 272 by superposing the results of transposed convolutions of each of the second input representations 32n2. The second upsampled representations 972 have a higher resolution than the second input representations 32n2. Further, each of the plurality of downsampled first upsampled representations 981 for determining the second output representation is determined by the first cross component convolution module 281 by applying a convolution to each of a respective plurality of first upsampled representations 971. Each of the respective plurality of first upsampled representations 971 for the determination of the downsampled first upsampled representation is determined by the first transposed convolution module 271 by superposing the results of transposed convolutions of each of the first input representations 32n1. The convolutions applied by the first cross component convolution module 281 have a downsampling rate which may correspond to the upsampling rate of the transposed convolutions applied by the second cross component transposed convolution module 292. Further, each of the plurality of upsampled third upsampled representations 993 for the determination of the second output representation is determined by the third cross component transposed convolution module 293 by applying a respective transposed convolution to each of a respective plurality of third upsampled representations 973. Each of the respective plurality of third upsampled representations 973 for the determination of the upsampled third upsampled representation is determined by the first transposed convolution module 273 by superposing the results of transposed convolutions of each of the input representations 32n3. The transposed convolutions applied by the third cross component transposed convolution module 293 have an upsampling rate which may correspond to the ratio between the resolutions of the second upsampled representations 971 and the third upsampled representations 972, which may correspond to the ratio between the resolutions of the second input representations 32n1 and the third input representations 32n2.
The layer 23n is configured for determining each of the third output representations 32n−13 by superposing a plurality of third upsampled representations 973, and a plurality of downsampled second upsampled representations 982. Each of the plurality of third upsampled representations 973 for the determination of the third output representation is determined by the third transposed convolution module 273 by superposing the results of transposed convolutions of each of the third input representations 32n3. The third upsampled representations 973 have a higher resolution than the third input representations 32n3. Further, each of the plurality of downsampled second upsampled representations 982 for determining the third output representation is determined by the second cross component convolution module 282 by applying a convolution to each of a respective plurality of second upsampled representations 972. Each of the respective plurality of second upsampled representations 972 for the determination of the downsampled second upsampled representation is determined by the second transposed convolution module 272 by superposing the results of transposed convolutions of each of the second input representations 32n1. The convolutions applied by the second cross component convolution module 282 have a downsampling rate which may correspond to the upsampling rate of the transposed convolutions applied by the third cross component transposed convolution module 293
Each of the transposed convolutions and the convolutions may sample the representation to which it is applied using a kernel. In examples, the kernel is quadratic with a number of k samples in each of two dimensions of the (transposed) convolution. That is, the (transposed) convolution may use a k×k kernel. Each sample of the kernel may have a respective coefficient, e.g. used for weighting the feature of the representation to which the sample of the kernel is applied at a specific position of the kernel. The coefficients of the kernel of the (transposed) convolution may be mutually different and may result from training of the CNN. Further, the coefficients of the kernels of the respective (transposed) convolutions applied by one of the (transposed) convolution modules 271-3, 281-2, 292-3 to the plurality of representations which are input to the (transposed) convolution module may be mutually different. That is, by example of the first cross component convolution module 281, the kernels of the convolutions applied to the plurality of first upsampled representations 971 for the determination of one of the downsampled first upsampled representations 981 may have mutually different coefficients. Same may apply to all of the (transposed) convolution modules 271-3, 281-2, 292-3.
Optionally, a nonlinear normalizations function, or more general in activation function, may be applied to the result of each of the convolutions and transposed convolutions. For example, a GDN function may be used as nonlinear normalizations function, for example as described in the introductory part of the description.
The scheme of layer 23n may equivalently be applied as implementation of the last layer 24N−1 or for each layer 24n of the sequence of layers of the encoding stage CNN 24, the first to third input representations 32n1-3 being replaced by the first to third input representations 22n1-3 of the respective layer 24n, and the first to third output representations 32n−11-3 being replaced by the first to third output representations 22n+11-3 of the respective layer. In case of the encoding stage CNN 24, the first to third transposed convolution modules 271-3 are replaced by first to third convolution modules, which differs from the first to third transposed convolution modules 271-3 in that the transposed convolutions are replaced by convolutions performing a downsampling instead of an upsampling. It is noted, that the orders of the indices of the layers of the encoding stage CNN 24 and the decoding stage CNN 23 are inverse to each other.
This section describes an embodiment of an auto-encoder E and a auto-decoder D, as they may be implemented within the auto-encoder architecture and the auto-decoder architecture described in section 2. The herein described auto-encoder E and the auto-decoder D may be specific embodiments of the encoding stage 20 and the decoding stage 21 as implemented in the encoder 10 and the decoder 20 of
Natural images are usually composed of high and low frequency parts, which can be exploited for image compression purposes. In particular, having channels at different resolutions might help to remove redundancies in the feature representation. The encoder network
consists of multi-resolution downsampling convolutions as follows
E=E
N−1
∘. . . ∘E
0
where the features are separated into three components at different resolutions, shortly {H, M, L}. E.g., H may refer to the first partial/input/output representations, M may refer to the second partial/input/output representations and L may refer to the third partial/input/output representations. Further, E n may represent the n-th layer of the encoding stage CNN 24.
The tuple
states the composition among the c total channels. For example, c0 may represent the number of the first partial representations, c1 may represent the number of second partial representations, and c3 may represent the number of third partial representations. The outputs zn+1=En(zn) are computed as
Here,
The cross-component convolutions ensure an information exchange between the three components at every stage; see
Analogously, let z=E(x) be the features and z′N:={circumflex over (z)} its quantized version. The decoder network consists of multi-resolution upsampling convolutions with functions gn as
D=D
1
∘. . . ∘D
N
Note that the order of the indices has been reversed here. In particular, the outputs z′n−1=Dn(z′n), n≠1 are computed with
Here, a gn,H→H, gn,M→M, gn,L→L are transposed k×k convolutions with upsampling rates un=const. The sampling rates of the cross component convolutions are indicated by their indices. The maps a gn,H→M, gn,M→L are k×k convolutions with constant spatial downsampling rate 2 and the maps gn,M→H, gn,L→M are k×k transposed convolutions with constant upsampling rate 2. Finally, the reconstruction is defined as {circumflex over (x)}:=z′0,H, where the last layer is computed as
Table 1 summarizes an example of the architecture of the maps in (2) and (3) on the basis of the multi-resolution convolution described in this section. It is noted, that the number of channels may be chosen different in further embodiments, and that the number of input and output channels of the individual layers, such as layers 2 and 3 of E, and layers 1 and 2 of D, is not necessarily identical, as described in section 4. Also, the Kernel size is to be understood exemplarily. Same holds for the Composition, which may alternatively chosen according to the criterions described in section 4.
.
In this section, embodiments of an encoder 10 are described. The encoder 10 according to
For example, the quantization error, for which the polynomial function 39 provides the estimated distortion, is a measure for a difference between quantized features of a candidate quantization, for which the estimated distortion is to be determined, and features of a feature representation to which the estimated distortion refers. According to embodiments, the polynomial function 39 provides an distortion approximation as a function of a displacement or modification of a single quantized feature. In other words, the estimated distortion may according to these embodiments represent a contribution to a total distortion of a quantization which contribution results from a modification of a single quantized feature of the quantization.
According to an embodiment, the polynomial function 39 is a sum of distortion contribution terms each of which is associated with one of the quantized features. Each of the distortion contribution terms may be a polynomial function between a quantization error of the associated quantized feature and a distortion contribution resulting from the quantization error of the associated quantized feature. Consequently, a difference between the estimated distortions of a first quantization and a second quantization, which estimated distortions are determined using the polynomial function, may be determined by considering the distortion contributions associated with the quantized features of the first quantization and the second quantizations which differ from each other. For example, the estimated distortion according to the polynomial function of a first quantization differing from a second quantization in one of the quantized features, i.e. a modified quantized feature, may be calculated on the basis of the distortion contribution terms of the modified quantized feature of the first and second quantizations.
According to embodiments, the polynomial function as a nonzero quadratic term and/or a nonzero by biquadratic term. Additionally or alternatively, a constant term and a linear term of the polynomial function are zero. Additionally or alternatively, uneven terms of the polynomial function of zero.
According to some embodiments, the quantization determination module 80 may determine a first predetermined quantization as the predetermined quantization 32′ by rounding the features of the feature representation 22 using a predetermined rounding scheme. According to alternative embodiments, the quantization determination module 80 may determine the first predetermined quantization by determining a low-distortion feature representation on the basis of the feature representation. To this end, the quantization determination module 80 may minimize a reconstruction error associated with the low-distortion feature representation to be determined, i.e. the unquantized low-distortion feature representation to be determined. That is, the quantization determination module 80 may, starting from the feature representation 22, adapt the feature representation so as to minimize the reconstruction error of the unquantized low-distortion feature representation. Minimizing may refer to adapting the feature representation so that the reconstruction error reaches a local minimum within a given accuracy. E.g., a gradient decent method may be used, or any recursive method for minimizing multi-dimensional data. The quantization determination module 80 may quantize the determined the predetermined quantization by quantizing the low-distortion feature representation, e.g. by rounding.
For determining the reconstruction error during minimization, the quantization determination module 80 may use a further CNN, e.g. CNN 23 such as implemented in decoding stage 21 for reconstructing the picture from the feature representation. That is, the quantization determination module 80 may use the further CNN for determining the reconstruction error for a currently tested unquantized low-distortion feature representation.
The rate-distortion estimation module 35 comprises a distortion estimation module 78. The distortion estimation module 78 is configured for determining a distortion contribution associated with the modified quantized feature of the tested candidate quantization 81. The distortion contribution represents a contribution of the modified quantized feature to an approximate distortion 91 associated with the tested candidate quantization 81. The distortion estimation module 78 determines the distortion contributions using the polynomial function 39. The rate-distortion estimation module 35 is configured for determining the rate-distortion measure 83 associated with the tested candidate quantization 81 on the basis of the distortion 90 of the predetermined quantization 32′ and on the basis of the distortion contribution associated with the tested candidate quantization 81.
According to embodiments, the rate-distortion estimation module 35 may comprise a distortion approximation module 79 which determines the approximated distortion 91 associated with the tested candidate quantization 81 on the basis of the distortion associated with the predetermined quantization 32′ and on the basis of a distortion modification information 85, which is associated with the modified quantized feature of the tested candidate quantization 81. The distortion modification information 85 may indicate an estimation for a change of the distortion of the tested candidate quantization 81 with respect to the distortion associated with the predetermined quantization 32′ reciting from the modification of the modified quantized feature.
The distortion modification information 85 may for example be provided as a difference between the distortion contribution to an estimated distortion of the tested candidate quantization 81 determined using the polynomial function 39, and a distortion contribution to an estimated distortion of the predetermined quantization 32′ determined using the polynomial function 39, the distortion contributions being associated with the modified quantized feature. In other words, the distortion approximation module 79 is configured for determining the distortion approximation 91 on the basis of the distortion 90 associated with the predetermined quantization, the distortion contribution associated with the modified quantized feature of the tested candidate quantization 81, and a distortion contribution associated with a quantized feature of the predetermined quantization 32′, which quantized feature is associated with the modified quantized feature, for example associated by its position within the respective quantizations. In other words, the distortion modification information 85 may correspond to a difference between a distortion contribution associated with a quantization error of a feature value of the modified quantized feature in the tested candidate quantization 81 and a distortion contribution of a quantization error associated with a feature value of a modified quantized feature in the predetermined quantization 32′. Thus, the distortion estimation module 78 may use the feature representation 22 to obtain quantization errors associated with feature values of the quantized features of the predetermined quantization 32′ and/or the tested candidate quantization 81.
According to embodiments, the rate-distortion estimation module 35 comprises a rate-distortion evaluator 93, which determines the rate-distortion measure 83 on the basis of the approximated distortion 91 and a rate 92 associated with the tested candidate quantization 81.
The rate-distortion estimation module 35 comprises a distortion determination module 88. The distortion determination module 88 determines the distortion 90 associated with the predetermined quantization 32′ by determining a reconstructed picture based on the predetermined quantization 32′ using a further CNN, for example the decoding stage CNN 23. For example, the further CNN is trained together with the CNN of the encoding stage to reconstruct the picture 12 from a quantized representation of the picture 12, the quantized representation being based on the feature representation which has been determined using the encoding stage 20. Distortion determination module 88 may determine the distortion of the predetermined quantization 32′ is a measure of the difference between the picture 12 and the reconstructed picture.
According to embodiments, the rate-distortion estimation module 35 further comprises a rate determination module 89. The rate determination module 89 is configured for determining the rate 92 associated with the tested candidate quantization 81. The rate determination module 89 may determine a rate associated with the predetermined quantization 32′, and may further determine a rate contribution associated with the modified quantized feature of the tested candidate quantization 81. The rate contribution may represent a contribution of the modified quantized feature to the rate 92 associated with the tested candidate quantization 81. For example, the rate determination method 89 may determine the rate associated with the tested candidate quantization 81 on the basis of the rate determined for the predetermined quantization 32′ and on the basis of the rate contribution associated with the modified quantized feature of the test candidate quantization, and a rate contribution associated with the quantized feature of the predetermined quantization 32′, which quantized feature is associated with the modified quantized feature.
For example, the rate determination module 89 may determine the rate associated with the predetermined quantization on the basis of respective rate contributions of quantized features of the predetermined quantization 32′.
According to embodiments, the rate determination module 89 determines a rate contribution associated with a quantized feature of a quantization on the basis of a probability model 52 for the quantized feature. The probability model 52 for the quantized feature may be based on a plurality of previous quantized features according to a coding order for the quantization. For example, the probability model 52 may be provided by an entropy module 50, which may determine the probability model 52 for the currently considered quantized feature based on previous quantized features, and optionally further based on information about a spatial correlation of the feature representation 22, for example the second probability parameter 84 as described with respect to sections 1 to 3.
According to embodiments, the quantization determination module 80 compares the estimated rate-distortion measure 83 determined for the tested candidate quantization 81 to a rate-distortion measure 83 of the predetermined quantization 32′. If the estimated rate-distortion measure 83 of the tested candidate quantization 81 indicates a lower rate at equal distortion, and/or a lower distortion at equal rate, the quantizations determination module may consider to define the tested candidate quantization as the predetermined quantization 32′, and may keep the predetermined quantization 32′ otherwise. In examples, the quantization determination module 80 may, after having tested each of the plurality of candidate quantizations, the predetermined quantization 32′ as the quantization 32.
The quantization determination module 80 may use a predetermined set of candidate quantizations. Alternative, the quantization determination module 80 may determine the tested candidate quantization 81 in dependence on a previously tested candidate quantization.
According to embodiments, the quantization determination module 80 may determine the candidate quantizations by rounding each of the features of the feature representation 22 so as to obtain a corresponding quantized feature of the candidate quantization. According to these embodiments, the quantization determination module may determine the tested candidate quantizations by selecting, for one of the quantized features of the test candidate quantization, a quantization feature candidate out of a set of quantized feature candidates. For example, the quantization determination module 80 may modify one of the quantized features with respect to the predetermined quantization 32′, by selecting the value for the quantized feature to be modified out of the set of quantized feature candidates.
The quantization determination module 80 may determine the set of quantized feature candidates for a quantized feature by one or more out of rounding up the feature of the feature representation which is associated with a quantized feature, rounding down the feature of the feature representation which is associated with the quantized feature, and using an expectation value of the feature, the expectation value being determined on the basis of the entropy model 52, or being provided by the entropy model 52.
Accordingly, the quantizer 30 may be configured for determining the quantization 32 by testing for each of the features 22′ of the feature representation 22 each out of the set of quantized feature candidates for quantizing the feature, wherein the quantizer 30 may perform the testing for the features according to the coding order. In other words, after having determined in the quantized feature for one of the features this quantized feature maybe entropy coded, and thus may be fixed for subsequently tested candidate quantizations 32′.
According to embodiments, the quantizer 30 comprises an initial predetermined quantization determination module 17 which determines an initial predetermined quantization 32′ which may be used as the predetermined quantization 32′ for testing a first quantized feature candidate for the first feature of the feature representation 22. For example, the initial predetermined quantization determination module 17 may determine the initial predetermined quantization 32′ by rounding each of features of the feature representation 22, i.e. using the same rounding scheme for each of the features, or by determining the quantization of the low-distortion feature representation as described with respect to
The quantizer 30 according to
According to embodiments, the predetermined quantization determination module 16 may compare the estimated rate-distortion measure 83 determined for the tested quantization feature candidate 37 to a rate-distortion measure associated with the predetermined quantization 32′. The rate-distortion measure for the predetermined quantization 32′ may be determined on the basis of the distortion 90 associated with the predetermined quantization 32′ and on the basis of the rate of the predetermined quantization 32′ as it may be determined by the rates determination module 89. If the estimated rate-distortion measure 83 determined for the tested quantization feature candidate 37 indicates that the tested quantization candidate 81 is associated with a higher rate at equal distortion, and/or a higher distortion at equal rate, the predetermined quantization determination module may consider a redefining of the predetermined quantization, and else may keep the predetermined quantization 32′ as the predetermined quantization 32′.
According to embodiments, the quantized feature determination stage 13 may, in case that the estimated rate-distortion measure 83 determined for the tested quantized feature candidate 37 indicates that the tested quantization candidate 81 is associated with a higher rated equal distortion and/or a higher distortion at equal rate, determine a rate-distortion measure associated with the tested candidate quantization 81. The rate-distortion measure may be determined by determining a reconstructed picture based on the tested candidate quantization 81 using the further CNN, as described with respect to the determination of the distortion of the predetermined quantization 32′. The quantizer 30 may be configured for determining the distortion as a measure of the difference between the picture in the reconstructed picture, e.g. by means of distortion determination module 88, and to determine the rate-distortion measure associated with the tested candidate quantization 81 on the basis of the distortion determined on the basis of the reconstructed picture. The such determined rate-distortion measure associated with the tested candidate quantization 81 may be more accurate than the estimated rate-distortion measure, as using the reconstructed picture may allow for an accurate determination of the distortion. The quantized feature determination stage 13 may compare the rate-distortion measure associated with the tested quantized feature candidate to the rate-distortion measure associated with the predetermined quantization. If the rate-distortion measure determined for the test candidate quantization 81 indicates that the tested candidate quantization 81 is associated with a higher rated equal distortion, or a higher distortion at equal rate, the predetermined quantization determination module 16 may use the tested candidate quantization 81 as the predetermined quantization 32′, and else may keep the predetermined quantization 32′ as the predetermined quantization 32′. Thus, in case that the tested candidate quantization 81 is used as the predetermined quantization 32′, the distortion 90 of the predetermined quantization 32′ may already be available.
This section describes an embodiment of a quantizer as it may optionally be implemented in the encoder architecture described in section 2, optionally and beneficially in combination with the implementation of the auto-encoder E and the auto-decoder D described in section 5. The herein described quantization may be a specific embodiment of the quantizer 30 as implemented in the encoder 10 and the decoder 20 of
Compression systems like those used in [11] [16] to are based on a symmetry between encoder and decoder, and they are implemented without signal-dependent encoder optimizations. However, designing such optimizations requires to understand the impact of the quantization. For linear, orthogonal transforms, the rate-distortion performance of different quantizers is well-known; [17]. On the other hand, it is rather difficult to estimate the impact of feature changes on the distortion for non-linear transforms. The purpose of this paper is to describe an RDO algorithm for refining the quantized features and improving the rate-distortion trade-off.
Suppose that the side information ŷ and the hyper parameters θ are fixed. We may consider
as a set of possible coding options. Provided, we are able to efficiently compute the distortion and the expected bitrate, the rate-distortion loss can be expressed as
d(w)=∥x−D(w)∥2,R(w,θ)=ΣlRl(wl;θ), (10)
J(w)=d(w)+λ(R′+R(w,θ)), (11)
E.g., distortion determination module 88 of
In (11), R′ is the constant bitrate of the side information. It is important to note that {circumflex over (z)}≠argmin J(w) holds in general. In other words, the encoder typically does not minimize J, although {circumflex over (z)} certainly provides an efficient compression of the input image. Note that changing an entry wl affects multiple bitrates due to (5). Furthermore, we simply assume uniform scalar quantization and disregard other quantization methods for optimizing the loss term (11). In existing video codecs, the impact of different coding options on d and R is well-understood. This has enabled the design of tailor-made algorithms for finding optimal coding decisions. For end-to-end compression systems, understanding the impact of different coding decisions on (11) is rather difficult, due to the non-linearity of (2). However, it turns out that optimization is possible by exhaustively testing different candidates w. Therefore, our goal is to implement an efficient algorithm for optimizing the quantized features. Similar to the fast-search methods in video codecs, our algorithm should avoid the evaluation of less promising candidates. This can be accomplished by estimating the distortion d(w) without executing the decoder network. Furthermore, it may be only necessary to re-compute the bitrate Rl (and possibly Rl+1, . . . , Rl+L) when a single entry w 1 is changed.
7.1 Distortion Estimation by a Biquadratic Polynomial
The biquadratic port polynomial described within this section may optionally be applied as the polynomial function 39 introduced with respect to
A basic property of orthogonal transforms is perfect reconstruction, which auto-encoders do not satisfy in general. However, we can expect for inputs x˜px and features z=E(x) that D(z) is an estimate at least as good as D({circumflex over (z)}), i. e.
0≤∥x−D(z)∥2≤∥x−D({circumflex over (z)})∥2.
In particular, it is desirable to ensure that z is close to a local minimum of the distortion d. This can be accomplished by adding the minimization of ∥x−D(E(x))∥2 as a secondary condition to the training of the network or by training for smaller values of A. Next, we define the following auxiliary function for displacements h as
ε(h):=∥D(z)−D(z+h)∥2.
Note that ε(0)=0 is a minimum and thus, the gradient is
∇ε(0)=0.
Thus, by Taylor's theorem, the impact of displacements h on E can be approximated by a higher-order polynomial without constant and linear term. Given the feature channels z=(z(1), . . . , z(c)), we evaluated ε(h) for different single-channel displacements
h∈{(h(1),0, . . . ,0),(0,h(2), . . . ,0), . . . ,(0,0, . . . ,h(c))}
on a set of sample images; see
Consequently, we fitted the following biquadratic polynomial to the data by a least-squares approximation
ε(h)≈Σj=1c(γ1(j)∥h(j)∥2+γ2(j)∥h(j)∥4). (12)
For example, the distortion estimation module 78 may apply (12) or part of it such as one or more of the summand terms of (12), for determining the distortion contribution of the modified quantized feature of the tested candidate quantization 81, and optionally also the distortion contribution of the quantized feature of the predetermined quantization 32′, which quantized feature is associated with the modified quantized feature. E.g., ε(h) may be referred to as estimated distortion associated with a quantized representation which is represented by h.
The inventors realized, that by using the triangle inequality, one can estimate the distortion of w=z+h as
d(w)≤d(z)+Σj=1c(γ1(j)∥h(j)∥2+γ2(j)∥h(j)∥4), (13)
Thus, the upper bound may be as an estimate of d(w). E.g. the distortion approximation 91 may be based on this estimation. Further note that for orthogonal transforms, the inequality holds with γ1(j)=1 and γ2(j)=0. In the case, when z is not a local minimum of d, it may be beneficial to re-compute a different z which decreases the unquantized error ∥x−D(z)∥2, for instance by using a gradient descent method. When z is close to a local minimum of d, we have the lower bound d(z)≤d(w) in addition to (13) which further improves the accuracy of the distortion approximation. The higher the accuracy of the distortion approximation, the more executions of the decoder may be avoided during determination of the quantization. The following algorithm, which optimizes the rate-distortion trade-off (11), avoids numerous executions of the decoder by estimating the distortion by the approximation (13).
The following algorithm 1 may represent an embodiment of the quantizer 30, and may optionally be an embodiment of the quantizer 30 as described with respect to
Algorithm 1: Fast rate-distortion optimization for the auto-encoder with user-defined step size δ.
The choice of δ is subject to the employed quantization scheme. According to embodiments, δl=1 for each position. Remark that the candidate value μl can be considered as a prediction constructed from the initial features z. The expected bitrate Rl(μl, θ) is minimal due to (7). Note that each change of a feature technically requires updates of the hyper parameters and the entropy model. The stated algorithm disregards these dependencies of the coding decisions, similar to the situation in hybrid, block-based video codecs. Finally, note that an exhaustive search for each candidate requires a total of N≈10HW decoder evaluations. Empirically, we have observed that Algorithm 1 reduces this number by a factor of approximately 25 to 50.
The present disclosure thus provides an auto-encoder for image compression using multi-scale representations of the features, thus improving the rate-distortion trade-off. The disclosure further provides a simple algorithm for improving the rate-distortion trade-off, which increases the efficiency of the trained compression system.
The usage of algorithm 1 of section 7 avoids multiple decoder executions by pre-estimating the impact of feature changes on the distortion by a higher-order polynomial. Same applies to the embodiments of
Although some aspects have been described as features in the context of an apparatus it is clear that such a description may also be regarded as a description of corresponding features of a method. Although some aspects have been described as features in the context of a method, it is clear that such a description may also be regarded as a description of corresponding features concerning the functionality of an apparatus.
Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
The inventive binary representation can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet. In other words, further embodiments provide a video bitstream product including the video bitstream according to any of the herein described embodiments, e.g. a digital storage medium having stored thereon the video bitstream.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software or at least partially in hardware or at least partially in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
In the foregoing Detailed Description, it can be seen that various features are grouped together in examples for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed examples require more features than are expressly recited in each claim. Rather, as the following claims reflect, subject matter may lie in less than all features of a single disclosed example. Thus the following claims are hereby incorporated into the Detailed Description, where each claim may stand on its own as a separate example. While each claim may stand on its own as a separate example, it is to be noted that, although a dependent claim may refer in the claims to a specific combination with one or more other claims, other examples may also include a combination of the dependent claim with the subject matter of each other dependent claim or a combination of each feature with other dependent or independent claims. Such combinations are proposed herein unless it is stated that a specific combination is not intended. Furthermore, it is intended to include also features of a claim to any other independent claim even if this claim is not directly made dependent to the independent claim.
The above described embodiments are merely illustrative for the principles of the present disclosure. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the pending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
Number | Date | Country | Kind |
---|---|---|---|
21157003.1 | Feb 2021 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2022/053447, filed Feb. 11, 2022, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP 21 157 003.1, filed Feb. 13, 2021, which is incorporated herein by reference in its entirety Embodiments of the invention relate to encoders for encoding a picture, e.g. a still picture or a picture of a video sequence. Further embodiments of the invention relate to decoders for reconstructing a picture. Further embodiments relate to methods for encoding a picture and to methods for decoding a picture. Some embodiments of the invention relate to rate-distortion-optimization for deep image compression. Some embodiments relate to an auto-encoder and an auto-decoder for image compression using multi-scale representations of the features. Further embodiments relate to an auto-decoder using an algorithm for determining a quantization of a picture.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2022/053447 | Feb 2022 | US |
Child | 18448485 | US |