Embodiments of the present invention relate to an apparatus and a method for encoding a a video into a bitstream, an apparatus and a method for decoding a video from a bitstream, and a bitstream having a video encoded thereinto. Some embodiments relate to adaptive loop filtering using a CNN-based classification.
In-loop filters have always formed one of the key building blocks of modern video codecs such as H.264/AVC [1, 2], H.265/HEVC [3, 4] or the recently finalized H.266/VVC [5, 6]. The concept of in-loop filtering is motivated by the observation that in the decoding process of a video signal, specific artefacts, so-called coding artefacts, may occur after the addition of prediction and reconstructed residual. Therefore, one tries to find suitable signal-modifications, called in-loop filters, which can be applied to a reconstructed frame of a video sequence before it is either displayed or used as an input for the prediction of other frames.
A classical example for coding artefacts are artificial edges which can be explained by the block-based structure of the underlying video-codec and which can be mitigated by a deblocking filter [7]. On the other hand, the state-of-the-art Versatile Video Coding Standard (VVC) is characterized by a large amount of different compression tools which all together contribute to its compression efficiency. A simple description and mitigation of the coding artefacts that may be caused by specific combinations of some of these tools with the underlying signal becomes more and more difficult. For these reasons, recent approaches often proceed in a data-driven way [8-10] by training specific Convolutional Neural Networks (CNNs) for in-loop filtering. In [11], a specific design of a data-driven in-loop filter has been presented as a generalization of the Adaptive Loop Filter (ALF) of VVC [12, 13].
Still, there is an ongoing desire to improve video compression, e.g. in terms of a rate-distortion relation, computational effort, and/or complexity.
An embodiment may have an apparatus for decoding a video from a bitstream, wherein the apparatus is configured to: reconstruct, based on the bitstream, the video using block-based predictive decoding, transform-based residual decoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool having a serial connection of a first in-loop filter and a second in-loop filter, wherein the second in-loop filter is configured to subject pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF, wherein the second in-loop filter is configured to perform, based on the bitstream, a mode switching between (alternative 1) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, or between (alternative 2) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, and a third mode of bypassing the second in-loop filter, or between (alternative 3) one or more first modes of performing the adaptive in-loop filtering, with each of the first modes using a CNN, and optionally, a second mode of bypassing the second in-loop filter.
Another embodiment may have an apparatus for encoding a video into a bitstream, wherein the apparatus is configured to: encode, into the bitstream, the video using block-based predictive encoding, transform-based residual encoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool having a serial connection of a first in-loop filter and a second in-loop filter, wherein the second in-loop filter is configured to subject pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF, wherein the second in-loop filter is configured to perform, and signal in the bitstream, a mode switching between (alternative 1) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, or between (alternative 2) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, and a third mode of bypassing the second in-loop filter, or between (alternative 3) one or more first modes of performing the adaptive in-loop filtering, with each of the first modes using a CNN, and optionally, a second mode of bypassing the second in-loop filter.
Another embodiment may have methods performed by the above apparatus for decoding a video from a bitstream or encoding a video into a bitstream.
Another embodiment may have a method for decoding a video from a bitstream, the method having: reconstruct, based on the bitstream, the video using block-based predictive decoding, transform-based residual decoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool having a serial connection of a first in-loop filtering and a second in-loop filtering, wherein the second in-loop filtering is performed by subjecting pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF, wherein the second in-loop filtering performs, based on the bitstream, a mode switching between (alternative 1) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, or between (alternative 2) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, and a third mode of bypassing the second in-loop filter, or between (alternative 3) one or more first modes of performing the adaptive in-loop filtering, with each of the first modes using a CNN, and optionally, a second mode of bypassing the second in-loop filter.
Another embodiment may have a method for encoding a video into a bitstream, the method having: encode, into the bitstream, the video using block-based predictive encoding, transform-based residual encoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool having a serial connection of a first in-loop filter and a second in-loop filter, wherein the second in-loop filtering is performed by subjecting pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF, wherein the second in-loop filtering performs, and signals in the bitstream, a mode switching between (alternative 1) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, or between (alternative 2) one or more first modes of performing the adaptive in-loop filtering, and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, and a third mode of bypassing the second in-loop filter, or between (alternative 3) one or more first modes of performing the adaptive in-loop filtering, with each of the first modes using a CNN, and optionally, a second mode of bypassing the second in-loop filter.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing any of the above methods when the computer program is run by a computer.
Another embodiment may have a bitstream generated by the above apparatus for encoding.
Embodiments of the present invention rely on the idea to use, in a prediction loop, which is part of a coding concept using block-based predictive decoding and transform-based residual decoding, an in-loop filter tool, which performs, for an in-loop filter of the in-loop filter tool, a mode switching between a plurality of modes that differ in complexity. Such mode switching allows an adoption to the coded video signal. The inventors realized that despite the fact that the controlling of the mode switching may increase complexity, the overall reduced computational effort and/or complexity may be reduced, because the computational resources may be distributed more efficiently on individual portions of the video signal. For example, the effect of an in-loop filtering may be higher for some portions, but lower for other ones, so that the possibility of a mode switching between different complexity levels of the in-loop filter may improve the trade-off between a rate-distortion measure and computational effort/complexity. In a first alternative, the modes of different complexity may be provided by a first mode and a second mode of performing the adaptive in-loop filtering, which have different computational complexities. In a second alternative, in addition to the first and the second modes, a bypass mode is provided as a further option for the mode switching. In the bypass mode, computational effort for the respective in-loop filter may be avoided. In a third alternative, the modes of different complexity are provided by one or more first modes, each of which use a CNN for performing the adaptive in-loop filtering and optionally a bypass mode.
Embodiments of the invention provide an apparatus for decoding a video from a bitstream. The apparatus is configured to reconstruct, based on the bitstream, (e.g. according to H.266) the video using block-based predictive decoding, transform-based residual decoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool comprising a serial connection of a first in-loop filter and a second in-loop filter. The second in-loop filter is configured to subject pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF (e.g. whose filter transfer function is locally adapted). In a first alternative of these embodiments, the second in-loop filter is configured to perform, based on the bitstream, a mode switching between one or more first modes of performing the adaptive in-loop filtering (e.g. for example, mutually differing in terms of complexity), and one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes. In a second alternative of these embodiments, the second in-loop filter is configured to perform, based on the bitstream, a mode switching between one or more first modes of performing the adaptive in-loop filtering (e.g. for example, mutually differing in terms of complexity), one or more second modes of performing the adaptive in-loop filtering, wherein the one or more first modes are computationally more complex than the one or more second modes, and a third mode of bypassing the second in-loop filter. In a third alternative of these embodiments, the second in-loop filter is configured to perform, based on the bitstream, a mode switching between one or more first modes of performing the adaptive in-loop filtering (e.g. for example, more than one and mutually differing in terms of complexity), with each of the first modes using a CNN, and optionally, a second mode of bypassing the second in-loop filter. For example in the third alternative, the mode switching may be performed between one first mode using a CNN and the bypass mode, or between a plurality of the first modes, each of which using a CNN, the CNNs having different computational complexities, or between a plurality of the first modes, each of which uses a CNN, and the bypass mode.
According to embodiments, the one or more first modes and/or the one or more second modes involve the second in-loop filter assigning a classification to pre-reconstructed samples of the current picture and filtering the pre-reconstructed samples with a filter transfer function which is adapted to the classification. The manner of performing the classification and/or the filter transfer functions may be specific to the respective modes.
For example, the classification of the one or more first modes is a soft-classification and/or the classification of the one or more second modes is a hard-classification. A soft-classification may be computationally more complex compared to a hard-classification, but may provide a better adaptation to the video signal, thereby providing a more accurate prediction and as a result a better rate-distortion of the encoded video signal.
According to an embodiment, the apparatus is configured to perform the mode switching (e.g. inter alias) based on an estimation of a measure of complexity (e.g. number of multiplications per sample) incurred by the second in-loop filter or the one or more first modes of the second in-loop filter within a predetermined video or picture section so far by disabling the one or more first modes, or any first mode (i.e. all those) exceeding a predetermined complexity for the predetermined video or picture section if the estimation fulfills a predetermined criterion (e.g. exceeds a threshold). Such a switching may prevent the coding complexity of exceeding a certain complexity threshold and thus exceeding resources available for the decoding, but the mode switching may allow still using higher complexity in-loop filtering modes, such as CNN based modes, for portions of the video, in which the usage of these higher complexity modes incur a comparably low complexity.
According to an embodiment, the apparatus is configured to perform the mode switching (e.g. inter alias) based on a measure for prediction quality or prediction imperfection within a predetermined picture area by disabling the one or more first modes, or any first mode exceeding the predetermined complexity, for the predetermined picture area if the measure for prediction quality or prediction imperfection fulfills a further predetermined criterion. For example, in case of a comparably high prediction quality, the impact of the in-loop filtering may be comparably low, so that the trade-off between complexity and rate-distortion may be improved by choosing a low-complexity filtering mode. For example, the quality for the prediction quality may be measured in terms of a metric of the prediction residuum.
According to an embodiment, the apparatus is configured to perform the mode switching based on prediction type or inter-prediction hierarchy level of a picture by disabling the one or more first modes, or any first mode exceeding a predetermined complexity for the picture if the measure for prediction quality or prediction imperfection fulfills a even further predetermined criterion. That is, for example, the consideration of the prediction quality may be combined with a conditioning on the prediction type or hierarchy level, so that different criterions may be applied for different prediction types or hierarchy levels. This combination allows for using more computational resources in case of prediction types/hierarchy levels of higher impact compared to ones incurring lower impact on the prediction signal, so that the usage of resources may be controlled to provide a good trade-off between complexity and rate-distortion.
According to an embodiment, the apparatus is configured to perform the mode switching in dependence on (e.g. inter alias) whether for a predetermined picture portion at least one reference picture succeeds a picture of the predetermined picture portion in presentation time order by disabling the one or more first modes, or any first mode exceeding the predetermined complexity for the predetermined picture portion if this is the case. Pictures not using reference pictures of later presentation times may be used as reference pictures more frequently, so that a high reconstruction quality of these pictures may have a higher impact compared to pictures using reference pictures of later presentation times. Therefore, spending more computational effort on reconstructing pictures not using reference pictures of later presentation times may provide a good trade-off between complexity and rate-distortion.
According to an embodiment, the apparatus is configured to perform the mode switching in dependence on (e.g. inter alias) whether a further predetermined picture portion is, within at least one block, or completely, intra coded by disabling the one or more first modes, or any first mode exceeding a predetermined complexity for the further predetermined picture portion if this is the case. Artifacts introduced by prediction may be more severe in inter-prediction compared to intra-prediction so that a restriction of the higher complexity first modes to inter-predicted blocks may increase the trade-off between complexity and rate-distortion.
According to embodiments, the above-mentioned signal-modification is generated by a weighted sum of FIR-filterings. The weights may vary per sample and are computed by an offline-trained CNN. They can be interpreted as probabilities for a sample to belong to a specific class.
Embodiments of this invention provide a reduction of the decoder-complexity of [11] by restricting the in-loop filtering process to a specific subset of all reconstructed blocks or by applying the proposed in-loop filter in different complexity configurations to different types of reconstructed blocks. For some embodiments, two main hypotheses that are verified by experiments motivate the design. First, it is assumed that due to the temporal prediction between frames, a removal of compression artefacts by the proposed in-loop filter is particularly important for those frames which are typically referenced most frequently in a typical Random-Access (RA) coding scenario with hierarchical B-pictures. Second, it is assumed that the proposed in-loop filter technology is most effective on those parts of a decoded video sequence where a prediction residual has been transmitted. Therefore, we introduce various settings where the proposed in-loop filter is applied for I-pictures is more complex than the one applied for B-pictures. Furthermore, we disallow the CNN-based in-loop filters for some input blocks, especially ones where the quantized prediction residual is zero. We describe the gain-complexity trade-offs of those settings.
A further embodiment provides an apparatus for encoding a video into a bitstream. The apparatus is configured to encode, into the bitstream, (e.g. according to H.266) the video using block-based predictive encoding, transform-based residual encoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool comprising a serial connection of a first in-loop filter and a second in-loop filter. The second in-loop filter is configured to subject pre-reconstructed (in the prediction loop) samples of a current picture to an adaptive in-loop filtering, ALF, (e.g. whose filter transfer function is locally adapted). Further, the second in-loop filter is configured to perform (e.g. by means of RD optimization), and signal in the bitstream, a mode switching according to one of the first, second, and third alternatives described above.
A further embodiment provides a method for decoding a video from a bitstream, wherein the method comprises: reconstruct, based on the bitstream, (e.g. according to H.266) the video using block-based predictive decoding, transform-based residual decoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool comprising a serial connection of a first in-loop filtering and a second in-loop filtering. The second in-loop filtering is performed by subjecting pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF (e.g. whose filter transfer function is locally adapted). Further, the second in-loop filtering performs, based on the bitstream, a mode switching according to one of the first, second, and third alternatives described above.
A further embodiment provides a method for encoding a video into a bitstream, wherein the method comprises: encode, into the bitstream, (e.g. according to H.266) the video using block-based predictive encoding, transform-based residual encoding and a prediction loop into which an in-loop filter tool is serially connected, the in-loop filter tool comprising a serial connection of a first in-loop filter and a second in-loop filter. The second in-loop filtering is performed by subjecting pre-reconstructed samples of a current picture to an adaptive in-loop filtering, ALF (e.g. whose filter transfer function is locally adapted). Further, the second in-loop filtering performs, and signals in the bitstream, a mode switching according to one of the first, second, and third alternatives described above.
Embodiments of the present disclosure are described in more detail below with respect to the figures, among which:
Embodiments of the present invention are now described in more detail with reference to the accompanying drawings, in which the same or similar elements or elements that have the same or similar functionality have the same reference signs assigned or are identified with the same name. In the following description, a plurality of details is set forth to provide a thorough explanation of embodiments of the disclosure. However, it will be apparent to one skilled in the art that other embodiments may be implemented without these specific details. In addition, features of the different embodiments described herein may be combined with each other, unless specifically noted otherwise.
The following description of the figures starts with a presentation of a description of an encoder and a decoder of a block-based predictive codec for coding pictures of a video in order to form an example for a coding framework into which embodiments of the present invention may be built in. The respective encoder and decoder are described with respect to
The encoder 10 is configured to subject the prediction residual signal to spatial-to-spectral transformation and to encode the prediction residual signal, thus obtained, into the data stream 14. Likewise, the decoder 20 is configured to decode the prediction residual signal from the data stream 14 and subject the prediction residual signal thus obtained to spectral-to-spatial transformation.
Internally, the encoder 10 may comprise a prediction residual signal former 22 which generates a prediction residual 24 so as to measure a deviation of a prediction signal 26 from the original signal, i.e. from the picture 12. The prediction residual signal former 22 may, for instance, be a subtractor which subtracts the prediction signal from the original signal, i.e. from the picture 12. The encoder 10 then further comprises a transformer 28 which subjects the prediction residual signal 24 to a spatial-to-spectral transformation to obtain a spectral-domain prediction residual signal 24′ which is then subject to quantization by a quantizer 32, also comprised by the encoder 10. The thus quantized prediction residual signal 24″ is coded into bitstream 14. To this end, encoder 10 may optionally comprise an entropy coder 34 which entropy codes the prediction residual signal as transformed and quantized into data stream 14. The prediction signal 26 is generated by a prediction stage 36 of encoder 10 on the basis of the prediction residual signal 24″ encoded into, and decodable from, data stream 14. To this end, the prediction stage 36 may internally, as is shown in
Likewise, decoder 20, as shown in
Although not specifically described above, it is readily clear that the encoder 10 may set some coding parameters including, for instance, prediction modes, motion parameters and the like, according to some optimization scheme such as, for instance, in a manner optimizing some rate and distortion related criterion, i.e. coding cost. For example, encoder 10 and decoder 20 and the corresponding modules 44, 58, respectively, may support different prediction modes such as intra-coding modes and inter-coding modes. The granularity at which encoder and decoder switch between these prediction mode types may correspond to a subdivision of picture 12 and 12′, respectively, into coding segments or coding blocks. In units of these coding segments, for instance, the picture may be subdivided into blocks being intra-coded and blocks being inter-coded. Intra-coded blocks are predicted on the basis of a spatial, already coded/decoded neighborhood of the respective block as is outlined in more detail below. Several intra-coding modes may exist and be selected for a respective intra-coded segment including directional or angular intra-coding modes according to which the respective segment is filled by extrapolating the sample values of the neighborhood along a certain direction which is specific for the respective directional intra-coding mode, into the respective intra-coded segment. The intra-coding modes may, for instance, also comprise one or more further modes such as a DC coding mode, according to which the prediction for the respective intra-coded block assigns a DC value to all samples within the respective intra-coded segment, and/or a planar intra-coding mode according to which the prediction of the respective block is approximated or determined to be a spatial distribution of sample values described by a two-dimensional linear function over the sample positions of the respective intra-coded block with driving tilt and offset of the plane defined by the two-dimensional linear function on the basis of the neighboring samples. Compared thereto, inter-coded blocks may be predicted, for instance, temporally. For inter-coded blocks, motion vectors may be signaled within the data stream, the motion vectors indicating the spatial displacement of the portion of a previously coded picture of the video to which picture 12 belongs, at which the previously coded/decoded picture is sampled in order to obtain the prediction signal for the respective inter-coded block. This means, in addition to the residual signal coding comprised by data stream 14, such as the entropy-coded transform coefficient levels representing the quantized spectral-domain prediction residual signal 24″, data stream 14 may have encoded thereinto coding mode parameters for assigning the coding modes to the various blocks, prediction parameters for some of the blocks, such as motion parameters for inter-coded segments, and optional further parameters such as parameters for controlling and signaling the subdivision of picture 12 and 12′, respectively, into the segments. The decoder 20 uses these parameters to subdivide the picture in the same manner as the encoder did, to assign the same prediction modes to the segments, and to perform the same prediction to result in the same prediction signal.
Again, data stream 14 may have an intra-coding mode coded thereinto for intra-coded blocks 80, which assigns one of several supported intra-coding modes to the respective intra-coded block 80. For inter-coded blocks 82, the data stream 14 may have one or more motion parameters coded thereinto. Generally speaking, inter-coded blocks 82 are not restricted to being temporally coded. Alternatively, inter-coded blocks 82 may be any block predicted from previously coded portions beyond the current picture 12 itself, such as previously coded pictures of a video to which picture 12 belongs, or picture of another view or an hierarchically lower layer in the case of encoder and decoder being scalable encoders and decoders, respectively.
The prediction residual signal 24″″ in
In
Naturally, while transformer 28 would support all of the forward transform versions of these transforms, the decoder 20 or inverse transformer 54 would support the corresponding backward or inverse versions thereof:
The subsequent description provides more details on which transforms could be supported by encoder 10 and decoder 20. In any case, it should be noted that the set of supported transforms may comprise merely one transform such as one spectral-to-spatial or spatial-to-spectral transform.
As already outlined above,
As illustrated in
In the following, embodiments of the invention are described, which may optionally be implemented as described with respect to
Decoder 20 may further comprise, as illustrated in
Insofar, the block-based predictive decoding and the transform-based residual decoding may be performed by decoding stage 51, e.g. in combination with the prediction loop 70, in particular in combination with the prediction module 58. It is noted however, that the splitting into decoding stage 51 and prediction loop 70, as it is illustrated in
Within the prediction loop 70, an in-loop filter tool 62 is serially connected. The in-loop filter tool comprises a serial connection of a first in-loop filter 64 and a second in-loop filter 66. The second in-loop filter 66 is configured to subject pre-reconstructed samples 12″ of a current picture to an adaptive in-loop filtering, ALF (e.g. whose filter transfer function is locally adapted). For example, the pre-reconstructed samples 12″ may represent reconstructed samples of the current picture before being filtered by the second in-loop filter 66.
For example, the pre-reconstructed samples 12″ may be provided by the first in-loop filter 64, which may derive the pre-reconstructed samples by filtering pre-reconstructed samples 12′″, which may be provided by operator 56 based on the reconstructed residual signal 24″″ and based on the prediction signal 26.
For example, the first in-loop filter 64 may be a static filter or an adaptive filter. The first in-loop filter may be
According to a first alternative, the second in-loop filter performs the mode-switching 68 between one or more first modes 72 of performing the adaptive in-loop filtering, the first modes 72, for example, mutually differing in terms of complexity, and one or more second modes 74 of performing the adaptive in-loop filtering. According to this embodiment, the one or more first modes 72 are computationally more complex than the one or more second modes 74.
According to a second alternative, the second in-loop filter may have, in addition to the one or more first modes 72 and one or more second modes 74, a third mode of bypassing the second in-loop filter, referred to as bypass mode 78, which is illustrated as an option in
According to a third alternative, the second in-loop filter may perform the mode switching 68 between the one or more first modes 72 (e.g., more than one first modes 72) and the bypass mode 78. According to this embodiment, each of the one or more first modes uses a CNN. Again, the first modes 72 may mutually differ in terms of complexity, e.g. in terms of complexity of the CNN.
Please note that for sample-wise classification, the input for classifier 81 may still include more than a single sample. For example, the classification may be performed on pre-reconstructed samples 12′ belonging to the entire current picture, or to a portion thereof, such as a block. As an output, a classification 83 may be provided individually for each sample. In examples, for the classification of each of the samples a neighborhood of the sample may be considered by the classification. E.g., the neighborhood may be a region within a sample array of the current picture, within which region the sample is located.
Filtering module further 80 uses a filter 85 for filtering the pre-reconstructed samples 12″ to obtain reconstructed samples 12′. For each sample 12″, the filter 85 may be selected, or adapted (e.g. by selecting a parametrization for the filter) based on the classification 83 selected for the sample. Classifier 81 and filter 85 may be specific to the mode out of the first and second modes. In other words, filtering module 80 may represent a description for each of the first modes 72 and/or second modes 74, where the implementation of the classifier 81 and/or the filter 85 may differ between the modes.
Thus, according to an embodiment, the one or more first modes 72 involve the second in-loop filter 66 assigning a classification 83 to pre-reconstructed samples 12″ of the current picture and filtering 85 the pre-reconstructed samples 12″ with a filter transfer function which is adapted to the classification 83.
According to an embodiment, the classification 81 of the one or more first modes 72 is a soft-classification.
According to an embodiment, the classification 81 of the one or more first modes 72 is based on a convolutional neural network (CNN).
According to an embodiment, the one or more second modes 74 involve the second in-loop filter assigning 81 a further classification 83 to pre-reconstructed samples 12″ of the current picture and filtering 85 the pre-reconstructed samples with a filter transfer function which is adapted to the further classification 83.
According to an embodiment, the classification 81 of the one or more second modes 74 is a hard-classification.
According to an embodiment, the classification 81 of the one or more second modes 74 is CNN based.
According to an embodiment, the one or more first modes 72 are CNN based and/or the second one or more second modes 74 are non-CNN based.
According to an embodiment, the classification 81 of the one or more second modes 74 is based on an analysis of local activity and directionality.
According to an embodiment, the second in-loop filter 66 is configured to perform the adaptive in-loop filtering by use of FIR filters adapted in a sample-wise manner.
For example, as already mentioned, the first modes 72 and/or second modes 74 may perform a sample-wise classification of the pre-reconstructed samples 12″, and the second in-loop filter 66 may use FIR filters for filtering the samples, the FIR filters being adapted for the filtering of the individual samples according to the classification of the respective samples.
In other words, the filtering as performed by filtering module 780 according to
According to the embodiment of
In other words, contributions of multiple classes, namely the filter results obtained by filtering sample 12″ with the filters associated with the classes of the first set, may contribute to the reconstructed sample 12′ according to the embodiment of
In more general words, according to an embodiment, the one or more first modes 72 involve the second in-loop filter 66 assigning a classification 83 to pre-reconstructed samples 12″ of the current picture and filtering 85 the pre-reconstructed samples 12″ with a filter transfer function which is adapted to the classification 83, the classification of the one or more first modes 74 being a soft-classification, wherein the second in-loop filter 66 is configured to perform the soft classification for first pre-reconstructed samples (e.g. those for which soft classification, i.e. any first mode, is to be used) by assigning 81, for each first pre-reconstructed sample, a classification value 84 to each of a first set of classes 82, with each of which an associated FIR filter is associated, and performing the adaptive in-loop filtering, in case of using the soft classification for the assigning the classification, by, at each first pre-reconstructed sample, applying, for each class of the first set of classes, the associated FIR filter associated with the respective class to the pre-reconstructed samples to obtain a filter result, and forming a weighted sum of the filter results of the first set of classes according to the classification values.
According to an embodiment, the one or more first modes involve the second in-loop filter assigning a classification to pre-reconstructed samples of the current picture and filtering the pre-reconstructed samples with a filter transfer function which is adapted to the classification, the one or more second modes involve the second in-loop filter assigning a further classification to pre-reconstructed samples of the current picture and filtering the pre-reconstructed samples with a filter transfer function which is adapted to the further classification, the classification of the one or more first modes is a soft-classification, e.g., as described with respect to
For example, the second in-loop filter 66 may determine the classification index based on a local activity and directionality information assigned to the current pre-reconstructed sample 12″. E.g., the assignment of the local activity and directionality information assigned to the current pre-reconstructed sample 12″ may be performed by the second in-loop filter 66, e.g. by the filtering module 80, in case that one of the second modes 74 is used.
In more general words, according to an embodiment, the second in-loop filter 66 performs the hard classification for second pre-reconstructed samples (e.g. those for which hard classification, i.e. any of the second modes, is to be used) by assigning a local activity and directionality information to each second pre-reconstructed sample and assigning to each second pre-reconstructed sample a classification index into a second set of classes, with each of which an associated FIR filter is associated, based on the local activity and directionality information assigned to the respective second pre-reconstructed sample, and performing the adaptive in-loop filtering, in case of using the hard classification for the assigning the classification, by applying to the pre-reconstructed samples, at each second pre-reconstructed sample, the associated FIR filter associated with a class of the second set of classes, onto which the classification index points which is assigned to the respective second pre-reconstructed sample.
In the following, further optional details of soft-classification are described. These details may optionally be combined with or implemented in the soft classification as performed by filtering module 780, but the details described with respect to filtering module 780 are optional, i.e. the details described in the following may alternatively refer to soft-classification performed differently.
According to an embodiment, the adaptive in-loop filtering, in case of using the soft classification for the assigning 81 the classification 83, is according to:
wherein ŷ are the samples resulting from the adaptive in-loop filtering; y are pre-reconstructed samples, L is the number of classes in the first set; Φk is the classification value for class k and fk is the FIR filter associated with class k of the first set.
For example, Θ defines a parametrization of the FIR filter.
According to an embodiment, wherein the adaptive in-loop filtering, in case of using the hard classification for the assigning the classification, is according to:
wherein ŷ are the samples resulting from the adaptive in-loop filtering; y are pre-reconstructed samples, L is the number of classes in the first set; χC
According to an embodiment, the CNN 91 comprises exactly one convolution layer and exactly 7, 9 or 11 basic layer groups.
According to an embodiment, a structure of the CNN is based on any of the following variants in column “7 layer”, “9 layer” or “11 layer”:
wherein (K, Nin, Nout) refers to kernel size K, a number of input channels Nin and a number of output channels Nout; wherein a type of the layer indicates a type of convolution as non-separable, NS; or depth-wise separable, DS.
According to an embodiment, Θ of the above formula defines the weights of at least one, of some or all layers of a CNN, e.g. CNN 981, used for the assigning of the classification value to each class of the first set 82 or the second set 82′.
According to an embodiment, the classification 81, when using soft-classification, e.g. as described with respect to
According to an embodiment, classifier 81 performs soft classification, e.g. as described with respect to
According to an embodiment, a 1st basic layer group of a CNN of the soft classification is adapted to receive 8 input channels, advantageously exactly 8 input channels.
According to an embodiment, the 8 input channels comprise:
According to an embodiment, the soft classification is to identify dominant features around a sample location.
According to an embodiment, the soft classification comprises a subsampler for providing a subsampling operator.
According to an embodiment, for implementing the subsampling operator, the soft classification comprises a CNN that comprises a max pooling operator with 3×3 window followed by a 2DN downsampling with factor 2 that is applied to output channels of a second basic layer group of the CNN; wherein in a last layer of the CNN, the downsampling step is reverted by an upsampling with trained upsampling filters.
According to an embodiment, he soft classification is configured for a depth-wise separable convolution.
According to an embodiment, the depth-wise separable convolution comprises a filtering process in two parts; wherein a first part comprises a 2D convolution with a k1×k2 kernel that is performed independently over each input channel of the soft classification; wherein a second part comprises a full convolution but with 1×1 kernels that is applied across all channels.
According to an embodiment, the soft classification is adapted for applying a softmax function 910 to an output channel of a last, e.g. seventh, basic layer group of the soft classification.
According to an embodiment, the softmax function 910 comprises a structure based on
wherein Φk(i) is interpretable as an estimated probability that the corresponding sample location i∈I is associated with a class of index k; Φk is a classification output; and ψl are the output channels of the last basic layer group.
According to an embodiment, the ALF is adapted for applying multiple 2D filters (fk) for different classes k to the classified samples.
According to an embodiment, the ALF is adapted for filtering the classified samples with a clipping function to reduce the impact of neighbour sample values when they are too different with the current sample value being filtered.
According to an embodiment, the clipping function is based on the determination rule
to modify the filtering of the input signal y with a 2D-filter f at sample local x wherein ‘Clip’ is the clipping function defined by Clip(d; b)=min(b; max(−b; d)) and ρ(i) are trained clipping parameters used for the filtering process y*fk and for a first convolutional layer of a CNN of the soft classification.
According to an embodiment, coefficients of the FIR filters associated with the classes 82 of the first set of classes are received as part of the bitstream 14.
According to an embodiment, the FIR filters associated with the classes of the first set 82 and the second set 82′ of classes comprise a diamond shape.
In the following, referring to
According to an embodiment, referring to
According to an embodiment, decoder 20 performs the mode switching by use of a syntax element in the bitstream.
According to an embodiment, the syntax element is signalled in the bitstream 14 individually for
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 by estimating a measure of complexity incurred by the second in-loop filter 66 or the one or more first modes 72 of the second in-loop filter within a predetermined video or picture section so far (e.g. number of multiplications per sample; e.g. by assuming a pre-set worst-case number of multiplications to be incurred each time the soft-classification is performed). The second in-loop filter 66 may check whether the estimation fulfills a predetermined criterion (e.g. exceeds a threshold), and if so, inferring that the syntax element, if same relates to (e.g. a block within . . . ) the predetermined video or picture section, assumes a predetermined value not corresponding to any first mode (e.g. “any of, i.e. each of, the one or more first modes”), or any first mode exceeding a predetermined complexity.
Alternatively, if the estimation fulfills the predetermined criterion estimation, the second in-loop filter 66 may infer that the syntax element, if same relates to (e.g. a block within . . . ) the predetermined video or picture section has a decreased value domain which excludes the one or more first modes, or any first mode (i.e. all those) exceeding the predetermined complexity, and is decreased relative to a complete value domain the syntax element has outside the predetermined video or picture section (e.g. if, or for sections for which, the even further predetermined criterion is not fulfilled), so that a bit rate for signaling at least one value in the decreased value domain, which does not correspond to any first mode, or any first mode exceeding the predetermined complexity, has a smaller bitrate consumption than compared to a corresponding value in the complete value domain (e.g. when using a truncated unary code).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 (e.g. inter alias) based on an estimation of a measure of complexity (e.g. number of multiplications per sample) incurred by the second in-loop filter or the one or more first modes of the second in-loop filter within a predetermined video or picture section so far by disabling the one or more first modes, or any first mode (i.e. all those) exceeding a predetermined complexity for the predetermined video or picture section if the estimation fulfills a predetermined criterion (e.g. exceeds a threshold).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 by determining, within a predetermined picture area, a measure for prediction quality or prediction imperfection within the predetermined picture area. The second in-loop filter 66 may check, whether the measure for prediction or prediction imperfection fulfills a further predetermined criterion (e.g. indicates that the prediction is poorer than a threshold), and if so, the second in-loop filter 66 may infer that the syntax element, if same relates to (e.g. a block within . . . ) the predetermined picture area, assumes a predetermined value not corresponding to any first mode, or any first mode exceeding a predetermined complexity. Alternatively, if the estimation fulfills the predetermined criterion estimation, the second in-loop filter 66 may infer that the syntax element, if same relates to (e.g. a block within . . . ) the predetermined video or picture section has a decreased value domain which excludes the one or more first modes, or any first mode exceeding the predetermined complexity, and is decreased relative to a complete value domain the syntax element has outside the predetermined picture area (e.g. if, or for areas for which, the even further predetermined criterion is not fulfilled), so that a bit rate for signaling at least one value in the decreased value domain, which does not correspond to any first mode, or any first mode exceeding the predetermined complexity, has a smaller bitrate consumption than compared to a corresponding value in the complete value domain (e.g. when using a truncated unary code).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 (e.g. inter alias) based on prediction type or inter-prediction hierarchy level of a picture by disabling the one or more first modes, or any first mode exceeding a predetermined complexity for the picture if the measure for prediction quality or prediction imperfection fulfills a even further predetermined criterion.
According to an embodiment, the measure for prediction quality or prediction imperfection includes one or more of
According to an embodiment, the predetermined picture area is a coding treeroot block, coding block, or slice.
According to an embodiment, the second in-loop filter 66 performs the mode-switching 68 by determining a prediction type or inter-prediction hierarchy level of a picture. The second in-loop filter 66 may check whether the prediction type or inter-prediction hierarchy level fulfils an even further predetermined criterion, and if so, inferring that the syntax element, if same relates to (e.g. a block within . . . ) the picture assumes a predetermined value not corresponding to any first mode, or any first mode exceeding a predetermined complexity. Alternatively, if the prediction type or inter-prediction hierarchy level fulfils the even further predetermined criterion, the second in-loop filter 66 may infer that the syntax element, if same relates to (e.g. a block within . . . ) the picture has a decreased value domain which excludes the one or more first modes, or any first mode exceeding (whose complexity exceeds) a predetermined complexity, and is decreased relative to a complete value domain the syntax element has outside the picture (e.g. if, or for pictures for which, the even further predetermined criterion is not fulfilled), so that a bit rate for signaling at least one value in the decreased value domain, which does not correspond to any first mode, or any first mode exceeding the predetermined complexity, has a smaller bitrate consumption than compared to a corresponding value in the complete value domain (e.g. when using a truncated unary code).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 (e.g. inter alias) based on prediction type or inter-prediction hierarchy level of a picture by disabling the one or more first modes, or any first mode exceeding a predetermined complexity for the picture if the measure for prediction quality or prediction imperfection fulfills a even further predetermined criterion.
According to an embodiment, the prediction type indicates whether the picture is inter-predicted based on reference pictures preceding and succeeding the picture in presentation time order, with the even further predetermined criterion being fulfilled if this is the case, and/or the inter-prediction hierarchy level of a picture indicates a temporal hierarchy level of the picture in a GOP, with the even further predetermined criterion being fulfilled if the hierarchy level exceeds same threshold.
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 by checking whether a predetermined picture portion has at least one reference picture which succeeds a picture of the predetermined picture portion in presentation time order, and if so, inferring that the syntax element, if same relates to the predetermined picture portion assumes a predetermined value not corresponding to any first mode, or any first mode exceeding a predetermined complexity. Alternatively, if the predetermined picture portion has at least one reference picture which succeeds a picture of the predetermined picture portion in presentation time order, the second in-loop filter 66 may infer, that the syntax element, if same relates to the predetermined picture portion has a decreased value domain which excludes the one or more first modes, or any first mode exceeding a predetermined complexity, and is decreased relative to a complete value domain the syntax element has outside the predetermined picture portion (e.g. if, or for picture portions for which, the even further predetermined criterion is not fulfilled), so that a bit rate for signaling at least one value in the decreased value domain, which does not correspond to any first mode, or any first mode exceeding the predetermined complexity, has a smaller bitrate consumption than compared to a corresponding value in the complete value domain (e.g. when using a truncated unary code).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 (e.g. inter alias) in dependence on whether for a predetermined picture portion at least one reference picture succeeds a picture of the predetermined picture portion in presentation time order by disabling the one or more first modes, or any first mode exceeding the predetermined complexity for the predetermined picture portion if this is the case.
According to an embodiment, the predetermined picture portion is a slice or a whole picture.
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 by checking whether a further predetermined picture portion is, within at least one block, or completely intra coded, and if so, the second in-loop filter 66 may infer that the syntax element, if same relates to the further predetermined picture portion, assumes a predetermined value not corresponding to any first mode, or any first mode exceeding a predetermined complexity. Alternatively, if the further predetermined picture portion is, within at least one block, or completely intra coded, the second in-loop filter 66 may infer that the syntax element, if same relates to the further predetermined picture portion, has a decreased value domain which excludes each first mode, or any first mode exceeding a predetermined complexity, and is decreased relative to a complete value domain the syntax element has outside the further predetermined picture portion (e.g. if, or for further picture portions for which, the even further predetermined criterion is not fulfilled), so that a bit rate for signaling at least one value in the decreased value domain, which does not correspond to any first mode, or any first mode exceeding the predetermined complexity, has a smaller bitrate consumption than compared to a corresponding value in the complete value domain (e.g. when using a truncated unary code).
According to an embodiment, the second in-loop filter 66 performs the mode switching 68 the mode switching (e.g. inter alias) in dependence on whether a further predetermined picture portion is, within at least one block, or completely intra coded by disabling the one or more first modes, or any first mode exceeding a predetermined complexity for the further predetermined picture portion if this is the case.
According to an embodiment, the predetermined picture portion is a slice, a whole picture or a CTU or a CU.
According to an embodiment, the soft classification is adapted to provide for a number of at most 35000, e.g., 29873, trained parameters.
In the following, referring to
In more general words, according to an embodiment, each of the first set of classes 82 has a first FIR filter and a second FIR filter is associated therewith, and the second in-loop filter 66 performs the soft classification for first pre-reconstructed samples 12″ (e.g. those for which soft classification is to be used) by assigning 81, for each first pre-reconstructed sample, a classification value 84 to each of the first set of classes 82. According to this embodiment, the second in-loop filter 66 performs the adaptive in-loop filtering, in case of using the soft classification for the assigning the classification, by applying 87, for each class of the first set of classes 82, the associated first FIR filter associated with the respective class onto the pre-reconstructed samples 12″ to obtain a first filtered version, e.g. filter results 86 in
According to an embodiment, the second in-loop filter 66 switches, based on the bitstream 14, between the two alternatives of performing the soft-classification described with respect to
In other words, according to an embodiment, the second in-loop filter 66 switches, based on the bitstream 14, between
According to an embodiment, the second in-loop filter 66 performs the switching between performing the performing the soft classification for first pre-reconstructed samples in the first or second manner in units of one or more of
According to an embodiment, the second in-loop filter 66 performs the switching between performing the soft classification for first pre-reconstructed samples in the first or second manner (e.g. inter alias) based on an estimation of a measure for multiplications per sample incurred by the second in-loop filter for the current picture so far by disabling the soft classification if the estimation fulfills a predetermined criterion (e.g. exceeds a threshold).
For example, encoder 10 may comprise an encoding module 31 to encode the video signal 12 representing the video. For example, the video signal may represent a sequence of pictures, which may be encoded by encoding module 31 according to a coding order. For example, the prediction loop 71 may be formed in that encoder 10 reconstructs the encoded signal provided by encoding module 31 to derive a reconstructed signal 12′, e.g. signal 46 of
Further in the description of the prediction loop 71, the prediction signal 26 may be used for predicting a portion of the signal 12 and for reconstructing same portion in the prediction loop 71, see combiner 42. Combiner 42 may combine the residual signal 26 with a reconstructed residual signal 24″″ derived by decoding module 33 from the encoded signal provided by the encoding module 31. For example, decoding module 33 may perform the inverse operation of encoding module 31, e.g., despite coding loss introduced by quantization. For example, encoding module 31 may correspond to transformer 28 and quantizer 32 and decoding module 33 may correspond to dequantizer 38 and inverse transformer 40 of
It is noted that encoder 10 may comprise entropy coder 34, e.g. as illustrated in
Insofar, the block-based predictive encoding and the transform-based residual encoding may be performed by encoding stage 31, e.g. in combination with the prediction loop 71, in particular in combination with the prediction module 44. It is noted however, that the implementation of the prediction loop 71 illustrated in
The in-loop filtering tool 62 may be implemented as described with respect to
Further, referring to the embodiments described with respect to decoder 20, if it is described that decoder 10 infers the value of a syntax element, the encoder 10 may treat this syntax element as being required to be inferred by the decode, and therefore may refrain from encoding the syntax element into the bitstream. Encoder 10 may derive the value of the syntax element based on the same measures/criterions as described with respect to the decoder, and may perform the mode switching accordingly.
Some aspects developed above shall be repeated hereinbelow again.
Aspect I (Switching between soft-classification based in-loop filters/conventional ALF/no ALF so that some complexity threshold in terms of average number of multiplications per sample is not exceeded):
One or several types of soft-classification based in-loop filters are supported, which may have different complexity. For each block of samples, at most one of these soft-classification based in-loop filters may be applied or none of them may be applied, where in the latter case, either the Adaptive Loop Filter with hard classification or no additional loop filter may be applied and where the switching between all these configurations (the different soft-classification based in-loop filters and the hard-classification-case/no-loop-filter-case) is always done such that a specific maximal number of multiplications per sample required by the execution of all soft-classification based in-loop filters, measured on average over some unit or sub-portion of the decoded video-sequence, does not exceed a specific threshold.
The switching may be signaled on a block-level. If a maximal threshold in terms of number of multiplications for a given unit or sub-portion has been reached and if a block still belongs to the given unit or sub-portion, it is automatically inferred that no soft-classification based in-loop filter is supported for this block.
Aspect II: (Switching between soft-classification based in-loop filters/conventional ALF/no ALF depends of prediction residual. The ‘more’ residual, the more complex the soft-classification based in-loop filter may be):
At least one soft-classification based in-loop filter+ALF+No-Inloop filter are supported, where the specific supported soft-classification based in-loop filter for a given block or the specific set of soft-classification based in-loop filers supported for the given block or the selection of whether any of the soft-classification based in-loop filters is to be applied at all on the given block depends on whether for the given block or for some sub-block of the given block, a prediction residual is coded in the bit-stream or where this selection depends on some specific quantity derived from the coded prediction residual for the given block or the sub-blocks of it, for example the number of coded non-zero transform coefficients, the energy of the coded transform coefficients etc. . . .
In one specific embodiment, the application of any of the soft-classification based in-loop filters is completely prohibited for the case that for no sub-block of the given block, a prediction residual is coded in the bit-stream. In this case, any configuration flag indicating whether the soft-classification based in-loop filter is to be used at all is inferred at a decoder to be false.
In another specific embodiment, only a soft-classification based in-loop filter that requires a number of multiplications per sample which is strictly smaller than that of some other soft-classification based in-loop filter which is supported on some other blocks is supported for blocks which have the property that for no sub-block of them, a prediction residual was coded in the bit-stream.
Aspect III: (Switching between soft-classification based in-loop filters/conventional ALF/no ALF depends on position of frame in the hierarchy between frames used for inter-prediction. Blocks on no-key frames may not use soft-classification based in-loop filters. Blocks on key-frames may use the most complex soft-classification based in-loop filters. Here, key frames are characterized as those frames which may refer only to past but not to future frames in output order.):
At least one soft-classification based in-loop filter+ALF+No-Inloop filter are supported, where the specific supported soft-classification based in-loop filter for a given block or the specific set of soft-classification based in-loop filers supported for the given block or the selection of whether any of the soft-classification based in-loop filters is to be applied at all on the given block depends on whether for the given frame/slice etc. that the block belongs to, reference samples for inter-prediction are available that belong to other frames/slices etc. which in the temporal-output order of the sequence lie in the future of the given frame/slice etc. that the block belongs to.
In one specific embodiment, the application of any of the soft-classification based in-loop filters is completely prohibited for the case that for the given frame/slice etc. that the given block belongs to, reference samples for inter-prediction are available that belong to other frames/slices etc. which in the temporal-output order of the sequence lie in the future of the given frame/slice etc. that the block belongs to.
In another specific embodiment, only soft-classification based in-loop filters that require a number of multiplications per sample which is strictly smaller than that of some other soft-classification based in-loop filters which are supported on some other blocks are supported for blocks which have the property that for the frame/slice that they belong to, reference samples for inter-prediction are available that belong to other frames/slices which in the temporal-output order of the sequence lie in the future of the given frame/slice that the block belongs to.
Aspect IV (Switching between soft-classification based in-loop filters/conventional ALF/no ALF depends on whether intra-coded blocks are present or whether whole block is intra-coded. The ‘more intra’, the more complex the soft-classification based in-loop filter may be):
At least one soft-classification based in-loop filter+ALF+No-Inloop filter are supported, where the specific supported soft-classification based in-loop filter for a given block or the specific set of soft-classification based in-loop filers supported for the given block or the selection of whether any of the soft-classification based in-loop filters is to be applied at all on the given block depends on the number of intra-predicted sample in the given block.
In a specific embodiment, only soft-classification based in-loop filters that require a number of multiplications per sample which is strictly smaller than that of some other soft-classification based in-loop filters which are supported on some other blocks are supported for blocks which have the property that for none of their sub-blocks, intra-prediction was applied.
In the following a performance-complexity analysis of an adaptive loop filter is described, and based thereon, embodiments for video encoders/decoders are derived and described. The features of the embodiments described in the following may be combined with any of the embodiments described with respect to
According to embodiments, the ALF may use CNN-based Classification.
According to an embodiment, the signal-modification is generated by a weighted sum of FIR-filterings. The weights may vary per sample and are computed by an offline-trained CNN. They can be interpreted as probabilities for a sample to belong to a specific class.
Convolutional neural network (CNN)-based in-loop filters are used for video coding and show great potential. However, one of the main issues of this approach is the high computational complexity of these filters. In the following, we present various settings for CNN-based in-loop filters targeting on the reduction of their decoder-complexity and describe the corresponding gain-complexity trade-offs. To this end, an effective complexity measure is used. Experiments show that it is possible to notably reduce this value for some CNN-based in-loop filters while maintaining similar average BD-rate savings, e.g., over Versatile Video Coding (VVC).
The following part of the description is structured as follows. Firstly, an embodiment of an ALF algorithm and a CNN-based in-loop filter as introduced in [11] is described. Thereafter, various variants of the CNN-based in-loop filter providing a further reduction of its complexity are described. Finally, simulation results are shown.
In the following, an embodiment of an CNN based in-loop filter is described, as it may optionally be implemented by the second in-loop filter 66. ALF partitions the reconstructed samples y into L=25 classes Ck. The samples of each such class are filtered with an FIR filter fk. Thus, ALF generates the reconstructed filtered frame ŷ according to
Here, χC
where I denotes the set of all sample locations.
In the following, an embodiment is described with respect to
A natural extension of (1) where the ALF classification χC
Here, ϕ1, . . . , ϕL denote the classification outputs of a trained CNN-based classifier with trained parameters Θ and fk denote FIR-filters that are also determined during training. The process (2) can be seen as an extension of (1) where the ALF classification functions are replaced by more general classification functions ϕk. The model architecture of the CNN-based classifier ϕk(y|Θ) is described with respect to in
where j is the output sample location, i denotes the sample locations in the support of f and ρ(i) are trained parameters. Here, Clip is the clipping function defined by Clip(d, b)=min(b, max(−b, d)). For notational simplicity, we shall denote the 2D-convolution including the clipping still by y*fk. A similar clipping operation is also applied for the first convolutional layer of the classifier, as displayed in
Finally, in order to better adapt to specific signal characteristics, according to an embodiment, an additional filtering process is used, now with adaptive filters {tilde over (f)}k that are transmitted in the bit-stream and are optimized at the encoder for each input frame. This additional 2nd filtering step is performed after filtering with the fk and is defined as
Here, the filters {tilde over (f)}k are computed such that the mean squared error between the target frame and the filtered reconstructed frame is minimized. We refer to [11] for a more detailed description of the CNN-based in-loop filter defined in (2) and (4).
In the following, CNN-based In-Loop Filters with various complexities according to embodiments are described, which may be variants of the CNN-based in-loop filter discussed above with respect to equations (2) to (4), and which may optionally be embodiments of the second in-loop filter 66. For example, all of the following embodiments may share the same basic structure consisting of a CNN-based classifier ϕk and the filtering process (y*fk) as described in (2). All variants are generated from the original 7-layer model presented in [11] and discussed above by modifying the number of channels for some of the BLGs, adding some further BLGs or introducing skip connections [18] between some of the BLGs. Here, a skip connection between the i-th and the j BLG is realized by adding the i-th BLG's input to output of the (j−1)-th BLG's activation sub-layer and using the result as the input for the j-th layer. Note that, like the original 7-layer model, all variants make use of the additional input data, QP, yDBF and Pred which are fed as inputs to the first BLG. Furthermore, also like the original 7-layer model, all variants may optionally share the maximum pooling operation with a 3×3 window followed by a downsampling by a factor of two which is applied to the second BLG's output. For all variants, this subsampling may optionally be reverted by an upsampling step with trained interpolation filters in the last BLG which is again identical to the original 7-layer architecture.
Exemplary embodiments, to which the experiments discussed below refer, include the following:
The total worst-case number of multiplications per luma-pixel for (2) associated with each of the models is illustrated in Table 3. These values can easily be derived from the model architectures given by Tables 1-2. We refer to [11] for more details about this.
According to embodiments, a residual-based criterion for CNN-based in-loop filter is applied. One of the main targets of the proposed CNN-based in-loop filters is the reduction of the error introduced by inaccurate prediction signals and quantization noise in the reconstructed transform coefficients. Embodiments of the invention rely on the finding that it seems a valid assumption to expect the filters to only have minor effect for blocks where the prediction is accurate enough, i.e. where the prediction residual is zero. As this is often the case, especially for the deeper temporal levels of inter prediction, there are numerous blocks where one can expect the effect of the in-loop filters on the coding gain to be relatively small compared to the complexity overhead introduced by the CNNs. Therefore, one approach provided by embodiments is to improve the trade-off by disallowing the CNN-based in-loop filters for all input blocks where the quantized prediction residual is zero. This approach can be applied to any of the above-described CNN-based in-loop filter architectures. However, in order to show the effect of the residual-based criterion, we chose the 7-layer model described above and Table 1 for the experiments presented below. Additional to the residual-based restrictions during the inference, in embodiments, the training of the CNN was also slightly modified compared to the 7-layer model described above. In particular, all samples where the quantized prediction residual was zero were excluded from the training loss in order to put the focus on the samples with non-zero residual.
In the following, simulation results for some embodiments of in-loop filters are presented, which are based on various models with different complexities and provide performance-complexity analysis for them. For this, two models were selected among the models mentioned above and trained based on the BVI-DVC data set [19] where only the luma-components of the signals were used for training. The training data was generated by compressing the raw video data by the VVC test model version VTM-13.0 [20] under the RA configuration with QPs from the set {22, 27, 32, 37, 42} and extracting the reconstructed frames before ALF as well as the reconstructed frames before any in-loop filter and the prediction signal. The first model was trained on I-frames while the 2nd model was trained on B-frames.
In technical terms, the training made use of the Adam optimization [21] with the mean squared error (MSE) loss function
for the input and target frames y and x. For the 9-layer models, this loss function was modified to
adds scaling coefficients ck for the individual classes which are derived by a Gram-Schmidt process [22]. The main purpose of this loss function is to simulate the 2nd filtering process (4) during the training of the CNN in-loop filter so that it is better adapted to that process. The training data batches were formed from randomly selected square blocks from the original sequences and the corresponding blocks in the reconstructed frames before ALF, the reconstructed frames before any in-loop filter and the prediction signal. In order to mitigate boundary effects, the blocks were extended by 8 samples on either side. The resulting extended blocks size was 166 for the 9-layer model and 80 for all other models.
After the training, the CNN-based in-loop filter was integrated into VTM-13.0 so that the first model is applied to frames of the lowest temporal level, which consists of I-frames and B-frames referencing only other frames of the lowest temporal level, while the second model is applied for all other frames. Whether the CNN-model corresponding to a frame's temporal level or the original ALF is to be applied is signalled on frame level and decided by an RD-decision at the encoder. If a CNN-model is applied, it can be switched on and off on CTU level where the switch is signalled. Moreover, it is also signalled on CTU level whether additionally the 2nd filtering from (4) is to be applied or not. For the 2nd filtering, the filters {tilde over (f)}k are determined at the encoder by conducting an RD-search that is similar to the determination of the filter coefficients in the ALF-encoder of VTM. The filter coefficients are then signalled per frame. The CNN-based in-loop filter proposed in this paper is applied to the luma component only. For the chroma components, chroma-ALF and Cross-Component ALF (CCALF) [13] of VVC are still applied.
All experiments were conducted using the AI and RA configurations of the JVET common test conditions [23] with two sets of QP values, {22, 27, 32, 37} (low QP) and {27, 32, 37, 42} (high QP).
From the models described in Section 3, the following combinations of first and second models were chosen for evaluation:
During the RD-search, when deciding whether to enable the CNN-based in-loop filter on CTU level, we replace the original RD-cost CTU_CNN_inloop_filter_cost for applying the CNN-based in-loop filter on the given CTU by ρ·CTU_CNN_inloop_filter_cost where ρ>1 is a positive constant. Note that one can reduce the overall effective complexity of the CNN-based in-loop filter by choosing a larger value for ρ so that it is applied less frequently based on RD-search. In particular, we have three test settings where we choose ρ1=1.005, ρ2=1.007 and ρ3=1.010 respectively.
Here, nCNN and n2nd are the numbers of 128×128-CTU blocks where the CNN-based in-loop filters (2) and (4) are applied respectively. CNN is the total worst-case number of multiplications per luma-pixel for the CNN-based in-loop filter (2) associated with the model applied for the input frame—see Table 3. Similar,
2nd is the total worst-case number of multiplications per luma-pixel for the CNN-based in-loop filter (4) given by the sum of
CNN and the number of multiplications per luma-pixel for the 2nd filtering with the adaptive filters {tilde over (f)}k—we refer to [11] for the complexity of the 2nd filtering. Finally Ninput-frame is the total number of samples in the input frame. The average effective complexity of the CNN-based in-loop filter is then given by taking the average of the effective complexities
over all frames of all the input video sequences and over all QPs in the respective QP range (low/high-QP).
Note that the highest BD-rate saving is obtained by the 11/11 setting at the cost of the highest average effective complexity as illustrated in
To summarize, the experimental results for the above described analysis show that one can still achieve notable BD-rate savings over VVC with significantly reduced complexity compared to our previous work. In particular, using the above-described 9/9 (ρ1) setup, may provide a similar BD-rate reduction of 4.41%/4.59% (for luma, low/high-QP) under the RA configuration with a reduced average effective complexity of only 6.79/7.00 kmul/sample compared to 4.39%/4.33% at 13.95/13.17 kmul/sample for the 7/7 setting [11]. Thus, the effective complexity was reduced from about 14 kmul/sample to about 7 kmul/sample while the overall coding gain essentially remained the same.
In the following, further implementation alternatives are described, referring to all of the embodiments described above.
Although some aspects have been described as features in the context of an apparatus it is clear that such a description may also be regarded as a description of corresponding features of a method. Although some aspects have been described as features in the context of a method, it is clear that such a description may also be regarded as a description of corresponding features concerning the functionality of an apparatus.
In particular, it is noted that
Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
The inventive encoded image signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet. In other words, further embodiments provide a video bitstream product including the video bitstream according to any of the herein described embodiments, e.g. a digital storage medium having stored thereon the video bitstream.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software or at least partially in hardware or at least partially in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
In the foregoing Detailed Description, it can be seen that various features are grouped together in examples for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed examples require more features than are expressly recited in each claim. Rather, as the following claims reflect, subject matter may lie in less than all features of a single disclosed example. Thus the following claims are hereby incorporated into the Detailed Description, where each claim may stand on its own as a separate example. While each claim may stand on its own as a separate example, it is to be noted that, although a dependent claim may refer in the claims to a specific combination with one or more other claims, other examples may also include a combination of the dependent claim with the subject matter of each other dependent claim or a combination of each feature with other dependent or independent claims. Such combinations are proposed herein unless it is stated that a specific combination is not intended. Furthermore, it is intended to include also features of a claim to any other independent claim even if this claim is not directly made dependent to the independent claim.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
22185052.2 | Jul 2022 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2023/069589, filed Jul. 13, 2023, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. 22185052.2, filed Jul. 14, 2022, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2023/069589 | Jul 2023 | WO |
Child | 19018293 | US |