The present invention generally relates to methods and systems for learned (learning-based) video compression.
Over the last few decades, video coding standards such as Advanced Video Coding (AVC) [1], high efficiency video coding (HEVC) [2], and versatile video coding (VVC) [3] follow the classical block-based hybrid video compression framework, and have been developed with various designs. However, with the rapid development of video usage and the introduction of Ultra-High-Definition video services, the growth of video data is beyond the improvement of compression ratio [4]. Therefore, it is important to explore video compression, e.g., for delivering high-quality video data at a given bit rate under the limited capabilities of networks and storage. As deep learning has been used and has achieved success [5]-[7] due to its powerful representation ability, learning-based compression has attracted interest and has achieved improvement, hence can be further explored [8]-[10].
For learning-based image compression, Balle' et al. proposes a basic CNN framework in [11]. They transform the image into latent code and then reconstruct it with an inverse transformation. To make the whole framework trainable end-to-end, uniform noise is added for quantization during the training process. With sufficient joint optimization training with a single loss function, learning-based image compression [12] with discretized Gaussian mixture likelihoods has achieved comparable performance with the latest traditional video standard VVC, revealing the effectiveness of neural networks in removing spatial redundancy.
In addition to spatial redundancy, temporal redundancy is another part to be reduced for video context. It has also made progress in learning-based video compression. Lu et al. proposes predictive end-to-end learned video compression framework (DVC) in [13]. In DVC, the residual between the input frame and the predicted frame is calculated. Then the motion vector and residual are separately compressed with the entropy model. However, the entropy of residue coding is greater than or equal to that of conditional coding [14]. Therefore, Li et al. proposes the deep contextual video compression framework (DCVC), which extracts a valuable context as a condition for the contextual encoder-decoder and entropy model to compress current frame [15]. Even though DCVC has been developed and, with the testing code released, has shown the effectiveness of conditional coding paradigm, improvements to DCVC may be desirable.
In DCVC there are only a resblock and a convolution layer for context refinement. Also, the context is generated without supervision and redundancy may exist among context channels. However, the context is important in the conditional framework as it will be introduced to the contextual encoder-decoder and entropy model for compression.
Therefore, some embodiments of the invention propose an enhanced context mining (ECM) model to reduce the redundancy across context channels. To take advantage of high-dimension context, specifically, some embodiments of the invention propose convolution and residual learning along with context channels. Thus, the latent clean context in the hidden layers is implicitly kept for contextual and entropy coding models.
Also, in DCVC, the error propagation problem exists. As shown in
In some embodiments of the invention, with the enhanced context mining model and transformer-based post-enhancement backend network, less error propagation and better compression efficiency can be obtained. Meanwhile, the proposed models in some embodiments may be extended to other learned coding methods that are extended from the DCVC framework. Some example contributions of some embodiments of the invention include:
According to an aspect of the invention, there is provided a computer-implemented method for learned video compression, which includes processing a current frame (xt) and previously decoded frame ({circumflex over (x)}t−1) of a video data using a motion estimation model to estimate a motion vector (vt) for every pixel, compressing the motion vector (vt) and reconstructing the motion vector (vt) to a reconstructed motion vector ({circumflex over (v)}t), applying an enhanced context mining (ECM) model to obtain enhanced context ({umlaut over (C)}E) from the reconstructed motion vector ({circumflex over (v)}t) and previously decoded frame feature (x̆t−1), compressing the current frame ({circumflex over (x)}t′) with the assistance of the enhanced context ({umlaut over (C)}E) to obtain a reconstructed frame ({circumflex over (x)}t′), and providing the reconstructed frame ({circumflex over (x)}t′) to a post-enhancement backend network to obtain a high-resolution frame ({right arrow over (x)}t).
In some embodiments, the motion estimation model may be based on a spatial pyramid network.
In some embodiments, applying the enhanced context mining (ECM) model may include utilizing cross-channel interaction and residual learning operation to reduce redundancy across context channels.
In some embodiments, applying the enhanced context mining model (ECM) may include obtaining the motion vector ({circumflex over (v)}t) and decoded frame feature (x̆t−1) based on the current input frame (xt) and previously decoded frame ({circumflex over (x)}t−1), warping the motion vector ({circumflex over (v)}t) and decoded frame feature (x̆t−1) to obtain a warped feature ({umlaut over (x)}t), and processing the warped feature ({umlaut over (x)}t) using a resblock and convolution layer to obtain a context (t).
In some embodiments, applying the enhanced context mining (ECM) model may further include incorporating convolution with ReLU on the context (t) to obtain residual context (RC), and adding the residual context (RC) to the context (t) to obtain the enhanced context ({umlaut over (C)}E).
In some embodiments, incorporating convolution with ReLU on the context (t) may include passing the context (t) through multiple layers of convolution and ReLU followed by one more convolution layer.
In some embodiments, the enhanced context mining (ECM) model may be designed without batch normalization layers.
In some embodiments, compressing the current frame (xt) may include concatenating the input frame (xt) and the enhanced context ({umlaut over (C)}E) together, processing the input frame (xt) and the enhanced context ({umlaut over (C)}E) to obtain latent code (yt) for entropy model, and transforming the latent code back to pixel space with the assistance of the enhanced context ({umlaut over (C)}E) to obtain the reconstructed frame ({circumflex over (x)}t′).
In some embodiments, the post-enhancement backend network may be transformer-based.
In some embodiments, the post-enhancement backend network may include multiple transposed gated transformer blocks (TGTBs) and multiple convolution layers.
In some embodiments, providing the reconstructed frame ({circumflex over (x)}t′) to the post-enhancement backend network may include applying a convolution layer to the reconstructed frame ({circumflex over (x)}t′) to obtain a low-level feature embedding F0∈H×W×C , where H×W is the spatial height and width, and C denotes the channel numbers, and processing the low-level feature (F0) using one or more transformer blocks to obtain a refined feature (FR), and applying a convolution layer to the refined feature (FR) to obtain residual image R∈H×W×3 to which the reconstructed frame ({circumflex over (x)}t′) is added to obtain {circumflex over (x)}t:{circumflex over (x)}t={circumflex over (x)}t′+R.
In some embodiments, at least one of the transformer blocks may include a transposed gated transformer block (TGTB) which is designed without layer normalization.
In some embodiments, at least one of the transformer blocks may be modified to contain a multi-head transposed attention (MTA) and a gated feed-forward network (GFN).
In some embodiments, the multi-head transposed attention (MTA) may include calculating self-attention across channels.
In some embodiments, the multi-head transposed attention (MTA) may include applying depth-wise convolution.
In some embodiments, the multi-head transposed attention (MTA) may generate, from a feature input X∈H×W×C, query (Q), key (K) and value (V) projections with the local context, and reshape the query (Q) to ĤŴ×Ĉ, and key (K) to Ĉ×ĤŴ, to obtain a transposed attention map of size {umlaut over (C)}×{umlaut over (C)}.
In some embodiments, the gated feed-forward network (GFN) may include gating mechanism and depth wise convolutions, the gating mechanism may be achieved as the element-wise product of two parallel paths of transformation layers, one of which is activated with the GELU non-linearity, and the depth-wise convolution may be applied to obtain information from spatially neighboring pixel positions.
In another aspect of the invention, there is provided a system for learned video compression, which includes one or more processors, and memory storing one or more programs configured to be executed by the one or more processors. The one or more programs include instructions for performing or facilitating performing of the computer-implemented method as aforementioned.
In some embodiments, there is provided a system for learned video compression. The system includes a motion estimation model configured to receive a current frame (xt) and previously decoded frame ({circumflex over (x)}t−1) of a video data to estimate a motion vector (vt) for every pixel, a motion vector (MV) encoder and decoder configured to compress the motion vector (vt) and to reconstruct the motion vector (vt) to a reconstructed motion vector ({circumflex over (v)}t), an enhanced context mining (ECM) model configured to obtain enhanced context ({umlaut over (C)}E) from the reconstructed motion vector (vt) and previously decoded frame feature (x̆t−1), a contextual encoder and decoder configured to compress the current frame (xt) with the assistance of the enhanced context ({umlaut over (C)}E) to obtain a reconstructed frame ({circumflex over (x)}t′), and a post-enhancement backend network configured to obtain a high-resolution frame ({circumflex over (x)}t) based on the reconstructed frame ({circumflex over (x)}t′).
In some embodiments, the motion estimation model may be based on a spatial pyramid network.
In some embodiments, the enhanced context mining (ECM) model may be configured to utilize cross-channel interaction and residual learning operation to reduce redundancy across context channels.
In some embodiments, the enhanced context mining model (ECM) may be configured to obtain the motion vector ({circumflex over (v)}t) and decoded frame feature (x̆t−1) based on the current input frame (xt) and previously decoded frame ({circumflex over (x)}t−1), warp the motion vector ({circumflex over (v)}t) and decoded frame feature (x̆t−1) to obtain a warped feature ({umlaut over (x)}t), and process the warped feature ({umlaut over (x)}t) by using a resblock and convolution layer to obtain a context (t).
In some embodiments, the enhanced context mining (ECM) model may further be configured to incorporate convolution with ReLU on the context (t) to obtain residual context (RC), and add the residual context (RC) to the context (t) to obtain the enhanced context ({umlaut over (C)}E).
In some embodiments, the enhanced context mining (ECM) model may be configured to incorporate convolution with ReLU on the context (t) by passing the context (t) through multiple layers of convolution and ReLU followed by one more convolution layer.
In some embodiments, the enhanced context mining (ECM) model may be designed without batch normalization layers.
In some embodiments, the contextual encoder and decoder may be configured to compress the current frame (xt) by concatenating the input frame (xt) and the enhanced context ({umlaut over (C)}E) together, feeding the input frame (xt) and the enhanced context ({umlaut over (C)}E) into the contextual encoder to obtain latent code (yt) for entropy model, and transforming the latent code back to pixel space with the assistance of the enhanced context ({umlaut over (C)}E) by the contextual decoder to obtain the reconstructed frame ({circumflex over (x)}t′).
In some embodiments, the post-enhancement backend network may be transformer-based.
In some embodiments, the post-enhancement backend network may include multiple transposed gated transformer blocks (TGTBs) and multiple convolution layers.
In some embodiments, the post-enhancement backend network may be configured to obtain the high-resolution frame ({circumflex over (x)}t) by applying a convolution layer to the reconstructed frame ({circumflex over (x)}t′) to obtain a low-level feature embedding F0∈H×W×C, where H×W is the spatial height and width, and C denotes the channel numbers, feeding the low-level feature (F0) through one or more transformer blocks to obtain a refined feature (FR), and applying a convolution layer to the refined feature (FR) to obtain residual image R∈H×W×3 to which the reconstructed frame ({circumflex over (x)}t′) is added to obtain {circumflex over (x)}t:{circumflex over (x)}t={umlaut over (x)}t′+R.
In some embodiments, at least one of the transformer blocks may include a transposed gated transformer block (TGTB) which is designed without layer normalization.
In some embodiments, at least one of the transformer blocks may be modified to contain a multi-head transposed attention (MTA) and a gated feed-forward network (GFN).
In some embodiments, the multi-head transposed attention (MTA) may be configured to calculate self-attention across channels.
In some embodiments, the multi-head transposed attention (MTA) may be configured to apply depth-wise convolution.
In some embodiments, the multi-head transposed attention (MTA) may be configured to generate, from a feature input X∈H×W×C, query (Q), key (K) and value (V) projections with the local context, and to reshape the query (Q) to ĤŴ×Ĉ, and key (K) to Ĉ×Ĥ{umlaut over (W)} to obtain a transposed attention map of size Ĉ×Ĉ.
In some embodiments, the gated feed-forward network (GFN) may include gating mechanism and depth wise convolutions, the gating mechanism may be achieved as the element-wise product of two parallel paths of transformation layers, one of which is activated with the GELU non-linearity, and the depth-wise convolution may be applied to obtain information from spatially neighboring pixel positions.
In yet another aspect of the invention, there is provided a non-transitory computer readable medium having instructions stored thereon which, when executed by one or more processors, cause the one or more processors to execute the computer-implemented method as aforementioned.
Other features and aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings. Any feature(s) described herein in relation to one aspect or embodiment may be combined with any other feature(s) described herein in relation to any other aspect or embodiment as appropriate and applicable.
Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings in which:
Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of embodiment and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.
Hereinafter, some embodiments of the invention will be described in detail with reference to the drawings.
The deep contextual video compression framework (DCVC) with conditional coding paradigm extracts a context and takes the context as a condition of the contextual encoder decoder and entropy model. However, the generation of critical context remains challenging. Some embodiments of the invention propose an enhanced context mining model to reduce the redundancy across context channels, where cross-channel interaction and residual learning are utilized to obtain better context features. Moreover, inspired by the success of in-loop filtering in traditional codecs, some embodiments of the invention provide a transformer-based post-enhancement network designed to alleviate the error propagation problem. Specifically, some embodiments of the invention propose a full resolution pipeline without down/up sampling for the in-loop filtering without information damage. In addition, to reduce the large computation and memory consumption of the transformer, some embodiments of the invention propose a transposed gated transformer block to calculate self-attention across channels rather than the spatial dimension. With the above designs, the whole framework is feasible for high-resolution frames and can be jointly optimized to exploit spatial and temporal information.
Deep learning-based image compression has been a popular topic with end-to-end training. Rather than manually designing components empirically, such as JPEG [24], JPEG2000 [25], and BPG [26], the learning-based image compression maps the image to latent code with a nonlinear network-based transform. After that, the latent code is quantized and written into the bitstream. Subsequently, the image could be reconstructed from latent code with an inverse transformation. In [11], the whole trainable framework, which could be jointly optimized by the estimated bit cost and reconstructed image quality, is proposed. Specifically, Balle' et al. adds uniform noise to the quantized latent code as a soft approximation during the training process and estimates the bit cost with a factorized entropy model. Later, Balle' et al. further proposes to take advantage of side information-related hyperprior to reduce spatial redundancy of latent code in [27], and the latent code is modeled as a zero-mean Gaussian with its own standard deviation. Moreover, autoregressive priors are incorporated into the entropy model in [28], [29] with a masked convolution network. Meanwhile, Fabian Mentzer et al. proposes to use 3D-CNN to learn a conditional probability model of latent code in [30] and Hu et al. proposes a coarse to fine entropy model to take advantage of different layers' hyper priors [31]. Furthermore, Cheng et al. proposes flexible Gaussian Mixture Likelihoods to parameterize the distributions of latent codes, achieving comparable performance with Versatile Video Coding (VVC) in [12]. In addition to CNN-based frameworks [32], recurrent neural network-based methods [33]-[35] and GAN-based frameworks [36], [37] are also proposed for learned image compression.
Besides image compression, the past few years also have witnessed a rapid development of learning-based video compression [38], [39]. Lu et al. proposes the first low delay predictive end-to-end video compression framework (DVC) in [13], [40] and all components are implemented with neural networks, optimizing with a single rate-distortion tradeoff loss. At this time, optical flow replaced typical MVs to estimate motion, while motion compensation relied on reconstructed MVs. Then the MVs and residuals between prediction and ground truth are separately coded by the image compression method. Based on DVC, Hu et al. further proposes adaptive flow coding, introducing multi-resolution representations at block and frame levels in [41]. Then, Hu et al. presented a feature space video compression framework (FVC), achieving great compression performance improvement. In FVC, motion estimation, compensation, compression and residual compression are operated on feature space instead of pixel space [42]. Subsequently, a coarse-to-fine deep video compression framework based on multi-resolution and feature space operation for better compensation is presented in [43]. To better handle motion, Agustsson et al. proposes a scale-space flow for learned video compression [44]. Moreover, Lin et al. proposes a decomposed motion paradigm (DMVC) for learned video compression in [45]. Enhanced motion compensation is proposed to generate a better predicted frame in [46]. In addition, Fabian et al. proposes Video Compression Transformer (VCT) to directly learn the relationship between frames [47].
In addition to the above methods, it is worth mentioning that Li et al. shifted the predictive coding paradigm to the conditional coding paradigm and propose the deep contextual video compression framework (DCVC) in [15]. Subsequently, Sheng et al. proposes to extract multi-scale temporal contexts and maintain the propagated feature based on the conditional coding paradigm (DCVC) [48]. Furthermore, on top of conditional coding, Li et al. proposes a hybrid spatial temporal entropy model, combining the contribution of [48] for learned video compression in [49]. Different from [46], [48], [49], the embodiments of the present invention focus on the generation of a better context with conditional coding paradigm [15].
Not limited to the previous frame as a reference, Lin et al. proposes multiple frame prediction to generate multiple MV fields in [50]. Yang et al. introduces recurrent neural network into video compression in [51] and proposes a Hierarchical Learned Video Compression (HLVC) framework with three hierarchical quality layers and a recurrent enhancement network in [52]. Reza et al. proposes to extend P-frame codecs to B-frame coding with a frame interpolation method in [53].
Deblocking filter and sample adaptive offset [54] are two in-loop filters, specified in HEVC [2] and VVC [3]. They are applied after the inverse quantization and before saving the frame to the decoded frame buffer. In particular, the deblocking filter is designed to weaken the discontinuities at the prediction and transformation block boundaries. Sample adaptive offset improves the frame quality after the deblocking filter by attenuating ringing artifacts. Using the two in-loop filters, a better quality reference frame can be obtained and hence the compression efficiency is improved.
With the development of deep learning, researchers have explored deep learning-based in-loop filtering enhancement in HEVC and VVC. Dai et al. [55] proposes a variable filter size residue-learning CNN (VRCNN) to improve the compression performance. Moreover, residual highway units [56], switchable deep learning approach [57], context-aware CNN [58] and enhanced deep convolutional neural networks [59] are proposed for in-loop filtering in HEVC. Later, Pham et al. proposes a learning-based spatial-temporal in-loop filtering to improve VVC default in-loop filtering, by taking advantage of coding information [16]. Zhang et al. proposes another specified CNN to enhance the Random Access (RA) mode in VVC [17]. Moreover, Ma et al. designed an MFRNet for post-processing and in-loop filtering for traditional video compression in [18].
Although in-loop filtering has been investigated and exploited in traditional video compression, the in-loop enhancement for end-to-end video compression is still awaiting more exploration. The embodiments of the present invention design a post-enhancement backend network and improve the end-to-end video compression efficiency further.
The overall framework of the proposed method is shown in
First, the current frame xt and previously decoded frame {circumflex over (x)}t−1 are fed into the motion estimation model to estimate optical flow, which is treated as the estimated motion vector vt for every pixel. After obtaining the motion vector vt, the MV encoder-decoder is used to compress vt and obtain the reconstructed motion vector {circumflex over (v)}t. Next, enhanced context mining is applied to learn contexts {umlaut over (C)}E from the reconstructed motion vector {circumflex over (v)}t and previously decoded frame feature x̆t−1. Then the enhanced context {umlaut over (C)}E is refilled into the contextual encoder-decoder and entropy model to compress the current frame xt. After the contextual decoder, the reconstructed frame {circumflex over (x)}t′ is obtained. Considering that in-loop filtering can further improve compression efficiency, the {circumflex over (x)}t′ is fed into the proposed transformer-based post-enhancement backend network to generate {circumflex over (x)}t. Then {right arrow over (x)}t is stored and propagated for the next frame compression.
According to some embodiments, the motion estimation model is based on the spatial pyramid network (Spynet) [60]. Moreover, some embodiments follow DCVC [15]method to refill the context into the contextual encoder-decoder and entropy model.
The main components of the framework in
Motion Estimation: The current frame xt and previously decoded frame x{circumflex over ( )}t−1 are fed into the motion estimation model to estimate optical flow, to exploit the temporal relationship. The optical flow is treated as the estimated motion vector vt for every pixel. According to an embodiment, the motion estimation model is based on the spatial pyramid network (Spynet).
MV Encoder-Decoder: After obtaining the motion vector vt, the MV encoder-decoder is used to compress and reconstruct the motion vector vt. {circumflex over (v)}t is the reconstructed motion vector.
Enhanced Context Mining: An enhanced context mining is proposed to learn richer contexts {umlaut over (C)}E from the reconstructed motion vector {circumflex over (v)}t and previously decoded frame feature x̆t−1 . Then the enhanced context {umlaut over (C)}E is refilled into the contextual encoder-decoder and entropy model to improve the compression efficiency. The details of the enhanced context mining model will be further described below in Item B.
Contextual Encoder-Decoder: With the assistance of enhanced context {umlaut over (C)}E the contextual encoder and decoder are used to compress the current frame xt. According to an embodiment, DCVC can be used to concatenate the contexts {umlaut over (C)}E to the frame xt, and then feed into contextual encoder-decoder.
Post Enhancement Backend Network: After the contextual decoder, the reconstructed frame {circumflex over (x)}t′ is obtained. Considering that in-loop filtering can further improve the compression efficiency, a transformer based post-enhancement backend network is proposed to further improve compression efficiency, which will be further described below in Item C.
Entropy Model: For the entropy model, the hierarchical priors and spatial priors and context temporal priors are fused together with Laplacian distribution to model the contextual latent code. In addition, the MV latent codes also have a corresponding entropy model. However, only the spatial and hyperprior priors are applied for MV latent codes according to an embodiment. Moreover, the arithmetic coder is implemented to write and read bitstream.
The details of the proposed enhanced context mining model and transformer-based post enhancement backend network are further provided below.
It can be seen from the overall framework that context is critical as it assists the contextual encoder-decoder and entropy model to compress the current frame. However, there is inaccurate motion estimation information via optical flow [61] and it may introduce artifacts in the bi-linear warping phase of {circumflex over (v)}t and x̆t−1. To make full use of the high-dimension context feature, it is proposed to reduce the redundancy across context channels with convolution operation and residual learning. The proposed enhanced context mining model in one embodiment is shown in
After obtaining the enhanced context {umlaut over (C)}E, the enhanced context {umlaut over (C)}E is refilled into the learned video compression framework, including contextual encoder-decoder and entropy model. With the enhanced context features, compression efficiency is improved.
Some embodiments of the invention follow DCVC [15] to assume the distribution of latent yt follows the Laplace distribution. The temporal prior Tt is fused with hyper prior Ht [27] and spatial prior St [28] to estimate the means and variance for latent ŷt. Moreover, the arithmetic coder is implemented to write and read bitstream. Meanwhile, the MV latent codes also have a corresponding entropy model. The spatial and hyperprior priors are applied for MV latent codes according to an embodiment as DCVC.
Inspired by the success of transformer in natural languages [19], [20] and vision problems [21]-[23], to capture long range pixel interactions for further performance improvement, a post-enhancement backend network, which is a transformer-based network, is proposed. In the transformer, self-attention's complexity grows quadratically with the spatial resolution of the input frame. Taking an image of H×W pixels as an example, the complexity of key-query dot product interaction is (W2H2). Therefore, some embodiments of the invention propose the transposed gated transforme block (TGTB) to capture long-range pixel interactions and make the transformer and whole framework feasible for the high-resolution frame simultaneously. The proposed post-enhancement backend network in one embodiment is shown in
Down-sampling and up-sampling are often used for transformer-based architecture. These operations can reduce the network parameters and accelerate the training process. Meanwhile, the larger receptive for global feature extraction can be obtained with down-sampling. However, information may be lost during the down-sampling process while the artifact will be inevitably added with up-sampling. Therefore, a full-resolution pipeline can be used for the transformer based post-enhancement backend network.
Given a reconstructed frame {circumflex over (x)}t′∈H×W×3, a convolution layer is applied first to obtain the low-level feature embedding F0∈H×W×C, where H×W is the spatial height and width, and C denotes the channel numbers. Then the feature F0 is passed through six transformer blocks to obtain the refined feature FR. Finally, a convolution layer is applied to FR to obtain residual image R∈H×W×3 to which the reconstructed frame {circumflex over (x)}t′ is added to obtain {circumflex over (x)}t:{circumflex over (x)}t={circumflex over (x)}t′+R. The ablation study demonstrates that adding up-down sampling operations in the pipeline will clearly decline compression performance, which verifies the benefit of the proposed full resolution pipeline.
The transposed gated transformer block (TGTB) is designed without layer normalization as the basic unit of post enhancement network, which includes a multi-head transposed attention (MTA) and a gated feed-forward network (GFN). The details of TGTB are as follows.
First, the normalization layer of the transformer block is removed to prevent performance degradation for the learning based video compression task. Layer normalization is usually applied to normalize the intermediate layers' distributions in the transformer, enabling faster model convergence by losing unimportant information [63]. However, it is pointed out that different tasks prefer different normalization methods in [64] and inappropriate normalization methods may lead to performance degradation. Since the transformed based enhancement network is incorporated into the whole compression framework and train all modules end-to-end, the unimportant information lost by layer normalization may be restored by other modules. In such a way, the whole framework is fooled with losing unimportant information by normalization and restoring information by other modules simultaneously, which leads to performance degradation. Therefore, the transformer block is designed to be without layer normalization in this embodiment. The influence of layer normalization is demonstrated in the ablation study, which verifies the benefit of removing the normalization layer.
Second, as mentioned, the major computation overhead of the transformer comes from the self-attention layer of the transformer. In one embodiment, the multi-head transposed attention (MTA) is applied to alleviate the computation problem, as shown in
From a feature input X∈H×W×C, the MTA generates query (Q), key (K) and value (V) projections with the local context. Specifically, 1×1 convolution aggregates pixel-wise-cross-channel context and 3×3 depth-wise convolution encodes the channel-wise spatial context. Next, the query (Q) is reshaped to ĤŴ×Ĉ, and key (K) is reshaped to Ĉ×ĤŴ. Therefore, the dot-product interaction of query and key is a transposed attention map of size Ĉ×Ĉ, instead of the attention map size of ĤŴ×ĤŴ. In general, the whole process of MTA could be defined as follows:
In addition, the gated feed-forward network is applied with a gating mechanism and depth-wise convolutions as
The loss function is targeted to optimize the whole framework with the rate distortion (R-D) cost. The loss function is defined as:
Training dataset: According to an embodiment, Vimeo-90k [68] is the training data, which has been commonly applied for learning based video compression tasks. The dataset consists of 91,701 sequences with a fixed resolution of 448×256, each containing seven frames. The video sequences are randomly cropped to 256×256 patches.
Testing dataset: The testing dataset includes HEVC standard sequences from the common test conditions [69] used by the standard community. To be specific, ClassB (1920×1080 resolution), Class C (832×480 resolution), Class D (416×240 resolution), and Class E (1280×720 resolution) are used to evaluated performance. Test is not performed on HEVC ClassA sequences (2560×1600 resolution) as previous learning-based codecs do not evaluate on ClassA sequences [13], [15], [42]. Moreover, 1920×1080 resolution videos from UVG [70] and MCL-JCV [71] datasets are also tested. Overall, the compression performance is measured with sixteen HEVC sequences, seven UVG sequences and thirty MCL-JCV sequences, including slow/fast motion, homogeneous/non-homogeneous scene, object rotation, complex texture, etc.
Implementation details: The example models are implemented on NVIDIA 3090 GPUs with PyTorch [72] and CompressAI [73] project. For comparison, this example follows DCVC [15] and trains four models with different values (In terms of PSNR, equals to 256, 512, 1024, 2048; In terms of MS-SSIM, equals to 8, 16, 32, 64.). The AdamW [74] optimizer is used with the initial learning rate of 1e−4, and the batch size is set to 8.
Testing configuration settings: Following the settings in [15], the group of pictures (GOP) is set as 10 for HEVC sequences and 12 for others. In addition, this example tests 100 frames for HEVC sequences and 120 frames for others. Because this disclosure focuses on inter-frame coding, existing learning-based image compression models in CompressAI project [73] is used for intra-frame coding. In this example, cheng2020-anchor [12] is utilized for the PSNR target and hyperprior [27] is utilized for MS-SSIM target in the proposed DEVC embodiment and DCVC. The quality levels of cheng2020-anchor and hyperprior are set as 3, 4, 5 and 6 respectively for different bit-rate coding scenarios, where the larger level value means better frame quality corresponding to larger in learning-based inter-frame coding.
The proposed DEVC method embodiment is compared with existing codecs, e.g., H.264 [1], H.265 [54] and VVC [3] to demonstrate the coding efficiency. One embodiment applies the x265 very slow profile and x264 very fast profile in FFmpeg for H.265 and H.264 as follows.
Regarding VVC, VTM-14.0 is considered a competitive baseline to compress the testing dataset. Because the GOP size of VTM with encoder low-delay vtm configuration is 8, it only supports intra-period as multiples of 8. Meanwhile, the proposed method is a predictive coding framework. Therefore, the intraperoid is set to 16 and predictive configuration lowdely_p_vtm.cfg is selected as the default configuration file of VVC for performance comparison.
Since the neural network is trained and performs on the RGB domain for vision tasks, RGB format is employed for some existing learning-based video compression works. This example also evaluates frame quality (PSNR or MS-SSIM) in the RGB domain. For H.264, H.265 and VVC, the yuv file is transformed to a png format image and the corresponding frame quality is calculated.
At the same time, representative learning based video coding methods are selected as baselines, these methods include: DVC [13], FVC [42], DCVC [15]. DVC is the first end-to-end predictive coding framework that pioneered learning-based video compression. FVC has shifted the pixel space framework to the feature space video compression framework. DCVC has enabled a conditional coding framework, which takes the high dimension context as a condition to guide compression rather than calculate the residual between the predicted and input frames. In general, the DVC, FVC and DCVC have made great improvements in learning-based video compression, which are considered representative learning-based methods with SOTA performance.
In addition, Li et al. proposes DCVC in [15] and further improves DCVC with a hybrid spatial-temporal entropy modeling and a content-adaptive quantization mechanism [49], which shows comparable performance with VVC. In [49], decoded frame and the feature before obtaining the reconstructed frame for temporal context mining are propagated for inter-frame prediction. On the contrary, only decoded frame is propagated with the method in one embodiment. The proposed method could be treated as another way to improve compression efficiency and extensible to other learning-based methods on top of conditional coding. Moreover, Lin et al. proposes multiple frame prediction to generate and propagate multiple MV fields in [50]. Learning-based multi-reference frameworks [52], [53] are also proposed for video compression. In this example, considering the fair comparison with only one decoded frame to be propagated for reference, only the performance comparisons among the proposed DEVC embodiment and VVC, DVC, FVC, DCVC, H.265, H.264 are shown.
The BD-Rate [75] is applied to measure the performance of the proposed method (DEVC) and other state-of-art methods. The x265 very slow profile is taken as the anchor and the BD-Rate comparison is presented in terms of PSNR as shown in TABLE I. The negative numbers indicate bitrate savings, and the best and second-best neural video coding methods are marked as bold in bracket and underlined. Furthermore, bits per pixel (bpp) is taken as the horizontal axis and reconstructed PSNR is taken as the vertical axis to visualize the coding performance curves as shown in
The proposed DEVC embodiment achieves a significant performance gap compared with previous conventional video codecs, H.264 and H.265, which shows the tremendous potential of learning based methods. The DEVC outperforms H.265 by 36.40% average bit savings on all testing datasets, especially up to 46.33% bit savings for UVG sequences and 41.06% bit savings for HEVC Class B sequences. Compared with the VTM LDP configuration, the proposed DEVC embodiment is inferior in terms of BD rate for PSNR, which can be further explained as follows. The proposed DEVC embodiment only applies one reference frame and flat QP for inter-prediction. However, the LDP configuration in VTM denotes a better configuration using multiple references and dynamic QP. Meanwhile, there are fewer I frame in VTM testing as the setting of the intra period is 16 rather than 10/12, and 100 frames are tested for HEVC sequences and 120 frames are tested for other sequences.
As can be observed from the experimental results, the proposed DEVC embodiment outperforms the listed end-to-end video codecs on all testing datasets, which demonstrates a strong generalization ability as the testing datasets contain different characteristics. In TABLE II, the DCVC is taken as the anchor to show the performance improvement with a double enhanced modeling scheme in one embodiment. A 9.85% bit rate reductions on average in terms of PSNR can be achieved with the proposed method.
The distortion metric is shifted from PSNR to MS-SSIM and the BD-Rate comparison is presented. The average of bpp and MS-SSIM are calculated for each test dataset class and the RD curves are drawn, as shown in
(−41.06%)
(−17.58%)
(−27.40%)
(−35.90%)
(−46.33%)
(−37.07%)
(−36.40%)
Regarding the subjective quality comparison, frames from RaceHorses, Cactus, and BQSquare are selected and local regions are enlarged for better visualization in
The model complexity in the model size of parameters number, MACs (multiply-accumulate operations) and encoding-decoding time cost are compared. The 480p resolution sequences of HEVC Class C are involved in complexity analyses. Moreover, complexity analysis is conducted on machine computer with NVIDIA 3090 GPU and Intel® Xeon® Silver 4210 CPU @ 2.20 GHz. The complexity comparison result is shown in TABLE IV. For the time cost of one frame encoding and decoding, The codec time including the time writing to and reading from a bitstream is measured. As the time to encode and decode one frame would be influenced by the frame's content and the status of the computer, the 95% confidence time costing interval for DCVC and DEVC are calculated.
(−57.90%)
(−44.34%)
(−49.83%)
(−52.05%)
(−50.04%)
(−54.28%)
(−52.85%)
Moreover, define δ as follows:
The self-attention structure is used to capture long-distance dependencies well, but it requires a lot of computation. According to an embodiment, transposed gated transformer block (TGTB) is proposed to alleviate the major computation overhead of self-attention and how to reduce the calculation amount further is worth exploring. With the proposed design embodiment, the compression efficiency has improved, with an average of 9.85% and 12.19% for PSNR and MS-SSIM respectively, under default GOP settings. Specifically, a 19.28% improvement for PSNR can be obtained when GOP size is 32, as shown in TABLE V.
To demonstrate the effectiveness of the enhanced context mining model (ECM) and transformer-based post enhancement backend network (TPEN), the proposed DEVC embodiment is taken as a baseline. The proposed components' influences are presented in TABLE VI. The positive numbers represent the degradation of compression performance. As shown in TABLE VI, using ECM to remove the redundancy across context channels can bring an 11% gain and the post enhancement backend network brings an extra 3% performance gain. It is verified that the proposed ECM can better learn context and the transformer-based post-enhancement network further improve the compression performance.
To study the benefit of full-resolution pipeline for TPEN, up-down sampling is added into the pipeline and the performance comparison is demonstrated in TABLE VII. As seen from TABLE VII, adding down/up sampling into the pipeline decline 2.29% compression performance. The experiment result corresponds to the design insight that information may be lost during the down-sampling process while the artifact will be inevitably added with up-sampling, and verifies the benefit of the proposed full-resolution pipeline.
0.38/31.64
0.26/30.76
The BQTerrace sequence is taken from HEVC Class B (1920×1080) as an example to present the bpp/PSNR value with layer normalization and without layer normalization in TPEN. The influence of the normalization layer is demonstrated in TABLE. VIII. It can be observed that larger bpp or poor quality is obtained with layer normalization in TPEN. This phenomenon may be explained as follows. Since the transformed-based enhancement network is incorporated into the whole compression framework and all modules are trained end to end, the whole framework is fooled with losing unimportant information by normalization and restoring information by other modules simultaneously, resulting in performance degradation. Therefore, the transformer block without layer normalization is designed.
The ablation study of transposed gated transformer block is demonstrated in TABLE IX. There is a 1.43% performance degradation without MTA and a 2.82% decline without GFN, which verifies the benefit of MTA and GFN in transposed gated transformer block.
E. Error propagation Analysis
RaceHorses is taken from HEVC Class C as an example to present the PSNR and bitrate comparisons between DCVC and the proposed DEVC embodiment in
According to some embodiments of the invention, a double-enhanced modeling for learned video compression is designed. In particular, a context mining model is designed to reduce redundancy across context channels. The residual learning and convolution operation are applied along with context channels to reduce the introduced artifacts caused by the bilinear warping and inaccurate motion estimation. A transformer-based post-enhancement backend network is also designed to capture the long-distance dependencies. Specifically, it is provided a full-resolution pipeline for the post enhancement network without information damage. Moreover, transposed gated transformer block without normalization is designed to alleviate self-attention computation and make the whole framework feasible for the high-resolution frame.
Extensive experiments show that the proposed embodiment outperforms the VVC with around 6.7% bit savings and surpasses DCVC 12.19% in terms of MS-SSIM. Regarding the PSNR metric, the proposed embodiment exceeds HEVC with an average 36.40% improvement, up to 46.33% bit savings for UVG sequences and 41.06% bit savings for HEVC Class B sequences. Compared with DCVC, the proposed embodiment achieves 9.85% bit savings. Specifically, a 19.28% improvement is obtained for PSNR when GOP is 32. The advantage of the proposed method/framework will be more pronounced when the GOP size increases and the time complexity increment of the proposed model can be neglected.
Although not required, one or more embodiments described with reference to the Figures can be implemented as an application programming interface (API) or as a series of libraries for use by a developer or can be included within another software application, such as a terminal or computer operating system or a portable computing device operating system. In one or more embodiments, as program modules include routines, programs, objects, components, and data files assisting in the performance of particular functions, the skilled person will understand that the functionality of the software application may be distributed across a number of routines, objects and/or components to achieve the same functionality desired herein.
It will also be appreciated that where the methods and systems of the invention are either wholly implemented by computing system or partly implemented by computing systems then any appropriate computing system architecture may be utilized. This will include stand-alone computers, network computers, dedicated or non-dedicated hardware devices. Where the terms “computing system” and “computing device” are used, these terms are intended to include (but not limited to) any appropriate arrangement of computer or information processing hardware capable of implementing the function described.
It will be appreciated by a person skilled in the art that variations and/or modifications may be made to the described and/or illustrated embodiments of the invention to provide other embodiments of the invention. The described/or illustrated embodiments of the invention should therefore be considered in all respects as illustrative, not restrictive. Example optional features of some embodiments of the invention are provided in the summary and the description. Some embodiments of the invention may include one or more of these optional features (some of which are not specifically illustrated in the drawings). Some embodiments of the invention may lack one or more of these optional features (some of which are not specifically illustrated in the drawings).