The disclosure relates to the field of video coding, and in particular, to a device and a method of using a loop filter to process a decoded video based on a Deep Neural Network (DNN) with Temporal Deformable Convolutions (TDC).
Traditional video coding standards, such as the H.264/Advanced Video Coding (H.264/AVC), High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) share a similar (recursive) block-based hybrid prediction/transform framework where individual coding tools like the intra/inter prediction, integer transforms, and context-adaptive entropy coding, are intensively handcrafted to optimize the overall efficiency. Basically, the spatiotemporal pixel neighborhoods are leveraged for predictive signal construction, to obtain corresponding residuals for subsequent transform, quantization, and entropy coding. On the other hand, the nature of Deep Neural Networks (DNN) is to extract different levels of spatiotemporal stimuli by analyzing spatiotemporal information from the receptive field of neighboring pixels. The capability of exploring highly nonlinearity and nonlocal spatiotemporal correlations provide promising opportunity for largely improved compression quality.
However, the compressed video inevitably suffers from compression artifacts, which severely degrade the Quality of Experience (QoE). The DNN-based methods have been developed to enhance the visual quality of compressed images, such as image denoising, super-resolution, deblurring, etc. When these methods are applied to videos, image-based methods suffer from instability and fluctuation caused by changing compressed video quality, video scene, and object motion in a video. Accordingly, it is important to make use of information from neighboring frames in videos to stabilize and improve the enhanced visual quality.
One caveat of leveraging information from multiple neighboring video frames is the complex motion caused by moving camera and dynamic scenes. Traditional block-based motion vectors do not work well for non-translational motions. Also, while learning-based optical flow methods can provide more accurate motion information at pixel-level, they are still prone to errors, especially along the boundary of moving objects.
Therefore, one or more embodiments of the disclosure provide a DNN-based model with Temporal Deformable Convolutions (TDC) to handle arbitrary and complex motions in a data-driven fashion without explicit motion estimation.
According to an embodiment, there is provided a method of performing video coding using one or more neural networks with a loop filter. The method includes: obtaining a plurality of image frames in a video sequence; determining a feature map for each of the plurality of image frames and determining an offset map based on the feature map; determining an aligned feature map by performing a temporal deformable convolution (TDC) on the feature map and the offset map; and generating a plurality of aligned frames.
According to an embodiment, there is provided an apparatus including: at least one memory storing computer program code; and at least one processor configured to access the at least one memory and operate as instructed by the computer program code. The computer program code includes: obtaining code configured to cause the at least one processor to obtain a plurality of image frames in a video sequence; determining code configured to cause the at least one processor to: determine a feature map for each of the plurality of image frames and determine an offset map based on the feature map; determine an aligned feature map by performing a temporal deformable convolution (TDC) on the feature map and the offset map; and generating code configured to cause the at least one processor to generate a plurality of aligned frames.
According to an embodiment, there is provided a non-transitory computer-readable storage medium storing computer program code, the computer program code, when executed by at least one processor, the at least one processor is configured to: obtain a plurality of image frames in a video sequence; determine a feature map for each of the plurality of image frames and determining an offset map based on the feature map; determine an aligned feature map by performing a temporal deformable convolution (TDC) on the feature map and the offset map; and generate a plurality of aligned frames.
The following description briefly introduces the accompanying drawings, which illustrate the one or more embodiments of the disclosure.
Example embodiments are described in detail herein with reference to the accompanying drawings. It should be understood that the one or more embodiments of the disclosure described herein are only example embodiments, and should not be construed as limiting the scope of the disclosure.
Referring to
Referring to
The processor 210 is implemented in hardware, firmware, or a combination of hardware and software. The processor 210 is a central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), a microprocessor, a microcontroller, a digital signal processor (DSP), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), or another type of processing component. In some implementations, the processor 210 includes one or more processors capable of being programmed to perform a function.
The memory 220 includes a random access memory (RAM), a read only memory (ROM), and/or another type of dynamic or static storage device (e.g., a flash memory, a magnetic memory, and/or an optical memory) that stores information and/or instructions for use by the processor 210.
The storage 230 stores information and/or software related to the operation and use of the computing device 200. For example, the storage 230 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optic disk, and/or a solid state disk), a compact disc (CD), a digital versatile disc (DVD), a floppy disk, a cartridge, a magnetic tape, and/or another type of non-transitory computer-readable medium, along with a corresponding drive.
The input interface 240 includes a component that permits the computing device 200 to receive information, such as via user input (e.g., a touch screen display, a keyboard, a keypad, a mouse, a button, a switch, and/or a microphone). Additionally, or alternatively, the input interface 240 may include a sensor for sensing information (e.g., a global positioning system (GPS) component, an accelerometer, a gyroscope, and/or an actuator). The output interface 250 includes a component that provides output information from the computing device 200 (e.g., a display, a speaker, and/or one or more light-emitting diodes (LEDs)).
The communication interface 260 includes a transceiver-like component (e.g., a transceiver and/or a separate receiver and transmitter) that enables the computing device 200 to communicate with other devices, such as via a wired connection, a wireless connection, or a combination of wired and wireless connections. The communication interface 260 may permit the computing device 200 to receive information from another device and/or provide information to another device. For example, the communication interface 260 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like.
The bus 270 includes a component that permits communication among the components of the computing device 200.
The computing device 200 may perform one or more operations described herein. The computing device 200 may perform operations described herein in response to the processor 210 executing software instructions stored in a non-transitory computer readable medium, such as the memory 220 and/or the storage 230. A computer-readable medium is defined herein as a non-transitory memory device. A memory device may include memory space within a single physical storage device or memory space spread across multiple physical storage devices.
Software instructions may be read into the memory 220 and/or the storage 230 from another computer-readable medium or from another device via the communication interface 260. When executed, software instructions stored in the memory 220 and/or the storage 230 may cause the processor 210 to perform one or more processes described herein. Additionally, or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
The number and arrangement of components shown in
First, a typical video compression framework will be described herein. Assume that an input video x includes a plurality of original image frames x1, . . . , xt. In the first motion estimation step, the plurality of original image frames are partitioned into spatial blocks, each spatial block can be further partitioned into smaller blocks iteratively, and a set of motion vectors mt between a current original image frame xt and a set of previous reconstructed frames {{circumflex over (x)}j}t−1 is computed for each spatial block. Here, the subscript t denotes the current t-th encoding cycle, which may not match the time stamp of the image frame. Also, the set of previous reconstructed frames {{circumflex over (x)}j}t−1 may include frames from multiple previous encoding cycles. In the second motion compensation step, a predicted frame {tilde over (x)}t is obtained by copying the corresponding pixels of the previous reconstructed frames {{circumflex over (x)}j}t−1 based on the motion vectors mt, and a residual rt between the current original image frame xt and the predicted frame {tilde over (x)}t is obtained by rt=xt−{tilde over (x)}t. In the third estimation step, after performing a Discrete Cosine Transform (DCT) on the spatial blocks, the DCT coefficients of residual rt are quantized to obtain a quantized residual ŷt. Therefrom, both the motion vectors mt and the quantized residual ŷt are encoded into a bitstream by entropy coding, and the bitstream transmitted to one or more decoders. On the decoder side, the quantized residual ŷt is first de-quantized (e.g., through inverse transformation such as Inverse Discrete Cosine Transform (IDCT)) to obtain a recovered residual {circumflex over (r)}1, and then the recovered residual {circumflex over (r)}t is added back to the predicted frame {tilde over (x)}t to obtain a reconstructed frame by {circumflex over (x)}t={tilde over (x)}t+{circumflex over (r)}t.
Moreover, additional components may be used to improve visual quality of the reconstructed frame {circumflex over (x)}t. One or more of the enhancement modules such as Deblocking Filter (DF), Sample-Adaptive Offset (SAO), Adaptive Loop Filter (ALF), Cross-Component Adaptive Filter (CCALF), etc. may be selected to process the reconstructed frame {circumflex over (x)}t. For example, the Deblocking Filter (DF) is a video filter that may be applied to a decoded video to improve visual quality and prediction performance by smoothing sharp edges formed between macroblocks when using block coding techniques. The Sample-Adaptive Offset (SAO) is an in-loop filter technique to reduce the mean sample distortion by adding an offset value to each sample. The SAO includes two types of offset techniques, which are edge offset (EO) and band offset (BO). The EO is driven by local directional structures in an image frame to be filtered, and the BO modifies the intensity values of image frames without a dependency on the neighborhood. The Adaptive Loop Filter (ALF) may be used to minimize the mean square error between original sample images and decoded sample images. The order of processing the enhancement modules and a selection of the enhancement modules may be variously modified according to a user setting.
According to an embodiment, an overall method of training a DNN LF is provided. Referring to
In addition, an alignment loss Lalign({circumflex over (f)}i,t,
Furthermore, the high-quality frame {circumflex over (x)}th and the original image frame xt may be input to a discrimination module 340 so as to recognize and detect the difference between the high-quality frame {circumflex over (x)}th and the original image frame xt. That is, the discrimination module 340 may compute a discrimination loss Ldis({circumflex over (x)}th,xt) based on {circumflex over (x)}th and xt and transmit the discrimination loss to the back propagation module 330. The discrimination loss Ldis({circumflex over (x)}th,xt) may be fed back to the DNN LF module 310 and the discrimination module 340 through the back propagation module 330 to train the DNN LF module 310 and the discrimination module 340.
The discrimination DNN may be a classification network which uses at least one of {circumflex over (x)}th and xt as an input, to compute a discrimination feature map d({circumflex over (x)}th) or d(xt). Based on the discrimination feature map d({circumflex over (x)}th) or d(xt), the discrimination DNN classifies whether the input is the original image frame xt or the generated (or synthesized) high-quality frame {circumflex over (x)}th. A classification loss Lclassify({circumflex over (x)}th, xt) can be computed to measure a mis-classification loss, such as a categorical cross-entropy loss. Also, a feature discrimination loss Lfeature(d({circumflex over (x)}th), d(xt)) may be computed to measure the difference between a discrimination feature map computed based on the generated high-quality image frame {circumflex over (x)}th and a discrimination feature map computed based on the original image frame xt.
The overall discrimination loss Ldis({circumflex over (x)}th, xt) may be a linear combination of Lfeature(d({circumflex over (x)}th),d(xt)) and Lclassify({circumflex over (x)}th,xt), which is calculated according to the following Equation (1):
Ldis({circumflex over (x)}th,xt)=Lclassify({circumflex over (x)}th,xt)+γLfeature(d({circumflex over (x)}th),d(xt)) (1)
Here, γ is a weight associated with the discrimination feature maps d({circumflex over (x)}th) and d(xt).
As described above, the reconstruction quality Lquality({circumflex over (x)}th,xt) output by the reconstruction quality computation module 320, the alignment loss Lalign({circumflex over (f)}i,t,
Ljoint=Lquality({circumflex over (x)}th,xt)λ1Lalign({circumflex over (f)}i,t,
Here, λ is a weight associated with the alignment loss, and β is a weight associated with the discrimination loss.
The gradient of the joint loss Ljoint can be back-propagated through the back propagation module 330 to update the DNN weight coefficients in the LF DNN (e.g., Feature Extraction DNN, Offset Generation DNN, TDC DNN, Frame Reconstruction DNN, Frame Synthesis DNN, Discrimination DNN, and TDC & Feature Fusion DNN).
Based on feeding back the joint loss Ljoint to the one or more DNN above, the predicted frame {tilde over (x)}t is added to update the set of N previous reconstructed frames {{circumflex over (x)}jt:{circumflex over (x)}jt ∈Nt}. For example, the oldest frame that is at the greatest distance away from the current frame may be removed from the set of N previous reconstructed frames, and the predicted frame {tilde over (x)}t may be added to replace the removed oldest frame. Thereafter, the encoder may enter the next encoding cycle from t to t+1.
According to an embodiment, the DNN LF module 310 may be used in combination with one or more of the above-described additional components (e.g., DF, SAO, ALF, CCALF, etc.) to improve the visual quality of the reconstructed frame {circumflex over (x)}t. For example, the reconstructed frame {circumflex over (x)}t may be sequentially processed through the DF, the DNN LF module, the SAO and the ALF. However, the one or more embodiments are not limited thereto, and the order of processing the additional components may be variously configured. In an embodiment, the DNN LF module 310 may be used alone as a replacement for all the other additional components to enhance the visual quality of the reconstructed frame {circumflex over (x)}t.
Referring to
The feature extraction module 410 may receive a set of N previous reconstructed frames {{circumflex over (x)}it:{circumflex over (x)}jt∈Nt} as an input, and configured to compute a feature map {circumflex over (f)}j,t,j=1, . . . , N by using a feature extraction DNN through a forward inference. For example, assuming that a frame {circumflex over (x)}it is used as a reference frame that all other frames must be aligned to, an offset generation module 420 may compute an offset map ΔPj→i,t based on {circumflex over (x)}it and {circumflex over (x)}it by concatenating feature maps {circumflex over (f)}i,t and {circumflex over (f)}j,t and passing the concatenated feature map through an offset generation DNN. Here, the frame {circumflex over (x)}it may be any frame of the set of N previous reconstructed frames {{circumflex over (x)}it:{circumflex over (x)}jt∈Nt}. Without loss of generality, the set of N previous reconstructed frames {{circumflex over (x)}it:{circumflex over (x)}jt∈Nt} are ranked according to their time stamps in an ascending order. Accordingly, a frame to enhance the visual quality may be selected based on the time stamps of the N reconstructed frames {{circumflex over (x)}it:{circumflex over (x)}jt∈Nt}. For example, when a target is to enhance the current reconstructed frame {circumflex over (x)}t, then, {circumflex over (x)}jt={circumflex over (x)}t. That is, all other previously reconstructed neighboring frames may be prior to the {circumflex over (x)}t. In another embodiment, a part of the previously reconstructed neighboring frames may be before {circumflex over (x)}t, and the remaining frames may be after {circumflex over (x)}t.
The offset map ΔPj→i,t generated by the offset generation module 420 may be input to the TDC module 430. In
According to an embodiment, the TDC DNN may include two-dimensional (2D) TDC layers. For example, assume that wk denotes a weight coefficient of a 2D TDC kernel, where k is a natural number greater than or equal to 1 (e.g., k=1, . . . , K), and pk denotes a predetermined offset for the k-th location in the kernel (e.g., a 3×3 kernel is defined with K=9 and pk∈{(−1,−1), (−1, 0), . . . , (1,1)}. A 2D TDC layer may compute an output feature fout based on an input feature fin and a learnable offset ΔP, where the feature at a sampling location p0 is determined based on the following equation:
fout,(p0)=Σk=1Kwkfin(p0+pk+Δpk) (3)
Here, the sum of offsets (p0+pk+Δpk) may be irregular and may not be an integer, the TDC operation can perform interpolations (e.g., bilinear interpolation) to remedy the irregular position of (p0+pk+Δpk).
Moreover, the alignment error computation module 460 may be configured to compute an alignment loss Lalign({circumflex over (f)}i,t,
While some specific embodiments of the DNN LF module have been described above, it should be understood that the one or more embodiments of the disclosure are not limited thereto. For example, a types of layer, a number of layer, a kernel size, etc. may be variously configured for each of the feature extraction DNN, the offset generation DNN, the TDC DNN, the frame reconstruction DNN and the frame synthesis DNN. For example, any backbone network, such as ResNET, may be used as the feature synthesis DNN. For example, a set of regular convolution and bottleneck layers may be stacked as the offset generation DNN. For example, a set of TDC layers may be stacked as the TDC DNN, and a few convolution layers with skip connections may be stacked together as the frame reconstruction DNN. For example, a few residual block layers may be stacked together as the frame synthesis DNN.
Referring to
According to an embodiment, input frames {circumflex over (x)}1t, . . . , {circumflex over (x)}Nt may be stacked together to obtain a 4D input tensor of size (n, c, h, w), where c is a number of channels (e.g., three for color frames) and (h, w) provides a resolution of a video frame. The feature extraction module 510 may be configured to compute a 4D feature tensor of feature maps {circumflex over (f)}1,t, . . . , {circumflex over (f)}N,t using the feature extraction DNN through forward inference. In an embodiment, the feature extraction DNN uses 3D convolution layers (e.g., C3D) to compute the feature maps {circumflex over (f)}1,, . . . , {circumflex over (f)}N,t and capture spatiotemporal characteristics of a video. In another embodiment, each individual feature map may be computed using 2D convolution layers as described with reference to
For example, assuming that wk denotes a weight coefficient of a 3D TDC kernel and pk denotes a predetermined offset for the k-th location in the kernel, where k is a natural number greater than or equal to 1 (e.g., k=1, . . . , K). The 3D TDC kernel may be defined as K=27 and pk ∈{(−1,−1,−1), (−1,−1, 0), . . . , (1,1,1)}. A 3D TDC layer may compute an output feature fout based on an input feature fin and a learnable offset ΔP, where the feature at a sampling location p0 is given using the same Equation (3) provided above.
In
While some specific embodiments of the DNN LF module have been described above, it should be understood that the one or more embodiments of the disclosure are not limited thereto. For example, a type of layer, a number of layer, a kernel size, etc. may be variously configured for each of the feature extraction DNN, the TDC and feature fusion DNN, and the frame reconstruction DNN.
The apparatus 600 may include at least one memory storing computer program code and at least one processor configured to access the at least one memory and operate as instructed by the computer program code. The computer program code 600 may include obtaining code 610, determining code 620 and generating code 630.
The obtaining code 610 may be configured to obtain a set of reconstructed image frames in a video sequence. According to an embodiment, the obtaining code 610 may be configured to perform operations of the feature extraction module 410 and 510 described above with respect to
The determining code 620 may be configured to determine a feature map for each of the plurality of image frames, determine an offset map based on the feature map and determine an aligned feature map by performing a temporal deformable convolution (TDC) on the feature map and the offset map. According to an embodiment, the determining code 620 may be configured to perform operations of the offset generation module 420, the TDC 430 and alignment error computation module 460 described above with respect to
The generating code 630 may be configured to generate a plurality of aligned frames and synthesize the plurality of aligned frames to output a plurality of high-quality frames corresponding to the plurality of image frames. According to an embodiment, the generating code 630 may be configured to perform operations of the frame reconstruction module 430 and the frame synthesis module 450 of
Although the apparatus 600 described as including only the obtaining code 610, the determining code 620 and the generating code 630, the one or more embodiments of the disclosure are not limited thereto. The one or more embodiments may include more or fewer components or parts than those shown in
The term used in the one or more embodiments of the disclosure such as “unit” or “module” indicates a unit for processing at least one function or operation, and may be implemented in hardware, software, or in a combination of hardware and software.
The term “unit”, “code” or “module” may be implemented by a program that is stored in an addressable storage medium and executable by a processor.
For example, the term “unit”, “code” or “module” may include software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of a program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and/or variables.
Some of the embodiments of the disclosure have been shown and described above. However, the one or more embodiments of the disclosure are not limited to the aforementioned specific embodiments. It may be understood that various modifications, substitutions, improvements and equivalents thereof can be made without departing from the spirt and scope of the disclosure. It should be understood that such modifications, substitutions, improvements and equivalents thereof shall fall within the protection scope of the disclosure, and should not be construed independent from the inventive concept or prospect of the disclosure.
This application is based on and claims priority to U.S. Provisional Application No. 63/090,126 filed on Oct. 9, 2020, in the United States Patent and Trademark Office, the disclosure of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
20180218502 | Golden | Aug 2018 | A1 |
20180249158 | Huang | Aug 2018 | A1 |
20200265567 | Hu | Aug 2020 | A1 |
Entry |
---|
Deng et al., “Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement”, Apr. 3, 2020, The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 10696-10703 (Year: 2020). |
International Search Report dated Nov. 23, 2021 in Application No. PCT/US21/46471. |
Written Opinion of the International Searching Authority dated Nov. 23, 2021 in Application No. PCT/US21/46471. |
Jianing Deng et al., “Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement”, Apr. 3, 2020, The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 10696-10703 (8 pages total). |
Extended European Search Report dated Dec. 5, 2022 in European Application No. 21878177.1. |
Tian et al., “TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution”, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jun. 13, 2020, pp. 3357-3366 (10 pages total). |
Liu et al., “Deep Learning-Based Video Coding: A Review and a Case Study”, ACM Computing Surveys, Feb. 2020, vol. 35, No. 1, Article 11, pp. 11:1-11:35 (35 pages total). |
Number | Date | Country | |
---|---|---|---|
20220116633 A1 | Apr 2022 | US |
Number | Date | Country | |
---|---|---|---|
63090126 | Oct 2020 | US |