The present disclosure relates to video compression schemes, including schemes involving deep omnidirectional video compression (DOVC). More specifically, the present disclosure is directed to a method for video processing, an encoder for video processing, and a decoder for video processing.
When transmitting images/videos, an encoder can utilize spatial correlation between pixels and temporal correlation between frames/pictures to compress the images/videos and transmit the compressed images/videos in a bitstream. A decoder can then reconstruct the images/videos from the bitstream. Researchers in this field have been committed to exploring a better compromise between a compression rate (e.g., bit-rate, R) and image distortion (D). In the past few decades, a series of image/video coding standards (e.g., JPEG, JPEG 2000, AVC, and HEVC) have been developed. Challenges remain in general compression performance and complexity.
Recent studies of video compression based on deep learning include two major aspects. First, some studies combine deep learning with traditional hybrid video compression. For example, replacing modules of loop filter, motion estimation, and motion compensation with neural networks. Disadvantages of traditional hybrid video compression frameworks include: (1) high complexity and (2) limits of optimization. Consumers' demands for high quality contents (e.g., resolutions 4K, 6K, 8K, etc.) result in remarkable increases of coding/decoding time and algorithm complexity. In addition, traditional hybrid video compression frameworks do not provide “end-to-end” global optimization. Second, some studies include deep learning-based video compression (DVC) framework focus only on 2D videos for compression. These traditional DVC frameworks use a pre-trained optical flow network to estimate motion information, and thus it is impossible to update their model parameters in real time to output optimal motion features. Also, the optical flow network only takes the previous frame as reference, which means that only “uni-directional” motion estimation is performed. Further, these traditional DVC frameworks fail to address data transmission problems from encoders to decoders. In other words, the traditional DVC frameworks require video sequence parameters to be manually provided to the decoder side (otherwise their decoders will not be able to decode videos). Therefore, it is advantageous to have improved methods or systems to address the foregoing issues.
One aspect of the present disclosure provides a method for video processing, comprising: parsing a first bitstream to determine a first quantized motion feature, wherein the first quantized motion feature is formed from first motion information of a luma current picture, wherein the first motion information is determined based on the luma current picture and first bi-directional predictive (B/P) pictures in a first group of pictures (GOP) based on first sets of reference pictures of the luma current picture; and decoding the first quantized motion feature to form a luma motion information by an MV decoder.
One aspect of the present disclosure provides an encoder for video processing, comprising: a processor; a memory configured to store instructions, when executed by the processor, to: separate a video into a chroma component and a luma component; receive a luma current picture of the luma component of the video; determine luma bi-directional predictive (B/P) pictures in a group of pictures (GOP) associated with the luma current picture; and perform a motion estimation (ME) process based on the luma current picture and the luma bi-directional predictive pictures so as to generate luma motion information of the luma current picture.
On aspect of the present disclosure provides a decoder for video processing, comprising: a processor; a memory configured to store instructions, when executed by the processor, to: parse a first bitstream to obtain a first quantized motion feature, wherein the first quantized motion feature is formed from first motion information of a luma current picture, wherein the first motion information is determined based on first bi-directional predictive (B/P) pictures in a first group of pictures (GOP) based on first sets of reference pictures of the luma current picture, and wherein the luma current picture is from a luma component separated from a video; decode the first quantized motion feature to form a luma motion information by an MV decoder.
To describe the technical solutions in the implementations of the present disclosure more clearly, the following briefly describes the accompanying drawings. The accompanying drawings show merely some aspects or implementations of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
To describe the technical solutions in the implementations of the present disclosure more clearly, the following briefly describes the accompanying drawings. The accompanying drawings show merely some aspects or implementations of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
For example, picture 4 can be determined based on picture 0 and 8. Picture 2 can be determined based on picture 0 and picture 4, whereas picture 6 can be determined based on picture 4 and picture 8. The rest of the pictures can then be determined by similar approaches. By this arrangement, the “non-key” pictures can be predicted in a “bi-directional” fashion. In some embodiments, the current picture xt discussed herein can be one of the pictures to be predicted. The order of coding the foregoing pictures 0-16 are indicated by “coding order” in
Referring now to
A motion compensation (MC) module 205 utilizes a bi-directional motion compensation process 207 to form a predicted picture {tilde over (x)}t according to the motion information {circumflex over (v)}t and the aforementioned reference pictures (e.g., reconstructed pictures xsC and xeC).
At a substractor 209, the difference between the predicted picture {tilde over (x)}t and the current picture xt is calculated. The difference is defined as current residual information rt. The current residual information rt is then encoded (by a residual encoder/decoder 211) to generate a latent residual feature yt or qy. Then latent residual feature yt or qy is quantized to obtain a quantized residual feature ŷt or {circumflex over (q)}y. The quantized residual feature ŷt or {circumflex over (q)}y is then decoded to get residual information {circumflex over (r)}t. The prediction picture {tilde over (x)}t and the residual information {circumflex over (r)}t are added (at an adder 210) to obtain a reconstructed picture xtC.
To repeat the foregoing bi-directional process for the whole video sequence, the reconstructed pictures xsC and xtC can be used as reference pictures, and then set the next picture
In some embodiment, similarly, the reconstructed pictures xeC and xtC can be used as reference pictures, and then set the next picture
In some embodiments, a bit-rate estimation module 213 can be used to determine a bit rate (R) for the encoder/decoder 203 and the residual encoder/decoder 211. The bit rate R can also be used to optimize the present DOVC framework by providing it as an input to a loss function module 215. The loss function module 215 can improve the present DOVC framework by optimizing a loss function such as “λD+R,” wherein “D” is a distortion parameter, “λ” is a Lagrangian coefficient, and R is the bit rate.
Referring now to
Referring back to
After the video sequence 301 is examined and/or processed (e.g., projected), current picture xt and reference (or reconstructed) pictures from previous picture {circumflex over (x)}t−1 (from a reference buffer 325) can be combined as input for a bi-directional ME module 307. The bi-directional ME module 307 generates current motion information vt. The current motion information vt is then encoded by a motion vector (MV) encoder 309 to form a latent motion feature mt, and then the latent motion feature mt is quantized by a quantization module 311 to obtain a quantized motion feature {circumflex over (m)}t. The quantized motion feature {circumflex over (m)}t is then decoded by the MV decoder 313 to form predicted motion information {circumflex over (v)}t.
The predicted motion information {circumflex over (v)}t is then directed to a bi-directional motion compensation (MC) module 315 to form a predicted picture {tilde over (x)}t. The predicted picture {tilde over (x)}t is further directed to a substractor 316, where the difference between the predicted picture {tilde over (x)}t and the current picture xt is calculated. The difference is defined as current residual information rt.
The current residual information rt is then encoded (by a residual encoder 317) to generate a latent residual feature yt. Then latent residual feature yt is quantized (by a quantization module 319) to obtain a quantized residual feature ŷt. The quantized residual feature ŷt is then decoded (by a residual decoder 321) to get predicted residual information {circumflex over (r)}t. The prediction picture {tilde over (x)}t and the prediction residual information {circumflex over (r)}t are directed to an adder 322 to obtain a reconstructed picture {circumflex over (x)}t. The reconstructed picture {circumflex over (x)}t. can be used as future reference pictures and stored in the reference buffer 325.
In some embodiments, the reconstructed picture {circumflex over (x)}t can be directed to a quality enhancement (QE) module 323 for quality enhancement. The QE module 323 performs a convolutional process so as to enhance the image quality of the reconstructed pictures and obtain pictures xtC with higher quality (i.e., quality-enhanced pictures xtC). Embodiments of the QE module 323 are discussed in detail with reference to
The DOVC framework 300 can include a bit-rate estimation module 329 configured to determine a bit rate (R) for the MV encoder/decoder 309, 313 and the residual encoder/decoder 317, 321. The bit rate R can also be used to optimize the present DOVC framework 300 by providing it as an input to the loss function module 327. The loss function module 327 can improve the present DOVC framework 300 by optimizing a loss function.
The deformable convolution module 703 is configured to fuse temporal-spatial information to generate motion information vt. Output features (e.g., the offset values δ and the offset field Δ) form the offset prediction module 701 are fed into the deformable convolution module 703 as an input. The deformable convolution module 703 then performs a convolutional process by using multiple convolutional layers with different parameters (e.g., stride, kernel, etc.). As shown in
Advantages of the ME module 700 include that it takes consecutive pictures together as an input so as to jointly consider and predict all deformable offsets at once (as compared to conventional optical flow methods that only handle one reference-target picture pair at a time).
In addition, the ME module 700 uses pictures with a symmetric structure (e.g., the “U-shaped” structure; meaning that the ME module 700 performs down-sampling and then up-sampling). Since consecutive pictures are highly correlated, offset prediction for the current picture can be benefited from the other adjacent pictures, which more effectively use temporal information of the fames, compared to conventional “pair-based” methods. Also, joint prediction is more computationally efficient at least because all deformable offsets can be obtained in a single process.
In some embodiments, during the convolutional process performed by the QE module 801, a regular convolutional layer can be set as “stride 1, zero padding” so as to retain feature size. Deconvolutional layers with “stride 2” can be used for down-sampling and up-sampling. Rectified Linear Unit (ReLU) can be adopted as an activation function for all layers except the last layer (which uses linear activation to regress the offset field Δ). In some embodiments, a normalization layer is not used.
In some embodiments, the ME module, the MC module, and the QE module can be implemented in accordance with other standards, such as HEVC, VVC, etc., and are not limited to the DOVC framework disclosed herein.
At block 1105, the method 1100 continues by generating a first reference picture based on the first key picture xsI and generating a second reference picture based on the second key picture xeI. At block 1107, the method 1100 continues to determine bi-directional predictive pictures (B/P pictures) in the GOP based on the first reference picture and the second reference picture.
At block 1109, the method 1100 continues to perform a motion estimation (ME) process based on the current picture xt and the bi-directional predictive pictures so as to generate motion information vt of the current picture xt.
In some embodiments, the first reference picture can be a first reconstructed picture xsC based on the first key picture xsI processed by a better portable graphics (BPG) image compression tool. The second reference picture can be a second reconstructed picture xeC based on the second key picture xeI processed by the BPG image compression tool.
In some embodiments, the method 1100 can further comprise: (i) encoding the motion information vt by a motion vector (MV) encoder so as to form a latent motion feature mt of the current picture xt; (ii) quantizing the latent motion feature mt to form a quantized motion feature {circumflex over (m)}t; (iii) transmitting the quantized motion feature {circumflex over (m)}t in a bitstream; (iv) receiving the quantized motion feature {circumflex over (m)}t from the bitstream; (v) decoding the quantized motion feature {circumflex over (m)}t to form a predicted motion information {circumflex over (v)}t by an MV decoder; and (vi) performing a motion compensation (MC) process based on the motion information {circumflex over (v)}t and the bi-directional predictive pictures to form a predicted picture {tilde over (x)}t.
In some embodiments, the method 1100 can further comprise: (a) determining a current residual information rt by comparing the predicted picture {tilde over (x)}t and the current picture xt; (b) encoding the current residual information rt by a residual encoder to form a latent residual feature yt; (c) quantizing the latent residual feature yt to form a quantized residual feature ŷt; (d) decoding the quantized residual feature ŷt by a residual decoder to form a predicted residual information {circumflex over (r)}t; and (e) generating a reconstructed picture xtC based on the predicted picture {tilde over (x)}t and the residual information {circumflex over (r)}t.
In some embodiments, the method 1100 can further comprise setting the reconstructed picture xtC as the first reference picture. In some embodiments, the method 1100 can further comprise determining, by a video discriminator module, whether the video includes an omnidirectional video sequence. The video discriminator module then sets a value of a flag. The value of the flag can be used to indicate whether the video includes the omnidirectional video sequence. In response to an event that video includes the omnidirectional video sequence, performing a sphere-to-plane projection on the omnidirectional sequence.
In some embodiments, the ME process is performed by an offset prediction network, the offset prediction network is configured to perform an offset prediction based on offset values (δ) of the current picture xt, the first reference picture, and the second reference picture. The offset values δ can be used to generate a feature map of the current picture xt by a spatiotemporal deformable convolution process. The feature map can used to form a residual map of the current picture xt by a quality enhancement module, and wherein the quality enhancement module includes multiple convolutional layers (L), and wherein the quality enhancement module performs a rectified linear unit (ReLU) activation process so as to form the residual map. The residual map is used to enhance a reconstructed picture x generated based on the current picture {circumflex over (x)}t.
In
It may be understood that the memory in the implementations of this technology may be a volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM) or a flash memory. The volatile memory may be a random-access memory (RAM) and is used as an external cache. For exemplary rather than limitative description, many forms of RAMs can be used, and are, for example, a static random-access memory (SRAM), a dynamic random-access memory (DRAM), a synchronous dynamic random-access memory (SDRAM), a double data rate synchronous dynamic random-access memory (DDR SDRAM), an enhanced synchronous dynamic random-access memory (ESDRAM), a synchronous link dynamic random-access memory (SLDRAM), and a direct Rambus random-access memory (DR RAM). It should be noted that the memories in the systems and methods described herein are intended to include, but are not limited to, these memories and memories of any other suitable type.
As shown in
The term “w(i, j)” is a weight factor of ERP or CMP.
Comparing
Comparing
The pictures in the two channels are separately processed by the system 2000 and then transmitted. The pictures from the two channels can later be merged. More particularly, the luma pictures in the luma component Y and the chroma pictures in the chroma component UV are separately processed (as current frame or current picture xt) by a motion estimation (ME) module 2007, a motion encoder 2009, a quantization module 2011, a motion decoder 2013, and a motion compensation module 2015, so as to complete a bidirectional motion estimation and compensation process and form reconstructed frames or pictures, as discussed in detail below.
The ME module 2007 is configured to generates current motion information vt. The current motion information vt is then encoded by the motion encoder 2009 to form a latent motion feature mt, and then the latent motion feature mt is quantized by the quantization module 2011 to obtain a quantized motion feature {circumflex over (m)}t. The quantized motion feature {circumflex over (m)}t is then decoded by the motion decoder 2013 to form predicted motion information {circumflex over (v)}t.
The predicted motion information {circumflex over (v)}t is then directed to a motion compensation (MC) module 2015 to form a predicted picture {tilde over (x)}t. The predicted picture {tilde over (x)}t is further directed to a substractor 2016, where the difference between the predicted picture {tilde over (x)}t and the current picture xt is calculated. The difference is defined as current residual information rt.
The current residual information rt is then encoded (by a residual encoder 2017) to generate a latent residual feature yt. Then latent residual feature yt is quantized (by a quantization module 2019) to obtain a quantized residual feature ŷt. The quantized residual feature ŷt is then decoded (by a residual decoder 2021) to get predicted residual information {circumflex over (r)}t. The prediction picture {tilde over (x)}t and the prediction residual information {circumflex over (r)}t are directed to an adder 2022 to obtain a reconstructed picture {circumflex over (x)}t.
In some embodiments, the reconstructed picture {circumflex over (x)}t. can be used as future reference pictures and stored in a reference buffer 2025. In some embodiments, the reconstructed picture {circumflex over (x)}t can be directed to a quality enhancement (QE) module 2023 for quality enhancement. The QE module 2023 performs a convolutional process so as to enhance the image quality of the reconstructed pictures and obtain quality-enhanced picture xtC with higher quality. Embodiments of the QE module 2023 are discussed in detail with reference to
In some embodiments, an entropy coding module 2027 can be configured to receive the quantized motion feature {circumflex over (m)}t and the quantized residual feature ŷt to perform an entropy coding (EC) process (e.g., assigning probabilities to symbols and producing a bit sequence or stream from these probabilities) for further processes (e.g., transmission in bit stream).
The DOVC methods that separate luma and chroma components as discussed in
As shown in
At the luma component 2105, the luma component of all the pictures are put in the same channel (i.e., DOVC-Y channel 2109). “Cin=1” presents that there is only one type of input, and “Cout=N” represents that “N” luma pictures are arranged in series. Similarly, at the chroma channel 2107, the chroma components U, V of all the pictures are put in the same channel (i.e., DOVC-UV channel 2111, “Cin=2” presents that there are two types of input; and “Cout=N” represents that “N” chroma pictures are arranged in series).
The DOVC-Y channel 2109 and the DOVC-UV channel 2111 are then merged at block 2113. In some embodiments, the components in the two channels 2109, 2111 can be transmitted in separate bitstreams and then be merged. Once the components in the two channels are merged, reconstructed video sequence ({circumflex over (F)}t) can be formed at block 2115.
As also shown in
Generally speaking, coding the chroma component is easier (i.e., requires much less bitrate) than the luma component. For example, as shown in
Using an average end-to-end WS-PSNR (peak signal-to-noise ratio) of Y, U, and V components as metrices, the testing results in
In
In
At block 2803, the method 2800 continues by receiving a luma current picture of the luma component of the video. Specifically, the luma current picture refers to luma components of all blocks in the current picture. At block 2805, the method 2800 continues by determining luma bi-directional predictive pictures (B/P pictures) in a group of pictures (GOP) associated with the luma current picture. At block 2807, the method 2800 continues by performing a motion estimation (ME) process based on the luma current picture and the luma bi-directional predictive pictures so as to generate luma motion information of the luma current picture. The processes discussed in blocks 2803, 2805 and 2807 are for the luma components of the video.
Blocks 2809, 2811, and 2813 are for the chroma components of the video. At block 2809, the method 2800 includes receiving a chroma current picture of the chroma component of the video. Specifically, the chroma current picture refers to chroma components of all blocks in the current picture. At block 2811, the method 2800 continues by determining chroma B/P pictures in a GOP associated with the chroma current picture. At block 2813, the ME process is performed based on the chroma current picture and the chroma bi-directional predictive pictures so as to generate chroma motion information of the chroma current picture.
In some embodiments, the processes for the luma components (blocks 2803, 2805 and 2807) and the processes for the chroma components (blocks 2809, 2811, and 2813) can be implemented in parallel. In some embodiments, the processes for the chroma components can be implemented prior to the processes for the luma components, or vice versa.
In some embodiments, the method 2800 can include merging the chroma component and the luma component to form a reconstructed video. In some embodiments, the GOP includes a first key picture and a second key picture. The first key picture is at a first time prior to the chroma or luma current picture, and the second key picture is at a second time later than the chroma or luma current picture.
In some embodiments, the method 2800 can include transmitting information of the first key picture and the second key picture in a bitstream. In some embodiments, method 2800 can further include (1) generating a first reference picture based on the first key picture (xsI); and (2) generating a second reference picture based on the second key picture (xeI). The first reference picture can be a first reconstructed picture (xsC) based on the first key picture (xsI) processed by a better portable graphics (BPG) image compression tool. The second reference picture can be a second reconstructed picture (xeC) based on the second key picture (xeI) processed by the BPG image compression tool.
In some embodiments, the method 2800 can include encoding the chroma and luma motion information by a motion vector (MV) encoder so as to form a latent motion feature of the current luma picture and the current chroma picture, respectively. A first compression parameter used by the MV encoder for the current luma picture can be different from a second compression parameter used by the MV encoder for the current chroma picture (see, e.g.,
In some embodiments, the method 2800 can include encoding residual information of the chroma and luma current pictures by a residual encoder so as to form a latent residual feature of the current luma picture and the current chroma picture, respectively. A first compression parameter used by the residual encoder for the current luma picture can be different from a second compression parameter used by the residual encoder for the current chroma picture (see, e.g.,
The above Detailed Description of examples of the disclosed technology is not intended to be exhaustive or to limit the disclosed technology to the precise form disclosed above. While specific examples for the disclosed technology are described above for illustrative purposes, various equivalent modifications are possible within the scope of the described technology, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative implementations may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative implementations or sub-combinations. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed or implemented in parallel, or may be performed at different times. Further, any specific numbers noted herein are only examples; alternative implementations may employ differing values or ranges.
In the Detailed Description, numerous specific details are set forth to provide a thorough understanding of the presently described technology. In other implementations, the techniques introduced here can be practiced without these specific details. In other instances, well-known features, such as specific functions or routines, are not described in detail in order to avoid unnecessarily obscuring the present disclosure. References in this description to “an implementation/embodiment,” “one implementation/embodiment,” or the like mean that a particular feature, structure, material, or characteristic being described is included in at least one implementation of the described technology. Thus, the appearances of such phrases in this specification do not necessarily all refer to the same implementation/embodiment. On the other hand, such references are not necessarily mutually exclusive either. Furthermore, the particular features, structures, materials, or characteristics can be combined in any suitable manner in one or more implementations/embodiments. It is to be understood that the various implementations shown in the figures are merely illustrative representations and are not necessarily drawn to scale.
Several details describing structures or processes that are well-known and often associated with communications systems and subsystems, but that can unnecessarily obscure some significant aspects of the disclosed techniques, are not set forth herein for purposes of clarity. Moreover, although the following disclosure sets forth several implementations of different aspects of the present disclosure, several other implementations can have different configurations or different components than those described in this section. Accordingly, the disclosed techniques can have other implementations with additional elements or without several of the elements described below.
Many implementations or aspects of the technology described herein can take the form of computer- or processor-executable instructions, including routines executed by a programmable computer or processor. Those skilled in the relevant art will appreciate that the described techniques can be practiced on computer or processor systems other than those shown and described below. The techniques described herein can be implemented in a special-purpose computer or data processor that is specifically programmed, configured, or constructed to execute one or more of the computer-executable instructions described below. Accordingly, the terms “computer” and “processor” as generally used herein refer to any data processor. Information handled by these computers and processors can be presented at any suitable display medium. Instructions for executing computer- or processor-executable tasks can be stored in or on any suitable computer-readable medium, including hardware, firmware, or a combination of hardware and firmware. Instructions can be contained in any suitable memory device, including, for example, a flash drive and/or other suitable medium.
The term “and/or” in this specification is only an association relationship for describing the associated objects, and indicates that three relationships may exist, for example, A and/or B may indicate the following three cases: A exists separately, both A and B exist, and B exists separately.
These and other changes can be made to the disclosed technology in light of the above Detailed Description. While the Detailed Description describes certain examples of the disclosed technology, as well as the best mode contemplated, the disclosed technology can be practiced in many ways, no matter how detailed the above description appears in text. Details of the system may vary considerably in its specific implementation, while still being encompassed by the technology disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the disclosed technology should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the disclosed technology with which that terminology is associated. Accordingly, the invention is not limited, except as by the appended claims. In general, the terms used in the following claims should not be construed to limit the disclosed technology to the specific examples disclosed in the specification, unless the above Detailed Description section explicitly defines such terms.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the implementations disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
Although certain aspects of the invention are presented below in certain claim forms, the applicant contemplates the various aspects of the invention in any number of claim forms. Accordingly, the applicant reserves the right to pursue additional claims after filing this application to pursue such additional claim forms, in either this application or in a continuing application.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/121381 | Sep 2021 | WO | international |
This application is a continuation of International Application No. PCT/CN2021/138494, filed Dec. 15, 2021, which claims priority to International Application No. PCT/CN2021/121381, filed Sep. 28, 2021, the entire disclosures of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/138494 | Dec 2021 | WO |
Child | 18620952 | US |