This application is the national stage entry under 35 U.S.C. § 371 of International Application PCT/US2018/054300, filed Oct. 4, 2018 which was published in accordance with PCT Article 21(2) on Apr. 11, 2019, in English, and which claims the benefit of European Patent Application No. 17306335, filed Oct. 5, 2017.
At least one of the present embodiments generally relates to, e.g., a method or an apparatus for video encoding or decoding, and more particularly, to a method or an apparatus for selecting a predictor candidate from a set of multiple predictor candidates for motion compensation in inter coding mode (merge mode or AMVP) based on a motion model such as, e.g., an affine model, for a video encoder or a video decoder.
To achieve high compression efficiency, image and video coding schemes usually employ prediction, including motion vector prediction, and transform to leverage spatial and temporal redundancy in the video content. Generally, intra or inter prediction is used to exploit the intra or inter frame correlation, then the differences between the original image and the predicted image, often denoted as prediction errors or prediction residuals, are transformed, quantized, and entropy coded. To reconstruct the video, the compressed data are decoded by inverse processes corresponding to the entropy coding, quantization, transform, and prediction.
A recent addition to high compression technology includes using a motion model based on affine modeling. In particular, affine modeling is used for motion compensation for encoding and decoding of video pictures. In general, affine modeling is a model using at least two parameters such as, e.g., two control point motion vectors (CPMVs) representing the motion at the respective corners of a block of picture, that allows deriving a motion field for the whole block of a picture to simulate, e.g., rotation and homothety (zoom). However, the set of control point motion vectors (CPMVs) potentially used as predictor in Merge mode is limited. Therefore, a method that would increase the overall compression performance of the considered high compression technology by improving the performance of the motion model used in Affine Merge and Advanced Motion Vector Prediction (AMVP) modes is therefore desirable.
The purpose of the invention is to overcome at least one of the disadvantages of the prior art. For this purpose, according to a general aspect of at least one embodiment, a method for video encoding is presented, comprising: determining, for a block being encoded in a picture, at least one spatial neighboring block, determining, for the block being encoded, a set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; determining, for the block being encoded and for each predictor candidate, a motion field based on a motion model and on the one or more control point motion vectors of the predictor candidate, wherein the motion field identifies motion vectors used for prediction of sub-blocks of the block being encoded; selecting a predictor candidate from the set of predictor candidates based on a rate distortion determination between predictions responsive to the motion field determined for each predictor candidate; encoding the block based on the motion field for the selected predictor candidate; and encoding an index for the selected predictor candidate from the set of predictor candidates. The one or more control point motion vectors and the reference picture are used for prediction of the block being encoded based on motion information associated to the block.
According to another general aspect of at least one embodiment, a method for video decoding is presented, comprising: receiving, for a block being decoded in a picture, an index corresponding to a particular predictor candidate among a set of predictor candidates for inter coding mode; determining, for the block being decoded, at least one spatial neighboring block; determining, for the block being decoded, the set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; determining, for the particular predictor candidate, one or more corresponding control point motion vectors for the block being decoded; determining for the particular predictor candidate, based on the one or more corresponding control point motion vectors, a corresponding motion field based on a motion model, wherein the corresponding motion field identifies motion vectors used for prediction of sub-blocks of the block being decoded; and decoding the block based on the corresponding motion field.
According to another general aspect of at least one embodiment, an apparatus for video encoding is presented, comprising: means for determining, for a block being encoded in a picture, at least one spatial neighboring block; means for determining, for a block being encoded, a set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; means for selecting a predictor candidate from the set of predictor candidates; means for determining for the block being encoded and for each predictor candidate, a motion field based on a motion model and based on the one or more control point motion vectors of the predictor candidate, wherein the motion field identifies motion vectors used for prediction of sub-blocks of the block being encoded; means for selecting a predictor candidate from the set of predictor candidates based on a rate distortion determination between predictions responsive to the motion field determined for each predictor candidate; means for encoding the block based on the corresponding motion field for the selected predictor candidate from the set of predictor candidates; and means for encoding an index for the selected predictor candidate from the set of predictor candidates.
According to another general aspect of at least one embodiment, an apparatus for video decoding is presented, comprising: means for receiving, for a block being decoded in a picture, an index corresponding to a particular predictor candidate among a set of predictor candidates for inter coding mode; means for determining, for the block being decoded, at least one spatial neighboring block; means for determining, for the block being decoded, the set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; means for determining, for the block being decoded, one or more corresponding control point motion vectors from the particular predictor candidate; means for determining for the block being decoded, a motion field based on a motion model and based on the one or more control point motion vectors for the block being decoded, wherein the motion field identifies motion vectors used for prediction of sub-blocks of the block being decoded; and means for decoding the block based on the corresponding motion field.
According to another general aspect of at least one embodiment, an apparatus for video encoding is provided, comprising: one or more processors, and at least one memory. Wherein the one or more processors is configured to: determine, for a block being encoded in a picture, at least one spatial neighboring block; determine, for the block being encoded, a set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; determine, for the block being encoded and for each predictor candidate, a motion field based on a motion model and on the one or more control point motion vectors of the predictor candidate, wherein the motion field identifies motion vectors used for prediction of sub-blocks of the block being encoded; select a predictor candidate from the set of predictor candidates based on a rate distortion determination between predictions responsive to the motion field determined for each predictor candidate; encode the block based on the motion field for the selected predictor candidate; and encode an index for the selected predictor candidate from the set of predictor candidates. The at least one memory is for storing, at least temporarily, the encoded block and/or the encoded index.
According to another general aspect of at least one embodiment, an apparatus for video decoding is provided, comprising: one or more processors and at least one memory. Wherein the one or more processors is configured to: receive, for a block being decoded in a picture, an index corresponding to a particular predictor candidate among a set of predictor candidates for inter coding mode; determine, for the block being decoded, at least one spatial neighboring block; determining, for the block being decoded, the set of predictor candidates for inter coding mode based on the at least one spatial neighboring block, wherein a predictor candidate comprises one or more control point motion vectors and one reference picture; determine, for the particular predictor candidate, one or more corresponding control point motion vectors for the block being decoded; determine, for the particular predictor candidate, based on the one or more corresponding control point motion vectors, a motion field based on a motion model, wherein the motion field identifies motion vectors used for prediction of sub-blocks of the block being decoded; and decoding the block based on the motion field. The at least one memory is for storing, at least temporarily, the decoded block.
According to another general aspect of at least one embodiment, the at least one spatial neighboring block comprises a spatial neighboring block of the block being encoded or decoded among neighboring top-left corner blocks, neighboring top-right corner blocks, and neighboring bottom-left corner blocks.
According to another general aspect of at least one embodiment, motion information associated to at least one of the spatial neighboring blocks comprises non-affine motion information. A non affine motion model is a translational motion model wherein only one motion vector representative of a translation is coded in the model.
According to another general aspect of at least one embodiment, motion information associated to all the at least one spatial neighboring blocks comprises affine motion information.
According to another general aspect of at least one embodiment, the set of predictor candidates comprises unidirectional predictor candidate or bidirectional predictor candidate.
According to another general aspect of at least one embodiment, a method may further comprise: determining a top left list of spatial neighboring blocks of the block being encoded or decoded among neighboring top-left corner blocks, a top right list of spatial neighboring blocks of the block being encoded or decoded among neighboring top-right corner blocks, a bottom left list of spatial neighboring blocks of the block being encoded or decoded among neighboring bottom-left corner blocks; selecting at least one triplet of spatial neighboring blocks, wherein each spatial neighboring block of the triplet respectively belongs to said top left list, said top right list, and said bottom left list and wherein the reference picture being used for prediction of each spatial neighboring block of said triplet is the same; determining, for the block being encoded or decoded, one or more control point motion vectors for top left corner, top right corner, and bottom left corner of the block based on motion information respectively associated to each spatial neighboring blocks of the selected triplet; wherein the predictor candidate comprises the determined one or more control point motion vectors and the reference picture.
According to another general aspect of at least one embodiment, a method may further comprise: evaluating the at least one selected triplets of spatial neighboring blocks according to one or more criteria based on the one or more control point motion vectors determined for the block being encoded or decoded; and wherein the predictor candidates are sorted in the set of predictor candidates for inter coding mode based on the evaluating.
According to another general aspect of at least one embodiment, the one or more criteria comprises a validity check according to equation 3 and cost according to equation 4.
According to another general aspect of at least one embodiment, the cost of a bidirectional predictor candidate is the mean of its first reference picture list related cost and its second reference picture list related cost.
According to another general aspect of at least one embodiment, a method may further comprise: determining a top left list of spatial neighboring blocks of the block being encoded or decoded among neighboring top-left corner blocks, a top right list of spatial neighboring blocks of the block being encoded or decoded among neighboring top-right corner blocks; selecting at least one pair of spatial neighboring blocks, wherein each spatial neighboring block of the pair respectively belongs to said top left list and said top right list and wherein, the reference picture being used for prediction of each spatial neighboring block of said pair is the same; determining, for the block being encoded or decoded, a control point motion vector for the top-left corner of the block based on motion information associated to spatial neighboring blocks of the top left list, a control point motion vector for the top-left corner of the block based on motion information associated to spatial neighboring blocks of the top left list; wherein the predictor candidate comprises said top-left and top-right control point motion vectors and the reference picture.
According to another general aspect of at least one embodiment, a bottom left list is to used instead of the top right list, the bottom left list comprising spatial neighboring blocks of the block being encoded or decoded among neighboring bottom-left corner blocks and wherein bottom-left control point motion vector is determined.
According to another general aspect of at least one embodiment, the motion model is an affine model and the motion field for each position (x, y) inside the block being encoded or decoded is determined by:
Wherein (v0x,v0y) and (v2x,v2y) are the control point motion vectors used to generate the motion field, (v0x,v0y) corresponds to the control point motion vector of the top-left corner of the block being encoded or decoded, (v2x,v2y) corresponds to the control point motion vector of the bottom-left corner of the block being encoded or decoded and h is the height of the block being encoded or decoded.
According to another general aspect of at least one embodiment, the method may further comprise encoding or retrieving an indication of the motion model used for the block being encoded or decoded, said motion model being based on control point motion vector of the top-left corner and the control point motion vector of the bottom-left corner or said motion model being based on control point motion vector of the top-left corner and the control point motion vector of the top-right corner.
According to another general aspect of at least one embodiment, the motion model used for the block being encoded or decoded is implicitly derived, said motion model being based on control point motion vector of the top-left corner and the control point motion vector of the bottom-left corner or said motion model being based on control point motion vector of the top-left corner and the control point motion vector of the top-right corner.
According to another general aspect of at least one embodiment, decoding or encoding the block based on the corresponding motion field comprises decoding or encoding, respectively, based on predictors for the sub-blocks, the predictors being indicated by the motion vectors.
According to another general aspect of at least one embodiment, the number of the spatial neighboring blocks is at least 5 or at least 7.
According to another general aspect of at least one embodiment, a non-transitory computer readable medium is presented containing data content generated according to the method or the apparatus of any of the preceding descriptions.
According to another general aspect of at least one embodiment, a signal is provided comprising video data generated according to the method or the apparatus of any of the preceding descriptions.
One or more of the present embodiments also provide a computer readable storage medium having stored thereon instructions for encoding or decoding video data according to any of the methods described above. The present embodiments also provide a computer readable storage medium having stored thereon a bitstream generated according to the methods described above. The present embodiments also provide a method and apparatus for transmitting the bitstream generated according to the methods described above. The present embodiments also provide a computer program product including instructions for performing any of the methods described.
It is to be understood that the figures and descriptions have been simplified to illustrate elements that are relevant for a clear understanding of the present principles, while eliminating, for purposes of clarity, many other elements found in typical encoding and/or decoding devices. It will be understood that, although the terms first and second may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
Various embodiments are described with respect to the HEVC standard. However, the present principles are not limited to HEVC, and can be applied to other standards, recommendations, and extensions thereof, including for example HEVC or HEVC extensions like Format Range (RExt), Scalability (SHVC), Multi-View (MV-HEVC) Extensions and H.266. The various embodiments are described with respect to the encoding/decoding of a slice. They may be applied to encode/decode a whole picture or a whole sequence of pictures.
Various methods are described above, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined.
In HEVC, to encode a video sequence with one or more pictures, a picture is partitioned into one or more slices where each slice can include one or more slice segments. A slice segment is organized into coding units, prediction units, and transform units.
In the present application, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeable, and the terms “picture” and “frame” may be used interchangeably. Usually, but not necessarily, the term if) “reconstructed” is used at the encoder side while “decoded” is used at the decoder side.
The HEVC specification distinguishes between “blocks” and “units,” where a “block” addresses a specific area in a sample array (e.g., luma, Y), and the “unit” includes the collocated blocks of all encoded color components (Y, Cb, Cr, or monochrome), syntax elements, and prediction data that are associated with the blocks (e.g., motion vectors).
For coding, a picture is partitioned into coding tree blocks (CTB) of square shape with a configurable size, and a consecutive set of coding tree blocks is grouped into a slice. A Coding Tree Unit (CTU) contains the CTBs of the encoded color components. A CTB is the root of a quadtree partitioning into Coding Blocks (CB), and a Coding Block may be partitioned into one or more Prediction Blocks (PB) and forms the root of a quadtree partitioning into Transform Blocks (TBs). Corresponding to the Coding Block, Prediction Block, and Transform Block, a Coding Unit (CU) includes the Prediction Units (PUs) and the tree-structured set of Transform Units (TUs), a PU includes the prediction information for all color components, and a TU includes residual coding syntax structure for each color component. The size of a CB, PB, and TB of the luma component applies to the corresponding CU, PU, and TU. In the present application, the term “block” can be used to refer, for example, to any of CTU, CU, PU, TU, CB, PB, and TB. In addition, the “block” can also be used to refer to a macroblock and a partition as specified in H.264/AVC or other video coding standards, and more generally to refer to an array of data of various sizes.
In the exemplary encoder 100, a picture is encoded by the encoder elements as described below. The picture to be encoded is processed in units of CUs. Each CU is encoded using either an intra or inter mode. When a CU is encoded in an intra mode, it performs intra prediction (160). In an inter mode, motion estimation (175) and compensation (170) are performed. The encoder decides (105) which one of the intra mode or inter mode to use for encoding the CU, and indicates the intra/inter decision by a prediction mode flag. Prediction residuals are calculated by subtracting (110) the predicted block from the original image block.
CUs in intra mode are predicted from reconstructed neighboring samples within the same slice. A set of 35 intra prediction modes is available in HEVC, including a DC, a planar, and 33 angular prediction modes. The intra prediction reference is reconstructed from the row and column adjacent to the current block. The reference extends over two times the block size in the horizontal and vertical directions using available samples from previously reconstructed blocks. When an angular prediction mode is used for intra prediction, reference samples can be copied along the direction indicated by the angular prediction mode.
The applicable luma intra prediction mode for the current block can be coded using two different options. If the applicable mode is included in a constructed list of three most probable modes (MPM), the mode is signaled by an index in the MPM list. Otherwise, the mode is signaled by a fixed-length binarization of the mode index. The three most probable modes are derived from the intra prediction modes of the top and left neighboring blocks.
For an inter CU, the corresponding coding block is further partitioned into one or more prediction blocks. Inter prediction is performed on the PB level, and the corresponding PU contains the information about how inter prediction is performed. The motion information (i.e., motion vector and reference picture index) can be signaled in two methods, namely, “merge mode” and “advanced motion vector prediction (AMVP)”.
In the merge mode, a video encoder or decoder assembles a candidate list based on already coded blocks, and the video encoder signals an index for one of the candidates in the candidate list. At the decoder side, the motion vector (MV) and the reference picture index are reconstructed based on the signaled candidate.
The set of possible candidates in the merge mode consists of spatial neighbor candidates, a temporal candidate, and generated candidates.
The motion vector of the collocated location in a reference picture can be used for derivation of a temporal candidate. The applicable reference picture is selected on a slice basis and indicated in the slice header, and the reference index for the temporal candidate is set to iref=0. If the POC distance (td) between the picture of the collocated PU and the reference picture from which the collocated PU is predicted from, is the same as the distance (tb) between the current picture and the reference picture containing the collocated PU, the collocated motion vector mvcol can be directly used as the temporal candidate. Otherwise, a scaled motion vector, tb/td*mvcol, is used as the temporal candidate. Depending on where the current PU is located, the collocated PU is determined by the sample location at the bottom-right or at the center of the current PU.
The maximum number of merge candidates, N, is specified in the slice header. If the number of merge candidates is larger than N, only the first N−1 spatial candidates and the temporal candidate are used. Otherwise, if the number of merge candidates is less than N, the set of candidates is filled up to the maximum number N with generated candidates as combinations of already present candidates, or null candidates. The candidates used in the merge mode may be referred to as “merge candidates” in the present application.
If a CU indicates a skip mode, the applicable index for the merge candidate is indicated only if the list of merge candidates is larger than 1, and no further information is coded for the CU. In the skip mode, the motion vector is applied without a residual update.
In AMVP, a video encoder or decoder assembles candidate lists based on motion vectors determined from already coded blocks. The video encoder then signals an index in the candidate list to identify a motion vector predictor (MVP) and signals a motion vector difference (MVD). At the decoder side, the motion vector (MV) is reconstructed as MVP+MVD. The applicable reference picture index is also explicitly coded in the PU syntax for AMVP.
Only two spatial motion candidates are chosen in AMVP. The first spatial motion candidate is chosen from left positions {a0, a1} and the second one from the above positions {b1, b0, b2}, while keeping the searching order as indicated in the two sets. If the number of motion vector candidates is not equal to two, the temporal MV candidate can be included. If the set of candidates is still not fully filled, then zero motion vectors are used.
If the reference picture index of a spatial candidate corresponds to the reference picture index for the current PU (i.e., using the same reference picture index or both using long-term reference pictures, independently of the reference picture list), the spatial candidate motion vector is used directly. Otherwise, if both reference pictures are short-term ones, the candidate motion vector is scaled according to the distance (tb) between the current picture and the reference picture of the current PU and the distance (td) between the current picture and the reference picture of the spatial candidate. The candidates used in the AMVP mode may be referred to as “AMVP candidates” in the present application.
For ease of notation, a block tested with the “merge” mode at the encoder side or a block decoded with the “merge” mode at the decoder side is denoted as a “merge” block, and a block tested with the AMVP mode at the encoder side or a block decoded with the AMVP mode at the decoder side is denoted as an “AMVP” block.
Motion compensation prediction can be performed using one or two reference pictures for prediction. In P slices, only a single prediction reference can be used for Inter prediction, enabling uni-prediction for a prediction block. In B slices, two reference picture lists are available, and uni-prediction or bi-prediction can be used. In bi-prediction, one reference picture from each of the reference picture lists is used.
In HEVC, the precision of the motion information for motion compensation is one quarter-sample (also referred to as quarter-pel or ¼-pel) for the luma component and one eighth-sample (also referred to as ⅛-pel) for the chroma components for the 4:2:0 configuration. A 7-tap or 8-tap interpolation filter is used for interpolation of fractional-sample positions, i.e., ¼, ½ and ¾ of full sample locations in both horizontal and vertical directions can be addressed for luma.
The prediction residuals are then transformed (125) and quantized (130). The quantized transform coefficients, as well as motion vectors and other syntax elements, are entropy coded (145) to output a bitstream. The encoder may also skip the transform and apply quantization directly to the non-transformed residual signal on a 4×4 TU basis. The encoder may also bypass both transform and quantization, i.e., the residual is coded directly without the application of the transform or quantization process. In direct PCM coding, no prediction is applied and the coding unit samples are directly coded into the bitstream.
The encoder decodes an encoded block to provide a reference for further predictions. The quantized transform coefficients are de-quantized (140) and inverse transformed (150) to decode prediction residuals. Combining (155) the decoded prediction residuals and the predicted block, an image block is reconstructed. In-loop filters (165) are applied to the reconstructed picture, for example, to perform deblocking/SAO (Sample Adaptive Offset) filtering to reduce encoding artifacts. The filtered image is stored at a reference picture buffer (180).
In particular, the input of the decoder includes a video bitstream, which may be generated by video encoder 100. The bitstream is first entropy decoded (330) to obtain transform coefficients, motion vectors, and other coded information. The transform coefficients are de-quantized (340) and inverse transformed (350) to decode the prediction residuals. Combining (355) the decoded prediction residuals and the predicted block, an image block is reconstructed. The predicted block may be obtained (370) from intra prediction (360) or motion-compensated prediction (i.e., inter prediction) (375). As described above, AMVP and merge mode techniques may be used to derive motion vectors for motion compensation, which may use interpolation filters to calculate interpolated values for sub-integer samples of a reference block. In-loop filters (365) are applied to the reconstructed image. The filtered image is stored at a reference picture buffer (380).
As mentioned, in HEVC, motion compensated temporal prediction is employed to exploit the redundancy that exists between successive pictures of a video. To do that, a motion vector is associated with each prediction unit (PU). As explained above, each CTU is represented by a Coding Tree in the compressed domain. This is a quad-tree division of the CTU, where each leaf is called a Coding Unit (CU) and is also illustrated in
In HEVC, one motion vector is assigned to each PU. This motion vector is used for motion compensated temporal prediction of the considered PU. Therefore, in HEVC, the motion model that links a predicted block and its reference block simply consists of a translation or calculation based on the reference block and the corresponding motion vector.
To make improvements to HEVC, the reference software and/or documentation JEM (Joint Exploration Model) is being developed by the Joint Video Exploration Team (JVET). In one JEM version (e.g., “Algorithm Description of Joint Exploration Test Model 5”, Document JVET-E1001_v2, Joint Video Exploration Team of ISO/IEC JTC1/SC29/WG11, 5rd meeting, 12-20 Jan. 2017, Geneva, CH), some further motion models are supported to improve temporal prediction. To do so, a PU can be spatially divided into sub-PUs and a model can be used to assign each sub-PU a dedicated motion vector.
In more recent versions of the JEM (e.g., “Algorithm Description of Joint Exploration Test Model 2”, Document JVET-B1001_v3, Joint Video Exploration Team of ISO/IEC JTC1/SC29/WG11, 2rd meeting, 20-26 Feb. 2016, San Diego, USA”), a CU is no longer specified to be divided into PUs or TUs. Instead, more flexible CU sizes may be used, and some motion data are directly assigned to each CU. In this new codec design under the newer versions of JEM, a CU may be divided into sub-CUs and a motion vector may be computed for each sub-CU of the divided CU.
One of the new motion models introduced in the JEM is the use of an affine model as the motion model to represent the motion vectors in a CU. The motion model used is illustrated by
wherein {right arrow over (v0)}(v0x,v0y) and {right arrow over (v1)}(v1x,v1y) are the control point motion vectors (CPMVs) used to generate the corresponding motion field, (v0x,v0y) corresponds to the control point motion vector of the top-left corner of the block being encoded or decoded, (v1x,v1y) corresponds to the control point motion vector of the top-right corner of the block being encoded or decoded, and w is the width of the block being encoded or decoded.
To reduce complexity, a motion vector is computed for each 4×4 sub-block (sub-CU) of the considered CU 700, as illustrated in
Affine motion compensation may be used in 2 ways in the JEM: Affine AMVP (AF_AMVP) mode and Affine Merge mode. They are introduced in the following sections.
Affine AMVP mode: A CU in AMVP mode, whose size is larger than 8×8, may be predicted in Affine AMVP mode. This is signaled through a flag in the bit-stream. The generation of the Affine Motion Field for that AMVP CU includes determining control point motion vectors (CPMVs), which are obtained by the encoder or decoder through the addition of a motion vector differential and a control point motion vector prediction (CPMVP). The CPMVPs are a pair of motion vector candidates, respectively taken from the set (A, B, C) and (D, E) illustrated in
Affine Merge mode: In Affine Merge mode, a CU-level flag indicates if a merge CU employs affine motion compensation. If so, then the first available neighboring CU that has been coded in an Affine mode is selected among the ordered set of candidate positions A, B, C, D, E of
Once the first neighboring CU in Affine mode is obtained, then the 3 CPMVs {right arrow over (v2)}, {right arrow over (v3)}), and {right arrow over (v4)} from the top-left, top-right and bottom-left corners of the neighboring affine CU are retrieved or calculated. For example,
When the control point motion vectors {right arrow over (v0)} and {right arrow over (v1)} of the current CU are obtained, the motion field inside the current CU being encoded or decoded is computed on a 4×4 sub-CU basis, through the model of Equation 1 as described above in connection with
Accordingly, a general aspect of at least one embodiment aims to improve the performance of the Affine Merge mode in JEM so that the compression performance of a considered video codec may be improved. Therefore, in at least one embodiment, an improved affine motion compensation apparatus and method are presented for Coding/Decoding Units that are coded in Affine Merge mode. The proposed improved affine mode includes determining a set of predictor candidates in the Affine Merge mode regardless the neighboring CU is coded in Affine mode or not.
As discussed before, in the current JEM, the first neighboring CU coded in Affine mode among the surrounding CUs is selected to predict the affine motion model associated with the current CU being encoded or decoded in Affine Merge mode. That is, the first neighboring CU candidate among the ordered set (A, B, C, D, E) of
Accordingly, at least one embodiment improves the Affine Merge prediction candidate, therefore providing the best coding efficiency when coding the current CU in Affine Merge mode, by creating new motion model candidates from motion vectors of neighbor blocks used as CPMVP. Corresponding new motion model candidates created from motion vectors of neighbor blocks used as CPMVP are also determined when decoding in merge to obtain the predictor from the signaled index. The improvements of this embodiment, at a general level, therefore comprise, for example:
Although, described for an encoding/decoding method based on merge mode, the present principles also apply to AMVP (ie Affine_Inter) mode. Advantageously, the various embodiments for creation of predictor candidate are unambiguously derivable for Affine AMVP.
The present principles are advantageously implemented in an encoder in the motion estimation module 175 and the motion compensation module 170 of
Accordingly,
At 1024, an optional second selection among the candidate CPMVPs is applied based on one or more criteria. According to a first criterion, candidate CPMVPs are further checked for validity using equation 3, for a block to encode of height H and Width W and where X and Y are respectively the horizontal and vertical components of a motion vector:
Accordingly, to this variant, at 1025, these valid CPMVPs are stored in the set of CPMVP candidates for merge mode until the set of predictor candidates is not full. According to a second criterion, valid candidate CPMVPs are then sorted depending on the value of the bottom left motion vector {right arrow over (v2)}, (taken from position F or G). The closest {right arrow over (v2)} is to the vector given by the affine motion model for the 4×4 sub-block at the same position as {right arrow over (v2)}, the better is the CPMVP. This constraint is for instance implemented with the following equation 4:
{right arrow over (ΔHor)}={right arrow over (v1)}−{right arrow over (v0)} Equation 4:
{right arrow over (ΔVer)}={right arrow over (v2)}−{right arrow over (v0)}
cost=abs({right arrow over (ΔHor)}·X*H−{right arrow over (ΔVer)}·Y*W)+abs({right arrow over (ΔHor)}·Y*H+{right arrow over (ΔVer)}·X*W)
Cost Computed for Each Candidate CPMVP
In case of bi-directional prediction, a cost is computed for each CPMVP and each reference picture list L0 and L1 using Equation 4. To compare unidirectional predictor candidates with bidirectional predictor candidates, for each predictor bidirectional candidate the cost of CPMVP is the mean of its list L0 related CPMVP cost and of its list L1 related CPMVP cost. According to this variant, at 1025, the ordered set of valid CPMVPs are stored in the set of CPMVP candidates for merge mode for each reference picture list L0 and L1.
According to second aspect of at least one embodiment of determining of a set of predictor candidates for merge mode 1020 of the encoding method 1000 as illustrated on
Besides, considering pairs among top-left list {A, B, C} and bottom-left list {F, G}, the 2 control point motion vectors {right arrow over (v0)} and {right arrow over (v2)} of the current CU are obtained based on motion information of the respective neighboring blocks top-left corner and bottom-left corner neighboring blocks of the pair. Based on the obtained {right arrow over (v0)} and {right arrow over (v2)}, we can derive the following equation 6 to compute {right arrow over (v1)} from {right arrow over (v0)} and {right arrow over (v2)}
As the standard Affine motion model, as described with equation 1 and illustrated on
Besides, it should be noted that the second aspect based on pairs of neighboring blocks for determining 2 CPMVs is not compatible with the evaluating 1024 of the predictor candidates based on their respective CPMVs as the third CPMV (according to Equation 5 or Equation 6) is not determined independently of the first 2 CPMVs. Thus, the validity check of Equation 3 and the cost function of Equation 4 are skipped and the CPMVP determined at 1023 are added to the set of predictor candidates for affine merge mode without any sorting.
Besides, according to a variant of that the second aspect based on pairs of neighboring blocks for determining 2 CPMVs, the use of bi-directional affine merge candidate is favored over uni-directional affine merge candidate, by adding the bi-directional candidates first in the set of predictor candidates. This ensures having the maximum of bidirectional candidates added in the set of predictor candidates for affine merge mode.
In a variation of first and second aspect of the at least one embodiment of determining of a set of predictor candidates for merge mode 1020, only affine neighbor blocks are used to create new affine motion candidates. In other word, determining top-left CPMV, top-right CPMV and bottom-left CPMV is based on motion information of the respective top-left neighboring blocks, top-right neighboring blocks and bottom-left neighboring blocks. In this case, motion information associated to the at least one spatial neighboring block comprises only affine motion information. This variation differs from the JEM affine merge mode in that the motion model of the selected affine neighbor block is expanded to the block to encode.
In another variation of first and second aspect of the at least one embodiment of determining of a set of predictor candidates for merge mode 1020, at 1030 a new affine motion model based on top-left CPMV {right arrow over (v0)} and bottom-left CPMV {right arrow over (v2)} is defined for determining the motion field instead of the standard Affine motion model based on top-left CPMV {right arrow over (v0)} and top-right CPMV {right arrow over (v1)}.
This variation is particularly well adapted to the embodiment based on pair of top-left and bottom-left neighboring blocks, wherein top-left and bottom-left CMPV {right arrow over (v0)} and {right arrow over (v2)} are determined for the predictor candidate. Advantageously, there is no need to compute the top-right CMPV {right arrow over (v1)} but the affine motion model directly uses {right arrow over (v2)} as a CPMV.
Besides, this variation provides additional CPMVPs for the set of predictor candidates in affine merge mode, CPMVPs comprising CPMV ({right arrow over (v0)}, {right arrow over (v1)}) and CPMVPs comprising CPMV ({right arrow over (v0)}, {right arrow over (v2)}) are added to the set of predictor candidates. At 1040, the rate-distortion competition occurs between predictor candidates based on ({right arrow over (v0)}, {right arrow over (v1)}) CPMV and predictor candidates based ({right arrow over (v0)}, {right arrow over (v2)}) CPMV. Unlike in some aspect of the described embodiment where the derivation of CPMVP candidate for Affine merge mode uses vector {right arrow over (v2)} only for computing the cost for the CPMVP, in some case a better prediction may be obtained by using ({right arrow over (v0)}, {right arrow over (v2)}) as CPMV. Accordingly, such motion model need to define which CPMVs are used. In this case a flag is added in the bitstream to indicate if we use ({right arrow over (v0)}, {right arrow over (v1)}) or ({right arrow over (v0)}, {right arrow over (v2)}) as Control Point Motion Vector.
control_point_horizontal_I0_flag[x0][y0] specifies the control point used for list L0 for the block being encoded, ie the current prediction block. The array indices x0, y0 specify the location (x0, y0) of the top-left luma sample of the considered prediction block relative to the top-left luma sample of the picture. If the flag is equal to 1, ({right arrow over (v0)}, {right arrow over (v1)}) are used as CPMV, else ({right arrow over (v0)}, {right arrow over (v2)}) are used.
In another variation of first and second aspect of the at least one embodiment of determining of a set of predictor candidates for merge mode 1020, at 1030 the new affine motion model based on top-left CPMV {right arrow over (v0)} and bottom-left CPMV {right arrow over (v2)} or the standard affine motion model based on top-left CPMV {right arrow over (v0)} and top-right CPMV {right arrow over (v1)} is implicitly derived from surrounding available information. For example, to have more precision for CPMV, we may use ({right arrow over (v0)}, {right arrow over (v1)}) if the width of the block is greater than the height and use ({right arrow over (v0)}, {right arrow over (v2)}) if the height of the block is greater than width.
In at least one implementation, a residual flag is used. At 1550, a flag is activated indicating that the coding is done with residual data. At 1560, the current CU is fully coded and reconstructed (with residual) giving the corresponding RD cost. Then the flag is deactivated indicating that the coding is done without residual data, and the process goes back to 1560 where the CU is coded (without residual) giving the corresponding RD cost. The lowest RD cost between the two previous ones indicates if residual must be coded or not (normal or skip). Then this best RD cost is put in competition with other coding modes. Rate distortion determination will be explained in more detail below.
The present inventors have recognized that one aspect of the existing Affine Merge process described above is that it systematically employs one and only one motion vector predictor to propagate an affine motion field from a surrounding past and neighboring CU towards a current CU. In various situations, the present inventors have further recognized that this aspect can be disadvantageous because, for example, it does not select the optimal motion vector predictor. Moreover, the choice of this predictor consists only of the first past and neighboring CU coded in Affine mode, in the ordered set (A, B, C, D, E), as already noted before. In various situations, the present inventors have further recognized that this limited choice can be disadvantageous because, for example, a better predictor might be available. Therefore, the existing process in the current JEM does not consider the fact that several potential past and neighboring CUs around the current CU may also have used affine motion, and that a different CU other than the first one found to have used affine motion may be a better predictor for the current CU's motion information.
Therefore, the present inventors have recognized the potential advantages in several ways to improve the prediction of the current CU affine motion vectors that are not being exploited by the existing JEM codecs.
Various embodiments of the system 1800 include at least one processor 1810 configured to execute instructions loaded therein for implementing the various processes as discussed above. The processor 1810 may include embedded memory, input output interface, and various other circuitries as known in the art. The system 1800 may also include at least one memory 1820 (e.g., a volatile memory device, a non-volatile memory device). The system 1800 may additionally include a storage device 1840, which may include non-volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 1840 may comprise an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples. The system 1800 may also include an encoder/decoder module 1830 configured to process data to provide encoded video and/or decoded video, and the encoder/decoder module 1830 may include its own processor and memory.
The encoder/decoder module 1830 represents the module(s) that may be included in a device to perform the encoding and/or decoding functions. As is known, such a device may include one or both of the encoding and decoding modules. Additionally, the encoder/decoder module 1830 may be implemented as a separate element of the system 1800 or may be incorporated within one or more processors 1810 as a combination of hardware and software as known to those skilled in the art.
Program code to be loaded onto one or more processors 1810 to perform the various processes described hereinabove may be stored in the storage device 1840 and subsequently loaded onto the memory 1820 for execution by the processors 1810. In accordance with the exemplary embodiments, one or more of the processor(s) 1810, the memory 1820, the storage device 1840, and the encoder/decoder module 1830 may store one or more of the various items during the performance of the processes discussed herein above, including, but not limited to the input video, the decoded video, the bitstream, equations, formulas, matrices, variables, operations, and operational logic.
The system 1800 may also include a communication interface 1850 that enables communication with other devices via a communication channel 1860. The communication interface 1850 may include, but is not limited to a transceiver configured to transmit and receive data from the communication channel 1860. The communication interface 1850 may include, but is not limited to, a modem or network card and the communication channel 1850 may be implemented within a wired and/or wireless medium. The various components of the system 1800 may be connected or communicatively coupled together (not shown in
The exemplary embodiments may be carried out by computer software implemented by the processor 1810 or by hardware, or by a combination of hardware and software. As a non-limiting example, the exemplary embodiments may be implemented by one or more integrated circuits. The memory 1820 may be of any type appropriate to the technical environment and may be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples. The processor 1810 may be of any type appropriate to the technical environment, and may encompass one or more of microprocessors, general purpose computers, special purpose computers, and processors based on a multi-core architecture, as non-limiting examples.
The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or a program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Furthermore, one skilled in the art may readily appreciate that the exemplary HEVC encoder 100 shown in
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Additionally, this application or its claims may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application or its claims may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this application or its claims may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Number | Date | Country | Kind |
---|---|---|---|
17306335 | Oct 2017 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/054300 | 10/4/2018 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2019/070933 | 4/11/2019 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20180098063 | Chen | Apr 2018 | A1 |
20180192047 | Lv et al. | Jul 2018 | A1 |
20180205965 | Chen et al. | Jul 2018 | A1 |
Number | Date | Country |
---|---|---|
105163116 | Dec 2015 | CN |
106559669 | Apr 2017 | CN |
3264762 | Jan 2018 | EP |
3422720 | Jan 2019 | EP |
2016141609 | Sep 2016 | WO |
2017118409 | Jul 2017 | WO |
WO 2017148345 | Sep 2017 | WO |
2019002215 | Jan 2019 | WO |
Entry |
---|
Anonymous, “Affine transform prediction for next generation video coding”, Study Group 16—Contribution 1016, Huawei Technologies Co., Ltd., International Telecommunication Union Telecommunication Standardization Sector, Study Period 2013-2016, COM xxx-C1016-E, Oct. 2015, 11 pages. |
Chen et al., “Algorithm Description of Joint Exploration Test Model 5 (JEM 5)”, Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Document JVET-E1001-v2, 5th meeting, Geneva, Switzerland, Jan. 12, 2017, 41 pages. |
Chen et al., “Algorithm Description of Joint Exploration Test Model 2”, Joint Video Exploration Team (JVET) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11. Document: JVET-81001 v3, 2nd Meeting, San Diego, California, USA, Feb. 20, 2016, 32 pages. |
Chen et al., “Algorithm Description of Joint Exploration Test Model 6 (JEM 6)”, Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Document JVET-F1001-v2, 6th Meeting, Hobart, Australia, Mar. 31, 2017, 49 pages. |
Huang et al., “Affine SKIP and DIRECT Modes for Efficient Video Coding”, 2012 Conference on Visual Communications and Image Processing, San Diego, California, USA, Nov. 27, 2012, 6 pages. |
Li et al., “An Affine Motion Compensation Framework for High Efficiency Video Coding”, 2015 IEEE International Symposium on Circuits and Systems (ISCAS), Lisbon, Portugal, May 24, 2015, pp. 525-528. |
Anonymous, “Reference software for ITU-T H.265 high efficiency video coding”, International Telecommunication Union, ITU-T Telecommunication Standardization Sector of ITU, Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services—Coding of moving video, Recommendation ITU-T H.265.2, Oct. 2014, pp. 1-12. |
Chen, et al., “Algorithm Description of Joint Exploration Test Model 7 (JEM 7)”, JVET-G1001-V1, Editors, Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, 7th Meeting: Torino, IT, Jul. 13-21, 2017, 50 pages. |
Chen, et al., “Improved Affine Motion Vector Coding”, JVET-D0128, Qualcomm Inc., Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, 4th Meeting: Chengdu, CN, Oct. 15-21, 2016, pp. 1-5. |
Huawei Technologies, “Affine transform prediction for next generation video coding”, ITU-T SG16 Meeting; Dec. 10, 2015-Oct. 23, 2015; Geneva, No. T13-SG16-C-1016, XP030100743, Sep. 29, 2015, pp. 1-11. |
Zou, et al., “EE4: Improved Affine Motion Prediction”, JVET-D0121, Qualcomm Incorporated, Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 1, 4th Meeting: Chengdu, CN, Oct. 15-21, 2016, pp. 1-5. |
Number | Date | Country | |
---|---|---|---|
20200288163 A1 | Sep 2020 | US |