This patent document relates to generation, storage, and consumption of digital audio video media information in a file format.
Digital video accounts for the largest bandwidth used on the Internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth demand for digital video usage is likely to continue to grow.
A first aspect relates to a method for processing video data comprising: determining to split a coding tree unit (CTU) into one or more coding units (CUs); determining to recursively split the CUs into prediction units (PUs), wherein one or more of the CUs are one or more prediction tree units (PTUs); and performing a conversion between a visual media data and a bitstream based on the PUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that at least one PTU is a leaf PU, and wherein a leaf PU is not further split.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that at least one PTU is further split into a plurality of PUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that at least one of the PUs is further split into a plurality of PUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that leaf PUs are not further split, and different leaf PUs from a common PTU have different prediction modes.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that residual from a plurality of the PUs is transform coded in a single transform unit (TU).
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the PTUs are split into: four PUs by a quad tree (QT) split, two PUs by a vertical binary tree (BT) split, two PUs by a horizontal BT split, three PUs by a vertical ternary tree (TT) split, three PUs by a horizontal TT split, four PUs by a vertical unsymmetrical quad tree (UQT) split, four PUs by a horizontal UQT split, two PUs by a vertical unsymmetrical binary tree (UBT) split, two PUs by a horizontal UBT split, four PUs by a vertical extended quad tree (EQT) split, four PUs by a horizontal EQT split, or combinations thereof.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that one or more PUs are split into: four PUs by a quad tree (QT) split, two PUs by a vertical binary tree (BT) split, two PUs by a horizontal BT split, three PUs by a vertical ternary tree (TT) split, three PUs by a horizontal TT split, four PUs by a vertical unsymmetrical quad tree (UQT) split, four PUs by a horizontal UQT split, two PUs by a vertical unsymmetrical binary tree (UBT) split, two PUs by a horizontal UBT split, four PUs by a vertical extended quad tree (EQT) split, four PUs by a horizontal EQT split, or combinations thereof.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the bitstream comprises syntax indicating splits applied to the PUs and PTUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the bitstream comprises a syntax element indicating whether a corresponding PTU is further split into multiple PUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the bitstream comprises a syntax element indicating whether a corresponding PU is further split into multiple PUs.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the bitstream comprises syntax indicating a split pattern and split direction for a PTU or for a PU.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that syntax indicating a split pattern and split direction is only signaled for a PTU or for a PU when the PTU or PU is further split.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that a depth is calculated for a PU or a PTU.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the depth is a QT depth indicating a number of times an ancestor video unit is split by a QT.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the depth is a multiple-type-tree (MTT) depth indicating a number of times an ancestor video unit is split by any split type.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the depth is initialized to a depth of a CU corresponding to the PTU or PU.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that a split of a current video unit is not included in the bitstream and is inferred by a decoder, and wherein the current video unit is a PU or a PTU.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the split is inferred according to: a current video unit dimension, a current video unit depth, a current video unit position relative to a picture boundary, a current video unit position relative to a sub-picture boundary, whether the current video unit can be further split, a current video unit depth relative to a depth threshold, a current video unit height relative to a height threshold, a current video unit width relative to a width threshold, or combinations thereof.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the split is disallowed by a comparison of: a current video unit height relative to a height threshold, a current video unit width relative to a width threshold, a current video height and a current video width relative to a size threshold, a current video unit depth relative to a depth threshold, a current video unit size relative to the size threshold, or combinations thereof.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the conversion includes encoding the visual media data into the bitstream.
Optionally, in any of the preceding aspects, another implementation of the aspect provides that the conversion includes decoding the bitstream to obtain the visual media data.
A second aspect relates to an apparatus for processing video data comprising: a processor; and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to perform the method of any of the preceding aspects.
A third aspect relates to a non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non-transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of the preceding aspects.
A fourth aspect relates to a non-transitory computer-readable recording medium storing a bitstream of a video which is generated by a method performed by a video processing apparatus,
A fifth aspect relates to a method for storing bitstream of a video comprising: determining to recursively split one or more coding units (CUs) into prediction units (PUs); generating a bitstream based on the determining; and storing the bitstream in a non-transitory computer-readable recording medium.
For the purpose of clarity, any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure.
These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.
For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.
It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or yet to be developed. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.
This document is related to image/video coding, and more particularly to partitioning of a picture. The disclosed mechanisms may be applied to the video coding standards such as High Efficiency Video Coding (HEVC) and/or Versatile Video Coding (VVC). Such mechanisms may also be applicable to other video coding standards and/or video codecs.
Video coding standards have evolved primarily through the development of the International Telecommunication Union (ITU) Telecommunication Standardization Sector (ITU-T) and International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) standards. The ITU-T produced a H.261 standard and a H.263 standard, ISO/IEC produced Motion Picture Experts Group (MPEG) phase one (MPEG-1) and MPEG phase four (MPEG-4) Visual standards, and the two organizations jointly produced the H.262/MPEG phase two (MPEG-2) Video standard, the H.264/MPEG-4 Advanced Video Coding (AVC) standard, and the H.265/High Efficiency Video Coding (HEVC) standard. Since H.262, the video coding standards are based on a hybrid video coding structure that utilizes a temporal prediction plus a transform coding.
The video signal 101 is a captured video sequence that has been partitioned into blocks of pixels by a coding tree. A coding tree employs various split modes to subdivide a block of pixels into smaller blocks of pixels. These blocks can then be further subdivided into smaller blocks. The blocks may be referred to as nodes on the coding tree. Larger parent nodes are split into smaller child nodes. The number of times a node is subdivided is referred to as the depth of the node/coding tree. The divided blocks can be included in coding units (CUs) in some cases. For example, a CU can be a sub-portion of a CTU that contains a luma block, red difference chroma (Cr) block(s), and a blue difference chroma (Cb) block(s) along with corresponding syntax instructions for the CU. The split modes may include a binary tree (BT), ternary tree (TT), and a quad tree (QT) employed to partition a node into two, three, or four child nodes, respectively, of varying shapes depending on the split modes employed. The video signal 101 is forwarded to the general coder control component 111, the transform scaling and quantization component 113, the intra-picture estimation component 115, the filter control analysis component 127, and the motion estimation component 121 for compression.
The general coder control component 111 is configured to make decisions related to coding of the images of the video sequence into the bitstream according to application constraints. For example, the general coder control component 111 manages optimization of bitrate/bitstream size versus reconstruction quality. Such decisions may be made based on storage space/bandwidth availability and image resolution requests. The general coder control component 111 also manages buffer utilization in light of transmission speed to mitigate buffer underrun and overrun issues. To manage these issues, the general coder control component 111 manages partitioning, prediction, and filtering by the other components. For example, the general coder control component 111 may increase compression complexity to increase resolution and increase bandwidth usage or decrease compression complexity to decrease resolution and bandwidth usage. Hence, the general coder control component 111 controls the other components of codec 100 to balance video signal reconstruction quality with bit rate concerns. The general coder control component 111 creates control data, which controls the operation of the other components. The control data is also forwarded to the header formatting and CABAC component 131 to be encoded in the bitstream to signal parameters for decoding at the decoder.
The video signal 101 is also sent to the motion estimation component 121 and the motion compensation component 119 for inter prediction. A video unit (e.g., a picture, a slice, a CTU, etc.) of the video signal 101 may be divided into multiple blocks. Motion estimation component 121 and the motion compensation component 119 perform inter predictive coding of the received video block relative to one or more blocks in one or more reference pictures to provide temporal prediction. Codec 100 may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.
Motion estimation component 121 and motion compensation component 119 may be highly integrated, but are illustrated separately for conceptual purposes. Motion estimation, performed by motion estimation component 121, is the process of generating motion vectors, which estimate motion for video blocks. A motion vector, for example, may indicate the displacement of a coded object in a current block relative to a reference block. A reference block is a block that is found to closely match the block to be coded, in terms of pixel difference. Such pixel differences may be determined by sum of absolute difference (SAD), sum of square difference (SSD), or other difference metrics. HEVC employs several coded objects including a CTU, coding tree blocks (CTBs), and CUs. For example, a CTU can be divided into CTBs, which can then be divided into coding blocks (CBs) for inclusion in CUs. A CU can be encoded as a prediction unit (PU) containing prediction data and/or a transform unit (TU) containing transformed residual data for the CU. The motion estimation component 121 generates motion vectors, PUs, and TUs by using a rate-distortion analysis as part of a rate distortion optimization process. For example, the motion estimation component 121 may determine multiple reference blocks, multiple motion vectors, etc. for a current block/frame, and may select the reference blocks, motion vectors, etc. having the best rate-distortion characteristics. The best rate-distortion characteristics balance both quality of video reconstruction (e.g., amount of data loss by compression) with coding efficiency (e.g., size of the final encoding).
In some examples, codec 100 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer component 123. For example, a video codec, such as codec 100, may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation component 121 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision. The motion estimation component 121 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a reference block of a reference picture. Motion estimation component 121 outputs the calculated motion vector as motion data to header formatting and CABAC component 131 for encoding and to the motion compensation component 119.
Motion compensation, performed by motion compensation component 119, may involve fetching or generating a reference block based on the motion vector determined by motion estimation component 121. Motion estimation component 121 and motion compensation component 119 may be functionally integrated, in some examples. Upon receiving the motion vector for the PU of the current video block, motion compensation component 119 may locate the reference block to which the motion vector points. A residual video block is then formed by subtracting pixel values of the reference block from the pixel values of the current block being coded, forming pixel difference values. In general, motion estimation component 121 performs motion estimation relative to luma components, and motion compensation component 119 uses motion vectors calculated based on the luma components for both chroma components and luma components. The reference block and residual block are forwarded to transform scaling and quantization component 113.
The video signal 101 is also sent to intra-picture estimation component 115 and intra-picture prediction component 117. As with motion estimation component 121 and motion compensation component 119, intra-picture estimation component 115 and intra-picture prediction component 117 may be highly integrated, but are illustrated separately for conceptual purposes. The intra-picture estimation component 115 and intra-picture prediction component 117 intra-predict a current block relative to blocks in a current picture, as an alternative to the inter prediction performed by motion estimation component 121 and motion compensation component 119 between pictures, as described above. In particular, the intra-picture estimation component 115 determines an intra-prediction mode to use to encode a current block. In some examples, intra-picture estimation component 115 selects an appropriate intra-prediction mode to encode a current block from multiple tested intra-prediction modes. The selected intra-prediction modes are then forwarded to the header formatting and CABAC component 131 for encoding.
For example, the intra-picture estimation component 115 calculates rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and selects the intra-prediction mode having the best rate-distortion characteristics among the tested modes. Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original unencoded block that was encoded to produce the encoded block, as well as a bitrate (e.g., a number of bits) used to produce the encoded block. The intra-picture estimation component 115 calculates ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block. In addition, intra-picture estimation component 115 may be configured to code depth blocks of a depth map using a depth modeling mode (DMM) based on rate-distortion optimization (RDO).
The intra-picture prediction component 117 may generate a residual block from the reference block based on the selected intra-prediction modes determined by intra-picture estimation component 115 when implemented on an encoder or read the residual block from the bitstream when implemented on a decoder. The residual block includes the difference in values between the reference block and the original block, represented as a matrix. The residual block is then forwarded to the transform scaling and quantization component 113. The intra-picture estimation component 115 and the intra-picture prediction component 117 may operate on both luma and chroma components.
The transform scaling and quantization component 113 is configured to further compress the residual block. The transform scaling and quantization component 113 applies a transform, such as a discrete cosine transform (DCT), a discrete sine transform (DST), or a conceptually similar transform, to the residual block, producing a video block comprising residual transform coefficient values. Wavelet transforms, integer transforms, sub-band transforms or other types of transforms could also be used. The transform may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain. The transform scaling and quantization component 113 is also configured to scale the transformed residual information, for example based on frequency. Such scaling involves applying a scale factor to the residual information so that different frequency information is quantized at different granularities, which may affect final visual quality of the reconstructed video. The transform scaling and quantization component 113 is also configured to quantize the transform coefficients to further reduce bit rate. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may be modified by adjusting a quantization parameter. In some examples, the transform scaling and quantization component 113 may then perform a scan of the matrix including the quantized transform coefficients. The quantized transform coefficients are forwarded to the header formatting and CABAC component 131 to be encoded in the bitstream.
The scaling and inverse transform component 129 applies a reverse operation of the transform scaling and quantization component 113 to support motion estimation. The scaling and inverse transform component 129 applies inverse scaling, transformation, and/or quantization to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block for another current block. The motion estimation component 121 and/or motion compensation component 119 may calculate a further reference block by adding the residual block back to a previous reference block for use in motion estimation of a later block/frame. Filters are applied to the reconstructed reference blocks to mitigate artifacts created during scaling, quantization, and transform. Such artifacts could otherwise cause inaccurate prediction (and create additional artifacts) when subsequent blocks are predicted.
The filter control analysis component 127 and the in-loop filters component 125 apply the filters to the residual blocks and/or to reconstructed picture blocks. For example, the transformed residual block from the scaling and inverse transform component 129 may be combined with a corresponding reference block from intra-picture prediction component 117 and/or motion compensation component 119 to reconstruct the original image block. The filters may then be applied to the reconstructed image block. In some examples, the filters may instead be applied to the residual blocks. As with other components in
When operating as an encoder, the filtered reconstructed image block, residual block, and/or prediction block are stored in the decoded picture buffer component 123 for later use in motion estimation as discussed above. When operating as a decoder, the decoded picture buffer component 123 stores and forwards the reconstructed and filtered blocks toward a display as part of an output video signal. The decoded picture buffer component 123 may be any memory device capable of storing prediction blocks, residual blocks, and/or reconstructed image blocks.
The header formatting and CABAC component 131 receives the data from the various components of codec 100 and encodes such data into a coded bitstream for transmission toward a decoder. Specifically, the header formatting and CABAC component 131 generates various headers to encode control data, such as general control data and filter control data. Further, prediction data, including intra-prediction and motion data, as well as residual data in the form of quantized transform coefficient data are all encoded in the bitstream. The final bitstream includes all information desired by the decoder to reconstruct the original partitioned video signal 101. Such information may also include intra-prediction mode index tables (also referred to as codeword mapping tables), definitions of encoding contexts for various blocks, indications of most probable intra-prediction modes, an indication of partition information, etc. Such data may be encoded by employing entropy coding. For example, the information may be encoded by employing context adaptive variable length coding (CAVLC), CABAC, syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding, or another entropy coding technique. Following the entropy coding, the coded bitstream may be transmitted to another device (e.g., a video decoder) or archived for later transmission or retrieval.
In order to encode and/or decode a picture as described above, the picture is first partitioned.
Various features involved in hybrid video coding using HEVC are highlighted as follows. HEVC includes the CTU, which is analogous to the macroblock in AVC. The CTU has a size selected by the encoder and can be larger than a macroblock. The CTU includes a luma coding tree block (CTB), corresponding chroma CTBs, and syntax elements. The size of a luma CTB, denoted as L×L, can be chosen as L=16, 32, or 64 samples with the larger sizes resulting in better compression. HEVC then supports a partitioning of the CTBs into smaller blocks using a tree structure and quadtree-like signaling.
The quadtree syntax of the CTU specifies the size and positions of corresponding luma and chroma CBs. The root of the quadtree is associated with the CTU. Hence, the size of the luma CTB is the largest supported size for a luma CB. The splitting of a CTU into luma and chroma CBs is signaled jointly. One luma CB and two chroma CBs, together with associated syntax, form a coding unit (CU). A CTB may contain only one CU or may be split to form multiple CUs. Each CU has an associated partitioning into prediction units (PUs) and a tree of transform units (TUs). The decision of whether to code a picture area using inter picture or intra picture prediction is made at the CU level. A PU partitioning structure has a root at the CU level. Depending on the basic prediction-type decision, the luma and chroma CBs can then be further split in size and predicted from luma and chroma prediction blocks (PBs) according to modes 300. HEVC supports variable PB sizes from 64×64 down to 4×4 samples. As shown, modes 300 can split a CB of size M pixels by M pixels into an M×M block, a M/2×M block, a M×M/2 block, a M/2×M/2 block, a M/4×M (left) block, a M/4×M (right) block, a M×M/4 (up) block, and/or a M×M/4 (down) block. It should be noted that the modes 300 for splitting CBs into PBs are subject to size constraints. Further, only M×M and M/2×M/2 are supported for intra picture predicted CBs.
A quadtree plus binary tree block structure with larger CTUs in Joint Exploration Model (JEM) is discussed below. Joint Video Exploration Team (JVET) was founded by Video Coding Experts group (VCEG) and MPEG to explore video coding technologies beyond HEVC. JVET has adopted many improvements included such improvements into a reference software named Joint Exploration Model (JEM).
The following parameters are defined for the QTBT partitioning scheme. The CTU size is the root node size of a quadtree, which is the same concept as in HEVC. Minimum quad tree size (MinQTSize) is the minimum allowed quadtree leaf node size. Maximum binary tree size (MaxBTSize) is the maximum allowed binary tree root node size. Maximum binary tree depth (MaxBTDepth) is the maximum allowed binary tree depth. Minimum binary tree size (MinBTSize) is the minimum allowed binary tree leaf node size.
In one example of the QTBT structure 501, the CTU size is set as 128×128 luma samples with two corresponding 64×64 blocks of chroma samples, the MinQTSize is set as 16×16, the MaxBTSize is set as 64×64, the MinBTSize (for both width and height) is set as 4×4, and the MaxBTDepth is set as 4. The quadtree partitioning is applied to the CTU first to generate quadtree leaf nodes. The quadtree leaf nodes may have a size from 16×16 (the MinQTSize) to 128×128 (the CTU size). If the leaf quadtree node is 128×128, the node is not to be further split by the binary tree since the size exceeds the MaxBTSize (e.g., 64×64). Otherwise, the leaf quadtree node can be further partitioned by the binary tree. Therefore, the quadtree leaf node is also the root node for the binary tree and has the binary tree depth as 0. When the binary tree depth reaches MaxBTDepth (e.g., 4), no further splitting is considered. When the binary tree node has width equal to MinBTSize (e.g., 4), no further horizontal splitting is considered. Similarly, when the binary tree node has a height equal to MinBTSize, no further vertical splitting is considered. The leaf nodes of the binary tree are further processed by prediction and transform processing without any further partitioning. In the JEM, the maximum CTU size is 256×256 luma samples.
Method 500 illustrates an example of block partitioning by using the QTBT structure 501, and tree representation 503 illustrates the corresponding tree representation. The solid lines indicate quadtree splitting and dotted lines indicate binary tree splitting. In each splitting (e.g., non-leaf) node of the binary tree, one flag is signaled to indicate which splitting type (e.g., horizontal or vertical) is used, where 0 indicates horizontal splitting and 1 indicates vertical splitting. For the quadtree splitting, there is no need to indicate the splitting type since quadtree splitting always splits a block both horizontally and vertically to produce 4 sub-blocks with an equal size.
In addition, the QTBT scheme supports the ability for the luma and chroma to have a separate QTBT structure 501. For example, in P and B slices the luma and chroma CTBs in one CTU share the same QTBT structure 501. However, in I slices the luma CTB is partitioned into CUs by a QTBT structure 501, and the chroma CTBs are partitioned into chroma CUs by another QTBT structure 501. Accordingly, a CU in an I slice can include a coding block of the luma component or coding blocks of two chroma components. Further, a CU in a P or B slice includes coding blocks of all three color components. In HEVC, inter prediction for small blocks is restricted to reduce the memory access of motion compensation, such that bi-prediction is not supported for 4×8 and 8×4 blocks, and inter prediction is not supported for 4×4 blocks. In the QTBT of the JEM, these restrictions are removed.
Triple-tree partitioning for VVC is now discussed.
In an example implementation, a VVC partitions a CTU into coding units by QT. Then, the CTU further partitioned by a BT or a TT. A leaf CU is a basic coding unit. The leaf CU may also be called a CU for convenience. In an example implementation, a leaf CU cannot be further split. Prediction and transform are both applied on CU in the same way as JEM. The whole partition structure is named multiple-type-tree (MTT).
In one example, ETT only splits one partition in a vertical direction, for example where W1=a1*W, W2=a2*W, and W3=a3*W, where a1+a2+a3=1, and where H1=H2=H3=H. This kind of ETT is vertical split and may be referred to as ETT-V. In one example, ETT-V split 701 can be used where W1=W/8, W2=3*W/4, W3=W/8, and H1=H2=H3=H. In one example, ETT only splits one partition in horizontal direction, for example where H1=a1*H, H2=a2*H, and H3=a3*H, where a1+a2+a3=1, and where W1=W2=W3=W. This kind of ETT is a horizontal split and may be referred to as ETT-H. In one example, ETT-H split 703 can be used where H1=H/8, H2=3*H/4, H3=H/8, and W1=W2=W3=W.
Inter prediction is now discussed, for example as used in HEVC. Inter prediction is the process of coding a block in current picture based on a reference block in a different picture called a reference picture. Inter prediction relies on the fact that the same objects tend to appear in multiple pictures in most video streams. Inter prediction matches a current block with a group of samples to a reference block in another picture with similar samples (e.g., generally depicting the same object at a different time in a video sequence). Instead of encoding each of the samples, the current block is encoded as a motion vector (MV) pointing to the reference block. Any difference between the current block and the reference block is encoded as residual. Accordingly, the current block is coded by reference to the reference block. At the decoder side, the current block can be decoded using only the MV and the residual so long as the reference block has already been decoded. Blocks coded according to inter prediction are significantly more compressed than blocks coded according to intra prediction. Inter prediction can be performed as unidirectional inter prediction or bidirectional inter prediction. Unidirectional inter prediction uses a MV pointing to a single block in a single reference picture and bidirectional inter prediction uses two MVs pointing to two different reference blocks in two different reference pictures. A slice of a picture coded according to unidirectional inter prediction is known as a P slice and a slice of a picture coded according to bidirectional inter prediction is known as a B slice. The portion of the current block that can be predicted from the reference block is known as a prediction unit (PU). Accordingly, a PU plus the corresponding residual results in the actual sample values in a CU of a coded block.
Each inter predicted PU has motion parameters for one or two reference picture lists. Motion parameters include a motion vector and a reference picture index. Usage of one of the two reference picture lists may also be signaled using inter prediction identification (ID) code (inter_pred_idc). Motion vectors may be explicitly coded as deltas (differences) relative to predictors. The following described various mechanisms for encoding the motion parameters.
When a CU is coded with skip mode, one PU is associated with the CU, and there are no significant residual coefficients, no coded motion vector delta or reference picture index is used. A merge mode can also be specified whereby the motion parameters for the current PU are obtained from neighboring PUs, including spatial and temporal candidates. The parameters can then be signaled by employing an index that corresponds to a selected candidate or candidates. Merge mode can be applied to any inter predicted PU, and is not limited to skip mode. The alternative to merge mode is the explicit transmission of motion parameters. In this case, a motion vector (coded as a motion vector difference compared to a motion vector predictor), a corresponding reference picture index for each reference picture list, and reference picture list usage are signaled explicitly for each PU. This signaling mode is referred to as AMVP.
When signaling indicates that one of the two reference picture lists is to be used, the PU is produced from one block of samples. This is referred to as uni-prediction. Uni-prediction is available both for P-slices and B-slices. When signaling indicates that both of the reference picture lists are to be used, the PU is produced from two blocks of samples. This is referred to as ‘bi-prediction’. Bi-prediction is available for B-slices only.
The following text provides the details on the inter prediction modes in HEVC. Merge mode is now discussed. Merge mode generates a list of candidate MVs. The encoder selects a candidate MV as the MV for a block. The encoder then signals an index corresponding to the selected candidate. This allows the MV to be signaled as a single index value. The decoder generates the candidate list in the same manner as the encoder and uses the signaled index to determine the indicated MV.
For spatial merge candidate derivation, a maximum of four merge candidates are selected among candidates that are located in five different positions. For temporal merge candidate derivation, a maximum of one merge candidate is selected among two candidates. Since a constant number of candidates for each PU is assumed at the decoder, additional candidates are generated when the number of candidates obtained from step 1 does not reach the maximum number of merge candidate (MaxNumMergeCand), which is signaled in slice header. Since the number of candidates is constant, an index of best merge candidate is encoded using truncated unary binarization (TU). If the size of CU is equal to 8, all the PUs of the current CU share a single merge candidate list, which is identical to the merge candidate list of the 2N×2N prediction unit.
Zero motion candidates are inserted to fill the remaining entries in the merge candidates list, and therefore hit the MaxNumMergeCand capacity. These candidates have zero spatial displacement and a reference picture index which starts from zero and increases every time a new zero motion candidate is added to the list. The number of reference frames used by these candidates is one and two for unidirectional and bidirectional prediction, respectively. Finally, no redundancy check is performed on these candidates.
Motion estimation regions for parallel processing is now discussed. To speed up the encoding process, motion estimation can be performed in parallel whereby the motion vectors for all prediction units inside a specified region are derived simultaneously. The derivation of merge candidates from a spatial neighborhood may interfere with parallel processing. This is because one prediction unit cannot derive the motion parameters from an adjacent PU until the adjacent PU's associated motion estimation is completed. To mitigate the trade-off between coding efficiency and processing latency, HEVC defines the motion estimation region (MER) whose size is signaled in the picture parameter set using the log2_parallel_merge_level_minus2 syntax element. When a MER is defined, merge candidates falling in the same region are marked as unavailable and therefore not considered in the list construction.
In motion vector prediction, spatial motion vector candidates and temporal motion vector candidates are considered. For spatial motion vector candidate derivation, two motion vector candidates are eventually derived based on motion vectors of each PU located in five different positions as depicted in
Spatial motion vector candidates are now discussed. In the derivation of spatial motion vector candidates, a maximum of two candidates are considered among five potential candidates as derived from PUs located in positions as depicted in
The no-spatial-scaling cases are checked first followed by the spatial scaling. Spatial scaling is considered when the POC is different between the reference picture of the neighboring PU and that of the current PU regardless of reference picture list. If all PUs of left candidates are not available or are intra coded, scaling for the above motion vector is allowed to help parallel derivation of left and above MV candidates. Otherwise, spatial scaling is not allowed for the above motion vector.
Temporal motion vector candidates are now discussed. Apart for the reference picture index derivation, all processes for the derivation of temporal merge candidates are the same as for the derivation of spatial motion vector candidates as shown in
Inter prediction methods beyond HEVC are now discussed. This includes sub-CU based motion vector prediction. In the JEM with QTBT, each CU can have at most one set of motion parameters for each prediction direction. Two sub-CU level motion vector prediction methods are considered in the encoder by splitting a large CU into sub-CUs and deriving motion information for all the sub-CUs of the large CU. An ATMVP method allows each CU to fetch multiple sets of motion information from multiple blocks smaller than the current CU in the collocated reference picture. In a spatial-temporal motion vector prediction (STMVP) method motion vectors of the sub-CUs are derived recursively by using the temporal motion vector predictor and a spatial neighboring motion vector. To preserve a more accurate motion field for sub-CU motion prediction, the motion compression for the reference frames is currently disabled.
In the first step, a reference picture and the corresponding block is determined by the motion information of the spatial neighboring blocks of the current CU. To avoid the repetitive scanning process of neighboring blocks, the first merge candidate in the merge candidate list of the current CU is used. The first available motion vector as well as the associated reference index are set to be the temporal vector and the index to the motion source picture. In this way, the corresponding block may be more accurately identified in ATMVP when compared with TMVP. The corresponding block (sometimes called the collocated block) is in a bottom-right or center position relative to the current CU.
In the second step, a corresponding block of the sub-CU is identified by the temporal vector in the motion source picture by adding the coordinate of the current CU to the temporal vector. For each sub-CU, the motion information of a corresponding block (the smallest motion grid that covers the center sample) is used to derive the motion information for the sub-CU. After the motion information of a corresponding N×N block is identified, the motion information is converted to the motion vectors and reference indices of the current sub-CU in the same way as TMVP. Motion scaling and other procedures also apply. For example, the decoder checks whether the low-delay condition is fulfilled. This occurs when the POCs of all reference pictures of the current picture are smaller than the POC of the current picture. The decoder may also use motion vector MVx to predict motion vector MVy for each sub-CU. MVx is the motion vector corresponding to reference picture list X and MVy is the motion vector for picture Y, with X being equal to 0 or 1 and Y being equal to 1−X.
Sub-CU motion prediction mode signaling is now discussed. The sub-CU modes are enabled as additional merge candidates and there is no additional syntax element used to signal the modes. Two additional merge candidates are added to the merge candidates list of each CU to represent the ATMVP mode and the STMVP mode. Up to seven merge candidates are used when the sequence parameter set indicates that ATMVP and STMVP are enabled. The encoding logic of the additional merge candidates is the same as for the merge candidates described above. Accordingly, for each CU in a P or B slice, two more RD checks is employed for the two additional merge candidates. In the JEM, all bins of the merge index are context coded by CABAC. In HEVC, only the first bin is context coded and the remaining bins are context bypass coded.
Adaptive motion vector difference resolution is now discussed. In HEVC, motion vector differences (MVDs) between the motion vector and predicted motion vector of a PU are signaled in units of quarter luma samples when use_integer_mv_flag is equal to 0 in the slice header. In the JEM, a locally adaptive motion vector resolution (LAMVR) is employed. In the JEM, MVD can be coded in units of quarter luma samples, integer luma samples, and/or four luma samples. The MVD resolution is controlled at the CU level, and MVD resolution flags are conditionally signaled for each CU that has at least one non-zero MVD component. For a CU that has at least one non-zero MVD component, a first flag is signaled to indicate whether quarter luma sample MV precision is used in the CU. When the first flag indicates that quarter luma sample MV precision is not used (e.g., first flag is equal to one), another flag is signaled to indicate whether integer luma sample MV precision or four luma sample MV precision is used. When the first MVD resolution flag of a CU is zero, or not coded for a CU (e.g., all MVDs in the CU are zero), the quarter luma sample MV resolution is used for the CU. When a CU uses integer-luma sample MV precision or four-luma-sample MV precision, the MVPs in the AMVP candidate list for the CU are rounded to the corresponding precision.
In the encoder, CU-level rate distortion (RD) checks are used to determine which MVD resolution should be used for a CU. The CU-level RD check is performed three times for each MVD resolution. To accelerate encoder speed, the following encoding schemes are applied in the JEM. During the RD check of a CU with normal quarter luma sample MVD resolution, the motion information of the current CU (integer luma sample accuracy) is stored. The stored motion information (after rounding) is used as the starting point for further small range motion vector refinement during the RD check for the same CU with integer luma sample and 4 luma sample MVD resolution so that the time-consuming motion estimation process is not duplicated three times. A RD check of a CU with 4 luma sample MVD resolution is conditionally invoked. For a CU, when RD cost integer luma sample MVD resolution is much larger than that of quarter luma sample MVD resolution, the RD check of 4 luma sample MVD resolution for the CU is skipped.
Higher motion vector storage accuracy is now discussed. In HEVC, motion vector accuracy is one-quarter pel (one-quarter luma sample and one-eighth chroma sample for 4:2:0 video). In the JEM, the accuracy for the internal motion vector storage and the merge candidate increases to 1/16 pel. The higher motion vector accuracy ( 1/16 pel) is used in motion compensation inter prediction for the CU coded with skip/merge mode. For the CU coded with normal AMVP mode, either the integer-pel or quarter-pel motion is used. Scalable HEVC (SHVC) upsampling interpolation filters, which have same filter length and normalization factor as HEVC motion compensation interpolation filters, are used as motion compensation interpolation filters for the additional fractional pel positions. The chroma component motion vector accuracy is 1/32 sample in the JEM. The additional interpolation filters of 1/32 pel fractional positions are derived by using the average of the filters of the two neighboring 1/16 pel fractional positions.
When OBMC applies to the current sub-block, motion vectors of up to four connected neighboring sub-blocks are used in addition to the current motion vectors to derive the prediction block for the current sub-block. The four connected neighboring sub-blocks are used when available and when not identical to the current motion vector. The four connected neighboring sub-blocks are illustrated in CU 2001 by vertical hashing. These multiple prediction blocks based on multiple motion vectors are combined to generate the final prediction signal of the current sub-block.
A prediction block based on motion vectors of a neighboring sub-block is denoted as PN, with N indicating an index for the neighboring above, below, left, and/or right sub-block. In the example shown, the motion vector of the above neighboring sub-block is used in OBMC of PN1, the motion vector of the left neighboring sub-block is used in OBMC of PN2, and the motion vector of the above neighboring sub-block and the left neighboring sub-block are used in OBMC of PN3.
A prediction block based on motion vectors of the current sub-block is denoted as PC. When PN is based on the motion information of a neighboring sub-block that contains the same motion information as the current sub-block, the OBMC is not performed from PN. Otherwise, every sample of PN is added to the same sample in PC. For example, four rows/columns of PN are added to PC. The weighting factors {¼, ⅛, 1/16, 1/32} are used for PN and the weighting factors {¾, ⅞, 15/16, 31/32} are used for PC. The exception are small MC blocks where height or width of the coding block is equal to 4 or a CU is coded with sub-CU mode. In such case, only two rows/columns of PN are added to PC. In this case weighting factors {¼, ⅛} are used for PN and weighting factors {¾, ⅞} are used for PC. For PN generated based on motion vectors of vertically (horizontally) neighboring sub-block, samples in the same row (column) of PN are added to PC with a same weighting factor. As shown in CU 2003, sub-block PN is adjacent to four neighboring sub-blocks, which are illustrated without hashing. The motion vectors of four neighboring sub-blocks are used in OBMC for sub-block PN.
In the JEM, a CU level flag is signaled to indicate whether OBMC is applied or not for the current CU when the current CU with size less than or equal to 256 luma samples. For the CUs with a size larger than 256 luma samples or not coded with AMVP mode, OBMC is applied by default. At the encoder, when OBMC is applied for a CU, the impact of OBMC is considered during the motion estimation stage. The prediction signal formed by OBMC using motion information of the top neighboring block and the left neighboring block is used to compensate the top and left boundaries of the original signal of the current CU. The normal motion estimation process is then applied.
When a CU is coded with merge mode, the LIC flag is copied from neighboring blocks, in a manner similar to motion information copy in merge mode. Otherwise, an LIC flag is signaled for the CU to indicate whether LIC applies or not. When LIC is enabled for a picture, an additional CU level RD check is used to determine whether LIC is applied or not for a CU. When LIC is enabled for a CU, a mean-removed sum of absolute difference (MR-SAD) and a mean-removed sum of absolute Hadamard-transformed difference (MR-SATD) are used instead of SAD and sum of absolute transformed difference (SATD) for an integer pel motion search and fractional pel motion search, respectively. To reduce the encoding complexity, the following encoding scheme is applied in the JEM. LIC is disabled for the entire picture when there is no clear illumination change between a current picture and corresponding reference pictures. To identify this situation, histograms of a current picture and every reference picture of the current picture are calculated at the encoder. If the histogram difference between the current picture and every reference picture of the current picture is smaller than a specified threshold, LIC is disabled for the current picture. Otherwise, LIC is enabled for the current picture.
The motion vector field (MVF) of a block is described by the following equation with the 4-parameter affine model and the 6-parameter affine model respectively:
where (mvh0, mvh0) is the motion vector of the top-left corner control point, (mvh1, mvh1) is the motion vector of the top-right corner control point, (mvh2, mvh2) is the motion vector of the bottom-left corner control point, and (x, y) represents the coordinate of a representative point relative to the top-left sample within a current block. The control point (CP) motion vectors may be signaled (like in the affine AMVP mode) or derived on-the-fly (like in the affine merge mode). w and h are the width and height of the current block. In practice, the division is implemented by right-shift with a rounding operation. In VVC test model (VTM), the representative point is defined to be the center position of a sub-block. For example, when the coordinate of the left-top corner of a sub-block relative to the top-left sample within a current block is (xs, ys), the coordinate of the representative point is defined to be (xs+2, ys+2).
In a division-free design, (1) and (2) are implemented as
For the 4-parameter affine model shown in (1):
For the 6-parameter affine model shown in (2):
where S represents the calculation precision. In VVC, S=7. In VVC, the MV used in MC for a sub-block with the top-left sample at (xs, ys) is calculated by (6) with x=xs+2 and y=ys+2.
After the control point MV (CPMV) of the current CU v0 and v1 are derived, according to the simplified affine motion model Equation 1, the MVF of the current CU is generated. In order to identify whether the current CU is coded with AF_MERGE mode, an affine flag is signaled in the bitstream when there is at least one neighbor block coded in affine mode.
Pattern matched motion vector derivation (PMMVD) mode is a special merge mode based on Frame-Rate Up Conversion (FRUC) techniques. With this mode, motion information of a block is derived at decoder side and not signaled by the encoder. A FRUC flag is signaled for a CU when a merge flag for the CU is true. When the FRUC flag is false, a merge index is signaled and the regular merge mode is used. When the FRUC flag is true, an additional FRUC mode flag is signaled to indicate which method (bilateral matching or template matching) is to be used to derive motion information for the block.
At encoder side, the decision on whether using FRUC merge mode for a CU is based on RD cost selection in a similar manner as normal merge candidate. The two matching modes (bilateral matching and template matching) are both checked for a CU by using RD cost selection. The one leading to the minimal cost is further compared to other CU modes. If a FRUC matching mode is the most efficient one, a FRUC flag is set to true for the CU and the related matching mode is used.
A motion derivation process in FRUC merge mode has two steps. A CU-level motion search is first performed, and then followed by a sub-CU level motion refinement. At the CU level, an initial motion vector is derived for the whole CU based on bilateral matching or template matching. A list of MV candidates is generated, and the candidate which leads to the minimum matching cost is selected as the starting point for further CU level refinement. Then a local search based on bilateral matching or template matching around the starting point is performed. The MV that results in the minimum matching cost is taken as the MV for the whole CU. Subsequently, the motion information is further refined at the sub-CU level with the derived CU motion vectors as the starting points.
For example, the following derivation process is performed for a width (W) times height (H) CU motion information derivation. At the first stage, the MV for the whole W×H CU is derived. At the second stage, the CU is further split into M×M sub-CUs. The value of M is calculated. D is a predefined splitting depth which is set to 3 by default in the JEM. Then the MV for each sub-CU is derived.
A CU level MV candidate set is now discussed. The MV candidate set at the CU level comprises: original AMVP candidates when the current CU is in AMVP mode; all merge candidates; several MVs in the interpolated MV field; and top and left neighboring motion vectors. When using bilateral matching, each valid MV of a merge candidate is used as an input to generate a MV pair with the assumption of bilateral matching. For example, one valid MV of a merge candidate is (MVa, refa) at reference list A. Then the reference picture refb of a paired bilateral MV is found in the other reference list B so that refa and refb are temporally at different sides of the current picture. When such a refb is not available in reference list B, refb is determined as a reference picture which is different from refa and has a temporal distance from the current picture equal to the minimal temporal distance in list B. After refb is determined, MVb is derived by scaling MVa based on the temporal distance between the current picture and refa, refb. Four MVs from the interpolated MV field are also added to the CU level candidate list. More specifically, the interpolated MVs at the position (0, 0), (W/2, 0), (0, H/2) and (W/2, H/2) of the current CU are added. When FRUC is applied in AMVP mode, the original AMVP candidates are also added to CU level MV candidate set. At the CU level, up to 15 MVs for AMVP CUs and up to 13 MVs for merge CUs are added to the candidate list.
A Sub-CU level MV candidate set is now discussed. The MV candidate set at sub-CU level comprises: an MV determined from a CU-level search; top, left, top-left, and top-right neighboring MVs; scaled versions of collocated MVs from reference pictures; up to 4 ATMVP candidates, and up to 4 STMVP candidates. The scaled MVs from reference pictures are derived as follows. All the reference pictures in both lists are traversed. The MVs at a collocated position of the sub-CU in a reference picture are scaled to the reference of the starting CU-level MV. ATMVP and STMVP candidates are limited to the first four candidates derived by ATMVP and STMVP. At the sub-CU level, up to 17 MVs are added to the candidate list.
The motion field of each reference picture in both reference lists is traversed at a 4×4 block level. For each 4×4 block in a reference picture, when the motion associated with the reference block passes through a 4×4 current block in the current picture (as shown in diagram 2700) and when the reference block has not been assigned any interpolated motion, the motion of the reference block is scaled to the current picture according to the temporal distance TD0 and TD1 (the same way as that of MV scaling of TMVP). The scaled motion is assigned to the current block in the current frame. If no scaled MV is assigned to a 4×4 block, the block's motion is marked as unavailable in the interpolated motion field.
Interpolation and matching cost are now discussed. Motion compensated interpolation is employed when a motion vector points to a fractional sample position. To reduce complexity, bi-linear interpolation is used instead of regular 8-tap HEVC interpolation for both bilateral matching and template matching. The calculation of matching cost is a bit different at different steps. When selecting the candidate from the candidate set at the CU level, the matching cost is the sum of absolute difference (SAD) of bilateral matching or template matching. After the starting MV is determined, the matching cost C of bilateral matching at the sub-CU level search is calculated as follows:
C=SAD+w·(|MVx−MVxs|+|MVy−MVys|) (8)
where w is a weighting factor which is empirically set to 4, MV and MVS indicate the current MV and the starting MV, respectively. SAD is used as the matching cost of template matching at sub-CU level search. In FRUC mode, the MV is derived by using luma samples only. The derived motion is used for both luma and chroma for MC inter prediction. After the MV is decided, final motion compensation is performed using an 8-tap interpolation filter for luma and a 4-tap interpolation filter for chroma.
MV refinement is now discussed. MV refinement is a pattern based MV search with the criterion of bilateral matching cost or template matching cost. An unrestricted center-biased diamond search (UCBDS) search pattern and an adaptive cross search pattern for MV refinement at the CU level and sub-CU level are supported in the JEM. For both CU and sub-CU level MV refinement, the MV is directly searched at quarter luma sample MV accuracy. This is followed by one-eighth luma sample MV refinement. The search range of MV refinement for the CU and sub-CU step are set equal to 8 luma samples.
The selection of prediction direction in template matching FRUC merge mode is now discussed. In the bilateral matching merge mode, bi-prediction is always applied. This is because the motion information of a CU is derived based on the closest match between two blocks along the motion trajectory of the current CU in two different reference pictures. There is no such limitation for the template matching merge mode. In the template matching merge mode, the encoder can choose among unidirectional inter prediction from list0, uni directional inter prediction from list1, and bidirectional inter prediction for a CU. The selection is based on a template matching cost as follows:
where cost0 is the SAD of list0 template matching, cost1 is the SAD of list1 template matching, and costB1 is the SAD of bi-prediction template matching. The value of factor is equal to 1.25, which biases the selection process is toward bi-prediction. The inter prediction direction selection is only applied to the CU-level template matching process.
Generalized Bi-prediction Improvement (GBi) is employed in VTM version three (VTM-3.0) and in bench-mark set version 2.1 (BMS2.1). GBi may apply unequal weights to predictors from L0 and L1 in bi-prediction mode. In inter prediction mode, multiple weight pairs including the equal weight pair (½, ½) are evaluated based on rate-distortion optimization (RDO). The GBi index of the selected weight pair is signaled to the decoder. In merge mode, the GBi index is inherited from a neighboring CU. In BMS2.1 GBi, the predictor generation in bi-prediction mode is shown in Equation (9).
PGBi=(w0*PL0+w1*PL1+RoundingOffsetGBi)>>shiftNumGBi, (9)
where PGBi is the final predictor of GBi. w0 and w1 are the selected GBi weight pair and applied to the predictors of lists L0 and L1, respectively. RoundingOffsetGBi and shiftNumGBi are used to normalize the final predictor in GBi. The supported w1 weight set is {−¼, ⅜, ½, ⅝, 5/4}, in which the five weights correspond to one equal weight pair and four unequal weight pairs. The blending gain is the sum of w1 and w0, and is fixed to 1.0. Therefore, the corresponding w0 weight set is { 5/4, ⅝, ½, ⅜, −¼}. The weight pair selection is at CU-level.
For non-low delay pictures, the weight set size is reduced from five to three, where the w1 weight set is {⅜, ½, ⅝} and the w0 weight set is {⅝, ½, ⅜}. The weight set size reduction for non-low delay pictures is applied to the BMS2.1 GBi and all the GBi tests in this disclosure.
An example GBi encoder bug fix is now described. To reduce the GBi encoding time, the encoder may store unidirectional inter prediction (uni-prediction) motion vectors estimated from a GBi weight equal to 4/8. The encoder can then reuse the motion vectors for a uni-prediction search of other GBi weights. This fast encoding method can be applied to both translation motion model and affine motion model. In VTM version 2 (VTM-2.0), a 6-parameter affine model and a 4-parameter affine model are employed. A BMS2.1 encoder may not differentiate the 4-parameter affine model and the 6-parameter affine model when the encoder stores the uni-prediction affine MVs and when GBi weight is equal to 4/8. Consequently, 4-parameter affine MVs may be overwritten by 6-parameter affine MVs after the encoding with GBi weight 4/8. The stored 6-parameter affine MVs may be used for 4-parameter affine ME for other GBi weights, or the stored 4-parameter affine MVs may be used for 6-parameter affine ME. The GBi encoder bug fix is to separate the 4-parameter and the 6-parameter affine MVs storage. The encoder stores those affine MVs based on affine model type when GBi weight is equal to 4/8. The encoder then reuses the corresponding affine MVs based on the affine model type for other GBi weights.
GBi encoder speed-up mechanisms are now described. Five example encoder speed-up methods are proposed to reduce the encoding time when GBi is enabled. A first method includes conditionally skipping affine motion estimation for some GBi weights. In BMS2.1, an affine ME including a 4-parameter and a 6-parameter affine ME is performed for all GBi weights. In an example an affine ME can be conditionally skipped for unequal GBi weights (e.g., weights unequal to 4/8). For example, an affine ME can be performed for other GBi weights if and only if the affine mode is selected as the current best mode and the mode is not affine merge mode after evaluating the GBi weight of 4/8. When the current picture is non-low-delay picture, the bi-prediction ME for the translation model is skipped for unequal GBi weights when affine ME is performed. When the affine mode is not selected as the current best mode or when the affine merge is selected as the current best mode, affine ME is skipped for all other GBi weights.
A second method includes reducing the number of weights for RD cost checking for low-delay pictures in the encoding for 1-pel and 4-pel MVD precision. For low-delay pictures, there are five weights for RD cost checking for all MVD precisions including ¼-pel, 1-pel and 4-pel. The encoder checks the RD cost for ¼-pel MVD precision first. A portion of GBi weights can be skipped for RD cost checking for 1-pel and 4-pel MVD precisions. Unequal weights can be ordered according to their RD cost in ¼-pel MVD precision. Only the first two weights with the smallest RD costs, together with GBi weight 4/8, are evaluated during the encoding in 1-pel and 4-pel MVD precisions. Therefore, three weights at most are evaluated for 1-pel and 4-pel MVD precisions for low delay pictures.
A third method includes conditionally skipping a bi-prediction search when the L0 and L1 reference pictures are the same. For some pictures in random access (RA), the same picture may occur in both reference picture lists (L0 and L1). For example, for random access coding configuration in common test conditions (CTC), the reference picture structure for the first group of pictures (GOP) is listed as follows.
In this example, pictures 16, 8, 4, 2, 1, 12, 14, and 15 have the same reference picture(s) in both lists. For bi-prediction for these pictures, the L0 and L1 reference pictures may be the same. Accordingly, the encoder may skip bi-prediction ME for unequal GBi weights when two reference pictures in bi-prediction are the same, when the temporal layer is greater than 1, and when the MVD precision is ¼-pel. For affine bi-prediction ME, this fast skipping method is only applied to 4-parameter affine ME.
A fourth method includes skipping RD cost checking for unequal GBi weights based on temporal layer and the POC distance between the reference picture and the current picture. The RD cost evaluations for those unequal GBi weights can be skipped when the temporal layer is equal to 4 (e.g., the highest temporal layer in RA) or when the POC distance between reference picture (either L0 or L1), the current picture is equal to 1, and coding QP is greater than 32.
A fifth method includes changing the floating-point calculation to a fixed-point calculation for unequal GBi weight during ME. For a bi-prediction search, the encoder may fix the MV of one list and refine the MV in another list. The target is modified before ME to reduce the computation complexity. For example, if the MV of L1 is fixed and the encoder is to refine the MV of L0, the target for L0 MV refinement can be modified with equation 10. O is original signal and P1 is the prediction signal of L1. w is GBi weight for L1.
T=((0<<3)−w*P1)*(1/(8−w)) (10)
The term (1/(8−w)) is stored in floating point precision, which increases computation complexity. The fifth method changes Equation 10 to a fixed-point value as in Equation 11.
T=(0*a1−P1*a2+round)>>N (11)
In Equation 11, a1 and a2 are scaling factors and they are calculated as:
γ=(1<<N)/(8−w);a1=γ<<3;a2=γ*w;round=1<<(N−1)
CU size constraints for GBi are now discussed. In this example, GBi is disabled for small CUs. In inter prediction mode, if bi-prediction is used and the CU area is smaller than 128 luma samples, GBi is disabled without any signaling.
I(k) may be the luma value from reference k (k=0, 1) after block motion compensation, and a ∂I(k)/∂x, ∂I(k)/∂y are horizontal and vertical components of the I(k) gradient, respectively. Assuming the optical flow is valid, the motion vector field (vx, vy) is given by
∂I(k)/∂t+vx∂I(k)/∂x+vy∂I(k)/∂y=0. (12)
Combining this optical flow equation with Hermite interpolation for the motion trajectory of each sample results in a unique third-order polynomial that matches both the function values I(k) and derivatives ∂I(k)/∂x, ∂I(k)/∂y at the ends. The value of this polynomial at t=0 is the BIO prediction:
predBIO=½·(I(0)+I(1)+vx/2·(τ1∂I(1)/∂x−τ0∂I(0)/∂x)+vy/2·(τ1∂I(1)/∂y−τ0∂I(0)/∂y)). (13)
Here, τ0 and τ1 denote the distances to the reference frames as shown in diagram 2800. Distances τ0 and τ1 are calculated based on the POC for Ref0 and Ref1: τ0=POC(current)−POC(Ref0), τ1=POC(Ref1)−POC(current). When both predictions come from the same time direction (either both from previous pictures or both from subsequent pictures) then the signs are different (τ0·τ1<0). In this case, BIO is applied only when the prediction is not from the same time moment (e.g., τ0≠τ1), when both referenced regions have non-zero motion (MVx0, MVy0, MVx1, MVy1≠0), and when the block motion vectors are proportional to the time distance (MVx0/MVx1=MVy0/MVy1=−τ0/τ1).
The motion vector field (vx, vy) is determined by minimizing the difference Δ between values in points A and B (intersection of motion trajectory and reference frame planes on diagram 2800). The model uses only the first linear term of a local Taylor expansion for Δ:
Δ=(I(0)−I(1)+vx(τ1∂I(1)/∂x+τ0∂I(0)/∂x)+vy(τ1∂I(1)/∂y+τ0∂I(0)/∂y)) (14)
All values in Equation (14) depend on the sample location (i′, j′), which was omitted from the notation so far. Assuming the motion is consistent in the local surrounding area, Δ is minimized inside the (2M+1)×(2M+1) square window Ω centered on the currently predicted point (i,j), where M is equal to 2:
For this optimization problem, the JEM may use a simplified approach making first a minimization in the vertical direction and then in the horizontal direction. This results in
In order to avoid division by zero or a very small value, regularization parameters r and m are introduced in Equations (19) and (20).
r=500·4d−8 (19)
m=700·4d−8 (20)
Here d is bit depth of the video samples.
With BIO, a motion field can be refined for each sample. To reduce the computational complexity, a block-based design of BIO is used in the JEM. The motion refinement is calculated based on 4×4 block. In the block-based BIO, the values of sn in Equation (18) of all samples in a 4×4 block are aggregated. Then the aggregated values of sn are used to derived BIO motion vectors offset for the 4×4 block. More specifically, the following formula is used for block-based BIO derivation:
where bk denotes the set of samples in the k-th 4×4 block of the predicted block. sn in Equations (16) and (17) are replaced by ((sn,bk)>>>4) to derive the associated motion vector offsets.
In some examples, MV regiment of BIO might be unreliable due to noise or irregular motion. Therefore, in BIO, the magnitude of MV regiment is clipped to a threshold value thBIO. The threshold value is determined based on whether the reference pictures of the current picture are all from one direction. If all the reference pictures of the current picture are from one direction, the value of the threshold is set to 12×214−d; otherwise, the value of the threshold is set to 12×213−d.
Gradients for BIO may be calculated at the same time as motion compensation interpolation using operations consistent with HEVC motion compensation process. This may include usage of a two-dimensional (2D) separable finite impulse response (FIR) filter. The input for this 2D separable FIR is the same reference frame sample as for motion compensation process with a fractional position (fracX, fracY) according to the fractional part of the block motion vector. In the case of a horizontal gradient ∂I/∂x, an interpolation BIO filter for prediction signal (BIOfilterS) is applied in a vertical direction corresponding to the fractional position fracY with a de-scaling shift d−8. Then a gradient BIO filter (BIOfilterG) is applied in a horizontal direction corresponding to the fractional position fracX with a de-scaling shift by 18−d. In case of vertical gradient ∂I/∂y a first gradient filter is applied vertically using BIOfilterG corresponding to the fractional position fracY with de-scaling shift d−8. Then a signal displacement is performed using BIOfilterS in a horizontal direction corresponding to the fractional position fracX with de-scaling shift by 18−d. The length of the interpolation filter for gradient calculation BIOfilterG and BIO signal displacement (BIOfilterF) is shorter (6-tap) in order to maintain reasonable complexity. Table 1 shows the filters used for a gradient calculation for different fractional positions of motion vectors for a block in BIO.
Table 2 shows the interpolation filters used for prediction signal generation in BIO.
In the JEM, BIO is applied to all bi-predicted blocks when the two predictions are from different reference pictures. BIO is disabled when LIC is enabled for a CU. In the JEM, OBMC is applied for a block after the MC process. To reduce the computational complexity, BIO is not applied during the OBMC process. This means that BIO is only applied in the MC process for a block when using the blocks own MV and is not applied in the MC process when the MV of a neighboring block is used during the OBMC process.
SAD=τ(x,y)|R0(x,y)−R1(x,y)|
In an example, BIO employs a second step that includes data preparation. For a W×H block, (W+2)×(H+2) samples are interpolated. The inner W×H samples are interpolated with the 8-tap interpolation filter as in motion compensation. The four side outer lines of samples, illustrated as black circles in diagram 3000, are interpolated with the bi-linear filter. For each position, gradients are calculated on the two reference blocks (denoted as R0 and R1).
Gx0(x,y),(R0(x+1,y)−R0(x−1,y))>>4
Gy0(x,y),(R0(x,y+1)−R0(x,y−1))>>4
Gx1(x,y),(R1(x+1,y)−R1(x−1,y))>>4
Gy1(x,y),(R1(x,y+1)−R1(x,y−1))>>4
For each position, internal values are calculated as:
T1=(R0(x,y)>>6)−(R1(x,y)>>6),T2=(Gx0(x,y)+Gx1(x,y))>>3,T3=(Gy0(x,y)+Gy1(x,y))>>3
B
1(x,y)=T2*T2, B2(x,y)=T2*T3, B3(x,y)=−T1*T2, B5(x,y)=T3*T3, B6(x,y)=−T1*T3
In an example, BIO employs a second step that includes calculating a prediction for each block. BIO is skipped for a 4×4 block if SAD between the two 4×4 reference blocks is smaller than a threshold. Vx and Vy are calculated. The final prediction for each position in the 4×4 block is also calculated.
b(x,y),(Vx(Gx0(x,y)−Gx1(x,y))+Vy(Gy0(x,y)−Gy1(x,y))+1)>>1
P(x,y),(R0(x,y)+R1(x,y)+b(x,y)+offset)>>shift
b(x,y) is known as a correction item.
BIO in VTM version four (VTM-4.0), rounds the results of calculation in BDOF depending on bit-depth. VTM-4.0 also removed the bi-linear filtering and fetches the nearest integer pixel of the reference block to pad the four side outer lines of samples (black circles in diagram 3000).
In DMVR, a bilateral template is generated as the weighted combination (e.g., average) of the two prediction blocks, from the initial MV0 of list0 and MV1 of list 1, respectively, as shown in diagram 3100. The template matching operation includes calculating cost measures between the generated template and the sample region around the initial prediction block in the reference picture. For each of the two reference pictures, the MV that yields the minimum template cost is considered as the updated MV of that list to replace the original MV. In the JEM, nine MV candidates are searched for each list. The nine MV candidates include the original MV and eight MVs with one luma sample offset from the original MV in the horizontal direction, the vertical direction, or both. The two new MVs, denoted as MV0′ and MV1′ as shown in diagram 3100, are used for generating the final bi-prediction results. A SAD is used as a cost measure. When calculating the cost of a prediction block generated by one surrounding MV, the rounded MV (to integer pel) is actually used to obtain the prediction block instead of the real MV.
DMVR is applied for the merge mode of bi-prediction with one MV from a preceding reference picture and another from a subsequent reference picture without the transmission of additional syntax elements. In the JEM, DMVR is not applied when a LIC candidate, an affine motion candidate, a FRUC candidate, and/or a sub-CU merge candidate is enabled for a CU.
Template matching based adaptive merge candidate reorder is now discussed. To improve coding efficiency, the order of each merge candidate is adjusted according to the template matching cost after the merge candidate list is constructed. The merge candidates are arranged in the list in accordance with the template matching cost of ascending order. Related operations are performed in the form of a sub-group.
The following are example technical problems solved by disclosed technical solutions. In HEVC, a CU can be split into at most four PUs. However, the splitting of PUs may not be flexible enough for some applications.
Disclosed herein are mechanisms to address one or more of the problems listed above. For example, a CTU can be split into CUs. Each CU can include a prediction tree unit (PTU). The PTUs are then recursively split into PUs. In this way, a coding tree is applied to split a CTU, and a prediction tree is applied to each of the leaf nodes in the coding tree. This also allows the generation of PUs in a consistent recursive manner according to a set of split rules. Each PU can then contain different prediction information. For example, some PTUs may not be split, which results in a PU that is the same size as the CU. In some examples, the PTUs are split by one or more of a quad tree (QT) split, a vertical binary tree (BT) split, a horizontal BT split, a vertical ternary tree (TT) split, a horizontal TT split, a vertical unsymmetrical quad tree (UQT) split, a horizontal UQT split, a vertical unsymmetrical binary tree (UBT) split, a horizontal UBT split, a vertical extended quad tree (EQT) split, or a horizontal EQT split. TUs can still be applied at the CU level and hence a TU can be applied to residual from multiple PUs. The bitstream can contain syntax describing the split patterns and/or split depth used to partition the PTUs and/or PUs. In some examples, PTU and/or PU splits may be allowed or disallowed based on position, size, depth, and/or various threshold values. In such cases, the split information can be omitted from the bitstream and inferred by the decoder.
A CTU 3501 can be further subdivided into CUs. For example, a coding tree can be applied to partition the CTU 3501 into CUs. A coding tree is a hierarchical data structure that applies an ordered list of one or more split modes to a video unit. A coding tree can be visualized with a largest video unit as a root node with progressively smaller nodes created by splits in parent nodes. Nodes that can no longer be split are referred to as leaf nodes. The leaf nodes created by application of a coding tree to a CTU are the CUs. The CU contains both luma components and chroma components.
In the present example, each CU is also a PTU 3503. Hence the CU and PTU 3503 are collectively depicted in schematic diagram 3500 by dashed lines. A PTU 3503 is structure containing both luma components and chroma components and that can be subdivided into PUs 3505 by application of a prediction coding tree. A prediction coding tree is a hierarchical data structure that applies an ordered list of one or more split modes to create PUs 3505. A PU 3505 is a group of samples that are encoded by the same prediction mode. The PUs 3505 are depicted in schematic diagram 3500 by dotted lines. The application of a coding tree to the PTU 3503 allows the PUs 3505 to be recursively generated based on different split patterns. For example, a PTU 3503 can be split by a QT split, a vertical BT split, a horizontal BT split, a vertical TT split, a horizontal TT split, a vertical UQT split, a horizontal UQT split, a vertical UBT split, a horizontal UBT split, a vertical EQT split, a horizontal EQT split, or combinations thereof.
Referring to
Such splits can be ordered in a split pattern according to the coding tree. This results in a highly customizable pattern of PUs 3505 of varying size. This also allows the encoder generate PUs 3505 that match well with other blocks, and hence can be predicted by reference blocks with less residual, which reduces encoding size.
The detailed embodiments below should be considered as examples to explain general concepts. These embodiments should not be interpreted in a narrow way. Furthermore, these embodiments can be combined in any manner In the following discussion, QT, BT, TT, UQT, and ETT may refer to QT split, BT split, TT split, UQT split and ETT split, respectively. In the following discussion, a block is a dyadic block if both width and height is a dyadic number, which is in a form of a 2 N with N being a positive integer. The term block represents a group of samples associated with one-color, two-color, or three-color components, such as a CU, PU, TU, CB, PB, or TB. In the following discussion, a block is a non-dyadic block if at least one of width and height is a non-dyadic number, which cannot be represented in a form of a 2N with N being a positive integer. In the following discussion, split and partitioning have the same meaning.
Example definitions supported by VVC are as follows. A coding block is an M×N block of samples for some values of M and N such that the division of a CTB into coding blocks is a partitioning. A coding tree block (CTB) is an N×N block of samples for some value of N such that the division of a component into CTBs is a partitioning. A coding tree unit (CTU) is a CTB of luma samples, two corresponding CTBs of chroma samples of a picture that has three sample arrays, or a CTB of samples of a monochrome picture, and syntax structures used to code the samples. A coding unit (CU) is a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays in the single tree mode, or a coding block of luma samples of a picture that has three sample arrays in the dual tree mode, or two coding blocks of chroma samples of a picture that has three sample arrays in the dual tree mode, or a coding block of samples of a monochrome picture, and syntax structures used to code the samples.
For conventions, in the following description, a CU may also refer a CB and a CTU may also refer a CTB. A CTU or a CU may be further split into CUs or leaf CUs. A leaf CU cannot be further split into CUs or leaf CUs, serving as a basic coding unit. A prediction tree unit (PTU) is associated with a leaf CU, covering the same region as the leaf CU. A PTU or prediction unit (PU) may be further split into PUs or leaf PUs. A leaf PU cannot be further split into PUs or leaf CUs, serving as a basic prediction unit.
In one example, a coding unit may be split into multiple PUs in a recursive way. For example, a CTU can be recursively split by a CU and PU split pattern as shown in
In one example, a PTU or a PU may be split into multiple PUs in different ways. For example, a PTU or a PU may be split into four PUs by a QT split. For example, a PTU or a PU may be split into two PUs by a vertical BT split. For example, a PTU or a PU may be split into two PUs by a horizontal BT split. For example, a PTU or a PU may be split into three PUs by a vertical TT split. For example, a PTU or a PU may be split into three PUs by a horizontal TT split. For example, a PTU or a PU may be split into four PUs by a vertical UQT split. For example, a PTU or a PU may be split into four PUs by a horizontal Unsymmetrical Quad Tree (UQT) split. For example, a PTU or a PU may be split into two PUs by a vertical UBT split. For example, a PTU or a PU may be split into two PUs by a horizontal UBT split. For example, a PTU or a PU may be split into four PUs by a vertical extended quad tree (EQT) split. For example, a PTU or a PU may be split into four PUs by a horizontal EQT split.
In one example, whether to and/or how to split a PTU or a PU may be signaled from an encoder to a decoder. In one example, a syntax element (such as a flag) may be signaled to indicate whether the PTU associated with a CU is further split into multiple PUs, or is not split and serves as a leaf PU. In one example, a syntax element (such as a flag) may be signaled to indicate whether a PU is further split into multiple PUs or is not split and serve as a leaf PU. In one example, one or multiple syntax element(s) may be signaled to indicate the splitting mechanism, which may comprise a splitting pattern (e.g. QT, BT, TT, UBT, UQT, and/or EQT) and/or the splitting direction (e.g. horizontal or vertical) for a PTU or a PU. In one example, the syntax element(s) indicating the splitting mechanism for a PTU or a PU may be conditionally signaled only when a decoder cannot infer whether the PTU or PU is further spilt. In one example, the syntax elements indicating whether to and/or how to split a PTU or a PU may be coded with context-based arithmetic coding. In one example, the syntax elements indicating whether to and/or how to split a PTU or a PU may be coded with bypass coding. In one example, all or part of the information indicating whether to and/or how to split a PTU or a PU may be signaled along with information indicating whether to and/or how to split a CTU or a CU.
In one example, a depth may be calculated for a PTU and/or PU. In one example, the depth may be a QT depth. The QT depth can be increased by K (e.g., K=1) for each ancestor PTU or PU of the current PTU and/or PU that is split by QT. In one example, the depth may be a MTT depth. The MTT depth can be increased by K (e.g. K=1) for each ancestor PTU or PU of the current PTU and/or PU that is split by any splitting method. In one example, the depth for a PTU may be initialized to be a fixed number such as zero. In one example, the depth for a PTU may be initialized to be a corresponding depth of the CU associated with the PTU.
In one example, whether to and/or how to split a PTU or a PU may be inferred by a decoder. In one example, the inference may depend on dimensions of the current CU, PTU, and/or PU. In one example, the inference may depend on the coding tree depth (such as QT depth or MTT depth) of the current CU, PTU, and/or PU. In one example, the inference may depend on the coding/prediction mode of the current CU, PTU, and/or PU. In one example, the inference may depend whether or not the current CU, PTU, and/or PU is at the picture, sub-picture, and/or CTU boundary. In one example, if a decoder can infer that a PTU and/or PU cannot be further split, the syntax element indicating whether the PTU and/or PU should be split is not signaled. In one example, if a decoder can infer that a PTU and/or PU cannot be split with a specific splitting method, the syntax element(s) indicating the splitting method for the PTU and/or PU should be signaled accordingly excluding the specific splitting method.
In one example, a PTU and/or PU is not allowed to be further split if a depth of the PTU/PU is larger/smaller than T, wherein T may be a fixed number or signaled from the encoder to the decoder or derived at the decoder. In one example, a PTU and/or PU is not allowed to be further split if the size and/or area of the PTU and/or PU is larger or smaller than T, wherein T may be a fixed number or signaled from the encoder to the decoder or derived at the decoder. In one example, a PTU and/or PU is not allowed to be further split if the width of the PTU or PU is larger or smaller than T1 and/or the height of the PTU and/or PU is larger or smaller than T2, wherein T1 or T2 may be fixed numbers or signaled from the encoder to the decoder or derived at the decoder.
In one example, a PTU/PU is not allowed to be further split if the maximum, minimum, or average of the width of the PTU and/or PU and the height of the PTU and/or PU is larger or smaller than T, wherein T may be a fixed number or signaled from the encoder to the decoder or derived at the decoder. In one example, a specific splitting method is not allowed for a PTU and/or PU if a depth of the PTU and/or PU is larger or smaller than T, wherein T may be a fixed number or signaled from the encoder to the decoder or derived at the decoder. In one example, a specific splitting method is not allowed for a PTU and/or PU if the size and/or area of the PTU and/or PU is larger or smaller than T, wherein T may be a fixed number, signaled from the encoder to the decoder, or derived at the decoder. In one example, a specific splitting method is not allowed for a PTU and/or PU if the width of the PTU and/or PU is larger or smaller than T1 and/or the height of the PTU and/or PU is larger or smaller than T2, wherein T1 or T2 may be fixed numbers, signaled from the encoder to the decoder, or derived at the decoder. In one example, a specific splitting method is not allowed for a PTU and/or PU if the maximum, minimum, or average of the width of the PTU and/or PU and the height of the PTU and/or PU is larger or smaller than T, wherein T may be a fixed number, signaled from the encoder to the decoder, or derived at the decoder.
The system 4000 may include a coding component 4004 that may implement the various coding or encoding methods described in the present document. The coding component 4004 may reduce the average bitrate of video from the input 4002 to the output of the coding component 4004 to produce a coded representation of the video. The coding techniques are therefore sometimes called video compression or video transcoding techniques. The output of the coding component 4004 may be either stored, or transmitted via a communication connected, as represented by the component 4006. The stored or communicated bitstream (or coded) representation of the video received at the input 4002 may be used by a component 4008 for generating pixel values or displayable video that is sent to a display interface 4010. The process of generating user-viewable video from the bitstream representation is sometimes called video decompression. Furthermore, while certain video processing operations are referred to as “coding” operations or tools, it will be appreciated that the coding tools or operations are used at an encoder and corresponding decoding tools or operations that reverse the results of the coding will be performed by a decoder.
Examples of a peripheral bus interface or a display interface may include universal serial bus (USB) or high definition multimedia interface (HDMI) or Displayport, and so on. Examples of storage interfaces include serial advanced technology attachment (SATA), peripheral component interconnect (PCI), integrated drive electronics (IDE) interface, and the like. The techniques described in the present document may be embodied in various electronic devices such as mobile phones, laptops, smartphones or other devices that are capable of performing digital data processing and/or video display.
The PTUs and/or the PUs can be recursively split into: four PUs by a QT split, two PUs by a vertical BT split, two PUs by a horizontal BT split, three PUs by a vertical TT split, three PUs by a horizontal TT split, four PUs by a vertical UQT split, four PUs by a horizontal UQT split, two PUs by a vertical UBT split, two PUs by a horizontal UBT split, four PUs by a vertical EQT split, four PUs by a horizontal EQT split, or combinations thereof. A PTU and/or a PU of the PU may collectively be referred to as a video unit in some cases for clarity of discussion.
In some examples, a depth of the prediction coding tree is calculated for a PU and/or PTU. The depth may be used to indicate the number of splits that occur in the prediction coding tree. The depth may be signaled and/or may be compared to one or more threshold to determine when splitting is no longer allowed for a lead node. In some examples, the depth is a QT depth indicating a number of times an ancestor video unit (e.g., the PTU) is split by a QT. In some examples, the depth is a multiple-type-tree (MTT) depth indicating a number of times an ancestor video unit (e.g., the PTU) is split by any split type. In some examples, the depth is initialized to a depth of a CU corresponding to the PTU or PU. This results in a total depth that is the sum of the coding tree depth as applied to the CTU and the prediction coding depth as applied to the current PTU and/or PU.
At step 4204, the video coding device performing a conversion between a visual media data and a bitstream based on the PUs. In some examples, the conversion includes encoding the visual media data into the bitstream. In some examples, the conversion includes decoding the bitstream to obtain the visual media data. The bitstream may include syntax indicating splits applied to the PUs and PTUs. For example, the bitstream may comprise a syntax element indicating whether a corresponding PTU and/or PU is further split into multiple PUs. In an example, the bitstream may comprise syntax indicating a split pattern (e.g., QT, BT, TT, etc.) and split direction (e.g., horizontal or vertical) for application to a PTU and/or a PU. The split pattern may include an ordered list of split types and split directions in some examples. In some example, syntax indicating a split pattern and split direction conditionally signaled, and hence is only signaled for a PTU and/or PU when the PTU and/or PU is further split. If no split is applied, the corresponding syntax can be omitted from the bitstream. In some examples, a split of a current video unit (PTU and/or PU) is not included in the bitstream and is inferred by a decoder.
In some examples, a split is inferred according to: a current video unit dimension (e.g., height, depth, and/or size), current video unit depth, current video unit position relative to picture boundary, current video unit position relative to a sub-picture boundary, whether the current video unit can be further split, a current video unit depth relative to a depth threshold, a current video unit height relative to a height threshold, a current video unit width relative to a width threshold, or combinations thereof. In some examples, a split is disallowed by a comparison of: a current video unit height relative to a height threshold (e.g., minimum and/or maximum height), a current video unit width relative to a width threshold (e.g., minimum and/or maximum width), a current video height and a current video width relative to a size threshold (e.g., minimum and/or maximum size), a current video unit depth relative to a depth threshold, a current video unit size (e.g., maximum and/or minimum width and/or height) relative to the size threshold, or combinations thereof.
It should be noted that the method 4200 can be implemented in an apparatus for processing video data comprising a processor and a non-transitory memory with instructions thereon, such as video encoder 4400, video decoder 4500, and/or encoder 4600. In such a case, the instructions upon execution by the processor, cause the processor to perform the method 4200. Further, the method 4200 can be performed by a non-transitory computer readable medium comprising a computer program product for use by a video coding device. The computer program product comprises computer executable instructions stored on the non-transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method 4200.
Source device 4310 may include a video source 4312, a video encoder 4314, and an input/output (I/O) interface 4316. Video source 4312 may include a source such as a video capture device, an interface to receive video data from a video content provider, and/or a computer graphics system for generating video data, or a combination of such sources. The video data may comprise one or more pictures. Video encoder 4314 encodes the video data from video source 4312 to generate a bitstream. The bitstream may include a sequence of bits that form a coded representation of the video data. The bitstream may include coded pictures and associated data. The coded picture is a coded representation of a picture. The associated data may include sequence parameter sets, picture parameter sets, and other syntax structures. I/O interface 4316 may include a modulator/demodulator (modem) and/or a transmitter. The encoded video data may be transmitted directly to destination device 4320 via I/O interface 4316 through network 4330. The encoded video data may also be stored onto a storage medium/server 4340 for access by destination device 4320.
Destination device 4320 may include an I/O interface 4326, a video decoder 4324, and a display device 4322. I/O interface 4326 may include a receiver and/or a modem. I/O interface 4326 may acquire encoded video data from the source device 4310 or the storage medium/server 4340. Video decoder 4324 may decode the encoded video data. Display device 4322 may display the decoded video data to a user. Display device 4322 may be integrated with the destination device 4320, or may be external to destination device 4320, which can be configured to interface with an external display device.
Video encoder 4314 and video decoder 4324 may operate according to a video compression standard, such as the High Efficiency Video Coding (HEVC) standard, Versatile Video Coding (VVM) standard and other current and/or further standards.
The functional components of video encoder 4400 may include a partition unit 4401, a prediction unit 4402 which may include a mode select unit 4403, a motion estimation unit 4404, a motion compensation unit 4405, an intra prediction unit 4406, a residual generation unit 4407, a transform processing unit 4408, a quantization unit 4409, an inverse quantization unit 4410, an inverse transform unit 4411, a reconstruction unit 4412, a buffer 4413, and an entropy encoding unit 4414.
In other examples, video encoder 4400 may include more, fewer, or different functional components. In an example, prediction unit 4402 may include an intra block copy (IBC) unit. The IBC unit may perform prediction in an IBC mode in which at least one reference picture is a picture where the current video block is located.
Furthermore, some components, such as motion estimation unit 4404 and motion compensation unit 4405 may be highly integrated, but are represented in the example of video encoder 4400 separately for purposes of explanation.
Partition unit 4401 may partition a picture into one or more video blocks. Video encoder 4400 and video decoder 4500 may support various video block sizes.
Mode select unit 4403 may select one of the coding modes, intra or inter, e.g., based on error results, and provide the resulting intra or inter coded block to a residual generation unit 4407 to generate residual block data and to a reconstruction unit 4412 to reconstruct the encoded block for use as a reference picture. In some examples, mode select unit 4403 may select a combination of intra and inter prediction (CIIP) mode in which the prediction is based on an inter prediction signal and an intra prediction signal. Mode select unit 4403 may also select a resolution for a motion vector (e.g., a sub-pixel or integer pixel precision) for the block in the case of inter prediction.
To perform inter prediction on a current video block, motion estimation unit 4404 may generate motion information for the current video block by comparing one or more reference frames from buffer 4413 to the current video block. Motion compensation unit 4405 may determine a predicted video block for the current video block based on the motion information and decoded samples of pictures from buffer 4413 other than the picture associated with the current video block.
Motion estimation unit 4404 and motion compensation unit 4405 may perform different operations for a current video block, for example, depending on whether the current video block is in an I slice, a P slice, or a B slice.
In some examples, motion estimation unit 4404 may perform uni-directional prediction for the current video block, and motion estimation unit 4404 may search reference pictures of list 0 or list 1 for a reference video block for the current video block. Motion estimation unit 4404 may then generate a reference index that indicates the reference picture in list 0 or list 1 that contains the reference video block and a motion vector that indicates a spatial displacement between the current video block and the reference video block. Motion estimation unit 4404 may output the reference index, a prediction direction indicator, and the motion vector as the motion information of the current video block. Motion compensation unit 4405 may generate the predicted video block of the current block based on the reference video block indicated by the motion information of the current video block.
In other examples, motion estimation unit 4404 may perform bi-directional prediction for the current video block, motion estimation unit 4404 may search the reference pictures in list 0 for a reference video block for the current video block and may also search the reference pictures in list 1 for another reference video block for the current video block. Motion estimation unit 4404 may then generate reference indexes that indicate the reference pictures in list 0 and list 1 containing the reference video blocks and motion vectors that indicate spatial displacements between the reference video blocks and the current video block. Motion estimation unit 4404 may output the reference indexes and the motion vectors of the current video block as the motion information of the current video block. Motion compensation unit 4405 may generate the predicted video block of the current video block based on the reference video blocks indicated by the motion information of the current video block.
In some examples, motion estimation unit 4404 may output a full set of motion information for decoding processing of a decoder. In some examples, motion estimation unit 4404 may not output a full set of motion information for the current video. Rather, motion estimation unit 4404 may signal the motion information of the current video block with reference to the motion information of another video block. For example, motion estimation unit 4404 may determine that the motion information of the current video block is sufficiently similar to the motion information of a neighboring video block.
In one example, motion estimation unit 4404 may indicate, in a syntax structure associated with the current video block, a value that indicates to the video decoder 4500 that the current video block has the same motion information as another video block.
In another example, motion estimation unit 4404 may identify, in a syntax structure associated with the current video block, another video block and a motion vector difference (MVD). The motion vector difference indicates a difference between the motion vector of the current video block and the motion vector of the indicated video block. The video decoder 4500 may use the motion vector of the indicated video block and the motion vector difference to determine the motion vector of the current video block.
As discussed above, video encoder 4400 may predictively signal the motion vector. Two examples of predictive signaling techniques that may be implemented by video encoder 4400 include advanced motion vector prediction (AMVP) and merge mode signaling.
Intra prediction unit 4406 may perform intra prediction on the current video block. When intra prediction unit 4406 performs intra prediction on the current video block, intra prediction unit 4406 may generate prediction data for the current video block based on decoded samples of other video blocks in the same picture. The prediction data for the current video block may include a predicted video block and various syntax elements.
Residual generation unit 4407 may generate residual data for the current video block by subtracting the predicted video block(s) of the current video block from the current video block. The residual data of the current video block may include residual video blocks that correspond to different sample components of the samples in the current video block.
In other examples, there may be no residual data for the current video block for the current video block, for example in a skip mode, and residual generation unit 4407 may not perform the subtracting operation.
Transform processing unit 4408 may generate one or more transform coefficient video blocks for the current video block by applying one or more transforms to a residual video block associated with the current video block.
After transform processing unit 4408 generates a transform coefficient video block associated with the current video block, quantization unit 4409 may quantize the transform coefficient video block associated with the current video block based on one or more quantization parameter (QP) values associated with the current video block.
Inverse quantization unit 4410 and inverse transform unit 4411 may apply inverse quantization and inverse transforms to the transform coefficient video block, respectively, to reconstruct a residual video block from the transform coefficient video block. Reconstruction unit 4412 may add the reconstructed residual video block to corresponding samples from one or more predicted video blocks generated by the prediction unit 4402 to produce a reconstructed video block associated with the current block for storage in the buffer 4413.
After reconstruction unit 4412 reconstructs the video block, the loop filtering operation may be performed to reduce video blocking artifacts in the video block.
Entropy encoding unit 4414 may receive data from other functional components of the video encoder 4400. When entropy encoding unit 4414 receives the data, entropy encoding unit 4414 may perform one or more entropy encoding operations to generate entropy encoded data and output a bitstream that includes the entropy encoded data.
In the example shown, video decoder 4500 includes an entropy decoding unit 4501, a motion compensation unit 4502, an intra prediction unit 4503, an inverse quantization unit 4504, an inverse transformation unit 4505, a reconstruction unit 4506, and a buffer 4507. Video decoder 4500 may, in some examples, perform a decoding pass generally reciprocal to the encoding pass described with respect to video encoder 4400.
Entropy decoding unit 4501 may retrieve an encoded bitstream. The encoded bitstream may include entropy coded video data (e.g., encoded blocks of video data). Entropy decoding unit 4501 may decode the entropy coded video data, and from the entropy decoded video data, motion compensation unit 4502 may determine motion information including motion vectors, motion vector precision, reference picture list indexes, and other motion information. Motion compensation unit 4502 may, for example, determine such information by performing the AMVP and merge mode.
Motion compensation unit 4502 may produce motion compensated blocks, possibly performing interpolation based on interpolation filters. Identifiers for interpolation filters to be used with sub-pixel precision may be included in the syntax elements.
Motion compensation unit 4502 may use interpolation filters as used by video encoder 4400 during encoding of the video block to calculate interpolated values for sub-integer pixels of a reference block. Motion compensation unit 4502 may determine the interpolation filters used by video encoder 4400 according to received syntax information and use the interpolation filters to produce predictive blocks.
Motion compensation unit 4502 may use some of the syntax information to determine sizes of blocks used to encode frame(s) and/or slice(s) of the encoded video sequence, partition information that describes how each macroblock of a picture of the encoded video sequence is partitioned, modes indicating how each partition is encoded, one or more reference frames (and reference frame lists) for each inter coded block, and other information to decode the encoded video sequence.
Intra prediction unit 4503 may use intra prediction modes for example received in the bitstream to form a prediction block from spatially adjacent blocks. Inverse quantization unit 4504 inverse quantizes, i.e., de-quantizes, the quantized video block coefficients provided in the bitstream and decoded by entropy decoding unit 4501. Inverse transform unit 4505 applies an inverse transform.
Reconstruction unit 4506 may sum the residual blocks with the corresponding prediction blocks generated by motion compensation unit 4502 or intra prediction unit 4503 to form decoded blocks. If desired, a deblocking filter may also be applied to filter the decoded blocks in order to remove blockiness artifacts. The decoded video blocks are then stored in buffer 4507, which provides reference blocks for subsequent motion compensation/intra prediction and also produces decoded video for presentation on a display device.
The encoder 4600 further includes an intra prediction component 4608 and a motion estimation/compensation (ME/MC) component 4610 configured to receive input video. The intra prediction component 4608 is configured to perform intra prediction, while the ME/MC component 4610 is configured to utilize reference pictures obtained from a reference picture buffer 4612 to perform inter prediction. Residual blocks from inter prediction or intra prediction are fed into a transform (T) component 4614 and a quantization (Q) component 4616 to generate quantized residual transform coefficients, which are fed into an entropy coding component 4618. The entropy coding component 4618 entropy codes the prediction results and the quantized transform coefficients and transmits the same toward a video decoder (not shown). Quantization components output from the quantization component 4616 may be fed into an inverse quantization (IQ) components 4620, an inverse transform component 4622, and a reconstruction (REC) component 4624. The REC component 4624 is able to output images to the DF 4602, the SAO 4604, and the ALF 4606 for filtering prior to those images being stored in the reference picture buffer 4612.
A listing of solutions preferred by some examples is provided next.
The following solutions show examples of techniques discussed herein.
In the solutions described herein, an encoder may conform to the format rule by producing a coded representation according to the format rule. In the solutions described herein, a decoder may use the format rule to parse syntax elements in the coded representation with the knowledge of presence and absence of syntax elements according to the format rule to produce decoded video.
In the present document, the term “video processing” may refer to video encoding, video decoding, video compression or video decompression. For example, video compression algorithms may be applied during conversion from pixel representation of a video to a corresponding bitstream representation or vice versa. The bitstream representation of a current video block may, for example, correspond to bits that are either co-located or spread in different places within the bitstream, as is defined by the syntax. For example, a macroblock may be encoded in terms of transformed and coded error residual values and also using bits in headers and other fields in the bitstream. Furthermore, during conversion, a decoder may parse a bitstream with the knowledge that some fields may be present, or absent, based on the determination, as is described in the above solutions. Similarly, an encoder may determine that certain syntax fields are or are not to be included and generate the coded representation accordingly by including or excluding the syntax fields from the coded representation.
The disclosed and other solutions, examples, embodiments, modules and the functional operations described in this document can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this document and their structural equivalents, or in combinations of one or more of them. The disclosed and other embodiments can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more them. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this document can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an field programmable gate array (FPGA) or an application specific integrated circuit (ASIC).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random-access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and compact disc read-only memory (CD ROM) and Digital versatile disc-read only memory (DVD-ROM) disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this patent document contains many specifics, these should not be construed as limitations on the scope of any subject matter or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular techniques. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.
Only a few implementations and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.
A first component is directly coupled to a second component when there are no intervening components, except for a line, a trace, or another medium between the first component and the second component. The first component is indirectly coupled to the second component when there are intervening components other than a line, a trace, or another medium between the first component and the second component. The term “coupled” and its variants include both directly coupled and indirectly coupled. The use of the term “about” means a range including ±10% of the subsequent number unless otherwise stated.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled may be directly connected or may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/103527 | Jun 2021 | WO | international |
This application is a continuation of International Application No. PCT/CN2022/102393, filed on Jun. 29, 2022, which claims the priority and benefit of International Application No. PCT/CN2021/103527 filed Jun. 30, 2021 by Beijing Bytedance Network Technology Co., Ltd. et., al, and titled “Recursive Prediction Unit in Video Coding” which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/102393 | Jun 2022 | US |
Child | 18400326 | US |