The present invention relates to a hierarchical architecture in video encoders. In particular, the present invention relates to rate-distortion optimization for deciding a block partition structure and corresponding coding modes in video encoding.
The Versatile Video Coding (VVC) standard is the latest video coding standard developed by the Joint Collaborative Team on Video Coding (JCT-VC) group of video coding experts from ITU-T Study Group. The VVC standard relies on a block-based coding structure which divides each picture into multiple Coding Tree Units (CTUs). A CTU consists of an N×N block of luminance (luma) samples together with one or more corresponding blocks of chrominance (chroma) samples. For example, each 4:2:0 chroma subsampling CTU consists of one 128×128 luma Coding Tree Block (CTB) and two 64×64 chroma CTBs. Each CTB in a CTU is further recursively divided into one or more Coding Blocks (CBs) in a Coding Unit (CU) for encoding or decoding to adapt to various local characteristics. Flexible CU structures such as the Quad-Tree-Binary-Tree (QTBT) structure may improve the coding performance compared to the Quad-Tree (QT) structure employed in the High-Efficiency Video Coding (HEVC) standard.
The prediction decision in video encoding or decoding is made at the CU level, where each CU is coded by one or a combination of selected coding modes. After obtaining a residual signal generated by the prediction process, the residual signal belong to a CU is further transformed into transform coefficients for compact data representation, and these transform coefficients are quantized and conveyed to the decoder.
A conventional video encoder for encoding video pictures into a bitstream is illustrated in
Merge mode with MVD (MMVD) For a CU coded by the Merge mode, implicitly derived motion information is directly used for prediction sample generation. Merge mode with MVD (MMVD) introduced in the VVC standard further refines a selected Merge candidate by signaling Motion Vector Differences (MVDs) information. A MMVD flag is signaled right after a regular Merge flag to specify whether MMVD mode is used for a CU. MMVD information signaled in the bitstream includes an MMVD candidate flag, an index to specify motion magnitude, and an index for indication of motion direction. In the MMVD mode, one of the first two candidates in the Merge list is selected to be used as the MV basis. An MMVD candidate flag is signaled to specify which one of the first two Merge candidates is used. A distance index specifies motion magnitude information and indicate a pre-defined offset from a starting point. An offset is added to either a horizontal or vertical component of the starting MV. The relation of the distance index and the pre-defined offset is specified in Table 1.
A direction index represents a direction of the MVD relative to the starting point. The direction index indicates one of the four directions along the horizontal and vertical directions. It is noted that the meaning of MVD sign could be variant according to the information of starting MVs. For example, when the staring MV(s) is a uni-prediction MV or bi-prediction MVs with both lists pointing to the same direction of the current picture, the sign shown in Table 2 specifies the sign of the MV offset added to the starting MV. Both lists pointing to the same direction of the current picture if Picture Order Counts (POCs) of two reference pictures are both larger than the POC of the current picture, or POCs of two reference pictures are both smaller than the POC of the current picture. In cases when the starting MVs is bi-prediction MVs with two MVs pointing to different directions of the current picture and the difference of the POCs in list 0 is greater than the one in list 1, the sign in Table 2 specifies the sign of the MV offset added to the list 0 MV component of the starting MV and the sign for the list 1 MV has an opposite sign. Otherwise, when the difference of the POCs in list 1 is greater than the one in list 0, the sign in Table 2 specifies the sign of the MV offset added to the list 1 MV component of the starting MV and the sign for the list 0 MV has an opposite sign. The MVD is scaled according to the difference of POCs in each direction. If the differences of POCs in both lists are the same, no scaling is needed; otherwise, if the difference of POCs in list 0 is larger than the one of list 1, the MVD for list 1 is scaled, by defining the POC difference of List 0 as td and POC difference of List 1 as tb. If the POC difference of List 1 is greater than List 0, the MVD for list 0 is scaled in the same way. If the starting MV is uni-predicted, the MVD is added to the available MV.
Bi-prediction with CU-level Weight (BCW) A bi-prediction signal is generated by averaging two prediction signals obtained from two different reference pictures and/or using two different motion vectors in the HEVC standard. In the VVC standard, the bi-prediction mode is extended beyond simple averaging to allow weighted averaging of the two prediction signals.
P
bi-pred((8−w)*P0+w*P1+4)>>3
In the VVC standard, five weights w ∈{−2, 3, 4, 5, 10} are allowed in the weighted averaging bi-prediction. In each bi-predicted cu, the weight w is determined in one of two ways: 1) for a non-Merge CU, the weight index is singled after the motion vector difference; 2) for a Merge CU, the weight index is inferred from neighboring blocks based on the Merge candidate index. BCW is only applied to CUs with 256 or more luma samples, which implies the CU width times the CU height must be greater than or equal to 256. For low-delay pictures, all 5 weights are used. For non-low-delay pictures, only 3 weights w∈{3, 4, 5} are used.
Fast search algorithms are applied to find the weight index without significantly increasing the encoder complexity at the video encoders. When combined with Adaptive Motion Vector Resolution (AMVR), unequal weights are only conditionally checked for 1-pel and 4-pel motion vector precisions if the current picture is a low-delay picture. When BCM is combined with the affine mode, affine Motion Estimation (ME) is performed for unequal weights only if the affine mode is selected as the current best mode. Unequal weights are only conditionally checked when the two reference pictures in bi-prediction are the same. Unequal weights are not searched when certain conditions are met, depending on the POC distance between the current picture and its reference pictures, the coding QP, and the temporal level.
The BCW weight index is coded using one context coded bin followed by bypass coded bins. The first context coded bin indicates if equal weight is used; and if unequal weight is used, additional bins are signaled using bypass coding to indicate which unequal weight is used. Weighted Prediction (WP) is a coding tool supported by the H.264/AVC and HEVC standards to efficiently code video content with fading. Support for WP was also added into the VVC standard. WP allows weighting parameters (weight and offset) to be signaled for each reference picture in each of the reference picture lists L0 and L1. The weight(s) and offset(s) of the corresponding reference picture(s) are applied during motion compensation. WP and BCW are designed for different types of video content. In order to avoid interactions between WP and BCW, which will complicate the VVC decoder design, if a CU uses WP, then the BCW weight index is not signaled, and w is inferred to be 4, implying equal weight is applied. For a Merge CU, the weight index is inferred from neighboring blocks based on the Merge candidate index. This can be applied to both normal Merge mode and inherited affine Merge mode. For constructed affined Merge mode, the affine motion information is constructed based on the motion information of up to 3 blocks. The BCW index for a CU using the constructed affine Merge mode is simply set equal to the BCW index of the first control point MV. In the VVC standard, Combined Inter and Intra Prediction (CIIP) and BCW cannot be jointly applied for a CU. When a CU is coded with the CIIP mode, the CBW index of the current CU is set to 4, implying equal weight is applied.
Multiple Transform Selection (MTS) for Core Transform In addition to DCT-II transforming which has been employed in the HEVC standard, a MTS scheme is used for residual coding both inter and intra coded blocks. It provides the flexibility to select a transform coding setting from multiple transforms such as DCT-II, DCT-VIII, and DST-VII. The newly introduced transform matrices are DST-VII and DCT-VIII. Table 3 shows the basic functions of DST and DCT transform.
In order to keep the orthogonality of the transform matrix, the transform matrices are quantized more accurately than the transform matrices in the HEVC standard. To keep the intermediate values of the transformed coefficients within the 16-bit range, after horizontal and after vertical transform, all the coefficients are 10-bit coefficients. In order to control the MTS scheme, separate enabling flags are specified at the Sequence Parameter Set (SPS) level for intra and inter prediction, respectively. When MTS is enabled at the SPS, a CU level flag is signaled to indicate whether MTS is applied or not. MTS is applied only for the luma component. The MTS signaling is skipped when one of the below conditions is applied. The position of the last significant coefficient for the luma Transform Block (TB) is less than 1 (i.e., DC only); the last significant coefficient of the luma TB is located inside the MTS zero-out region.
If the MTS CU flag is equal to zero, then DCT-II is applied in both directions. However, if the MTS CU flag is equal to one, then two other flags are additionally signaled to indicate the transform type for the horizontal and vertical directions, respectively. A transform and flags signaling mapping table is shown in Table 4. Unified the transform selection for Intra Sub-Partition (ISP) and implicit MTS is used by removing the intra-mode and block-shape dependencies. If a current block is coded in ISP mode or if the current block is an intra block and both intra and inter explicit MTS is on, then only DST-VII is used for both horizontal and vertical transform cores. When it comes to transform matrix precision, 8-bit primary transform cores are used. Therefore, all the transform cores used in the HEVC standard are kept as the same, including 4-point DCT-II and DST-VII, 8-point, 16-point and 32-point DCT-II. Also, other transform cores including 64-point DCT-II, 4-point DCT8, 8-point, 16-point, 32-point DST-VII and DCT-VIII, use 8-bit primary transform cores.
To reduce the complexity of large size DST-VII and DCT-VIII, High frequency transform coefficients are zeroed out for the DST-VII and DCT-VIII blocks with size (width or height, or both width and height) equal to 32. Only the coefficients within the 16×16 lower-frequency region are retained.
As in the HEVC standard, the residual of a block can be coded with transform skip mode. To avoid the redundancy of syntax coding, the transform skip flag is not signalled when the CU level MTS CU flag is not equal to zero. Note that implicit MTS transform is set to DCT-II when Low-Frequency Non-Separable Transform (LFNST) or Matrix-based Intra Prediction (MIP) is activated for the current CU. Also the implicit MTS can be still enabled when MTS is enabled for inter coded blocks.
Geometric Partitioning Mode (GPM) In the VVC standard, the GPM is supported for inter prediction. The GPM is signaled using a CU-level flag as one kind of Merge mode, with other Merge modes including the regular Merge mode, the MMVD mode, the CCIP mode, and the subblock Merge mode. In total, 64 partitions are supported by GPM for each possible CU size w×h=2m×2n with m, n ∈{3 . . . 6} excluding 8×64 and 64×8. When this mode is used, a CU is split into two parts by a geometrically located straight line as shown in
If geometric partitioning mode is used for the current CU, then a geometric partition index indicating the partition mode of the geometric partition (angle and offset), and two Merge indices (one for each partition) are further signaled. The number of maximum GPM candidate size is signaled explicitly in the SPS and specifies syntax binarization for GPM merge indices. After predicting each part of the geometric partition, the sample values along the geometric partition edge are adjusted using a blending processing with adaptive weights to acquire the prediction signal for the whole CU. Transform and quantization process will be applied to the whole CU as in other prediction modes. Finally, the motion field of a CU predicted using the geometric partition modes is stored.
The uni-prediction candidate list is derived directly from the Merge candidate list constructed according to the extended Merge prediction process. Denote n as the index of the uni-prediction motion in the geometric uni-prediction candidate list. The LX motion vector of the n-th extended Merge candidate, with X equal to the parity of n, is used as the n-th uni-prediction motion vector for geometric partitioning mode. For example, the uni-prediction motion vector for Merge index 0 is L0 MV, the uni-prediction motion vector for Merge index 1 is L1 MV, the uni-prediction motion vector or Merge index 2 is L0 MV, and the uni-prediction motion vector for Merge index 3 is L1 MV. In case a corresponding LX motion vector of the n-the extended merge candidate does not exist, the L(1−X) motion vector of the same candidate is used instead as the uni-prediction motion vector for geometric partitioning mode.
After predicting each part of a geometric partition using its own motion, blending is applied to the two prediction signals to derive samples around the geometric partition edge. The blending weight for each position of the CU are derived based on the distance between individual position and the partition edge.
The distance for a position (x, y) to the partition edge are derived as:
where i, j are the indices for angle and offset of a geometric partition, which depend on the signaled geometric partition index. The sign of ρx,j and ρy,j depend on angle index i.
The weights for each part of a geometric partition are derived as following:
The partIdx depends on the angle index i.
Mv1 from the first part of the geometric partition, Mv2 from the second part of the geometric partition and a combined motion vector of Mv1 and Mv2 are stored in the motion field of a geometric partitioning mode coded CU. The stored motion vector type for each individual position in the motion field are determined as:
sType=abs(motionIdx)<32?2:(motionIdx≤0?(1−partIdx):partIdx)
where motionIdx is equal to d(4x+2, 4y+2), which is recalculated from the above equation. The partIdx depends on the angle index i. If sType is equal to 0 or 1, Mv0 or Mv1 are stored in the corresponding motion field, otherwise if sType is equal to 2, a combined motion vector from Mv0 and Mv2 are stored. The combined motion vector is generated using the following process: if Mv1 and Mv2 are from different reference picture lists (one from L0 and the other from L1), then Mv1 and Mv2 are simply combined to form bi-prediction motion vectors; otherwise, if Mv1 and Mv2 are from the same list, only the uni-prediction motion Mv2 is stored.
Combined Inter and Intra Prediction (CIIP) In the VVC standard, when a CU is coded in Merge mode, if the CU contains at least 64 luma samples (that is, CU width times CU height is equal to or larger than 64), and if both CU width and CU height are less than 128 luma samples, an additional flag is signaled to indicate if the Combined Inter and Intra Prediction (CIIP) mode is applied to the current CU. As the name suggested, the CIIP mode combines an inter prediction signal with an intra prediction signal. The inter prediction signal in the CIIP mode Pinter is derived using the same inter prediction process applied to the regular merge mode; and the intra prediction signal Pintra is derived following the regular intra prediction process with the planar mode. Then, the intra and inter prediction signals are combined using weighted averaging, where the weight value is calculated depending on the coding modes of the top and left neighbouring blocks as follows. A variable isIntraTop is set to 1 if the top neighboring block is available and intra coded, otherwise isIntraTop is set to 0, and a variable isIntraLeft is set to 1 if the left neighboring block is available and intra coded, otherwise isIntraLeft is set to 0. The weight value wt is set to 3 if the sum of the two variables isIntraTop and isIntraLeft is equal to 2, otherwise the weight value wt is set to 2 if the sum of the two variables is equal to 1; otherwise the weight value wt is set to 1. The CIIP prediction is calculated as follows:
P
CIIP=((4−wt)*Pinter+wt*Pintra+2)>>2
Embodiments of video encoding methods for a video encoding system perform Rate Distortion Optimization (RDO) by a hierarchical architecture. The embodiments of video encoding methods comprise receiving input data associated with a current block in a video picture, determining a block partitioning structure of the current block and determining a corresponding coding mode for each coding block in the current block by multiple Processing Element (PE) groups, splitting the current block into one or more coding blocks according to the block partitioning structure, and entropy encoding the coding blocks in the current block according to the corresponding coding modes determined by the PE groups. Each PE group has multiple parallel PEs performing RDO tasks. Each PE group is associated with a particular block size, and for each PE group, the current block is divided into one or more partitions each having the particular block size associated with the PE group and each partition is divided into sub-partitions according to one or more partitioning types. The parallel PEs of each PE group test multiple coding modes on each partition of the current block and corresponding sub-partitions split from each partition to derive rate-distortion costs associated with the coding modes on each partition and sub-partition. The block partitioning structure of the current block and the corresponding coding mode for each coding block in the current block are decided according to the rate-distortion costs.
In some embodiments of the hierarchical architecture, a buffer size required for each PE group is related to the particular block size associated with the PE group. For example, a smaller memory buffer is required for PE groups associated with smaller block sizes. The buffer size required for each PE group may be further reduced by setting a same block partitioning testing order for all PE threads in the PE group, and based on rate-distortion costs associated with at least two partitioning types, a set of reconstruction buffer initially storing reconstruction samples associated with one of the two partitioning types is released for storing reconstruction samples associated with another partitioning type. For example, the block partitioning testing order for all PE threads is horizontal binary-tree partitioning, vertical binary-tree partitioning, and no-split. The partitioning types for dividing each partition in the current block into sub-partitions include one or a combination of horizontal binary-tree partitioning, vertical binary-tree partitioning, horizontal ternary-tree partitioning, and vertical ternary-tree partitioning according to some embodiments.
A PE in a PE group is used to test a coding mode or one or more candidates of a coding mode in one PE call, or a PE tests a coding mode or a candidates of a coding mode in multiple PE calls. A PE call is a time interval. A PE computes a low-complexity RDO operation followed by a high-complexity RDO operation in a PE call or a PE computes a low-complexity RDO operation or a high-complexity RDO operation in a PE call. In some embodiments, a first PE in a PE group computes a low-complexity RDO operation of a coding mode and a second PE in the same PE group computes a high-complexity RDO operation of the coding mode, and intermediate results can be pass from the first PE to the second PE. For example, the two PEs test a coding mode on first and second partitions, where the first PE computes the low-complexity RDO operation for the second partition while the second PE computes the high-complexity RDO operation for the first partition.
In some preferred embodiments, coding tools or coding modes with similar properties are combined in a same PE thread in each PE group. In some embodiments, one or more predefined conditions are checked for one or more PE groups, and the video encoding system adaptively selects coding modes for one or more PEs when the predefined conditions are satisfied. The predefined conditions may be associated with comparisons of information between the partition/sub-partition and one or more neighboring blocks of the partition/sub-partition, a current temporal identifier, historical Motion Vector (MV) list, or preprocessing results. The information between the partition/sub-partition and one or more neighboring blocks of the partition/sub-partition comprises a prediction mode, block size, block partition type, MVs, reconstruction samples, or residuals. In an embodiment, one or more PEs skip coding in one or more PE calls when the predefined conditions are satisfied. For example, one of the predefined conditions is satisfied when an accumulated rate-distortion cost of one PE is higher than each of accumulated rate-distortion costs of other PEs by a predefined threshold.
In some embodiments, one or more buffers are shared among the parallel PEs of a same PE group by unifying a data scanning order among the PEs. A current PE of a current PE group may share prediction samples from one or more PEs of the current PE group directly without temporary storing the prediction samples in a buffer. In one embodiment, the current PE tests one or more GPM candidates on each partition or sub-partition by acquiring the prediction samples form the one or more PEs testing Merge candidates on the partition or sub-partition. GPM tasks originally assigned to the current PE may be adaptively skipped according to a rate-distortion cost associated with a prediction result of the current PE. In another embodiment, the current PE tests one or more CIIP candidates on each partition or sub-partition by acquiring the prediction samples from one or more PEs testing Merge candidates on the partition or sub-partition and one PE testing an intra Plannar mode. CITP tasks originally assigned to the current PE may be adaptively skipped according to a rate-distortion cost associated with a prediction result of the current PE. In yet another embodiment, the current PE tests one or more AMVP-BI candidates on each partition or sub-partition by acquiring the prediction samples from the one or more PEs testing AMVP-UNI candidates on the partition or sub-partition. In one embodiment, the current PE tests one or more BCW candidates on each partition or sub-partition by acquiring the prediction samples form the one or more PEs testing AMVP-UNI candidates on the partition or sub-partition.
According to an embodiment, a set of neighboring buffer storing neighboring reconstruction samples is shared between multiple PEs in one PE group. In one embodiment, residual of each coding block is generated and the residual is shared between multiple PEs for transform processing according to different transform coding settings. In some embodiments of the present invention, Sum of Absolute Transform Difference (SATD) units are dynamically shared among the parallel PEs within one PE group.
Aspects of the disclosure further provide an apparatus for a video encoding system. The apparatus comprising one or more electronic circuits configured for receiving input data associated with a current block in a video picture, determining a block partitioning structure of the current block and determining a corresponding coding mode for each coding block in the current block by multiple PE groups, splitting the current block into one or more coding blocks according to the block partitioning structure, and entropy encoding the coding blocks in the current block according to the corresponding coding modes determined by the PE groups. Each PE group has multiple parallel PEs. Each PE group is associated with a particular block size, and for each PE group, the current block is divided into one or more partition each having the particular block size and each partition is divided into sub-partitions according to one or more partitioning types. The parallel PEs of each PE group test multiple coding modes on each partition of the current block and corresponding sub-partitions split from each partition. The block partitioning structure of the current block and the corresponding coding mode of each coding block are decided according to rate-distortion costs associated with the coding modes tested by the PE groups.
Various embodiments of this disclosure that are proposed as examples will be described in detail with reference to the following figures, wherein like numerals reference like elements, and wherein:
It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Reference throughout this specification to “an embodiment”, “some embodiments”, or similar language means that a particular feature, structure, or characteristic described in connection with the embodiments may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in an embodiment” or “in some embodiments” in various places throughout this specification are not necessarily all referring to the same embodiment, these embodiments can be implemented individually or in conjunction with one or more other embodiments. Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
High Throughput Video Encoder
Each PE tests a coding mode or one or more candidates of a coding mode in a PE call, or each PE tests a coding mode or a candidates of a coding mode in multiple PE calls. The PE call is a time interval. The required buffer size of PEs in each PE group may be further optimized according to the particular block size associated with the PE group. For each coding mode or each candidate of a coding mode, video data in a partition or sub-partition may be computed by a low-complexity Rate Distortion Optimization (RDO) operation followed by a high-complexity RDO operation. The low-complexity RDO operation and high-complexity RDO operation of a coding mode or a candidate of a coding mode may be computed by one PE or multiple PE.
In various embodiments of the high throughput video encoder, since more than one parallel PE is employed in each PE group to shorten the original PE thread chain of the PE group, the encoder latency of the PE groups is reduced while maintaining the supreme rate-distortion performance. The high throughput video encoder of the present invention increases the encoder throughput to be capable of supporting Ultra High Definition (UHD) video encoding. The required buffer sizes of PEs in various embodiments of the hierarchical architecture can be optimized according to the particular block size of each PE group. Each PE group is designed to process a particular block size, the required buffer size for each PE group is related to the corresponding particular block size. For example, a smaller buffer is used for PEs of a PE group processing smaller size blocks. In the embodiment as shown in
Method 1: Combine Coding Tools or Coding Modes with Similar Properties in a PE Thread Some embodiments of the present invention further reduce the necessary resources required while enhancing the encoding throughput by combining coding tools or coding modes with similar properties in the same PE thread. Table 5 shows the coding modes tested by six PEs in a PE group according to an embodiment of combining coding tools or coding modes with similar properties in the same PE thread. Call 0, Call 1, Call 2, and Call 3 represent four PE calls of a PE thread in a sequential order for processing a current partition or sub-partition within a CTB. Each PE thread is scheduled to test dedicated one or more of coding tools, coding modes and candidates in each PE call. In this embodiment, the first PE tests normal inter candidate modes to encode a current partition or sub-partition, where uni-prediction candidates are tested followed by bi-prediction candidates. The second PE encodes the current partition or sub-partition by intra angular candidate modes. The third PE encodes the current partition or sub-partition by Affine candidate modes, and the fourth PE encodes the current partition or sub-partition by MMVD candidate modes. The fifth PE applies GEO candidate modes and the sixth PE applies inter Merge candidate modes to encode the current partition or sub-partition. As shown in Table 5, similar property coding tools or coding modes are combined together in the same PE thread, for example, the evaluation of inter Merge modes could be put in PE thread 1 and the evaluation of Affine modes could be put in PE thread 3. If similar property coding tools or coding modes are not put in the same PE thread, each PE needs to have more hardware circuits to support variety of coding tools. For example, if some of the MMVD candidate modes are tested by PE 1 while some MMVD candidate modes are tested by PE 4, two sets of MMVD hardware circuits are required in hardware implementation, one for PE 1, another for PE 4. Only one set of MMVD hardware circuits is required for PE 4 if all MMVD candidate modes are tested by PE 4 as shown in Table 5. According to the embodiment shown in Table 5, similar property coding tools or coding modes are arranged to be executed by the same PE thread such as Affine related coding tools are all put in PE thread 3, MMVD related coding tools are all put in PE thread 4, and GEO related coding tools are all put in PE thread 5.
Method 2: Adaptive Coding Modes for PE Thread In some embodiments of the hierarchical architecture, coding modes associated with one or more PE threads in a PE group are adaptively selected according to one or more predefined conditions. Some embodiments of the predefined condition is associated with comparisons of information between the current partition/sub-partition and one or more neighboring blocks of the current partition/sub-partition, the current temporal layer ID, historical MV list, or preprocessing results. For example, the preprocessing results may correspond to the search result of the IME stage. In some embodiments, a predefined condition relates to the comparisons between coding modes, block sizes, block partition types, motion vectors, reconstruction samples, residuals or coefficients of the current partition/sub-partition and one or more neighboring blocks. For example, a predefined condition is satisfied when a number of neighboring blocks coded in an intra mode is greater than or equal to a threshold TH1. In another example, a predefined condition is satisfied when the current temporal identifier is less than or equal to a threshold TH2. According to Method 2, one or more predefined conditions are checked to adaptively select coding modes for PEs in a PE group. Pre-specified coding modes are evaluated by the PEs when the one or more predefined conditions are satisfied, otherwise, default coding modes are evaluated by the PEs. In one embodiment of adaptively selecting coding modes for a current partition, a predefined condition is satisfied when any neighboring block of the current partition is coded by an intra mode, a PE table having more intra modes is tested on the current partition if at least one neighboring block is coded in an intra mode; otherwise, a PE table having less or none intra mode is tested on the current partition.
Method 3: Buffers Shared Among PEs of Same PE Group In some embodiments of the hierarchical architecture, certain buffers may be shared among PEs inside the same PE group by unifying a data scanning order among PE threads. For example, the sharing buffers are one or a combination of the source sample buffer, neighboring reconstruction samples buffer, neighboring motion vectors buffer, and neighboring side information buffer. By unifying the source samples loading method among PE threads with a particular scanning order, only one set of source sample buffer is required to be shared with all PEs in the same PE group. After finishing coding of each PE in a current PE group, each PE outputs final coding results to a reconstruction buffer, coefficient buffer, side information buffer, and updated neighboring buffer, and the video encoder compares the rate-distortion costs to decide the best coding result for the current PE group.
Hardware Sharing in Parallel PEs for GPM A current coding block coded in GPM is split into two parts by a geometrically located straight line, and each part of the geometric partition in the current coding block is inter-predicted using its own motion. The candidate list for GPM is derived directly from the Merge candidate list, for example, six GPM candidates are derived from Merge candidates 0 and 1, Merge candidates 1 and 2, Merge candidates 0 and 2, Merge candidates 3 and 4, Merge candidates 4 and 5, and Merge candidates 3 and 5 respectively. After obtaining corresponding Merge prediction samples for each part of the geometric partition according to two Merge candidates, the Merge prediction samples around the geometric partition edge are blended to derive GPM prediction samples. In the conventional hardware design for computing GPM prediction samples, addition buffer resources are required to store Merge prediction samples. With the parallel PE thread design, an embodiment of a GPM PE shares the Merge prediction samples from two or more Merge PEs directly without temporary storing the Merge prediction samples in a buffer. A benefit of this parallel PE design with hardware sharing is to save the bandwidth, this benefit is achieved because GPM PEs directly use the Merge prediction samples from Merge PEs to do GPM arithmetic calculation instead of fetching reference samples from the buffer. Some other benefits of directly passing predictors from Merge PEs to GPM PEs include reducing the circuits in GPM PEs and saving the Motion Compensation (MC) buffers for GPM PEs.
With the parallel PE design, an embodiment adaptively skips the tasks assigned to one or more remaining GPM candidates according to the rate-distortion cost of a current GPM candidate when two or more GPM candidates are tested. The PE call originally assigned for the remaining GPM candidates may be reassigned to do some other tasks or may be idle. The order of the Merge candidates is first sorted by the bits required by the Motion Vector Difference (MVD) from best to worse (i.e. from the least MVD bits to the most MVD bits). For examples, one or more GPM candidates combining the Merge candidates associating with fewer MVD bits are tested in the first PE call. If the rate-distortion cost computed in the first PE call is greater than a current best rate-distortion cost of another coding tool, then GPM tasks of the remaining GPM candidates are skipped. It is based on the assumption that the GPM candidate combining the Merge candidates associated with the least MVD bits is the best GPM candidate among all GPM candidates. If this best GPM candidate cannot generate a better predictor compared to the predictor generated by another coding tool, other GPM candidates are not worth to try. In the example as shown in
Hardware Sharing in Parallel PEs for CIP A current block coded in CIIP is predicted by combining inter prediction samples and intra prediction samples. The inter prediction samples are derived based on the inter prediction process using a Merge candidate and the intra prediction samples are derived based on the intra prediction process with the Planar mode. The intra and inter prediction samples are combined using weighted averaging, where the weight value is calculated depending on the coding modes of the top and left neighbouring blocks. With the parallel PE thread design according to an embodiment as shown in
With the parallel PE design, the tasks in one or more PE computing CIIP candidates can adaptively skip some CIIP candidates according to the rate-distortion performance of the prediction result generated by a previous CIIP candidate in the same PE thread. In one embodiment, if there are two or more CIIP candidates tested in a PE thread, by sorting the Merge candidates in order from the best (e.g. least MVD bits, lowest SATD, or lowest SAD) to the worse (e.g. most MVD bits, highest SATD, or highest SAD), original assigned tasks for the subsequent CIIP candidates are skipped when the rate-distortion cost associated with a current CIIP candidate is greater than the current best cost. For example, the first Merge candidate (Merge0) has a lower SAD than the second Merge candidate (Merge1), if the rate-distortion performance of the first CIIP candidate (CIIP0) is worse than the current best rate-distortion performance of another coding tool, then the second CIIP candidate (CIIP1) is skipped. It is because there is a high probability that the rate-distortion performance of the second CIIP candidate is worse than that of the first CIIP candidate if the Merge candidates is correctly sorted.
Hardware Sharing in Parallel PEs for AMVP-BI A current block coded in Bi-directional Advance Motion Vector Prediction (AMVP-BI) is predicted by combining uni-directional prediction samples from AMVP List 0 (L0) and List 1 (L1). With the parallel PE design according to an embodiment as shown in
Hardware Sharing in Parallel PEs for BCW A predictor of a current block coded in BCW is generated by weighted averaging of two uni-directional prediction signals obtained from two different reference lists L0 and L1. With the parallel PE design according to an embodiment as shown in
Neighboring Sharing in Parallel PEs With the parallel PE design, the buffer of neighboring reconstruction samples can be shared between different PEs according to an embodiment of the present invention. For example, only one set of neighbor buffer is needed as intra PEs and Matrix-based Intra Prediction (MIP) PEs can both acquire neighboring reconstruction samples from this shared buffer. As shown in
On-the-Fly Terminate Processing of Other PEs In some embodiments of the multiple PE design, the remaining processing of at least one other PE thread is early terminated according to accumulated rate-distortion costs of the parallel PEs. For example, if a current accumulated rate-distortion cost of a PE thread is much better than other PE threads (i.e. the current accumulated rate-distortion cost is much lower than each of the accumulated rate-distortion costs of other PE threads), the remaining processing of other PE threads is early terminated for power saving.
MTS Sharing for Parallel PE Architecture A Multiple Transform Selection (MTS) scheme processes residual with multiple selected transforms. For example, the different transforms include DCT-II, DCT-VIII, and DST-VII.
Low Complexity SATD on-the-fly Re-allocation With the parallel PE design, SATD units could be shared among parallel PEs.
Representative Flowchart for High Throughput Video Encoding
Exemplary Video Encoder Implementing Present Invention Embodiments of the present invention may be implemented in video encoders. For example, the disclosed methods may be implemented in one or a combination of an entropy encoding module, an Inter, Intra, or prediction module, and a transform module of a video encoder. Alternatively, any of the disclosed methods may be implemented as a circuit coupled to the entropy encoding module, the Inter, Intra, or prediction module, and the transform module of the video encoder, so as to provide the information needed by any of the modules.
Various components of the Video Encoder 1800 in
Embodiments of high throughput video encoding processing methods may be implemented in a circuit integrated into a video compression chip or program code integrated into video compression software to perform the processing described above. For examples, encoding coding blocks may be realized in program code to be executed on a computer processor, a Digital Signal Processor (DSP), a microprocessor, or field programmable gate array (FPGA). These processors can be configured to perform particular tasks according to the invention, by executing machine-readable software code or firmware code that defines the particular methods embodied by the invention.
The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
The present invention claims priority to U.S. Provisional Patent Application Ser. No. 63/251,066, filed on Oct. 1, 2021, entitled “PE-group structure, PE-parallel processing, and scalable mode removal”. The U.S. Provisional Patent application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63251066 | Oct 2021 | US |