The present invention relates to high-throughput video encoding or decoding methods. In particular, the present invention relates to high-throughput video encoding methods implemented in a rate distortion optimization stage of video encoding systems.
The Versatile Video Coding (VVC) standard is the latest video coding standard developed by the Joint Collaborative Team on Video Coding (JCT-VC) group of video coding experts from ITU-T Study Group. The VVC standard relies on a block-based coding structure which divides each picture into multiple Coding Tree Units (CTUs). A CTU consists of an NxN block of luminance (luma) samples together with one or more corresponding blocks of chrominance (chroma) samples. For example, each 4:2:0 chroma subsampling CTU consists of one 128x128 luma Coding Tree Block (CTB) and two 64 ×64 chroma CTBs. Each CTB in a CTU is further recursively divided into one or more Coding Blocks (CBs) in a Coding Unit (CU) for encoding or decoding to adapt to various local characteristics. Flexible CU structures such as the Quad-Tree-Binary-Tree (QTBT) structure improve the coding performance compared to the Quad-Tree (QT) structure employed in the High-Efficiency Video Coding (HEVC) standard.
The prediction decision in video encoding or decoding is made at the CU level, where each CU is coded by one or a combination of coding modes selected in a Rate Distortion Optimization (RDO) stage. After obtaining a residual signal generated by the prediction process, the residual signal belong to a CU is further transformed into transform coefficients for compact data representation, and these transform coefficients are quantized and conveyed to the decoder. Several coding tools or coding modes introduced in the VVC standard are briefly described in the following.
Merge mode with MVD (MMVD) For a CU coded by the Merge mode, implicitly derived motion information is directly used for prediction sample generation. Merge mode with Motion Vector Difference (MMVD) introduced in the VVC standard further refines a selected Merge candidate by signaling Motion Vector Difference (MVD) information. A MMVD flag is signaled right after a regular Merge flag to specify whether MMVD mode is used for a CU. MMVD information signaled in the bitstream includes an MMVD candidate flag, an index to specify motion magnitude, and an index for indication of motion direction. In the MMVD mode, one of the first two candidates in the Merge list is selected to be used as the MV basis. An MMVD candidate flag is signaled to specify which one of the first two Merge candidates is used. A distance index specifies motion magnitude information and indicate a pre-defined offset from a starting point. An offset is added to either a horizontal or vertical component of the starting MV. The relation of the distance index and the pre-defined offset is specified in Table 1.
A direction index represents a direction of the MVD relative to the starting point. The direction index indicates one of the four directions along the horizontal and vertical directions. It is noted that the meaning of MVD sign could be variant according to the information of starting MVs. For example, when the staring MV(s) is a uni-prediction MV or bi-prediction MVs with both lists pointing to the same direction of the current picture, the sign shown in Table 2 specifies the sign of the MV offset added to the starting MV. Both lists pointing to the same direction of the current picture if Picture Order Counts (POCs) of two reference pictures are both larger than the POC of the current picture, or POCs of two reference pictures are both smaller than the POC of the current picture. In cases when the starting MVs is bi-prediction MVs with two MVs pointing to different directions of the current picture and the difference of the POCs in list 0 is greater than the one in list 1, the sign in Table 2 specifies the sign of the MV offset added to the list 0 MV component of the starting MV and the sign for the list 1 MV has an opposite sign. Otherwise, when the difference of the POCs in list 1 is greater than the one in list 0, the sign in Table 2 specifies the sign of the MV offset added to the list 1 MV component of the starting MV and the sign for the list 0 MV has an opposite sign. The MVD is scaled according to the difference of POCs in each direction. If the differences of POCs in both lists are the same, no scaling is needed; otherwise, if the difference of POCs in list 0 is larger than the one of list 1, the MVD for list 1 is scaled, by defining the POC difference of List 0 as td and POC difference of List 1 as tb. If the POC difference of List 1 is greater than List 0, the MVD for list 0 is scaled in the same way. If the starting MV is uni-predicted, the MVD is added to the available MV.
Bi-prediction with CU-level Weight (BCW) A bi-prediction signal is generated by averaging two prediction signals obtained from two different reference pictures and/or using two different motion vectors in the HEVC standard. In the VVC standard, the bi-prediction mode is extended beyond simple averaging to allow weighted averaging of the two prediction signals.
In the VVC standard, five weights w ∈ {-2, 3, 4, 5, 10} are allowed in the weighted averaging bi-prediction. In each bi-predicted CU, the weight w is determined in one of two ways: 1) for a non-Merge CU, the weight index is signaled after the motion vector difference; 2) for a Merge CU, the weight index is inferred from neighboring blocks based on the Merge candidate index. BCW is only applied to CUs with 256 or more luma samples, which implies the CU width times the CU height must be greater than or equal to 256. For low-delay pictures, all 5 weights are used. For non-low-delay pictures, only 3 weights w∈{3,4,5} are used.
Fast search algorithms are applied to find the weight index without significantly increasing the encoder complexity at the video encoders. When BCW is combined with Adaptive Motion Vector Resolution (AMVR), unequal weights are only conditionally checked for 1-pel and 4-pel motion vector precisions if the current picture is a low-delay picture. When BCM is combined with the affine mode, affine Motion Estimation (ME) is performed for unequal weights only if the affine mode is selected as the current best mode. Unequal weights are only conditionally checked when the two reference pictures in bi-prediction are the same. Unequal weights are not searched when certain conditions are met, depending on the POC distance between the current picture and its reference pictures, the coding QP, and the temporal level.
The BCW weight index is coded using one context coded bin followed by bypass coded bins. The first context coded bin indicates if equal weight is used; and if unequal weight is used, additional bins are signaled using bypass coding to indicate which unequal weight is used. Weighted Prediction (WP) is a coding tool supported by the H.264/AVC and HEVC standards to efficiently code video content with fading. Support for WP was also added into the VVC standard. WP allows weighting parameters (weight and offset) to be signaled for each reference picture in each of the reference picture lists L0 and L1. The weight(s) and offset(s) of the corresponding reference picture(s) are applied during motion compensation. WP and BCW are designed for different types of video content. In order to avoid interactions between WP and BCW, which will complicate the VVC decoder design, if a CU uses WP, then the BCW weight index is not signaled, and w is inferred to be 4, implying equal weight is applied. For a Merge CU, the weight index is inferred from neighboring blocks based on the Merge candidate index. This can be applied to both normal Merge mode and inherited affine Merge mode. For constructed affined Merge mode, the affine motion information is constructed based on the motion information of up to 3 blocks. The BCW index for a CU using the constructed affine Merge mode is simply set equal to the BCW index of the first control point MV. In the VVC standard, Combined Inter and Intra Prediction (CIIP) and BCW cannot be jointly applied for a CU. When a CU is coded with the CIIP mode, the CBW index of the current CU is set to 4, implying equal weight is applied.
Geometric Partitioning Mode (GPM) In the VVC standard, GPM is supported for inter prediction. The use of GPM is signaled using a CU-level flag as one kind of Merge modes, with other Merge modes including regular Merge mode, MMVD mode, CCIP mode, and subblock Merge mode. In total, 64 partitions are supported by GPM for each possible CU size w × h = 2 m × 2n with m, n ∈ {3 ••• 6} excluding 8×64 and 64×8. Formerly, when this mode is used, a CU is split into two parts by a geometrically located straight line as shown in
If geometric partitioning mode is used for the current CU, then a geometric partition index indicating the partition mode of the geometric partition (angle and offset), and two Merge indices (one for each partition) are further signaled. The number of maximum GPM candidate size is signaled explicitly in the Sequence Parameter Set (SPS) and specifies syntax binarization for GPM merge indices. The sample values are adjusted using a blending processing with adaptive weights to acquire the prediction signal for the whole CU. Transform and quantization process will be applied to the whole CU as in other prediction modes. Finally, the motion field of a CU predicted using the geometric partition mode is stored.
The uni-prediction candidate list is derived directly from the Merge candidate list constructed according to the extended Merge prediction process. Denote n as the index of the uni-prediction motion in the geometric uni-prediction candidate list. The LX motion vector of the n-th extended Merge candidate, with X equal to the parity of n, is used as the n-th uni-prediction motion vector for geometric partitioning mode. For example, the uni-prediction motion vector for Merge index 0 is L0 MV, the uni-prediction motion vector for Merge index 1 is L1 MV, the uni-prediction motion vector for Merge index 2 is L0 MV, and the uni-prediction motion vector for Merge index 3 is L1 MV. In case a corresponding LX motion vector of the n-th extended Merge candidate does not exist, the L(1 - X) motion vector of the same candidate is used instead of the uni-prediction motion vector for geometric partitioning mode.
After predicting each part of a geometric partition using its own motion information, blending is applied to the two prediction signals to derive samples of the current CU. The blending weight for each position of the CU are derived based on the position of each sample and information about the partition mode of the geometric partition (for example, angle and offset) of the current CU.
A CU coded by GPM can include three parts, where a first part is inter-predicted based on a first set of predictors, a second part is inter-predicted based on a second set of predictors, and a third part between the first and second parts is inter-predicted based on a third set of predictors. The third set of predictors are derived by blending based on the first set of predictors and the second set of predictors. Mv1 from the first part of the geometric partition, Mv2 from the second part of the geometric partition and a combined motion vector of Mv1 and Mv2 are stored in the motion field of a geometric partitioning mode coded CU. The stored motion vector type for each individual position in the motion field is determined as:
where motionIdx is equal to d(4x + 2, 4y + 2), which is recalculated from the above equation. The partIdx depends on the angle index i. If sType is equal to 0 or 1, Mv0 or Mv1 are stored in the corresponding motion field, otherwise if sType is equal to 2, a combined motion vector from Mv0 and Mv2 are stored. The combined motion vector is generated using the following process: if Mv1 and Mv2 are from different reference picture lists (one from L0 and the other from L1), then Mv1 and Mv2 are simply combined to form bi-prediction motion vectors; otherwise, if Mv1 and Mv2 are from the same list, only the uni-prediction motion Mv2 is stored.
Combined Inter and Intra Prediction (CIIP) In the VVC standard, when a CU is coded in Merge mode, if the CU contains at least 64 luma samples (that is, CU width times CU height is equal to or larger than 64), and if both CU width and CU height are less than 128 luma samples, an additional flag is signaled to indicate whether Combined Inter and Intra Prediction (CIIP) mode is applied to the current CU. As the name suggested, CIIP mode combines an inter prediction signal with an intra prediction signal. The inter prediction signal in CIIP mode Pinter is derived using the same inter prediction process applied to the regular Merge mode; and the intra prediction signal Pintra is derived following the regular intra prediction process with the Planar mode. Then, the intra and inter prediction signals are combined using weighted averaging, where the weight value is calculated depending on the coding modes of the top and left neighbouring blocks as follows. A variable isIntraTop is set to 1 if the top neighboring block is available and intra coded, otherwise isIntraTop is set to 0, and a variable isIntraLeft is set to 1 if the left neighboring block is available and intra coded, otherwise isIntraLeft is set to 0. The weight value wt is set to 3 if the sum of the two variables isIntraTop and isIntraLeft is equal to 2, otherwise the weight value wt is set to 2 if the sum of the two variables is equal to 1; otherwise the weight value wt is set to 1. The CIIP prediction is calculated as follows:
Embodiments of video coding methods for a video encoding system or video decoding system comprise receiving input data associated with a current block, comparing a size of the current block with a threshold size, determining a coding mode for the current block by disabling GPM when the size of the current block is greater than or equal to the threshold size, and encoding or decoding the current block by the determined coding mode. The current block includes a first part, a second part, and a third part when the coding mode is GPM, the first part of the current block is inter-predicted based on a first set of predictors while the second part of the current block is inter-predicted based on a second set of predictors, and the third part is inter-predicted based on a third set of predictors. The third set of predictors are derived by blending based on the first set of predictors and the second set of predictors. The current block in these embodiments is a Coding Block (CB) or a Coding Unit splitting from a Coding Tree Block (CTB) or Coding Tree Unit (CTU).
In some embodiments of the video encoding or decoding method, the threshold size is 2048 samples, and GPM is disabled for the current block when the size of the current block is 64×64, 64×32, or 32×64 samples. In some embodiment, GPM is enabled for large size blocks when a number of candidates in a Merge candidate list is small. For example, the video encoding or decoding system determines a number of candidates in a Merge candidate list of the current block, compares the number of candidates with a threshold number, and disables GPM for the current block when the number of candidates is larger than the threshold number. In this case, GPM is enabled for the current block when the size of the current block is smaller than the threshold size, or when the size of the current block is larger than or equal to the threshold size and the number of candidates in the Merge candidate list is less than or equal to the threshold number. An example of the threshold number is 3.
Embodiments of video encoding methods determine a block partitioning structure and coding modes using parallel Processing Elements (PEs). The video encoding methods comprise receiving an input data associated with a current block, processing the input data by the parallel PEs to determine the block partitioning structure of the current block and a corresponding coding mode for each coding block in the current block, and encoding each coding block in the current block according to the corresponding coding mode. Each PE performs tasks for a Rate Distortion Optimization (RDO) operation in each PE run. The PEs access a Search Range Memory (SRM) to fetch search range reference samples for the PEs. Two or more PEs receive search range reference samples in a broadcasting form. The PEs test a number of coding modes on possible partitions and sub-partitions of the current block, and based on rate-distortion costs associated with the coding modes tested by the PE groups, a block partitioning structure for splitting the current block into one or more coding blocks and a corresponding coding mode for each coding block are decided. The current block in these embodiments is a CTB or CTU, the coding blocks in the CTB are CBs and the coding blocks in the CTU are CUs.
In some embodiments of the present invention, the SRM is a 3-layer SRM structure including a layer 3 SRM, multiple layer 2 SRMs, and at least one broadcast SRM. The search range reference samples are output from the layer 3 SRM to the layer 2 SRM by time interleaving reading for distributing the search range reference samples to corresponding PEs. At least one layer 2 SRM outputs the search range reference samples to one broadcast SRM, and each broadcast SRM broadcasts the search range reference samples to two or more PEs at the same time. In one embodiment of the 3-layer SRM structure, a layer 3 cache port is shared by two or more layer 2 SRMs. A scanning order of each broadcast SRM is the same as a scanning order of the corresponding PEs in some preferred embodiments, so the broadcast SRM is a plug-in design.
The search range reference samples for a regular Merge candidate are broadcasted to PEs testing the regular Merge candidate, a GPM candidate, or a CIIP candidate according to embodiments of the present invention. Similarly, the search range reference samples for an Advanced Motion Vector Prediction (AMVP) candidate may be broadcasted to PEs testing the AMVP candidate or a Symmetric Motion Vector Difference (SMVD) candidate, or the search range reference samples for an Adaptive Motion Vector Resolution (AMVR) candidate may be broadcasted to PEs testing the AMVR candidate, a SMVD candidate, or a Bi-prediction with CU-level Weight (BCW) candidate. A scan order for the two or more PEs receiving the broadcasting search range reference samples is the same according to some embodiments of the present invention, thus the search range reference samples read out from the SRM are directly used by these PEs without buffering.
In some embodiments of the video encoding methods, the bandwidth between the SRM and the PEs may be further reduced by preloading search range reference samples of pre-loadable candidates. For example, the search range reference samples of pre-loadable candidates needed in a subsequent run are preloaded in a current run. Some examples of the pre-loadable candidates are AMVP candidates, AMVR candidates, and affine inter based candidates.
The coding modes tested by some of the PEs are reordered according to an embodiment so that high-bandwidth modes are processed in parallel with low-bandwidth modes. In an exemplary PE, a Merge mode with Motion Vector Difference (MMVD) candidate tested by one PE thread is reordered to be executed in parallel with an intra mode tested by another PE.
In one embodiment, at least one PE processing small coding blocks loads the search range reference samples of candidates from the SRM at the same time when the search range reference samples are in a same window or when a rotated-index is in a same window. In another embodiment, a bilinear filter is used in a Low Complexity (LC) operation for testing one or more MMVD candidates in order to reduce a reference region of the search range reference samples needed
Aspects of the disclosure further provide an apparatus for a video encoding or decoding system. The apparatus comprising one or more electronic circuits configured for receiving input data associated with a current block, checking if a size of the current coding block is greater than or equal to a threshold size, determining a coding mode for the current coding block by disabling GPM if a size of the current coding block is greater than or equal to the threshold size, and encoding or decoding the current block by the determined coding mode.
Various embodiments of this disclosure that are proposed as examples will be described in detail with reference to the following figures, wherein like numerals reference like elements, and wherein:
It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Reference throughout this specification to “an embodiment”, “some embodiments”, or similar language means that a particular feature, structure, or characteristic described in connection with the embodiments may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in an embodiment” or “in some embodiments” in various places throughout this specification are not necessarily all referring to the same embodiment, these embodiments can be implemented individually or in conjunction with one or more other embodiments. Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
High Throughput Video Encoder A high throughput video encoder 300 for encoding video pictures into a video bitstream is illustrated in
Novel SRM Design for High-throughput Video Encoder The terms PE run or port run are used to count the number of time intervals required by a PE to test one or more coding modes. For example, a PE encodes a predetermined partition by an intra mode in one PE run. A bottleneck arose in Search Range Memory (SRM) accessing for high-throughput encoders is the high bandwidth required by the parallel PEs. The more parallel PEs employed in the video encoder the better the throughput, however, this implies enormous amount of parallel PEs are accessing the SRM simultaneously. Two possible solutions for reducing the high bandwidth requirement between the SRM and the PEs include installing N times copies of SRM and time-interleaving reading. The major drawback of having N times copies of SRM is the encoder cost will largely increase. Time-interleaving reading for a large number of PEs results in enormous reading-path buffers and long idle time. For example, when there are 64 PEs accessing the SRM through 64 ports at the same time, 720 SRM banks per resolution are required, resulting a total of 2880 separate SRAMs is required for testing the four resolutions in Adaptive Motion Vector Resolution (AMVR). For PEs processing less calculations, there will be a long idle time due to time-interleaving reading delay. Moreover, to preserve around 60 cycles of data for time-interleaving reading, the cost of reading buffers is extremely high. For example, the size of the reading buffer needed for one PE is 60 cycles *12 * 4 pixels * 10 bits/pixel = 28800 bits, which is also very costly.
For blocks to be encoded in the affine Inter mode, the number of port runs can be reduced from 8 to 2 by forcing all pre-call of affine Inter candidates to use four shared reference regions. For PEs processing blocks to be encoded by the MMVD mode, the number of port runs can be reduced from 8 to 2 because 2 MMVD candidates reuse the reference regions of 2 Merge mode candidates with an enlarged range. The MMVD coding modes share some of the PEs originally perform tasks for the Merge mode. The number of port runs is greatly reduced by implementing the SRM architecture with local caches and storing common used candidates as candidate-wise manner.
In some embodiments, the broadcasting based SRM architecture of the present invention is employed together with hardware sharing in parallel PEs for certain coding tools. In one embodiment, the candidate list for GPM is derived directly from the Merge candidate list, for example, six GPM candidates are derived from Merge candidates 0 and 1, Merge candidates 1 and 2, Merge candidates 0 and 2, Merge candidates 3 and 4, Merge candidates 4 and 5, and Merge candidates 3 and 5 respectively. After obtaining corresponding Merge prediction samples for each part of the geometric partition according to two Merge candidates, the Merge prediction samples around the geometric partition edge are blended to derive GPM prediction samples. With the hardware sharing in parallel PE design, an embodiment of a GPM PE shares the Merge prediction samples from one or more Merge PEs directly without temporary storing the Merge prediction samples in a buffer. A benefit of this parallel PE design with hardware sharing is to save the bandwidth, this benefit is achieved because GPM PEs directly use the Merge prediction samples from Merge PEs to do GPM arithmetic calculations instead of fetching reference samples from the buffer. By combining the broadcasting based SRM architecture and the hardware sharing in parallel PEs, the search range reference samples of Merge candidates for GPM candidates are retrieved from the broadcasting SRM, and the predictors of Merge candidates for GPM candidate are shared directly from the Merge PEs.
In some preferred embodiments, the scan order for all PEs sharing search range reference samples is the same, in that case, the search range reference samples read out from the SRM are directly used by the PEs without buffering. The access patterns of all PEs related to the same candidates by broadcasting are all the same, so the broadcasting based SRM architecture directly broadcasts the SRAM readout data by hard-wiring to all PEs without any arbitration. The temporary buffers between PEs and cache may be minimized. In an embodiment of the present invention, the broadcasting SRMs for the PE groups can be a plug-in design when the same scanning order is employed in the broadcast caches and the PE engines. In other words, the broadcasting based SRM architecture is a transparent accessing design for the PE engines to read search range reference samples from the SRM or write reference samples into the SRM when the scanning order of broadcast caches is equal to the PE access order. All or partial PEs can fetch reference samples in a broadcasting form as soon as these reference samples are available in the level 1 SRMs (i.e. the broadcast SRAMs as shown in
Further Improvement on SRM Design by Preloading Some embodiments of the present invention further improves the broadcasting based SRM design by more evenly distribute the loading time in each PE run.
In the conventional design, N PEs access the SRM in a time-interleaving manner, where each PE needs to wait for N cycles to access the SRAM, thus a large internal buffer (regFile) is needed for each PE to buffer N cycle data. In comparison to the conventional design, the benefits of employing broadcasting based caches for all or partial PEs include eliminating or reducing the waiting time of PEs for accessing the SRM, eliminating or reducing the idle time of PEs processing short tasks, and eliminating the need of a large internal buffer for all or partial PEs. By further using the partial-preloading technique, the worse-case port-number can be reduced, so as to minimize the reference cache (refCache) worst-case bandwidth.
Representative Flowchart for SRM Accessing in High-throughput Video Encoder
Adaptively Disable GPM In some embodiments of the present invention, the encoder and decoder adaptively disable the GPM coding tool according to a block size. The encoder or decoder of some embodiments turns off the GPM coding tool for any partition or sub-partition having a size greater than or equal to a threshold size. The partition or sub-partition is a block partitioned from a CTB or a CTU to be tested by various coding modes in the RDO stage, and is referred to as a Coding Block (CB) or a Coding Unit (CU) when the partition or sub-partition is selected in the RDO stage. For example, the threshold size is 2048 samples, so in the RDO stage, the PE group processing 64×64 partition, 64×32 sub-partitions, and 32×64 sub-partitions skips evaluating the GPM coding tool on the 64×64 partitions, 64×32 sub-partitions, and 32×64 sub-partitions. In some embodiments, the encoder or decoder disables GPM for any block having a size greater than NxN samples, for example, N is 32 or 16. In some other embodiments, the GPM coding tool is adaptively disabled according to a Merge candidate list. Specifically, the encoder turns off the GPM coding tool for any block with a number of Merge candidates in the Merge candidate list larger than a threshold number according to some embodiments of the present invention. For example, the GPM coding tool is only enabled for blocks having a Merge candidate list with only 2 or 3 candidates, while the GPM coding tool is disabled for blocks having a Merge candidate with 4 or more candidates. The corresponding video decoder also disallows decoding a block having Merge candidates more than the threshold number using a GPM mode. In one embodiment, the encoder or decoder adaptively disables GPM according to both a block size and a number of candidates in a Merge candidate list. For example, the encoder or decoder enables GPM for large blocks only if there is a few candidates in the Merge candidate list, otherwise the encoder or decoder disables GPM for large blocks having a lot of candidates. To encode or decode a current block by GPM, the current block includes a first part, a second part, and a third part. The first part of the current block is inter-predicted by a first set of predictors, and the second part of the current block is inter-predicted by a second set of predictors. The first or second set of predictors is derived using its own motion information such as the motion vector and reference index. The third part of the current block is inter-predicted based on a third set of predictors, where the third set of predictors are derived by blending based on the first set of predictors and the second set of predictors.
Reordered PE Modes for Minimizing SRAM Bandwidth In some embodiments of the present invention, the coding tools or coding modes is reordered to minimize the SRAM bandwidth. The SRAM bandwidth is further reduced by properly reordering the processing modes.
In some other embodiments of PE mode reordering, high-bandwidth modes are reordered to be processed together with low-bandwidth modes in order to balance the bandwidth required for accessing search range reference samples from the SRM. PEs used to compute low-bandwidth modes such as intra modes do not need to access motion compensation reference samples.
Spatial Shattered SRAM Access for High-Depth BT/TT Splitting PE groups processing high-depth Binary-Tree (BT) or Ternary-Tree (TT) splitting nodes are the bottleneck of search range memory accessing as the numbers of parallel PEs in these PE groups are larger than other PE groups. For example, the PE group TPE-B in
Reduce MMVD Bandwidth for LC The usage range of MMVD for various MMVD distance indices is shown in Table 2. In some embodiments of the present invention, the MMVD Low Complexity (LC) bandwidth can be largely reduced by applying a bilinear filter in the LC operation.
Exemplary Video Encoder and Video Decoder Implementing Present Invention Embodiments of the present invention may be implemented in video encoders. For example, one or a combination of the disclosed methods may be implemented in an entropy encoding module, an Inter, Intra, or prediction module, and/or a transform module in a video encoder. Alternatively, any of the disclosed methods may be implemented as a circuit coupled to the entropy encoding module, the Inter, Intra, or prediction module, and the transform module of the video encoder, so as to provide the information needed by any of the modules.
A corresponding Video Decoder 1300 for decoding the video bitstream generated by the Video Encoder 1200 of
Various components of the Video Encoder 1200 and Video Decoder 1300 in
Embodiments of the video coding methods may be implemented in a circuit integrated into a video compression chip or program code integrated into video compression software to perform the processing described above. For examples, encoding or decoding coding blocks may be realized in program code to be executed on a computer processor, a Digital Signal Processor (DSP), a microprocessor, or Field Programmable Gate Array (FPGA). These processors can be configured to perform particular tasks according to the invention, by executing machine-readable software code or firmware code that defines the particular methods embodied by the invention.
The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
The present invention claims priority to U.S. Provisional Pat. Application Serial No. 63/280,178, filed on Nov. 17, 2021, entitled “New Memory Bandwidth Reduction Method/Architecture in Hardware Encoder”. The U.S. Provisional Pat. Application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63280178 | Nov 2021 | US |