The present disclosure relates to encoding and decoding of image and video data.
A conventional video codec can compress image and video data for transmission and storage. Some examples of standardized coding specifications include H.264 (AVC), H.265 (HEVC), H.266 (VVC), and AV1. A new video encoding and decoding software, called AOM Video Model (AVM), is currently under development by AOMedia with the intent being that the resulting specification will become the successor to the AV1 specification. Conventional video codecs are block-based and they first partition a video frame or field picture into smaller image regions called as coding blocks. This partitioning is a multi-stage process where a frame is first split into smaller coding-tree units (CTUs) or super-blocks (SBs). A CTU or SB can further be divided into smaller coding blocks (CBs). In
After the partitioning stage, a video encoder can predict pixel samples of a current block from neighboring blocks by using intra prediction. Alternatively or additionally, a codec may also use pixel information and blocks from previously coded frames/pictures by using inter prediction techniques. Some of the commonly used inter prediction techniques include weighted or non-weighted single or multi-hypothesis motion compensated prediction, temporally interpolated prediction, or hybrid modes that can use both inter and intra prediction. Prediction may involve simple motion models, e.g., translation only, or more complex motion models such as an affine model. The prediction stage aims to reduce the spatial and/or temporally redundant information in coding blocks from neighboring samples or frames/pictures. The resulting block, after subtracting the predicted values (e.g. with intra or inter prediction) from the block of interest, is usually called the residual block. The encoder may further apply a transformation on the residual block using variants of the discrete cosine transform (DCT), discrete sine transform (DST), or other possible transforms, including for example wavelet transforms. The block on which a transform is applied is usually referred to as a transform unit (TU).
The transform stage provides energy compaction in the residual block by mapping the residual values from the pixel domain to some alternative vector or Euclidean space. This stage helps reduce the number of bits required to transmit the energy-compacted coefficients. It is also possible for an image or video codec to skip the transform stage. Usually, this is done if performing a transform on the residual block is found not to be beneficial, for example in cases when the residual samples after prediction are found to be already compact enough. In such a case a DCT-like transform might not provide additional compression benefits.
After the transform stage, the resultant coefficients are passed through a quantizer, which reduces the number of bits required to represent the transform coefficients, while at the same time introducing some form of distortion into the signal. Optionally, optimization techniques such as trellis-based quantization, adaptive rounding, or dropout optimization/coefficient thresholding, in which certain coefficients that may be less significant or are deemed to be too costly to encode while providing potentially minimal benefit to the subjective quality of a partition are thrown away, can be employed to tune the quantized coefficients based on some rate-distortion and/or other (e.g., complexity) criteria. The quantization stage can cause significant loss of information especially at low bitrate constraints. In such cases, quantization may lead to visible distortion or loss of information in images/video. The tradeoff between the rate (amount of bits sent over a time period) and distortion is often controlled with a quantization parameter (QP). In the entropy coding stage, the quantized transform coefficients, which usually make up the bulk of the final output bitstream, are signaled to the decoder using lossless entropy coding methods such as the multi-symbol arithmetic coding (MS-AC) in AV1/AVM, the context-adaptive binary arithmetic coding (CABAC) in EVC and VVC, or other entropy coding methods.
In addition to the quantized coefficients, certain encoder decisions are signaled to the decoder as side information. Some of this information may include partitioning types, intra and inter prediction modes (e.g. weighed intra prediction, multi-reference line modes, etc.), the transform type applied to transform blocks, and/or other flags/indices pertaining to tools such as a secondary transform. This side information usually accounts for a smaller portion of the final bitstream as compared to quantized transform coefficients. The decoder uses all of the information above to perform an inverse transformation on the de-quantized coefficients and reconstruct the pixel samples. Additional tools, including restoration, de-blocking, and loop-filters, may also be applied on the reconstructed pixel samples to enhance the quality of reconstructed images.
In the AVM reference software, several transform candidates are possible. These options consist of a combination of 1) the discrete cosine transform (DCT), 2) the asymmetric discrete sine transform (ADST), 3) the flipped ADST, and 4) the Identity transform (IDTX). These transforms can be applied either in 1 dimension (1D), e.g., horizontally or vertically, or alternatively can be applied both horizontally and vertically with 2D transforms as summarized in Table 1 below. Except for the IDTX transform, all transform types in Table 1 apply a transform kernel along either the vertical or horizontal direction. In the AVM, a secondary transform called the “intra secondary transform” (IST) is currently under consideration. This secondary transform is applied as a non-separable transform kernel on top of the primary transform coefficients based on a mode decision.
Regardless of the transform type selected by an encoder the resulting coefficients from the transform stage or prediction residuals (if IDTX is used) need to be signaled to the decoder. In the AVM, coefficient coding can be summarized in 4 main parts: 1) scan order selection, 2) coding of the last coefficient position, 3) level and sign derivation, and 4) context-based coefficient coding.
AV1 currently implements 5 default scans: an up-right diagonal scan, a bottom-left diagonal scan, a Zig-zag scan, a column scan, and a row scan. These scans determine the order in which the coefficients are signaled to the decoder. Examples of the zig-zag, row, and column scans are illustrated in
Before coding the coefficients, AV1 and AVM first determine the last position of the most significant coefficient in a transform block, or the coefficient location end-of-block (EOB). If the EOB value is 0, the transform unit does not have any significant coefficients and nothing needs to be coded for the current transform unit. In this case, only a skip flag (all zero syntax element) is signaled that indicates whether the EOB is 0 or 1.
If the EOB value is non-zero, then a transform type is coded only for luma blocks. Additionally an intra secondary transform (IST) flag may be signaled based on the luma transform type. The latter two syntax elements let the decoder know all the necessary details to compute an inverse transform.
Then the last coefficient position is explicitly coded. This last position determines which coefficient indices to skip during the scan order coding. To provide an example, if EOB=4 for a left-most transform block in
If a coefficient needs to be coded, a transform coefficient is first converted into a ‘level’ value by taking its absolute value. For square blocks with 2D transforms a reverse Zig-zag scan is used to encode the level information. This scan starts from the bottom right side of the transform unit in a coding loop (e.g., starting from the EOB index until the scan index hits 0) as in the second column of
After the level value is coded in reverse scan order, the sign information is coded separately using a forward scan pass over the significant coefficients. The sign flag is bypass coded with 1 bit per coefficient without using probability models. The motivation of bypass coding here is to simplify entropy coding since DCT coefficients usually have random signs.
In AV1, the level information is encoded with a proper selection of contexts or probability models using multi-symbol arithmetic encoding. These contexts are selected based on various parameters such as transform size, plane (luma or chroma) information, and the sum of previously coded level values in a spatial neighborhood.
The AVM further implements a forward skip coding (FSC) mode that improves coding of prediction residuals when the IDTX (identity) transform is used. This technique moves the signaling of the IDTX transform flag from the TU level to the CB level and uses an alternative residual coding method [3, 4]. For FSC coded blocks, an explicit trigonometric transform is skipped both for columns and rows of a transform unit. As shown in
In this disclosure, a flexible coefficient coding (FCC) approach is presented. FCC has three main aspects. In the first aspect, an encoder may dynamically define spatial sub-regions over a transform unit (TU) or a prediction unit (PU). These sub-regions may organize the coefficient samples residing inside a TU or a PU into variable coefficient groups (VCGs). Each VCG may correspond to a sub-region inside a larger TU or PU. The shape of VCGs or the boundaries between different VCGs may be determined based on the relative distance of coefficient samples with respect to each other. Alternatively, the VCG regions may be defined according to scan ordering within a TU. VCGs may differ from traditional coefficient groups due to the variations in entropy coding techniques associated within each VCG and specialized operations each VCG may perform.
In the second aspect, an encoder may allocate to each VCG: 1) a different number of symbols for a given syntax element, and/or 2) a different number of syntax elements within the same TU or PU. The decision whether to allocate more symbols or more syntax elements may depend on the type of arithmetic coding engine used in a particular coding specification. For multi-symbol arithmetic coding (MS-AC), a VCG may allocate a different number of symbols for a syntax element. For example, to encode absolute coefficient values inside a TU after performing a transform such as the discrete cosine transform (DCT), a VCG region may be defined around lower-frequency transform coefficients and for that VCG M-symbols can be encoded the absolute coefficient values. Another VCG region can be defined around the higher-frequency transform coefficients to encode K-symbols, where K may be different than M. For binary arithmetic coders (BACs), FCC allows for coding a variable number of syntax elements in different VCGs. In this case, one VCG in a TU may code M-syntax elements associated with signaling the absolute coefficient value, where each one of the M-syntax elements may have 2-symbols.
The third aspect described in this disclosure allows for applying, by an encoder, specialized and different probability models and context derivation rules associated with each VCG in a given TU or PU. Since each VCG may code a different number of symbols or syntax elements in different spatial locations of a TU or PU, different context models may be used for each VCG to provide better granularity for entropy modeling for arithmetic coding. Furthermore, different VCGs may also use different entropy coders including combinations of arithmetic coding, Golomb-Rice coding, Huffman coding. The FCC proposed here can be used in new image and video coding specifications and their implementations such as extensions of HEVC (H.265) and VVC (H.266) from MPEG/ITU-T, or of AV1 by the Alliance for Open Media (AOM) such as its successor development model AVM (AOM Video Model).
In the present disclosure, a new flexible coefficient coding method is described for video encoders and decoders. FCC introduces the concept of variable coefficient groups (VCGs). A single VCG may be defined as a spatial sub-region of a TU or PU with variable shapes. Each VCG may be associated with different entropy coding approaches. For instance a VCG may code a variable number M of symbols or syntax elements attached specifically to itself to encode information. A VCG may also be associated with different entropy modeling and context derivation rules specific to itself to improve the arithmetic coding performance. An example of information that VCGs can code is level information, which is defined as the absolute value of transform coefficient samples or prediction residual samples. In this regard, application of VCGs may differ from traditional coefficient coding, since, in standard codecs, the same number of symbols or syntax elements and mostly the same entropy and context derivation rules within the same TU are used. For instance, in both the AV1 and AVM reference software, the level value for a single coefficient is split into different level ranges and then coded in multiple coding passes using MS-AE. These ranges or coding passes are defined as the base range (BR), low range (LR), and high range (HR) in the Background. However, both BR and LR passes in the AVM use 4-symbols for a given TU or PU regardless of the coefficient location and context derivation rules do not vary across different sub-regions within the same TU. FCC modifies the entropy coding process depending on the defined VCGs.
In AV1, 3 such passes are used such as BR, LR, and HR coding. In FCC, Pi may have different coefficient coding passes and different symbols/syntax elements associated with each pass. Alternatively, different VCGs may utilize different entropy coding methods. For instance, VCGi may use a combination of Arithmetic coding and Golomb-Rice coding, whereas a VCGj may use Huffman coding or different entropy coding methods. VCGs may have different quantizers, quantization matrices, quantization step sizes, delta QPs derived or signaled, and encoder optimizations such as making use of trellis optimization. Moreover, VCGs may use different predictive coding techniques such as DPCM.
The FCC technique may find application in several embodiments, which are described below.
In the proposed design, the number of symbols (N) encoded to represent a level value in a given coding range (e.g. BR, LR in AVM) or a coefficient coding pass can depend on the relative spatial sub-region (VCG) in a TU or a PU. To provide an example, in
In one example, an additional LR coding pass can be performed after the 6-symbol BR pass on VCG0 in
In one example, the LR loop can code a maximal of M-symbols for each sub-range loop. M can be different in VCG0 in
In a preferred embodiment, the shape of VCG0 can be determined by the encoder and decoder based on the underlying coefficient's relative location inside a given TU or PU. For instance, in
In a preferred embodiment, if a 1D transform is applied along a single direction, such as a vertical DCT (V_DCT) or horizontal ADST (H_ADST) as shown in Table 1, then the bulk of coefficients may reside in different regions of a TU as compared to the 2D transform example. In this case, the VCG regions can be determined differently from the 2D transform case. This is shown in
In one embodiment, variations to the spatial regions defined by VCGs and the variations to the number of symbols associated with each VCG are allowed. For instance, in
In another embodiment, different color components (for example luma and chroma) are allowed to use different spatial regions defined by VCGs, different number of symbols associated with each VCG and a different number of passes. For example, in
In a further embodiment, the VCG regions can be determined by the scan indices of the underlying coefficients. An example for this case is shown in
Some potential variations to VCG regions are shown in
The ideas described above can apply to arbitrary block sizes of N×M. An example is shown in
As another example, VCG regions of non-square blocks of 2D transforms may be defined differently than those of square blocks. For example, in
In one example, a larger TU or PU can be split into equal sized coefficient groups (CGs) as done in HEVC or VVC. The concept of VCGs extends and generalizes CGs in the sense that with VCGs each CG can have a variable number of symbols coded to represent particular information and may have a different entropy coding process with some examples detailed in
In an alternative embodiment, the coefficients in a 2D TU, PU can be vectorized to form a 1 dimensional array. In this case, the variable symbol coding can work on coefficient indices directly rather than dividing the 2D TU, PU, or CG based on row or column indexes.
In another embodiment, different VCGs may share the same entropy coding methods but may have different context derivation rules.
Instead of coding a variable number of symbols in each VCG, a different number of syntax elements can be coded in each different VCG. For instance, in both HEVC (H.265) and VVC (H.266) a context-adaptive binary arithmetic coding (CABAC) engine is used. CABAC can only transmit a maximum of 2 context coded symbols per syntax element in the bitstream, therefore it is not possible to modify the number of symbols per syntax element across different VCGs. However, a variable number of syntax elements to transmit a particular information such as coefficient level values may be coded in each VCG.
Table 2 summarizes syntax elements signaled for level coding in VVC. In this table a quantized coefficient value at a scan index k is defined as qk and its absolute value is defined as |qk|. The first row shows the actual value of quantized level values to be encoded. First, in VVC's coefficient coding a significance flag (sig) is coded, which indicates if the level value is non-zero at coefficient index k. This sig syntax is a 2-symbol element. If sig=0 then no other syntax element is coded for the level value at coefficient index k. If sig is non-zero, then another, greater than 1, (gt1) syntax element is coded that indicates whether a level value is larger than 1. Consequently, a parity flag (par) and a greater than 3 (gt3) flag may be coded conditioned on the values of the previous syntax elements. Lastly, a remainder (rem) term is coded in bypass coding mode to represent larger level values. Note that the transform skip residual coding in VVC may code a different number of syntax elements.
In the proposed FCC design, different VCGs may code a different number of syntax elements, where each syntax element may have 2-symbols.
In general, each VCG may have an arbitrary number of coded syntax elements that may be different from each other. The number of added syntax elements associated with each VCG may be selected based on where the coefficient level values are concentrated and also, some VCGs may code less syntax elements if the coefficient level values in the respective VCG regions are sparse (or lesser in magnitude).
Once the optimal symbol counts or syntax elements are chosen for each VCG, the second aspect proposed in this disclosure is to assign different probability models and entropy modeling for arithmetic coding for each region.
For instance in the AVM, context derivation neighborhoods were previously shown in
In the proposed design different context derivation neighborhoods may be used for different VCGs:
In one example, different VCGs can use the same context derivation neighborhood as shown in
In one example, different VCGs may have different context derivation neighborhoods. One example is provided in
In one example, VCGs of non-square blocks may have different context derivation neighborhoods than the square blocks. For instance, an asymmetric neighborhood may be used so that more (less) neighboring samples along the longer (shorter) dimension of a non-square transform block are included in the neighborhood. In
In one example, VCGs of different color components (e.g. luma and chroma) may use different neighborhoods and/or rules for context derivation. For example, chroma may use fewer neighboring coefficients to calculate Mag. As another example, VCGs usually are divided further into two or more frequency bands such that the coefficients belonging to a frequency band share the same probability model. A VCG may use different frequency bands for luma and chroma. For example, in
In one example, different VCGs may use the same or different context derivation neighborhoods as explained above. However, regardless of the context derivation neighborhood associated with each VCG, FCC may use the same context index derivation rule (e.g. min (Mag+1)>>1, 4) and may use the same offsets.
In one example, different VCGs may use the same or different context derivation neighborhoods as explained above. Moreover, different VCGs may have different context derivation rules. For instance, VCG0 in
In one example, different VCGs may have different data hiding properties. For instance, for VCG0 in
In one example, the number of non-zero coefficient samples in the VCG0 in
In one example, an encoder may perform parity hiding (PH) when encoding level values only for a subset of VCGs. For instance, the encoder may allow PH only for VCG0 and disallow for other VCGs inside a TU. In another case, in
In one example, the sign bit may be context coded only for certain VCGs. For instance, in
In one example, different VCGs may use different quantization step sizes or delta QP values. In one case, a single VCG0 may use a delta quantization such that coefficient values residing inside this VCG are quantized according to step size (Q+deltaQ0) alternatively another VCG1 may use (Q+deltaQ1), where deltaQ0 and deltaQ1 may have different values. This allows for variable quantization based on where the VCG regions are defined. These quantization related values may be inferred based on the VCG indices and are determined by the decoder based on the location of underlying coefficients. Alternatively, such information can be indicated/signaled at the beginning of each VCG, or at higher-levels such as sequence/picture/tile/CTU levels to indicate the underlying VCGs may share the same quantizers or only a subset of VCGs as defined by their spatial location or index inside a unit will have a pre-defined
In one example, different VCGs may use different quantization matrices, where a quantization matrix belonging to an individual VCG may use variable spatial quantization rules such as different quantization step sizes with respect to coefficient index etc. These quantization matrices may be implicitly defined for individual VCG indices. For instance the VCG0 in
In one embodiment, the regions of VCGs may be determined based on a data-driven algorithm, such as k-means clustering with each of the k-VCG regions defined over regions with coefficient magnitudes closer to each other. Such a partitioning can be done offline using pre-defined data using fixed transforms such as the DCT.
In one example, each VCG inside a TU may have different encoder operations. In one example, trellis quantization or coefficient optimization may be disabled for certain VCGs in order to reduce encoding complexity. For instance, in
In another example, a neural network (NN) algorithm may be provided with sufficient training data, that includes the transform type, coefficient samples, level information, the associated signaling costs for each coefficient index or location, and a number of allowed VCG counts. After sufficient offline training, the NN may determine the optimal symbol counts or syntax element counts across different spatial locations inside a TU for a transform type and the spatial regions of each VCG. In another example, optimal deltaQ0, deltaQ1, deltaQk can be determined using a NN algorithm to perform modified quantization in different VCG regions.
In one embodiment, different VCGs may use different transform types within the FCC framework. For instance, VCGi may use a 2D DCT primary transform and VCGj may use an IDTX transform. Moreover, a secondary transform in a given TU may only apply to certain VCGs. For instance, a secondary transform may only apply to the primary coefficient samples located in VCG0, but other VCGi s may disable a secondary transform.
As previously explained, VCGs may be defined according to where the underlying coefficients are located inside a given TU e.g. based on scan ordering and scan indices of the underlying coefficients, or based on the row and column indices of underlying coefficients. Some examples were illustrated from
In one embodiment, different VCGs may use different predictive coding techniques. For instance, for VCGi differential pulse-code modulation (DPCM) approaches may be used in horizontal mode, where prior to coding quantized or un-quantized coefficient samples a prediction is performed across columns first by taking differences across consecutive column vectors. Likewise, for VCGj a vertical DPCM may be used where prediction can be performed across rows. In general predictive coding may be performed for each VCG differently and can be done either in residual domain (where a transform is skipped) or in the transform (frequency) domain. Since VCGs are defined according to predefined spatial regions over a PU or a TU in preferred embodiments, a mode decision can be associated with a VCG and inferred from the coefficient location as per previous discussion. In an alternative design, mode decisions may be signaled to indicate which mode decisions each VCG should use and such signaling can be done either: 1) at the beginning of each VCG, 2) at the beginning of a subset VCGs if the VCGk index k matches a specific value such as k=0 in
In one embodiment, VCGs may use different modes such as different transform types, predictive coding schemes, quantization matrices/approaches, RDOQ/TCQ rules and quantization states can be signaled for each VCG, depending on if the VCG is active, meaning that a coefficient exists in a given VCG region then a mode/flag/index or indicator related to a transform type, secondary transform type, or zero-out, or predictive coding scheme can be signaled for that VCG. In one example, if a TU is split into two VCG regions, then VCG0 can signal a transform type, or a DPCM direction mode. Alternatively, VCG1 can signal another transform type or another DPCM direction mode. In general arbitrary mode decisions may be signaled for individual VCGs.
In another case, a higher level flag in a sequence parameter set (SPS), picture parameter set (PPS), or a tile level flag or a CTU level flag can be signaled to indicate if a PU or TU is split into VCGs or not. In this case, a high-level flag value of 0 may indicate there is no multiple VCGs in a given TU/PU and the entire TU uses the same coefficient coding method. Alternatively, a non-zero high-level flag value may indicate that there is VCG splits in the given frame/tile/CTU where each TU may have a fixed N number of VCGs. Alternatively, a higher level index may also be signaled to specify how many (N) VCGs the present frame/tile/CTU contains per PU/TU.
In one embodiment, a higher level unit such as a CTU, a CU or a PU may be split into multiple VCGs. In this case, differently from the embodiments discussed above a VCG is not defined within a TU. Instead, a VCG may contain multiple TUs inside it. For instance in
In one embodiment, if information pertaining to VCG modes are signaled such signaling can be performed at various units:
If a mode/flag/index is signaled for VCGs at CU/PU/CTU levels then this information can be either coded with contexts using an arithmetic coder or can be coded in bypass mode. If context coding is used, the mode decisions from previous CU/PU/CTU levels can be used to select different entropy models to more efficiently encode the VCG mode/index for the current units.
The decoding system 1600 may include an entropy decoder 1610, an inverse quantizer 1620, an inverse transform 1630, an adder 1640, a loop filter 1650, a prediction unit 1660, and a reference picture buffer 1670 all operating under control of a controller 1680. The entropy decoder 1610, inverse quantizer 1620, inverse transform 1630, and adder 1640 each perform operations that invert operations applied by their counterparts in the encoder shown in
The prediction unit 1660 may perform prediction according to a prediction mode and prediction references supplied in the coded video data. The prediction unit 1660 may generate prediction coding from reference picture data stored in a reference picture buffer 1670 and may supply a prediction block to the adder 1640. The adder 1640 may add recovered data output from the inverse transform unit 1630 to prediction content from the prediction unit 1660 on a pixelwise basis.
The in-loop filter 1650 may perform certain filtering operations on recovered pixel blocks output by the adder 1640. For example, the in-loop filter 1640 may include a deblocking filter, a sample adaptive offset (“SAO”) filter, and/or other types of in loop filters. The in-loop filter may operate on frames assembled from multiple pixel blocks generated by the decoding system 1600. Reassembled reference frames may be stored in the reference picture buffer 1670 for use in decoding of later-received video.
The foregoing discussion has described operation of the aspects of the present disclosure in the context of video coders and decoders. Commonly, these components are provided as electronic devices. Video decoders and/or controllers can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays, and/or digital signal processors. Alternatively, they can be embodied in computer programs that execute on camera devices, personal computers, notebook computers, tablet computers, smartphones, or computer servers. Such computer programs typically are stored in physical storage media such as electronic-, magnetic-, and/or optically-based storage devices, where they are read to a processor and executed. Decoders commonly are packaged in consumer electronics devices, such as smartphones, tablet computers, gaming systems, DVD players, portable media players and the like; and they also can be packaged in consumer software applications such as video games, media players, media editors, and the like. And, of course, these components may be provided as hybrid systems that distribute functionality across dedicated hardware components and programmed general-purpose processors, as desired.
Several embodiments of the present invention are specifically illustrated and described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.
This application benefits from priority of U.S. application Ser. No. 63/392,941, entitled “Flexible Coefficient Coding in Video Compression,” filed Jul. 28, 2022, the disclosure of which is incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63392941 | Jul 2022 | US |