This disclosure is related to methods and apparatus on improving the coding efficiency of both luma and chroma components. More specifically, a loop filter called Cross-component Sample Adaptive Offset (CCSAO) is proposed to explore cross-component relationship between luma and chroma components.
Various video coding techniques may be used to compress video data. Video coding is performed according to one or more video coding standards. For example, video coding standards include versatile video coding (VVC), joint exploration test model (JEM), high-efficiency video coding (H.265/HEVC), advanced video coding (H.264/AVC), moving picture expert group (MPEG) coding, or the like. Video coding generally utilizes prediction methods (e.g., inter-prediction, intra-prediction, or the like) that take advantage of redundancy present in video images or sequences. An important goal of video coding techniques is to compress video data into a form that uses a lower bit rate, while avoiding or minimizing degradations to video quality.
Embodiments of the present disclosure provide method and apparatus for Cross-Component Sample Adaptive Offset (CCSAO) process.
In a first aspect of the present disclosure, a method for Cross-Component Sample Adaptive Offset (CCSAO) process is provided. The method comprises: determining a first rate-distortion (RD) cost for a block using a first classifier, wherein the first classifier having a first category number in a first value range, and wherein the first RD cost is the least among RD costs associated with classifiers having a category number in the first value range; determining a second RD cost for the block using a second classifier, wherein the second classifier having a second category number in a second value range, and wherein the second RD cost is the least among RD costs associated with classifiers having a category number in the second value range; and applying the first classifier as a classifier for the CCSAO process in response to determining the first RD cost is less than the second RD cost.
In a second aspect of the present disclosure, a method for Cross-Component Sample Adaptive Offset (CCSAO) process is provided. The method comprises: applying a classifier for the blocks in a frame; deriving a first set of offset values for the categories of the classifier; estimating, for each of the blocks, a first RD cost associated with the first set of offset values and a second RD cost associated with a second set of offset values, wherein the second set of offset values are smaller than the first set of offset values in terms of absolute values; assigning the first set of offset values as the offset values for a block in response to determining that the first RD cost is less than the second RD cost for the block.
In a third aspect of the present disclosure, a method for Cross-Component Sample Adaptive Offset (CCSAO) process is provided. The method comprises: applying a first classifier for the blocks in a frame; sorting the blocks with CCSAO enabled in ascending or descending order according to the associated distortion or RD cost; excluding a portion of sorted blocks for which the associated distortion or RD cost is higher than the associated distortion or RD cost for the other portion of blocks; training the excluded blocks with a second classifier; and applying the second classifier for the excluded blocks.
In a fourth aspect of the present disclosure, an apparatus or Cross-Component Sample Adaptive Offset (CCSAO) process is provided. The apparatus comprises: a memory; and at least one processor coupled to the memory and configured to perform the method according to any of the methods of the present disclosure.
In a fifth aspect of the present disclosure, a computer readable storage medium is provided. The computer readable storage medium has stored therein a bitstream comprising encoded video information generated by the methods of the present disclosure.
It is to be understood that both the foregoing general description and the following detailed description are examples only and are not restrictive of the present disclosure.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
Reference will now be made in detail to specific implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
It should be illustrated that the terms “first,” “second,” and the like used in the description, claims of the present disclosure, and the accompanying drawings are used to distinguish objects, and not used to describe any specific order or sequence. It should be understood that the data used in this way may be interchanged under an appropriate condition, such that the embodiments of the present disclosure described herein may be implemented in orders besides those shown in the accompanying drawings or described in the present disclosure.
The first version of the HEVC standard was finalized in October 2013, which offers approximately 50% bit-rate saving or equivalent perceptual quality compared to the prior generation video coding standard H.264/MPEG AVC. Although the HEVC standard provides significant coding improvements than its predecessor, there is evidence that superior coding efficiency can be achieved with additional coding tools over HEVC. Based on that, both VCEG and MPEG started the exploration work of new coding technologies for future video coding standardization, one Joint Video Exploration Team (JVET) was formed in October 2015 by ITU-T VECG and ISO/IEC MPEG to begin significant study of advanced technologies that could enable substantial enhancement of coding efficiency. One reference software called joint exploration model (JEM) was maintained by the JVET by integrating several additional coding tools on top of the HEVC test model (HM).
In October 2017, the joint call for proposals (CfP) on video compression with capability beyond HEVC was issued by ITU-T and ISO/IEC. In April 2018, 23 CfP responses were received and evaluated at the 10-th JVET meeting, which demonstrated compression efficiency gain over the HEVC around 40%. Based on such evaluation results, the JVET launched a new project to develop the new generation video coding standard that is named as Versatile Video Coding (VVC). In the same month, one reference software codebase, called VVC test model (VTM), was established for demonstrating a reference implementation of the VVC standard.
Like HEVC, the VVC is built upon the block-based hybrid video coding framework.
In general, the basic intra prediction scheme applied in the VVC is kept the same as that of the HEVC, except that several modules are further extended and/or improved, e.g., intra sub-partition (ISP) coding mode, extended intra prediction with wide-angle intra directions, position-dependent intra prediction combination (PDPC), matrix-based intra prediction, and 4-tap intra interpolation.
SAO is a process that modifies the decoded samples by conditionally adding an offset value to each sample after the application of the deblocking filter, based on values in look-up tables transmitted by the encoder. SAO filtering is performed on a region basis, based on a filtering type selected per CTB by a syntax element sao-type-idx. A value of 0 for sao-type-idx indicates that the SAO filter is not applied to the CTB, and the values 1 and 2 signal the use of the band offset and edge offset filtering types, respectively. In the band offset mode specified by sao-type-idx equal to 1, the selected offset value directly depends on the sample amplitude. In this mode, the full sample amplitude range is uniformly split into 32 segments called bands, and the sample values belonging to four of these bands (which are consecutive within the 32 bands) are modified by adding transmitted values denoted as band offsets, which can be positive or negative. The main reason for using four consecutive bands is that in the smooth areas where banding artifacts can appear, the sample amplitudes in a CTB tend to be concentrated in only few of the bands. In addition, the design choice of using four offsets is unified with the edge offset mode of operation which also uses four offset values. In the edge offset mode specified by sao-type-idx equal to 2, a syntax element sao-eo-class with values from 0 to 3 signals whether a horizontal, vertical or one of two diagonal gradient directions is used for the edge offset classification in the CTB.
Thus, for SAO types 1 and 2, a total of four amplitude offset values are transmitted to the decoder for each CTB. For type 1, the sign is also encoded. The offset values and related syntax elements such as sao-type-idx and sao-eo-class are determined by the encoder—typically using criteria that optimize rate-distortion performance. The SAO parameters can be indicated to be inherited from the left or above CTB using a merge flag to make the signaling efficient. In summary, SAO is a nonlinear filtering operation which allows additional refinement of the reconstructed signal, and it can enhance the signal representation in both smooth areas and around edges.
Near the finalization stage of VVC, Pre-SAO was proposed in JVET-Q0434. Though it was not adopted in VVC, its coding performance with low complexity is still promising in the future video coding standard development. Note in JVET-Q0434, Pre-SAO is only applied on luma component samples using luma samples for classification.
The contribution proposes a tool called Pre-Sample Adaptive Offset (Pre-SAO). Pre-SAO operates by applying two SAO-like filtering operations called SAOV and SAOH and they are applied jointly with the deblocking filter (DBF) before applying the existing (legacy) SAO, as illustrated in
where T is a predetermined positive constant and d1 and d2 are offset coefficients associated with two classes based on the sample-wise difference between Y1 (i) and Y2 (i) given by
The first class for d1 is given as taking all sample locations i such that f(i)>T while the second class for d2 is given by f(i)<−T. The offset coefficients d1 and d2 are calculated at the encoder so that the mean square error between output picture Y3 of SAOV and the original picture X is minimized, in the same way as in the existing SAO process. After SAOV is applied, the second SAO-like filter SAOH operates as applying SAO to Y4, after SAOV has been applied, with a classification based on the sample-wise difference between Y3(i) and Y4(i), the output picture of the deblocking filter for the horizontal edges (DBFH)—see
Note that both SAOV and SAOH operate only on the picture samples affected by the respective deblocking (DBFV or DBFH). Hence, unlike with the existing SAO process, only a subset of all samples in the given spatial region (picture, or CTU in case of legacy SAO) are being processed by the Pre-SAO, which keeps the resulting increase in decoder-side mean operations per picture sample low (two or three comparisons and two additions per sample in the worst-case scenario according to preliminary estimates). It should also be noted that Pre-SAO only needs samples used by the deblocking filter without storing additional samples at the decoder.
After VVC version 1 is finalized, bilateral filter is proposed for compression efficiency exploration beyond VVC and is studied if it has potential to be part of the next generation standard. The proposed filter is carried out in the sample adaptive offset (SAO) loop-filter stage, as shown in
In detail, the output sample IOUT is obtained as
where IC is the input sample from deblocking, ΔIBIF is the offset from the bilateral filter and ΔISAO is the offset from SAO.
The proposed implementation provides the possibility for the encoder to enable or disable filtering at the CTU and slice level. The encoder takes a decision by evaluating the RDO cost.
The following syntax elements are introduced in the PPS:
The semantic is as follows:
pps_bilateral_filter_enabled_flag equal to 0 specifies that the bilateral loop filter is disabled for slices referring to the PPS. pps_bilateral_filter_flag equal to 1 specifies that the bilateral loop filter is enabled for slices referring to the PPS.
bilateral_filter_strength specifies a bilateral loop filter strength value used in the bilateral transform block filter process. The value of bilateral_filter_strength shall be in the range of 0 to 2, inclusive.
bilateral_filter_qp_offset specifies an offset used in the derivation of the bilateral filter look-up table, LUT(x), for slices referring to the PPS. bilateral_filter_qp_offset shall be in the range of −12 to +12, inclusive.
The following syntax elements, adapted from JVET-P0078, are introduced:
The semantic is as follows:
slice_bilateral_filter_all_ctb_enabled_flag equal to 1 specifies that the bilateral filter is enabled and is applied to all CTBs in the current slice. When slice_bilateral_filter_all_ctb_enabled_flag is not present, it is inferred to be equal to 0.
slice_bilateral_filter_enabled_flag equal to 1 specifies that the bilateral filter is enabled and may be applied to CTBs of the current slice. When slice_bilateral_filter_enabled_flag is not present, it is inferred to be equal to slice_bilateral_filter_all_ctb_enabled_flag.
bilateral_filter_ctb_flag[xCtb>>CtbLog2SizeY][yCtb>>CtbLog2SizeY] equal to 1 specifies that the bilateral filter is applied to the luma coding tree block of the coding tree unit at luma location (xCtb, yCtb). bilateral_filter_ctb_flag [cIdx][xCtb>>CtbLog2SizeY][yCtb>>CtbLog2Size Y] equal to 0 specifies that the bilateral filter is not applied to the luma coding tree block of the coding tree unit at luma location (xCtb, yCtb). When bilateral_filter_ctb_flag is not present, it is inferred to be equal (slice_bilateral_filter_all_ctb_enabled_flag & slice_bilateral_filter_enabled_flag).
For CTUs that are filtered, the filtering process proceeds as follows.
At the picture border, where samples are unavailable, the bilateral filter uses extension (sample repetition) to fill in unavailable samples. For virtual boundaries, the behavior is the same as for SAO, i.e., no filtering occurs. When crossing horizontal CTU borders, the bilateral filter can access the same samples as SAO is accessing. As an example, if the center sample IC (see
The samples surrounding the center sample IC are denoted according to
Each surrounding sample IA, IR etc will contribute with a corresponding modifier value μΔI
where |·| denotes absolute value. For data that is not 10-bit, ΔIR=(|IR−IC|+2n-6)>>(n−7) is used instead, where n=8 for 8-bit data etc. The resulting value is now clipped so that it is smaller than 16:
The modifier value is now calculated as
where LUTROW[ ] is an array of 16 values determined by the value of qpb=clip(0, 25, QP+bilateral_filter_qp_offset−17):
This is different from JVET-P0073 where 5 such tables were used, and the same table was reused for several qp-values.
As described in JVET-N0493 section 3.1.3, these values can be stored using six bits per entry resulting in 26*16*6/8=312 bytes or 300 bytes if excluding the first row which is all zeros.
The modifier values for μΔI
and the other diagonal samples and two-steps-away samples are calculated likewise. The modifier values are summed together as
Note that μΔI
The msum value is now multiplied either by c=1, 2 or 3, which can be done using a single adder and logical AND gates in the following way:
where & denotes logical AND and k1 is the most significant bit of the multiplier c and k2 is the least significant bit. The value to multiply with is obtained using the minimum block dimension D=min(width, height) as shown in Table 1:
Finally, the bilateral filter offset ΔIBIF is calculated. For full strength filtering, ΔIBIF is calculated as
whereas for half-strength filtering, ΔIBIF is instead calculated as
A general formula for n-bit data is to use
where bilateral_filter_strength can be 0 or 1 and is signalled in the pps.
In VVC, an Adaptive Loop Filter (ALF) with block-based filter adaption is applied. For the luma component, one among 25 filters is selected for each 4×4 block, based on the direction and activity of local gradients.
Two diamond filter shapes (as shown in
For luma component, each 4×4 block is categorized into one out of 25 classes. The classification index C is derived based on its directionality D and a quantized value of activity Â, as follows:
To calculate D and Â, gradients of the horizontal, vertical and two diagonal directions are first calculated using 1-D Laplacian:
where indices i and j refer to the coordinates of the upper left sample within the 4×4 block and R(i, j) indicates a reconstructed sample at coordinate (i, j).
To reduce the complexity of block classification, the subsampled 1-D Laplacian calculation is applied. As shown in
Then D maximum and minimum values of the gradients of horizontal and vertical directions are set as:
The maximum and minimum values of the gradient of two diagonal directions are set as:
To derive the value of the directionality D, these values are compared against each other and with two thresholds t1 and t2 in the following steps:
The activity value A is calculated as:
A is further quantized to the range of 0 to 4, inclusively, and the quantized value is denoted as Â. For chroma components in a picture, no classification method is applied.
Before filtering each 4×4 luma block, geometric transformations such as rotation or diagonal and vertical flipping are applied to the filter coefficients f(k, l) and to the corresponding filter clipping values c(k, l) depending on gradient values calculated for that block. This is equivalent to applying these transformations to the samples in the filter support region. The idea is to make different blocks to which ALF is applied more similar by aligning their directionality.
Three geometric transformations, including diagonal, vertical flip and rotation are introduced:
where K is the size of the filter and 0≤k, l≤K−1 are coefficients coordinates, such that location (0,0) is at the upper left corner and location (K−1, K−1) is at the lower right corner. The transformations are applied to the filter coefficients f(k, l) and to the clipping values c(k, l) depending on gradient values calculated for that block. The relationship between the transformation and the four gradients of the four directions are summarized in the following Table 2.
At decoder side, when ALF is enabled for a CTB, each sample R(i, j) within the CU is filtered, resulting in sample value R′(i, j) as shown below,
where f(k,l) denotes the decoded filter coefficients, K(x,y) is the clipping function and c(k,l) denotes the decoded clipping parameters. The variable k and I vary between −L/2 and L/2 where L denotes the filter length. The clipping function K(x,y)=min(y, max(−y,x)) which corresponds to the function Clip3 (−y, y, x). The clipping operation introduces non-linearity to make ALF more efficient by reducing the impact of neighbor sample values that are too different with the current sample value.
CC-ALF uses luma sample values to refine each chroma component by applying an adaptive, linear filter to the luma channel and then using the output of this filtering operation for chroma refinement.
Filtering in CC-ALF is accomplished by applying a linear, diamond shaped filter (
where (x, y) is chroma component i location being refined (xy, yy) is the luma location based on (x, y), Si is filter support area in luma component, ci(x0, y0) represents the filter coefficients.
As shown in
In the VVC reference software, CC-ALF filter coefficients are computed by minimizing the mean square error of each chroma channels with respect to the original chroma content. To achieve this, the VTM algorithm uses a coefficient derivation process similar to the one used for chroma ALF. Specifically, a correlation matrix is derived, and the coefficients are computed using a Cholesky decomposition solver in an attempt to minimize a mean square error metric. In designing the filters, a maximum of 8 CC-ALF filters can be designed and transmitted per picture. The resulting filters are then indicated for each of the two chroma channels on a CTU basis.
Additional characteristics of CC-ALF include:
As an additional feature, the reference encoder can be configured to enable some basic subjective tuning through the configuration file. When enabled, the VTM attenuates the application of CC-ALF in regions that are coded with high QP and are either near mid-grey or contain a large amount of luma high frequencies. Algorithmically, this is accomplished by disabling the application of CC-ALF in CTUs where any of the following conditions are true:
The motivation for this functionality is to provide some assurance that CC-ALF does not amplify artifacts introduced earlier in the decoding path (this is largely due the fact that the VTM currently does not explicitly optimize for chroma subjective quality). It is anticipated that alternative encoder implementations would either not use this functionality or incorporate alternative strategies suitable for their encoding characteristics.
ALF filter parameters are signalled in Adaptation Parameter Set (APS). In one APS, up to 25 sets of luma filter coefficients and clipping value indexes, and up to eight sets of chroma filter coefficients and clipping value indexes could be signalled. To reduce bits overhead, filter coefficients of different classification for luma component can be merged. In slice header, the indices of the APSs used for the current slice are signaled.
Clipping value indexes, which are decoded from the APS, allow determining clipping values using a table of clipping values for both luma and chroma components. These clipping values are dependent of the internal bitdepth. More precisely, the clipping values are obtained by the following formula:
where B equal to the internal bitdepth, a is a pre-defined constant value equal to 2.35, and N equal to 4 which is the number of allowed clipping values in VVC. The AlfClip is then rounded to the nearest value with the format of power of 2.
In slice header, up to 7 APS indices can be signaled to specify the luma filter sets that are used for the current slice. The filtering process can be further controlled at CTB level. A flag is always signalled to indicate whether ALF is applied to a luma CTB. A luma CTB can choose a filter set among 16 fixed filter sets and the filter sets from APSs. A filter set index is signaled for a luma CTB to indicate which filter set is applied. The 16 fixed filter sets are pre-defined and hard-coded in both the encoder and the decoder.
For chroma component, an APS index is signaled in slice header to indicate the chroma filter sets being used for the current slice. At CTB level, a filter index is signaled for each chroma CTB if there is more than one chroma filter set in the APS.
The filter coefficients are quantized with norm equal to 128. In order to restrict the multiplication complexity, a bitstream conformance is applied so that the coefficient value of the non-central position shall be in the range of −27 to 27−1, inclusive. The central position coefficient is not signalled in the bitstream and is considered as equal to 128.
In VVC, to reduce the line buffer requirement of ALF, modified block classification and filtering are employed for the samples near horizontal CTU boundaries. For this purpose, a virtual boundary is defined as a line by shifting the horizontal CTU boundary with “N” samples as shown in
Modified block classification is applied for the luma component as depicted in
For filtering processing, symmetric padding operation at the virtual boundaries are used for both luma and chroma components. As shown in
Different to the symmetric padding method used at horizontal CTU boundaries, simple padding process is applied for slice, tile and subpicture boundaries when filter across the boundaries is disabled. The simple padding process is also applied at picture boundary. The padded samples are used for both classification and filtering process. To compensate for the extreme padding when filtering samples just above or below the virtual boundary the filter strength is reduced for those cases for both luma and chroma by increasing the right shift in equation 24 by 3.
For the existing SAO design in the HEVC, VVC, AVS2 and AVS3 standards, the luma Y, chroma Cb and chroma Cr sample offset values are decided independently. That is, for example, the current chroma sample offset is decided by only current and neighboring chroma sample values, without taking collocated or neighboring luma samples into consideration. However, luma samples preserve more original picture detail information than chroma samples, and they can benefit the decision of the current chroma sample offset. Furthermore, since chroma samples usually lost high frequency details after color conversion from RGB to YCbCr, or after quantization and deblocking filter, introducing luma samples with high frequency detail preserved for chroma offset decision can benefit chroma sample reconstruction. Hence, further gain can be expected by exploring cross-component correlation. Note the correlation here not only includes cross-component sample values but includes picture/coding information such as prediction/residual coding modes, transform types, quantization/deblcoking/SAO/ALF parameters from cross-components.
Another example is for SAO, the luma sample offsets are decided only by luma samples, however, for example, a luma sample with the same BO classification can be further classified by its collocated and neighboring chroma samples, thus may lead to a more effective classification. SAO classification can be taken as a shortcut to compensate the sample difference between the original picture and the reconstructed picture, so an effective classification is desired.
One focus of the disclosure is to improve the coding efficiency of luma and chroma components, with similar design spirit of SAO but introducing cross-component information. SAO is used in the HEVC, VVC, AVS2 and AVS3 standards. To facilitate the description of the disclosure, the existing SAO technology in the abovementioned standards is briefly reviewed. Then, the proposed methods with examples are provided.
Please note that though the existing SAO design in the HEVC, VVC, AVS2, and AVS3 standards is used as the basic SAO method in the following description, to a person skilled in the art of video coding, the proposed cross-component method described in the disclosure can also be applied to other loop filter designs or other coding tools with similar design spirits. For example, in the AVS3 standard, SAO is replaced by a coding tool called Enhanced Sample Adaptive Offset (ESAO), however, the proposed CCSAO can also be applied in parallel with ESAO. Another example where CCSAO can be applied in parallel is Constrained Directional Enhancement Filter (CDEF) in the AV1 standard.
Note if the video is RGB format, the proposed CCSAO can also be applied by simply mapping YUV notation to GBR in the below paragraphs, for example.
Note the figures in this disclosure can be combined with all examples mentioned in this disclosure.
A classifier example (C0) is using the collocated luma or chroma sample value (Y0 in
The classification can take rounding into account:
Some band_num and bit_depth examples are listed as below in Table 3.
Using different luma (or chroma) sample position for C0 classification can be another classifier. For example, using the neighboring Y7 but not Y0 for C0 classification, as shown in
Different color format can have different classifiers “constraints”. For example, YUV 420 format uses luma/chroma candidates selection in
The C0 position and C0 band_num can be combined and switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblcok/Sample levels. Different combinations can correspond to different classifiers, as shown in the following Table 5:
The collocated luma sample value (Y0) can be replaced by a value (Yp) by weighting collocated and neighboring luma samples.
Another classifier example (C1) is the comparison score [−8, 8] of the collocated luma samples (Y0) and neighboring 8 luma samples, which yields 17 classes in total:
The C1 example is equal to the following function where threshold “th” is 0:
where f(x,y)=1, if x−y>th; f(x,y)=0, if x−y=th; f(x,y)=−1, if x−y<th
Similar as C4 classifier, one or plural thresholds can be predefined (e.g., kept in a LUT) or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels to help classify (quantize) the difference.
A variation (C1′) is only counting comparison score [0, 8], and this yields 8 classes. (C1, C1′) is a classifier group and a PH/SH level flag can be signaled to switch between C1 and C1′:
Initial Class (C1′)=0,Loop over neighboring 8 luma samples (Yi,i=1 to 8) (Eq. 30)
A variation (C1s) is selectively using neighboring N out of M neighboring samples to count the comparison score. An M-bit bitmask can be signaled at SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels to indicate which neighboring samples are selected to count the comparison score. Using
Similar to C1s, a variation (C1's) is only counting comparison score [0, +N], the previous bitmask 01111110 example gives comparison score is in [0, 6], which yields 7 offsets.
Different classifiers can be combined to yield a general classifier. For example, the following Table 6 shows examples for different pictures:
Another classifier example (C2) is using the difference (Yn) of collocated and neighboring luma samples.
C0 and C2 can be combined to yield a general classifier, as shown in the following Table 7 for example:
Another classifier example (C3) is using a bitmask for classification. A 10-bit bitmask is signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels to indicate the classifier. For example, bitmask 11 1100 0000 means for a given 10-bit luma sample value, only MSB 4 bits are used for classification, and this yields 16 classes in total. Another example bitmask 10 0100 0001 means only 3 bits are used for classification, and this yields 8 classes in total. The bitmask length (N) can be fixed or switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, for a 10-bit sequence, a 4-bit bitmask 1110 signaled in PH in a picture, MSB 3 bits b9, b8, b7 are used for classification. Another example is 4-bit bitmask 0011 on LSB. b0, b1 are used for classification. The bitmask classifier can apply on luma or chroma classification. Whether to use MSB or LSB for bitmask N be can fixed or switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
The luma position and C3 bitmask can be combined and switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. Different combinations can be different classifiers.
A “max number of 1s” of the bitmask restriction can be applied to restrict the corresponding number of offsets. For example, restricting “max number of 1s” of the bitmask to 4 in SPS, and this yields the max offsets in the sequence to be 16. The bitmask in different POC can be different, but the “max number of 1s” shall not exceed 4 (total classes shall not exceed 16). The “max number of 1s” value can be signaled and switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels, as shown in the following Table 8:
As shown in
All abovementioned classifiers (C0, C1, C1′, C2, C3) can be combined, as shown in the following Table 10 for example:
Another classifier example (C4) is using the difference of CCSAO input and to-be-compensated sample value for classification. For example, if CCSAO is applied in the ALF stage, the difference of the current component pre-ALF and post-ALF sample values are used for classification. One or plural thresholds can be predefined (e.g., kept in a LUT) or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels to help classify (quantize) the difference. The C4 classifier can be combined with C0 Y/U/V bandNum to form a joint classifier (POC1 example), as shown in the following Table 11:
Another classifier example (C5) is using “coding information” to help subblock classification since different coding mode may introduce different distortion statistics in the reconstruction image. A CCSAO sample is classified by its sample previous coding information and the combination can form a classifier, as shown in the following Table 12 for example:
Another classifier example (C6) is using the YUV color transformed value for classification. For example, to classify the current Y component, 1/1/1 collocated or neighboring Y/U/V samples are selected to be color transformed to RGB, and using C3 bandNum quantize R value to be the current Y component classifier.
Another classifier example (C7) can be taken as a generalized version of C0/C3 and C6. To derive the current component C0/C3 bandNum classification, all 3 color components' collocated and neighboring samples are used. For example, to classify the current U sample, collocated and neighboring Y/V, current and neighboring U samples are used as shown in
where S is the intermediate sample ready to be used for C0/C3 bandNum classification, Rij is the i-th component j-th collocated/neighboring samples, and cij is the weighting coefficient which can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. Note one special subset case of C7 is only using 1/1/1 collocated or neighboring Y/U/V samples to derive the intermediate sample S, which can be also taken as a special case of C6 (color transform by using 3 components). The S can be further fed into C0/C3 bandNum classifier:
Note the same as C0/C3 bandNum classifier, the C7 can also be combined with other classifier to form a joint classifier. Note C7 is not the same as the later example which jointly uses collocated and neighboring Y/U/V samples for classification (3 component joint bandNum classification for each Y/U/V component).
One constraint can be applied: sum of cij=1 to reduce cij signaling overhead and limit the value of S within the bitdepth range. For example, force c00=(1-sum of other cij). Which cij (c00 in this example) is forced (derived by other coeffs) can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
Another classifier example (C8) is using cross-component/current component spatial activity information as a classifier. Similar to the ALF block activity classifier, one sample located at (k,l) can get sample activity by
For example, using 2 direction laplacian gradient to get A and a predefined map {Qn} to get Â:
where (BD-6), or denoted as B, is a predefined normalization term associated with bitdepth.
A is then further mapped to the range of [0, 4]:
Note the B, Qn can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
Another classifier example (C9) is using cross-component/current component spatial gradient information as a classifier. Similar to the ALF block gradient classifier mentioned above, one sample located at (k,l) can get a sample gradient class by
For example, as the ALF block classifier but apply at sample level for sample classification,
Please note that as ALF, C8 and C9 can be combined to form a joint classifier.
Another classifier example (C10) is using cross/current component edge information for current component classification. By extending the original SAO classifier, C10 can extract the cross/current component edge information more effectively:
For example, as shown in
The direction patterns can be 0, 45, 90, 135 degrees (45 deg. per direction), or extending to 22.5 deg. per direction, or a predefined direction set, or signaled in SPS/APS/PPS/PH/SH/Region(Set)/CTU/CU/Subblock/Sample levels.
The edge strength can also be defined as (b−a), which simplifies the calculation but sacrifices precision.
The M−1 thresholds can be predefined or signaled in SPS/APS/PPS/PH/SH/Region(Set)/CTU/CU/Subblock/Sample levels.
The M−1 thresholds can be different sets for edge strength calculation, for example, different sets for (c−a), (c−b). If different sets are used, the total classes may be different. For example, [−T, 0, T] for calculating (c−a) but [−T, T] for (c−b), and total classes are 4*3.
The M−1 thresholds can use “symmetric” property to reduce signaling overhead. For example, using predefined pattern [−T, 0, T] but not [T0, T1, T2] which requires to signal 3 threshold values. Another example is [−T, T].
The threshold values can only contain power of 2 values, which not only effectively grab edge strength distribution but reduces comparison complexity (only MSB N bits need be compared).
The position of (a, b) can be indicated by signaling 2 syntaxes: (1) edgeDir indicates the selected direction, and (2) edgeStep indicates the sample distance used to calculate the edge strength, as shown in
The edgeDir/edgeStep can be predefined or signaled in SPS/APS/PPS/PH/SH/Region(Set)/CTU/CU/Subblock/Sample levels.
The edgeDir/edgeStep can be coded with FLC/TU/EGk/SVLC/UVLC codes.
Please note that C10 can be combined with bandNumY/U/V or other classifiers to form a joint classifier. For example, combining 16 edge strengths with max 4 bandNum Y bands yields 64 classes.
Other disclosed classifier examples which use only current component information for current component classification can be used as cross-component classification. For example, as in
Plural classifiers can be used in the same POC. The current frame is divided by several regions, and each region uses the same classifier. For example, 3 different classifiers are used in POC0, and which classifier (0, 1, or 2) is used is signaled in CTU level, as shown in the following Table 13:
The max number of plural classifiers (plural classifiers can also be called alternative offset be fixed or sets) can signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, fixed (pre-defined) max number of plural classifiers is 4. 4 different classifiers are used in POC0, and which classifier (0, 1, or 2) is used is signaled in CTU level. Truncated-unary code can be used to indicate the classifier used for each luma or chroma CTB (0: CCSAO is not applied, 10: applied set 0, 110: applied set 1, 1110: applied set 2, 1111: applied set 3). Fixed-length code, golomb-rice code, and exponential-golomb code can also be used to indicate the classifier (offset set index) for CTB. 3 different classifiers are used in POC1, as shown in the following Table 14:
An example of Cb and Cr CTB offset set indices is given for 1280×720 sequence POC0 (number of CTUs in a frame is 10×6 if the CTU size is 128×128). POC0 Cb uses 4 offset sets and Cr uses 1 offset set (0: CCSAO is not applied, 1: applied set 0, 2: applied set 1, 3: applied set 2, 4: applied set 3). The “type” means the position of the chosen collocated luma sample (Yi). Different offset sets can have different types, band_num and corresponding offsets:
An example of jointly using collocated and neighboring Y/U/V samples for classification is listed (3 component joint bandNum classification for each Y/U/V component). In POC0, {2,4,1} offset sets are used for {Y,U,V} in this POC, respectively. Each offset set can be adaptively switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. Different offset sets can have different classifiers. For example, as candidate position (candPos) indicating in
Note the classIdx derivation of a joint classifier can be represented as “or-shift” form and simplifies derivation process. For example,
Another example is in POC1 component V set1 classification, candPos={neighboring Y8, neighboring U3, neighboring V0} with bandNum={4,1,2} are used, and this yields 8 classes, as shown in the following Table 15:
An example of jointly using collocated and neighboring Y/U/V samples for current Y/U/V sample classification is listed (3 components joint edgeNum (C1s) and bandNum classification for each Y/U/V component). edge CandPos is the centered position used for C1s classifier, edge bitMask is the C1s neighboring samples activation indicator, edgeNum is corresponding number of C1s classes. Note in this example, C1s is only applied on Y classifier (so edgeNum equals to edgeNumY) with edge candPos is always Y4 (current/collocated sample position). However, C1s can be applied on Y/U/V classifiers with edge candPos is neighboring sample position.
With “diff” denoting Y C1s's comparison score, the classIdx derivation can be:
Please note that as mentioned before, for a single component, plural C0 classifiers can be combined (different positions or weight combination, bandNum) to form a joint classifier, and this joint classifier can be combined with other components to form another joint classifier. For example, using 2 Y samples (candY/candX and bandNum Y/bandNumX), 1 U sample (candU and bandNumU), and 1 V sample (candV and bandNumV) to classify one U sample (Y/V can have the same concept).
Some decoder normative or encoder conformance constraints can be applied if using plural C0 for one single component: (1) selected C0 candidates must be mutually different (for example, candX !=candY), (2) the newly added bandNum must be less than other bandNum (for example, bandNumX<=bandNumY). By applying intuitive constraints within one single component (Y), redundant cases can be removed to save bit cost and complexity.
The max band_num (bandNumY, bandNumU, or bandNumV) can be fixed or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, fixing max band_num=16 in the decoder, and for each frame, 4 bits are signaled to indicate the C0 band_num in this frame. Some other max band_num examples are listed in the following Table 17:
The max number of classes or offsets (combinations of jointly using multiple classifiers, for example, C1s edgeNum*C1 bandNum Y*bandNumU*bandNumV) for each set (or all set added) can be fixed or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, fixing max all sets added class_num=256*4, and an encoder conformance check or a decoder normative check can be used to check the constraint.
A restriction can be applied on C0 classification: restrict band_num (bandNumY, bandNumU, or bandNumV) to be only power of 2 values. Instead of explicitly signaling band_num, a syntax band_num_shift is signaled. Decoder can use shift operation to avoid multiplication. Different band_num_shift can be used for different component:
Another operation example is taking rounding into account to reduce error:
Class (C0)=((Y0+(1<<(band_num_shift−1)))>>band_num_shift)>>bit_depth (Eq. 56)
For example, if band_num_max (Y, U, or V) is 16, the possible band_num_shift candidates are 0, 1, 2, 3, 4, corresponding to band_num=1, 2, 4, 8, 16, as shown in the following Tables:
The classifier of Cb and Cr can be different. The Cb and Cr offsets for all classes can be signaled separately, as shown in the following Table 20 for example:
The max offset value can be fixed or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblcok/Sample levels. For example, max offset is between [−15, 15]. Different component can have different max offset value.
The offset signaling can use DPCM. For example, offsets {3, 3, 2, 1, −1} can be signaled as {3, 0, −1, −1, −2}.
The classifier of Cb and Cr can be the same. The Cb and Cr offsets for all classes can be signaled jointly, as shown in the following Table 21 for example:
The classifier of Cb and Cr can be the same. The Cb and Cr offsets for all classes can be signaled jointly, with a sign flag difference, as shown in the following Table 22 for example:
The sign flag can be signaled for each class, as shown in the following Table 23 for example:
The classifier of Cb and Cr can be the same. The Cb and Cr offsets for all classes can be signaled jointly, with a weight difference. The weight (w) can be selected in a limited table, for example, +−¼, +−½, 0, +−1, +−2, +−4 . . . etc., where |w| only includes the power-of-2 values, as shown in the following Table 24:
The weights can be signaled for each class, as shown in the following Table 25 for example:
If plural classifiers are used in the same POC, different offset sets are signaled separately or jointly.
The previously decoded offsets can be stored for use of future frames. An index can be signaled to indicate which previously decoded offsets set is used for the current frame, to reduce offsets signaling overhead. For example, POC0 offsets can be reused by POC2 with signaling offsets set idx=0, as shown in the following Table 26:
The reuse offsets set idx for Cb and Cr can be different, as shown in the following Table 27:
The offset signaling can use additional syntax, start and length, to reduce signaling overhead. For example, band_num=256 but only offsets of band_idx=37˜44 signaled. In this example, the syntax of start and length both are 8 bits fix-length coded (should match band_num bits), as shown in the following Table 28:
If CCSAO is applied to all YUV 3 components, collocated and neighboring YUV samples can be jointly used for classification, and all abovementioned offsets signaling method for Cb/Cr can be extended to Y/Cb/Cr. Note different component offset set can be stored and used separately (each component has its own stored sets) or jointly (each component share the same stored set). A separate set example is shown in the following Table 29:
If a sequence bitdepth is higher than 10 (or a certain bitdepth), the offset can be quantized before signaling. On decoder side, the decoded offset is dequantized before applying it, for example, for a 12-bit sequence, the decoded offsets are left shifted (dequantized) by 2:
Filter strength concept: the classifier offsets can be further weighted before applying to samples. The weight (w) can be selected in a limited table, for example, +−¼, +−½, 0, +−1, +−2, +−4 . . . etc., where |w| only includes the power-of-2 values. The weight index can be signaled at SPS/APS/PPS/PH/SH/Region(Set)/CTU/CU/Subblock/Sample levels. The quantized offset signaling can be taken as a subset of this weight application. If recursive CCSAO is applied as in
Weighting for different classifiers: plural classifiers' offsets can be applied to the same sample with a weight combination. Similar weight index mechanism can be signaled as mentioned above. The following equations show an example:
Instead of directly signaling CCSAO params in PH/SH, the previously used params/offsets can be stored in APS or a memory buffer for the next pictures/slices reuse. An index can be signaled in PH/SH to indicate which stored previous frame offsets are used for the current picture/slice. A new APS ID can be created to maintain the CCSAO history offsets. The following table shows one example using
where aps_adaptation_parameter_set_id provides an identifier for the APS for reference by other syntax elements.
When aps_params_type is equal to CCSAO_APS, the value of aps_adaptation_parameter_set_id shall be in the range of 0 to 7, inclusive (for example).
ph_sao_cc_y_aps_id specifies the aps_adaptation_parameter_set_id of the CCSAO APS that the Y colour component of the slices in the current picture refers to.
When ph_sao_cc_y_aps_id is present, the following applies:
APS update mechanism: a max number of APS offset sets can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. Different component can have different max number limitation. If the APS offset sets are full, the newly added offset set can replace one existed stored offset with FIFO, LIFO, LRU mechanism, or receiving an index value indicating which APS offset set should be replaced. Note if the chosen classifier consists of candPos/edge info/coding info . . . etc., the all classifier information can be taken as part of APS offset set and can be also stored in it with its offset values. The update mechanisms mentioned above can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
A constraint can be applied (pruning): the newly received classifier info and offsets cannot be the same as any of stored APS offset set (of the same component, or across different components).
For example, if C0 candPos/bandNum classifier is used, the max number of APS offset sets is 4 per Y/U/V, and FIFO update is used for Y/V, idx indicating updating is used for U, as shown in the following Table 31:
The pruning criterion can be relaxed to give a more flexible way for encoder trade-off:
The 2 criteria can be applied at the same time or individually. Whether to apply each criterion can be predefined or switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
N/thr can be predefined or switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels.
The FIFO update can be (1) update from previously left set idx circularly (if all updated, started from set 0 again) as in the above example, (2) everytime update from set 0. Note the update can be in PH (as in example), or SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels, when receiving a new offset set.
For LRU update, the decoder maintains a count table which counts the “total offset set used count”, which can be refreshed in SPS/APS/per GOP structure/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels, the newly received offset set replaced the least-recently-used offset set in the APS. If 2 stored offset sets have the same count, FIFO/LIFO can be used, as shown for component Y in the following Table 32 for example:
Different components can have different update mechanisms.
As descripted above, different compnens (for exmaple, U/V) can share the same classifier (same candPos/edge info/coding info/offsets, can additionally have a weight w modifier).
Since offset sets used by different picture/slice may only have slight offset value difference, a “patch” concept can be used in the offset replacement mechanism. For example, when signaling a new offset set (OffsetNew), the offset values can be on top of an existed APS stored offset set (OffsetOld). The encoder only signals delta values to update the old offset set (DPCM: OffsetNew=OffsetOld+delta). Note in the following example, other choices than FIFO update (LRU, LIFO, or signaling an index indicating which set to be updated) can also be used. YUV components can have the same or use different updating mechanisms. Also note that though the classifier candPos/bandNum does not change in this example, one can indicate to overwrite the set classifier by signaling an additional flag (0: only update set offsets, 1: update both set classifier and set offsets), as shown in the following Table 33:
The DPCM delta offset values can be signaled in FLC/TU/EGk (order=0,1, . . . ) codes. One flag can be signaled for each offset set indicating whether to enable DPCM signaling. The DPCM delta offset values, or the new added offset values (directly signaled without DPCM, when enable APS DPCM=0) (ccsao_offset_abs), can be dequantized/mapped before applying to the target offsets (CcSaoOffsetVal). The offset quantization step can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. The following equations show an example:
One constraint can be applied to reduce direct offset signaling overhead: the updated offset values must have the same sign as the old offset values. By such infered offset sign, the new updated offset does not need to transmitt the sign flag again (ccsao_offset_sign_flag is inferred to be the same as the old offset in (1)).
Let R(x, y) be the input luma or chroma sample value before CCSAO, R′(x, y) be the output luma or chroma sample value after CCSAO:
When CCSAO is operated with other loop filters, the clip operation can be as follows.
Different clipping combinations give different trade-offs between correction precision and hardware temporary buffer size (register or SRAM bitwidth).
If any of the collocated and neighboring luma (chroma) samples used for classification is outside the current picture, CCSAO is not applied on the current chroma (luma) sample. For example, if
A variation is, if any of the collocated and neighboring luma or chroma samples used for classification is outside the current picture, the missed samples are used repetitive or mirror padding to create samples for classification, and CCSAO can be applied on the current luma or chroma samples, as shown in
Using luma samples for CCSAO classification may increase luma line buffer and hence increase decoder hardware implementation cost.
Solution 1 is to disable CCSAO for chroma sample which any of its luma candidates may across VB (outside the current chroma sample VB).
Solution 2 is using repetitive padding from luma line −4 for “cross VB” luma candidates (repetitive padding from luma nearest neighbor below VB for “cross VB” chroma candidates.).
Solution 3 is using mirror padding from below luma VB for “cross VB” luma candidates.
Solution 4 is “double sided symmetric padding” (similar to VVC ALF VB padding method).
The padding methods give more luma or chroma samples possibility to apply CCSAO so more coding gain can be achieved.
Note that at the bottom picture (or slice, tile, brick) boundary CTU row, the samples below VB are processed at the current CTU row, so the special handling (Solution 1, 2, 3) is not applied at the bottom picture (or slice, tile, brick) boundary CTU row.
A restriction can be applied to reduce the CCSAO required line buffer, and to simplify boundary processing condition check.
The CCSAO applied region unit can be CTB based. That is, the on/off control, CCSAO parameters (offsets, luma candidate positions, band_num, bitmask . . . etc. used for classification, offset set index) are the same in one CTB.
The applied region can be “not aligned to CTB boundary”. For example, not aligned to chroma CTB boundary but top-left shift (4, 4) samples. The syntaxes (on/off control, CCSAO parameters) are still signaled for each CTB, but the truly applied region is not aligned to the CTB boundary.
The CCSAO applied region unit (mask size) can be variant (larger or smaller than CTB size). The mask size can be different for different components. The mask size can be switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, in PH, a series of mask on/off flags and offset set indices are signaled to indicate each CCSAO region information, as shown in the following Table 34:
The CCSAO applied region frame partition can be fixed. For example, partition the frame into N regions.
Each region can have its own region on/off control flag and CCSAO parameters. Also, if the region size is larger than CTB size, it can have both CTB on/off control flags and region on/off control flag.
Different CCSAO applied region can share the same region on/off control and CCSAO parameters. For example, in
If plural classifiers are used in one frame, the method regaridng how to apply the classifier set index can be switched in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. For example, in
For default region case, a region level flag can be signaled if CTBs in this region does not use default set index, but to use other classifier set in this frame. For example, the following Table 36 shows square partition 4 regions:
The CCSAO applied region unit can be quad-tree/binary-tree/ternary-tree split from picture/slice/CTB level. Similar to CTB split, a series of split flags are signaled to indicate the CCSAO applied region partition.
The CCSAO applied region can be a specific area according to coding information (sample position, sample coded modes, loop filter parameters . . . etc.) inside a block. For example, 1) applied only when samples that are skip mode coded, or 2) only N samples along CTU boundaries, or 3) only on 8×8 grid samples in the frame, or 4) only DBF-filtered samples, or 5) only top M and left N rows in a CU, or (6) only intra coded samples, or (7) only samples in cbf=0 blocks, or (8) only on blocks with block QP in [N, M], where (N, M) can be predefined or signaled in SPS/APS/PPS/PH/SH/Region/CTU/CU/Subblock/Sample levels. The cross-component coding information may also be taken into account, (9) applied on chroma samples which collocated luma samples are in cbf=0 blocks.
Whether to introduce coding information applied region restriction can be predefined or singling one control flag in SPS/APS/PPS/PH/SH/Region(per alternative set)/CTU/CU/Subblock/Sample levels to indicate if a specified coding information is included/excluded in CCSAO application. Decoder skips CCSAO processing for those area according to the predefined condition or the control flags. For example, YUV use different predefined/flag controlled conditioned, and switched in region (set) level. The CCSAO application judgement can be in CU/TU/PU or sample levels, as shown in the following Table 37:
Another example is reusing all or part of bilateral enabling constraint (predefined), as shown by the following code:
Excluding some specific area may benefit CCSAO statistics collection. The offset derivation may be more precise or suitable for those truly need to be corrected area. For example, blocks with cbf=0 usually means perfectly predicted which may not need to be further corrected. Excluding those blocks may benefit other area's offset derivation.
Different applied regions can use different classifiers. For example, in a CTU, skip mode coded samples use C1, samples at CU center use C2, samples that are skip mode coded and are at CU center samples use C3.
The predefined or flag control “coding information excluding area” mechanism can be used in DBF/Pre-SAO/SAO/BIF/CCSAO/ALF/CCALF/NNLF, or other loop filters.
The following table shows an example of CCSAO syntax. Note the binarization of each syntax element can be changed.
Note in AVS3, the term patch is similar with slice, and patch header is similar with slice header.
If a higher-level flag is off, the lower level flags can be inferred from it and no need to be signaled. For example, if ph_cc_sao_cb_flag is false in this picture, ph_cc_sao_cb_band_num_minus1, ph_cc_sao_cb_luma_type, cc_sao_cb_offset_sign_flag, cc_sao_cb_offset_abs, ctb_cc_sao_cb_flag, cc_sao_cb_merge_left_flag, cc_sao_cb_merge_up_flag are not present and inferred to be false.
The SPS ccsao_enabled_flag cand can be conditioned on SPS SAO enabled flag.
ph_cc_sao_cb_ctb_control_flag, ph_cc_sao_cr_ctb_control_flag indicate whether to enable Cb/Cr CTB on/off control granularity. If enabled, ctb_cc_sao_cb_flag and ctb_cc_sao_cr_flag can be further signaled. Otherwise, whether CCSAO is applied in the current picture depends on ph_cc_sao_cb_flag, ph_cc_sao_cr_flag, without further signaling ctb_cc_sao_cb_flag and ctb_cc_sao_cr_flag at CTB level.
For ph_cc_sao_cb_type and ph_cc_sao_cr_type, a flag can be further signaled to distinguish if the center collocated luma position is used (Y0 position in
The following table shows an example in AVS that single (sct_num=1) or plural (set_num>1) classifiers are used in the frame. Note the syntax notation can be mapped to the notation used above.
If combined with
For high level syntax, similar to SAO, pps_ccsao_info_in_ph_flag and gci_no_sao_constraint_flag can be added.
pps_ccsao_info_in_ph_flag equal to 1 specifies that ccsao filter information could be present in the PH syntax structure and not present in slice headers referring to the PPS that do not contain a PH syntax structure. pps_ccsao_info_in_ph_flag equal to 0 specifies that ccsao filter information is not present in the PH syntax structure and could be present in slice headers referring to the PPS. When not present, the value of pps_ccsao_info_in_ph_flag is inferred to be equal to 0.
gci_no_ccsao_constraint_flag equal to 1 specifies that sps_ccsao_enabled_flag for all pictures in O1sInScope shall be equal to 0. gci_no_ccsao_constraint_flag equal to 0 does not impose such a constraint.
The SAO classification methods included in this disclosure (including cross-component sample/coding info classification) can serve as a post prediction filter, wherein the prediction can be intra, inter, or other prediction tools such as Intra Block Copy.
The refined prediction samples (Ypred′, Upred′, Vpred′) are updated by adding the corresponding class offset and are used for intra, inter, or other prediction thereafter:
For chroma U and V components, besides the current chroma component, the cross-component (Y) can be used for further offset classification. The additional cross-component offset (h′_U, h′_V) can be added on the current component offset (h_U, h_V). The following Table 40 shows an example:
The refined prediction samples (Upred“, Vpred”) are updated by adding the corresponding class offset and are used for intra, inter, or other prediction thereafter. The following equations show an example:
The intra and inter prediction can use different SAO filter offsets.
The SAO/CCSAO classification methods included in this disclosure (including cross-component sample/coding info classification) can serve as a filter applied on reconstructed samples of a TU. As in
To efficiently decide the best CCSAO parameters in one picture, one hierarchical rate-distortion (RD) optimization algorithm is designed, including 1) a progressive scheme for searching the best single classifier; 2) a training process for refining the offset values for one classifier; 3) a robust algorithm to effectively allocate suitable classifiers for different local regions. A typical CCSAO classifier in ECM−2.0 is as follows:
where {Ycol, Ucol, Vcol} are the three collocated samples that are used to classify the current sample; {NY, NU, NY} are the numbers of bands that are applied to Y, U and V components, respectively; BD is the coding bitdepth; Crec and Crec′ are the reconstructed samples before and after the CCSAO is applied; σCCSAO[i] is the value of the CCSAO offset that is applied to the i-th category; Clip1(·) is the clipping function that clips the input to the range of the bitdepth, i.e., [0, 2BD−1]; >> represents the right-shift operation. In the proposed CCSAO, the collocated luma sample can be chosen from 9 candidate positions while the collocated chroma sample is fixed. The blocks of the present disclosure can be any type of coding blocks, such as a coding tree block (CTB) for example.
Specifically, for searching the best classifier which consists of N categories (NY·NU·NV), the multi-stage early termination method of
In some embodiments, the method further comprises determining an initial RD cost for the block; and determines that the first RD cost is less than the initial RD cost. The initial RD cost represents the RD cost for the block when no CCSAO filter is to be applied.
In some embodiments, the method further comprises repeating the operation of determining the RD cost for the block using an additional classifier having a category number in the next value range in response to determining the RD cost is decreasing until a threshold value range is reached; and applying the last used classifier as the classifier for the CCSAO process. For example, if the first RD cost is more than the second RD cost, the method may continue to find the best classifier in the next value ranges (as in the example above, the method may continue to find the best classifier(s) having the category number between 33 and 48, between 49 and 64, etc.) as long as the newly found best classifier has the least RD cost. This iterative process may stop when the threshold value range is reached. For example, the method may specify that the searching stops when the value range of (113 to 128) is reached. In some embodiments, the iterative process may stop when the repetitions count to a certain value. For example, the process may stop when the operation of determining the RD cost for the block using an additional classifier having a category number in the next value range has been repeated for 5 times. When the process stops, the method applies the last used classifier, which has the least RD cost, as the classifier for the CCSAO process. In this way, when classifiers with more categories no longer improve the RD cost, these classifiers are skipped. Also, multiple breakpoints can be set for N categories early termination based on different configurations. For example, for coding configuration of AI, the breakpoint is 4 categories (NY·NU·NV<4, 8, 12 . . . ); for coding configuration of RA/LB, the breakpoint is 16 categories (NY·NU·NV<16, 32, 48, 64 . . . ).
In some embodiments, the value ranges are adjacent ranges. For example, the adjacent ranges are (1 to 16), (17 to 32), etc. However, the value ranges can also be overlapping intervals, such as (1 to 16), (9 to 24), etc.
In some embodiments, the classifiers use EO (Edge Offset) or BO (Band Offset) or a combination thereof for classification. In some embodiments, a classifier is skipped in response to determining that the classifier comprises a BO classifier in which the number of bands for Y component is less than the number of bands for U component or the number of bands for V component. For example, a classifier containing categories (2·4·1) can be skipped because the number of bands for Y component (i.e., 2) is less than the number of bands for U component (i.e., 4).
The progressive scheme of
Specifically, for a given classifier, the reconstructed samples in the frame are first classified according to Eq. 70, Eq. 71 and Eq. 72. The SAO fast distortion estimation is used to derive the first set of offset values for the categories of the given classifier. In some embodiments, the method further comprises determining an initial RD cost for the block without using any CCSAO filter. The first set of offset values are assigned for the block based on determining that a first RD cost associated with the first set of offset values is less than the initial RD cost. Otherwise, the method may disable the block based on determining that the initial RD cost is less than the first RD cost. The method of
For one category, k, s(k), x(k), are the sample positions, original samples, and samples before CCSAO, E is the sum of differences between s(k) and x(k), N is the sample count, ΔD is the estimated delta distortion by applying offset h, ΔJ is RD cost, λ is the lagrange multiplier, R is the bit cost.
The original samples can be true original samples (raw image samples without pre-processing) or Motion Compensated Temporal Filter (MCTF, one classical encoding algorithm pre-processes the original samples before encoding) original samples.
λ can be the same as that of SAO/ALF, or weighted by a factor (according to configuration/resolution).
The encoder optimized CCSAO by trade-off total RD cost for all categories.
The statistic data E and N for each category are stored for each block for further determination of plural region classifiers.
Specifically, to investigate whether a second classifier benefits the whole frame quality, the blocks with CCSAO enabled are first sorted in ascending or descending order according to the distortion (or according to RD cost, including bit cost). Next, a portion of sorted blocks (i.e., a predefined/dependent ratio, e.g., (setNum−1)/setNum−1, of the blocks) are excluded. In some embodiments, excluding a portion of sorted blocks includes excluding half of the sorted blocks. For example, half of the blocks with smaller distortion are kept the same classifier, while the other half blocks are trained with a new second classifier. In some embodiments, the excluded blocks may be trained with different classifiers. For example, during the block on-off offset refinement, each block may select its best classifier, therefore a good classifier may propagate to more blocks. With the spirits of shuffle and diffusion, the strategy gives both randomness and robustness for parameter decision. If the current number of classifiers does not further improve the RD cost, more plural classifiers are skipped.
It should be noted that the flowcharts as illustrated herein provide examples of sequences of various operations. Although shown in a particular sequence or order, unless otherwise specified, the order of the operations can be modified. Thus, the illustrated embodiments should be understood as an example, and the operations can be performed in a different order, and some operations can be performed in parallel or in sequence. Additionally, one or more operations can be omitted in various embodiments; thus, not all operations are required in every embodiment.
The processor 4420 typically controls overall operations of the computing environment 4410, such as the operations associated with the display, data acquisition, data communications, and image processing. The processor 4420 may include one or more processors to execute instructions to perform all or some of the steps in the above-described methods. Moreover, the processor 4420 may include one or more modules that facilitate the interaction between the processor 4420 and other components. The processor may be a Central Processing Unit (CPU), a microprocessor, a single chip machine, a GPU, or the like.
The memory 4440 is configured to store various types of data to support the operation of the computing environment 4410. Memory 4440 may include predetermine software 4442. Examples of such data comprise instructions for any applications or methods operated on the computing environment 4410, video datasets, image data, etc. The memory 4440 may be implemented by using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk.
The I/O interface 4450 provides an interface between the processor 4420 and peripheral interface modules, such as a keyboard, a click wheel, buttons, and the like. The buttons may include but are not limited to, a home button, a start scan button, and a stop scan button. The I/O interface 4450 can be coupled with an encoder and decoder.
In some embodiments, there is also provided a non-transitory computer-readable storage medium comprising a plurality of programs, such as comprised in the memory 4440, executable by the processor 4420 in the computing environment 4410, for performing the above-described methods. For example, the non-transitory computer-readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disc, an optical data storage device or the like.
The non-transitory computer-readable storage medium has stored therein a plurality of programs for execution by a computing device having one or more processors, where the plurality of programs when executed by the one or more processors, cause the computing device to perform the above-described methods.
In some embodiments, the computing environment 4410 may be implemented with one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), graphical processing units (GPUs), controllers, micro-controllers, microprocessors, or other electronic components, for performing the above methods.
The present disclosure describes a hardware implementation for an apparatus according to one or more aspects of the present disclosure. The apparatus for encoding video data or decoding video data may include a memory and at least one processor. The processor may be coupled to the memory and configured to perform the above mentioned processes described above with reference to
The various operations, methods, and systems described in connection with the disclosure herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. According one or more aspects of the present disclosure, a computer program product for encoding video data or decoding video data may include processor executable computer code for performing the above mentioned processes described above with reference to
The preceding description is provided to enable any person skilled in the art to make or use various embodiments according to one or more aspects of the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the various embodiments. Thus, the claims are not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.
This application is a continuation application of PCT Application No. PCT/US2022/049269, filed on Nov. 8, 2022, which is based upon and claims priority to Provisional Application No. 63/277,110 filed on Nov. 8, 2021, the entire contents of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63277110 | Nov 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2022/049269 | Nov 2022 | WO |
Child | 18658178 | US |