VIDEO CODING METHOD AND STORAGE MEDIUM

TECHNICAL FIELD

Embodiments of the disclosure relate to, but are not limited to, the video technology, in particular to a video coding method and a storage medium.

BACKGROUND

The digital video compression technology is mainly to compress a huge amount of digital image video data, so as to facilitate transmission and storage, etc. A picture of an original video sequence contains a luma component and a chroma component. In digital video coding, an encoder reads a black and white picture or a color picture and partitions each picture into largest coding units (LCUs) of equal size (e.g., 128×128, 64×64, etc.). Each LCU may be partitioned into rectangular coding units (CUs) according to a rule, which may be further partitioned into prediction units (PUs) and transform units (TUs), etc. A hybrid coding framework includes modules for such as prediction, transform, quantization, entropy coding, and in-loop filter. The prediction module may use intra prediction and inter prediction. For the intra prediction, sample information of a current block is predicted based on information of the same picture, thereby eliminating spatial redundancy. The inter prediction can refer to information of different pictures and use motion estimation to search for motion vector information that best matches the current block, so as to eliminate temporal redundancy. For the transform, a predicted residual is transformed into a frequency domain to redistribute energy, and combined with the quantization, information that is not sensitive to human eyes is removed, thereby eliminating visual redundancy. The entropy coding can eliminate character redundancy according to a current context model and probability information of a binary bitstream, to generate a bitstream.

With a sharp increase in Internet videos and an increasingly high requirement on video clarity, although a lot of video data can be saved with existing digital video compression standards, it is still necessary to pursue better digital video compression technologies to reduce pressure on a bandwidth and a traffic of digital video transmission.

SUMMARY

A video decoding method is provided in an embodiment of the disclosure. The method is performed by a video decoding apparatus and includes the following. A first flag of a reconstructed picture is decoded, where the first flag contains information of a neural network based loop filtering (NNLF) mode to be used when performing NNLF on the reconstructed picture. The NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag, and NNLF is performed on the reconstructed picture according to the determined NNLF mode. The NNLF mode includes a first mode and a second mode. The second mode includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data.

A video encoding method is further provided in an embodiment of the disclosure. The method is performed by a video encoding apparatus and includes the following. A rate-distortion cost for performing NNLF on a reconstructed picture input by using a first mode is calculated, and a rate-distortion cost for performing NNLF on the reconstructed picture by using a second mode is calculated. It is determined to perform NNLF on the reconstructed picture by using a mode with a minimum rate-distortion cost between the first mode and the second mode. The first mode and the second mode each are a set NNLF mode. The second mode includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data.

A non-transitory computer-readable storage medium is further provided in an embodiment of the disclosure. The non-transitory computer-readable storage medium stores a bitstream. The bitstream is generated according to the video encoding method of any embodiment of the disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are used to provide an understanding of embodiments of the disclosure, and constitute a part of the specification, and together with embodiments of the disclosure, serve to explain technical solutions of the disclosure, and are not limiting to the technical solutions of the disclosure.

FIG. 1A is a schematic diagram of a coding system according to an embodiment.

FIG. 1B is a framework diagram of an encoding end in FIG. 1A.

FIG. 1C is a framework diagram of a decoding end in FIG. 1A.

FIG. 2 is a block diagram of a filter unit according to an embodiment.

FIG. 3A is a structural diagram of a network of a neural network based loop filtering (NNLF) filter according to an embodiment.

FIG. 3B is a structural diagram of a residual block in FIG. 3A.

FIG. 3C is a schematic diagram illustrating an input to and output of the NNLF filter in FIG. 3A.

FIG. 4A is a structural diagram of a backbone of an NNLF filter according to another embodiment.

FIG. 4B is a structural diagram of a residual block in FIG. 4A.

FIG. 4C is a schematic diagram illustrating an input to and output of the NNLF filter in FIG. 4A.

FIG. 5 is a schematic diagram illustrating arrangement of feature maps of input information.

FIG. 6 is a flowchart of an NNLF method applied to an encoding end according to an embodiment of the disclosure.

FIG. 7 is a block diagram of a filter unit according to an embodiment of the disclosure.

FIG. 8 is a flowchart of a video encoding method according to an embodiment of the disclosure.

FIG. 9 is a flowchart of an NNLF method applied to a decoding end according to an embodiment of the disclosure.

FIG. 10 is a flowchart of a video decoding method according to an embodiment of the disclosure.

FIG. 11 is a schematic diagram of a hardware structure of an NNLF filter according to an embodiment of the disclosure.

FIG. 12A is a schematic diagram illustrating an input to and output of an NNLF filter without adjusting orders of chroma information according to an embodiment of the disclosure.

FIG. 12B is a schematic diagram illustrating an input to and output of an NNLF filter with adjusting orders of chroma information according to an embodiment of the disclosure.

FIG. 13 is a flowchart of an NNLF method applied to a decoding end according to an embodiment of the disclosure.

FIG. 14 is a flowchart of a video decoding method according to an embodiment of the disclosure.

FIG. 15 is a flowchart of an NNLF method applied to an encoding end according to an embodiment of the disclosure.

FIG. 16 is a flowchart of a video encoding method according to an embodiment of the disclosure.

FIG. 17 is a flowchart of a training method for an NNLF model according to an embodiment of the disclosure.

DETAILED DESCRIPTION

Multiple embodiments are described in the disclosure, but the description is exemplary rather than limiting, and it will be apparent to those of ordinary skill in the art that more embodiments and implementations may be included within the scope of the embodiments described in the disclosure.

In the description of the disclosure, the words “exemplary” or “for example” are used as examples, illustrations, or explanations. Any embodiment described in the disclosure as “exemplary” or “for example” should not be construed as being more preferred or advantageous than other embodiments. Herein, “and/or” is a description of the association relationship of associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean that there are three situations: only A, both A and B, and only B. “Multiple” means two or more. In addition, to clearly describe the technical solutions of the embodiment of the disclosure, the words “first”, “second”, and the like are used to distinguish the same items or similar items with substantially the same functions and effects. Those skilled in the art will appreciate that the words “first”, “second”, and the like are not limited in number or order of execution, and that the words “first”, “second”, and the like are not limited to any difference.

In describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not depend on the particular order of the steps described herein the method or process should not be limited to the particular order of steps. As will be understood by those of ordinary skill in the art, other sequences of steps are also possible. Accordingly, the particular sequence of steps set forth in the specification should not be construed as limiting the claims. Furthermore, the claims for the method and/or process should not be limited to the steps of performing them in the written order, which can be readily understood by those skilled in the art and which can vary and remain within the spirit and scope of the embodiments of the disclosure.

A neural network based loop filtering (NNLF) method, a video encoding method, and a video decoding method in embodiments of the disclosure can be applied to various video coding standards, such as H.264/Advanced Video Coding (AVC), H.265/High Efficiency Video Coding (HEVC), H.266/Versatile Video Coding (VVC), Audio Video Coding Standard (AVS), and other standards developed by Moving Picture Experts Group (MPEG), Alliance for Open Media (AOM), Joint Video Experts Team (JVET) and extensions of these standards, or any other customized standards.

FIG. 1A is a block diagram of a video coding system that can be used in embodiments of the disclosure. As illustrated in the figure, the system is divided into an encoding end apparatus 1 and a decoding end apparatus 2. The encoding end apparatus 1 is configured to generate a bitstream. The decoding end apparatus 2 can decode the bitstream. The decoding end apparatus 2 can receive the bitstream from the encoding end apparatus 1 via a link 3. The link 3 includes one or more media or devices that can transfer the bitstream from the encoding end apparatus 1 to the decoding end apparatus 2. In one example, the link 3 includes one or more communication media that enable the encoding end apparatus 1 to send the bitstream directly to the decoding end apparatus 2. The encoding end apparatus 1 modulates the bitstream according to a communication standard and sends the modulated bitstream to the decoding end apparatus 2. The one or more communication media may include wireless and/or wired communication media, and may form part of a packet network. In another example, the bitstream can also be output from an output interface 15 to a storage device, and the decoding end apparatus 2 can read stored data from the storage device via streaming or downloading.

As illustrated in the figure, the encoding end apparatus 1 includes a data source 11, a video encoding apparatus 13, and the output interface 15. The data source 11 includes a video capturing apparatus (such as a camera), an archive containing previously captured data, a feed interface for receiving data from a content provider, a computer graphics system for generating data, or a combination of these sources. The video encoding apparatus 13, also called a video encoder, is configured to encode data from the data source 11 and outputs encoded data to the output interface 15. The output interface 15 may include at least one of a regulator, a modem, or a transmitter. The decoding end apparatus 2 includes an input interface 21, a video decoding apparatus 23, and a display apparatus 25. The input interface 21 includes at least one of a receiver or a modem. The input interface 21 can receive a bitstream via the link 3 or from the storage device. The video decoding apparatus 23, also called a video decoder, is configured to decode the received bitstream. The display apparatus 25 is configured to display decoded data. The display apparatus 25 can be integrated with other devices of the decoding end apparatus 2 or arranged separately. The display apparatus 25 is optional for the decoding end. In other examples, the decoding end may include other devices or equipment for the decoded data.

FIG. 1B is a block diagram of an exemplary video encoding apparatus that can be used in embodiments of the disclosure. As illustrated in the figure, the video encoding apparatus 10 includes a partition unit 101, a prediction unit 100, an inter prediction unit 121, an intra prediction unit 126, a residual generation unit 102, a transform unit 104, a quantization unit 106, an inverse quantization unit 108, an inverse transform unit 110, a reconstruction unit 112, a filter unit 113, a decoded picture buffer 114, and an entropy coding unit 115.

The partition unit 101 is configured to cooperate with the prediction unit 1100 to partition received video data into slices, coding tree units (CTUs), or other relatively large units. The received video data may be a video sequence including video pictures such as I-pictures, P-pictures, or B-pictures.

The prediction unit 100 is configured to partition the CTU into coding units (CUs), and perform intra prediction coding or inter prediction coding on the CU. For performing intra prediction and inter prediction on the CU, the CU can be partitioned into one or more prediction units (PUs).

The prediction unit 100 includes an inter prediction unit 121 and an intra prediction unit 126.

The inter prediction unit 121 is configured to perform inter prediction on the PU and generate prediction data for the PU. The prediction data includes a prediction block for the PU, motion information for the PU, and various syntax elements for the PU. The inter prediction unit 121 may include a motion estimation (ME) unit and a motion compensation (MC) unit. The motion estimation unit can be configured for motion estimation to generate a motion vector. The motion compensation unit can be configured to obtain or generate a prediction block according to the motion vector.

The intra prediction unit 126 is configured to perform intra prediction on the PU and generate prediction data for the PU. The prediction data for the PU may include a prediction block and various syntax elements for the PU.

The residual generation unit 102 (indicated by a circle with a plus sign after the partition unit 101 in the figure) is configured to generate a residual block for the CU by subtracting, from the original block for the CU, the prediction blocks for the PUs into which the CU is partitioned.

The transform unit 104 is configured to partition the CU into one or more transform units (TUs), and the partition for the prediction unit and the partition for the transform unit may be different. A residual block associated with the TU is a sub-block obtained by partitioning the residual block of the CU. A coefficient block associated with the TU is generated by applying one or more transforms to the residual block associated with the TU.

The quantization unit 1106 is configured to quantize a coefficient(s) in a coefficient block based on a quantizer parameter(s) (QP(s)). The quantization degree of the coefficient block can be changed by adjusting the QP(s).

The inverse quantization unit 108 and the inverse transform unit 110 are configured to apply inverse quantization and inverse transform to the coefficient block, respectively, to obtain a reconstructed residual block associated with the TU.

The reconstruction unit 112 (indicated by a circle with a plus sign after the inverse transform unit 110 in the figure) is configured to add the reconstructed residual block and the prediction block generated by the prediction unit 100 to generate a reconstructed picture.

The filter unit 113 is configured to perform loop filtering on the reconstructed picture.

The decoded picture buffer 114 is configured to store the loop-filtered reconstructed picture. The intra prediction unit 126 can extract a reference picture for a neighbouring block for a current block from the decoded picture buffer 114 to perform intra prediction. The inter prediction unit 121 can use a previous reference picture buffered in the decoded picture buffer 114 to perform inter prediction on the PU of the current picture.

The entropy coding unit 115 is configured to perform entropy coding on received data (e.g., syntax elements, quantized coefficient blocks, motion information, etc.), so as to generate a video bitstream.

In other examples, the video encoding apparatus 10 may include more, fewer, or different functional components than this example, for example, the transform unit 104, the inverse transform unit 110, etc. may be omitted.

FIG. 1C is a block diagram of an exemplary video decoding apparatus that can be used in embodiments of the disclosure. As illustrated in the figure, the video decoding apparatus 15 includes an entropy decoding unit 150, a prediction unit 152, an inverse quantization unit 154, an inverse transform unit 155, a reconstruction unit 158, a filter unit 159, and a decoded picture buffer 160.

The entropy decoding unit 150 is configured to perform entropy decoding on a received encoded video bitstream, and extract syntax elements, quantized coefficient blocks, motion information for the PU, etc. The prediction unit 152, the inverse quantization unit 154, the inverse transform unit 155, the reconstruction unit 158, and the filter unit 159 can all perform corresponding operations based on the syntax elements extracted from the bitstream.

The inverse quantization unit 154 is configured to perform inverse quantization on a coefficient block associated with a quantized TU.

The inverse transform unit 155 is configured to apply one or more inverse transforms to an inverse quantized coefficient block to generate a reconstructed residual block for the TU.

The prediction unit 152 includes an inter prediction unit 162 and an intra prediction unit 164. If a current block is encoded using intra prediction, the intra prediction unit 164 can determine an intra prediction mode for the PU based on syntax elements decoded from the bitstream, and perform intra prediction according to reconstructed reference information adjacent to the current block obtained from the decoded picture buffer 160. If the current block is encoded using inter prediction, the inter prediction unit 162 can determine a reference block for the current block based on motion information for the current block and corresponding syntax elements for the current block, and perform inter prediction based on the reference block obtained from the decoded picture buffer 160.

The reconstruction unit 158 (indicated by a circle with a plus sign after the inverse transform unit 155 in the figure) is configured to obtain a reconstructed picture based on the reconstructed residual block associated with the TU and a prediction block for the current block generated by the prediction unit 152 through intra prediction or inter prediction.

The filter unit 159 is configured to perform loop filtering on the reconstructed picture.

The decoded picture buffer 160 is configured to store the loop-filtered reconstructed picture which is used as a reference picture for subsequent motion compensation, intra prediction, inter prediction, etc., and can also output the filtered reconstructed picture as decoded video data for presentation on the display apparatus.

In other embodiments, the video decoding apparatus 15 may include more, fewer or different functional components, for example, the inverse transform unit 155 may be omitted in some cases.

Based on the above-mentioned video encoding apparatus and video decoding apparatus, the following basic encoding and decoding processes can be performed. At the encoding end, a picture is partitioned into blocks, intra prediction or inter prediction or other algorithms are performed on the current block to generate a prediction block for the current block, the prediction block is subtracted from the original block for the current block to obtain a residual block, the residual block is transformed and quantized to obtain a quantization coefficient, and entropy coding is performed on the quantization coefficient to generate a bitstream. At the decoding end, intra prediction or inter prediction is performed on the current block to generate a prediction block for the current block, and on the other hand, the quantization coefficient obtained by decoding the bitstream is inversely quantized and inversely transformed to obtain the residual block, the prediction block and the residual block are added to obtain a reconstructed block, the reconstructed block constitutes a reconstructed picture, and loop filtering is performed on the reconstructed picture based on the picture or block to obtain a decoded picture. Through operations similar to those of the decoding end, the encoding end also obtains a decoded picture, which may also be referred to as a loop-filtered reconstructed picture. The loop-filtered reconstructed picture can be used as a reference picture for inter prediction for subsequent pictures. Information determined by the encoding end, such as block partition information, mode information and parameter information such as prediction, transform, quantization, entropy coding, in-loop filter, etc. can be signalled into the bitstream. By decoding the bitstream or analyzing set information, the decoding end determines the block partition information, the mode information and parameter information such as prediction, transform, quantization, entropy coding, in-loop filter, etc. used by the encoding end, thereby ensuring that the decoded picture obtained by the encoding end is the same as the decoded picture obtained by the decoding end.

Although the above is an example of a block-based hybrid coding framework, embodiments of the disclosure are not limited thereto. With the development of technology, one or more modules in the framework and one or more steps in the process may be replaced or optimized.

Embodiments of the disclosure relate to, but are not limited to, the above filter units (the filter units may also be referred to as loop filtering modules) in the encoding end and decoding end and corresponding loop filtering methods.

In an embodiment, the filter units in the encoding end and decoding end each include tools such as a deblocking filter (DBF) 20, a sample adaptive offset (SAO) 22, and an adaptive loop filter (ALF) 26, and further include a neural network based loop filter 24 between the SAO and the ALF, as illustrated in FIG. 2. The filter unit performs loop filtering on the reconstructed picture, which can compensate for distortion information and provide a better reference for subsequent sample encoding.

In an exemplary embodiment, an NNLF solution is provided, and a model used (also referred to as a network model) uses a filtering network as illustrated in FIG. 3A. Herein, this NNLF is referred to as NNLF1, and a filter for performing NNLF1 is referred to as an NNLF1 filter. As illustrated in the figure, a backbone of the filtering network includes multiple residual blocks (ResBlock) connected in sequence, and further includes a convolutional layer (indicated by Conv in the figure), an activation function layer (ReLU in the figure), a concat layer (indicated by Cat in the figure), and a pixel shuffle layer (indicated by PixelShuffle in the figure). A structure of each residual block is illustrated in FIG. 3B, which includes a convolutional layer with a convolutional-kernel size of 1×1, an ReLU layer, a convolutional layer with a convolutional-kernel size of 1×1, and a convolutional layer with a convolutional-kernel size of 3×3 that are connected in sequence.

As illustrated in FIG. 3A, an input to the NNLF1 filter includes luma information (i.e., a Y component) and chroma information (i.e., a U component and a V component) of a reconstructed picture (rec_YUV) and multiple auxiliary information, such as luma information and chroma information of a prediction picture (pred_YUV), QP information, and picture-type information. The QP information includes a base quantization parameter (BaseQP) default in an encoding profile and a slice quantization parameter (SliceQP) of a current slice, and the picture-type information includes a slice type (SliceType), i.e., a type of a picture to which the current slice belongs. An output of the model is a filtered picture (output_YUV) that is subjected to the NNLF1 filter, and the filtered picture output by the NNLF1 filter can also be used as a reconstructed picture input to a subsequent filter.

In NNLF1, one model is used to filter YUV components of the reconstructed picture (rec_YUV) and output YUV components of the filtered picture (out_YUV), as illustrated in FIG. 3C, where auxiliary input information such as YUV components of the prediction picture is omitted in the figure. A filtering network of this model has a skip-connection branch between the input reconstructed picture and the output filtered picture, as illustrated in FIG. 3A.

In another exemplary embodiment, another NNLF solution is provided, which is referred to as NNLF2. In NNLF2, two models need to be trained separately. One model is used to filter a luma component of a reconstructed picture, and the other model is used to filter two chroma components of the reconstructed picture. These two models can use the same filtering network, which also has a skip-connection branch between a reconstructed picture input to an NNLF2 filter and a filtered picture output by the NNLF2 filter. As illustrated in FIG. 4A, a backbone of the filtering network includes multiple residual blocks with attention mechanisms (AttRes Block) connected in sequence, and a convolutional layer (Conv 3×3) for feature mapping, and a shuffle layer (Shuffle). A structure of each of the residual blocks with attention mechanisms is illustrated in FIG. 4B, which includes a convolutional layer (Conv 3×3), an activation layer (PReLU), a convolutional layer (Conv 3×3), and an attention layer (Attention) that are connected in sequence, where M represents the number of feature maps, and N represents the number of samples in one dimension.

In NNLF2, model 1 for filtering the luma component of the reconstructed picture is illustrated in FIG. 4C, input information of model 1 includes the luma component of the reconstructed picture (rec_Y), and an output of model 1 is a luma component of the filtered picture (out_Y), where auxiliary input information such as a luma component of a prediction picture is omitted in the figure. In NNLF2, model 2 for filtering the two chroma components of the reconstructed picture is illustrated in FIG. 4C, input information of model 2 includes the two chroma components of the reconstructed picture (rec_UV) and the luma component of the reconstructed picture (rec_Y) serving as the auxiliary input information, and an output of model 2 is two chroma components of the filtered picture (out_UV). Model 1 and model 2 may further contain other auxiliary input information such as QP information, block partition pictures, deblocking filtering boundary-strength information, etc.

The above NNLF1 solution and NNLF2 solution can be implemented by neural network based common software (NCS) in neural network based video coding (NNVC), and each serve as a baseline tool in the NNVC reference software test platform, i.e., a baseline NNLF.

In video coding, reference to picture information of a previous picture is allowed for a current picture in an inter prediction technology, thereby improving the coding performance, and coding effects of the previous picture may also affect coding effects of a subsequent picture. In NNLF1 and NNLF2 solutions, to adapt a filtering network to the influence of the inter prediction technology, the model training process thereof includes an initial training stage and an iterative training stage, and a multi-round training mode is used. In the initial training stage, a model to-be-trained has not yet been deployed in an encoder, and a first-round trained model is obtained by performing first-round training on the model based on collected sample data of the reconstructed picture. In the iterative training stage, each model is deployed in the encoder. Specifically, first, the first-round trained model is deployed in the encoder, and a second-round trained model is obtained through re-collecting of sample data of the reconstructed picture and second-round training of the first-round trained model. Then, the second-round trained model is deployed in the encoder, and a third-round trained model is obtained through re-collecting of sample data of the reconstructed picture and third-round training of the second-round trained model, and such iterative training is repeated. Finally, for a model obtained after each round of training, coding testing is performed on a validation set, so as to find a model with the best coding performance for actual deployment.

As above mentioned, in NNLF1 and NNLF2, for the luma component and chroma component of the reconstructed picture that are input to a neural network, a joint-input mode is used in NNLF1, and as illustrated in FIG. 3C, only one network model needs to be trained; and the luma component and chroma component of the reconstructed picture are separately input in NNLF2, and as illustrated in FIG. 4C, two models need to be trained. For two chroma components, i.e., the U component and V component, both NNLF1 and NNLF2 use the joint-input mode, which means that there is a binding relationship. As illustrated in FIG. 5, during inputting stacked feature maps to a model for prediction and output, three components of each picture are arranged in the order of Y, U, and V, with the U component at the front and the V component at the back. Currently, there is a lack of solutions to study the influence of adjusting input orders of the U component and V component on the neural network.

A method for adjusting chroma information is proposed in an embodiment of the disclosure. By adjusting chroma information input to an NNLF filter, for example, swapping orders of the U component and V component, the coding performance of the NNLF filter can be further optimized.

An NNLF method is provided in an embodiment of the disclosure. The method is applied to a video encoding apparatus. As illustrated in FIG. 6, the method includes the following.

At S110, a rate-distortion cost for performing NNLF on a reconstructed picture input by using a first mode is calculated, and a rate-distortion cost for performing NNLF on the reconstructed picture by using a second mode is calculated.

At S120, it is determined to perform NNLF on the reconstructed picture by using a mode with a minimum rate-distortion cost between the first mode and the second mode.

The first mode and the second mode each are a set NNLF mode. The second mode includes a chroma adjustment mode. Compared with the first mode, specified adjustment of input chroma information before filtering is added in the chroma adjustment mode.

Tests have shown that adjustment of the input chroma information can affect the coding performance. In this embodiment, an optimal mode can be selected according to a rate-distortion cost from the second mode in which the chroma information is adjusted and the first mode in which the chroma information is not adjusted, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, specified adjustment is performed on the chroma information in any one or more of the following adjustment manners. Orders of two chroma components of the reconstructed picture are swapped, for example, a sort order in which the U component is at the front and the V component is at the back is adjusted to a sort order in which the V component is at the front and the U component is at the back. A weighted average value and a square error value of the two chroma components of the reconstructed picture are calculated, and the weighted average value and the square error value are determined as chroma information to be input in the NNLF mode.

In an exemplary embodiment of the disclosure, the reconstructed picture is a reconstructed picture of a current picture or a current slice or a current block, but may also be a reconstructed picture of another coding unit. Herein, a reconstructed picture using an NNLF filter may belong to coding units of different levels such as a picture-level (including a picture and a slice), a block-level, and the like.

In an exemplary embodiment of the disclosure, a network structure of a model used in the first mode is the same as or different from a network structure of a model used in the second mode.

In an exemplary embodiment of the disclosure, the second mode further includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data. A network model used in the first mode is trained using the original data.

In an exemplary embodiment of the disclosure, the second mode further includes a chroma adjustment and fusion mode. Compared with the first mode, specified adjustment of input chroma information before filtering is added in the chroma adjustment and fusion mode, and training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data. A network model used in the first mode is trained using the original data.

In an exemplary embodiment of the disclosure, the first mode includes (or is implemented as) one first mode, and the second mode includes (or is implemented as) one or more second modes. The rate-distortion cost cost₁for performing NNLF on the reconstructed picture input by using the first mode is calculated as follows. A first filtered picture output after performing NNLF on the reconstructed picture by using the first mode is obtained, and cost₁is calculated according to a difference between the first filtered picture and a corresponding original picture. The rate-distortion cost cost₂for performing NNLF on the reconstructed picture by using the second mode is calculated as follows. For each of the one or more second modes, a second filtered picture output after performing NNLF on the reconstructed picture by using the second mode is obtained, and cost₂of the second mode is calculated according to a difference between the second filtered picture and the corresponding original picture.

In an example of this embodiment, the second mode includes multiple second modes. A mode with a minimum rate-distortion cost between the first mode and the second mode is a mode corresponding to a minimum value among cost₁and multiple cost₂calculated.

In an example of this embodiment, the difference is represented by a sum of squared differences (SSD). In other examples, the difference may also be represented by a mean squared error (MSE), a mean absolute error (MAE), or another indicator, which is not limited in the disclosure. The same applies to other embodiments of the disclosure.

In an exemplary embodiment of the disclosure, the method further includes the following. A filtered picture obtained by performing NNLF on the reconstructed picture by using the mode with the minimum rate-distortion cost is determined as a filtered picture output after performing NNLF on the reconstructed picture.

In an exemplary embodiment of the disclosure, an NNLF filter for performing NNLF at an encoding end and/or a decoding end is arranged after a deblocking filter or a sample adaptive offset and before an adaptive loop filter. In an example of this embodiment, a structure of a filter unit (or referred to as a loop filtering module, as illustrated in FIG. 1B and FIG. 1C) is illustrated in FIG. 7, where the DBF represents the deblocking filter, the SAO represents the sample adaptive offset, and the ALF represents the adaptive loop filter. NNLF-A represents an NNLF filter using the first mode, and NNLF-B represents an NNLF filter using the second mode such as the chroma adjustment mode, which may also be referred to as a chroma adjustment (CA) module. The NNLF-A filter may be the same as an NNLF filter without chroma adjustment (or called chroma information adjustment), such as the NNLF1 filter and NNLF2 filter.

Although the NNLF-A filter and NNLF-B filter are illustrated in FIG. 7, a model can be used in actual implementation, where the NNLF-B filter can be regarded as the NNLF-A filter with chroma adjustment added. Based on the model, loop filtering is performed once by using the model (i.e., NNLF is performed on the reconstructed picture by using the first mode, or in other words, NNLF is performed on the reconstructed picture by using the NNLF-A filter) without adjusting the input chroma information (e.g., swapping UV orders), and the rate-distortion cost cost₁is calculated according to the difference, such as an SSD between the first filtered picture output and the original picture. In the case of adjusting the input chroma information, loop filtering is performed once by using the model (i.e., NNLF is performed on the reconstructed picture by using the second mode, or in other words, NNLF is performed on the reconstructed picture by using the NNLF-B filter), and the rate-distortion cost cost₂is calculated according to the difference between the second filtered picture output and the original picture, so that an NNLF mode to be used can be determined according to the magnitude of cost₁and cost₂. Information of the NNLF mode to be used (i.e., the selected NNLF mode) is indicated by a first flag and is signalled into a bitstream for a decoder to read. At the decoding end, the first flag is decoded to determine the NNLF mode actually used by the encoding end, and NNLF is performed on the input reconstructed picture by using the determined NNLF mode.

For example, if cost₁<cost₂, NNLF is performed on the reconstructed picture by using the first mode. If cost₂<cost₁, NNLF is performed on the reconstructed picture by using the second mode. If cost₁=cost₂, NNLF is performed on the reconstructed picture by using the first mode or the second mode.

During loop filtering, part or all of the DBF, SAO, or ALF may not be enabled. In addition, a deployment position of the NNLF filter is not limited to the position illustrated in the figure. It can be easily understood that, the implementation of the NNLF method in the disclosure is not limited to the deployment position. Furthermore, filters in the filter unit are not limited to the filters illustrated in FIG. 7, more or fewer filters can be used, or other types of filters can be used.

A video encoding method is further provided in an embodiment of the disclosure. The method is applied to a video encoding apparatus and includes the following. When performing NNLF on a reconstructed picture, as illustrated in FIG. 8, the following operations are executed.

At S210, when chroma adjustment is enabled for NNLF, NNLF is performed on the reconstructed picture according to the NNLF method of any embodiment of the disclosure, where the NNLF method is applied to an encoding end and uses a chroma adjustment mode.

At S220, a first flag of the reconstructed picture is encoded, where the first flag contains information of an NNLF mode to be used when performing NNLF on the reconstructed picture.

In this embodiment, an optimal mode can be selected according to a rate-distortion cost from the second mode in which the chroma information is adjusted and the first mode in which the chroma information is not adjusted, and the selected mode is signalled into a bitstream, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, the first flag is a picture-level syntax element or a block-level syntax element.

In an exemplary embodiment of the disclosure, when one or more of the following conditions are satisfied, it is determined that chroma adjustment is enabled for NNLF. A sequence-level chroma adjustment enable-flag is decoded, and it is determined according to the sequence-level chroma adjustment enable-flag that chroma adjustment is enabled for NNLF. A picture-level chroma adjustment enable-flag is decoded, and it is determined according to the picture-level chroma adjustment enable-flag that chroma adjustment is enabled for NNLF.

In an exemplary embodiment of the disclosure, the method further includes the following. When it is determined that chroma adjustment is not enabled for NNLF, NNLF is performed on the reconstructed picture by using the first mode, and encoding of a chroma adjustment enable-flag is skipped.

In an exemplary embodiment of the disclosure, the second mode is the chroma adjustment mode, and the first flag indicates whether chroma adjustment is to be performed when performing NNLF on the reconstructed picture. The first flag of the reconstructed picture is encoded as follows. When it is determined to perform NNLF on the reconstructed picture by using the first mode, the first flag is set to a value indicating that chroma adjustment is not to be performed. When it is determined to perform NNLF on the reconstructed picture by using the second mode, the first flag is set to a value indicating that chroma adjustment is to be performed.

In an exemplary embodiment of the disclosure, the second mode includes (or is implemented as) multiple second modes, and the first flag indicates whether the second mode is to be used when performing NNLF on the reconstructed picture. The first flag of the reconstructed picture is encoded as follows. When it is determined to perform NNLF on the reconstructed picture by using the first mode, the first flag is set to a value indicating that the second mode is not to be used. When it is determined to perform NNLF on the reconstructed picture by using the second mode, the first flag is set to a value indicating that the second mode is to be used, and a second flag is further encoded, where the second flag contains index information of one second mode with a minimum rate-distortion cost among the multiple second modes.

A video encoding method applied to an encoding end is provided in an embodiment of the disclosure, which mainly involves NNLF. In this embodiment, the second mode is a chroma adjustment mode.

During loop filtering of a reconstructed picture, if a filter unit includes multiple filters, then the reconstructed picture is processed in a specified order of the filters. When data such as the reconstructed picture (which may be a filtered picture output by another filter) input to an NNLF filter is obtained, the following operations are executed.

Step a), whether chroma adjustment is enabled for NNLF in a current sequence is determined according to a sequence-level chroma adjustment enable-flag ca_enable_flag. If ca_enable_flag is “1”, then attempt to perform chroma adjustment on the current sequence, and proceed to step b). If ca_enable_flag is “0”, then chroma adjustment is not enabled for NNLF in the current sequence, and in this case, NNLF is performed on the reconstructed picture by using the first mode, encoding of a first flag is skipped, and the process ends.

Step b), for a reconstructed picture of a current picture of the current sequence, NNLF is first performed by using the first mode, that is, original input information is input to an NNLF model for filtering, and a first filtered picture is obtained from an output of the model.

Step c), then, NNLF is performed by using the second mode (the chroma adjustment mode in this embodiment), that is, orders of a U component and a V component of a reconstructed picture input are swapped, then the reconstructed picture is input to the NNLF model for filtering, and a second filtered picture is obtained from an output of the model.

Step d), a rate-distortion cost C_NNLFis calculated according to a difference between the first filtered picture and an original picture, and a rate-distortion cost C_CAis calculated according to a difference between the second filtered picture and the original picture. The two rate-distortion costs are compared. If C_CA<C_NNLF, it is determined to perform NNLF on the reconstructed picture by using the second mode, and the second filtered picture is determined as a filtered picture which is finally output after performing NNLF on the reconstructed picture. If C_CA≥C_NNLF, it is determined to perform NNLF on the reconstructed picture by using the first mode, and the first filtered picture is determined as a filtered picture which is finally output after performing NNLF on the reconstructed picture.

A formula for calculating a rate-distortion cost in this embodiment is as follows.

cost=Wy*SSD(Y)+Wu*SSD(U)+Wv*SSD(V)

- In the above, SSD(*) indicates to calculate an SSD for a certain color component. Wy, Wu, and Wv indicate weights of SSDs for the Y component, U component, and V component respectively, for example, 10:1:1 or 8:1:1.

A formula for calculating an SSD is as follows.

$S S D = \sum_{x = 0}^{M} \sum_{y = 0}^{N} {❘ rec (x, y) - org (x, y) ❘}^{2}$

- M indicates a length of the reconstructed picture of the current picture, N indicates a width of the reconstructed picture of the current picture, rec(x, y) indicates a pixel value of the reconstructed picture at a pixel (x, y), and org(x, y) indicates a pixel value of the original picture at the pixel (x, y).

Step e), the first flag is encoded according to a mode (i.e., a selected mode) used for the current picture, so as to indicate whether to perform chroma adjustment. In this case, the first flag may also be referred to as a chroma adjustment enable-flag picture_ca_enable_flag, which indicates that chroma adjustment needs to be performed when performing NNLF on the reconstructed picture.

If the current picture has been processed, a reconstructed picture of a next picture is loaded and processed in the same way.

In this embodiment, NNLF is performed in a unit of the reconstructed picture of the current picture. In other embodiments, NNLF may also be performed based on other coding units such as a block (for example, a CTU) and a slice in the current picture.

As illustrated in FIG. 12A, for a certain NNLF filter, when the first mode is used, input information of the filter may be arranged in the order of {recY, recU, recV, predY, predU, predV, baseQP, sliceQP, slicetype, . . . }, and output information of the filter may be arranged in the order of {cnnY, cnnU, cnnV}, where rec indicates the reconstructed picture, pred indicates a prediction picture, and cnn indicates an output filtered picture.

When the second mode in which the chroma information is adjusted is used, the arrangement order of the input information of the filter is adjusted to {recY, recV, recU, predY, predV, predU, baseQP, sliceQP, slicetype, . . . }, and the arrangement order of information of an output network of the filter is adjusted to {cnnY, cnnV, cnnU}, as illustrated in FIG. 12B.

This embodiment explores the generalization of inputs of NNLF by adjusting input orders of the U component and V component in the chroma information, which can further improve the filtering performance of a neural network. In addition, only a flag with few bits needs to be encoded as a control switch, which basically has no influence on the decoding complexity.

In this embodiment, for three components YUV in a picture, a joint rate-distortion cost is used to make a determination. In other embodiments, a more refined operation may also be executed, where rate-distortion costs for each component in different modes are calculated respectively, and a mode with a minimum rate-distortion cost is selected.

An NNLF method is further provided in an embodiment of the disclosure. The method is applied to a video decoding apparatus. As illustrated in FIG. 9, the method includes the following.

At S310, a first flag of a reconstructed picture is decoded, where the first flag contains information of an NNLF mode to be used when performing NNLF on the reconstructed picture.

At S320, the NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag, and NNLF is performed on the reconstructed picture according to the determined NNLF mode.

The NNLF mode includes a first mode and a second mode. The second mode includes a chroma adjustment mode. Compared with the first mode, specified adjustment of input chroma information before filtering is added in the chroma adjustment mode.

In the NNLF method of this embodiment, a better mode is selected according to the first flag from a mode in which the chroma information is adjusted and a mode in which the chroma information is not adjusted, which can enhance the filtering effect of NNLF and improve the quality of a decoded picture.

In an exemplary embodiment of the disclosure, specified adjustment is performed on the chroma information in any one or more of the following adjustment manners. Orders of two chroma components of the reconstructed picture are swapped. A weighted average value and a square error value of the two chroma components of the reconstructed picture are calculated, and the weighted average value and the square error value are determined as chroma information to be input in the NNLF mode.

In an exemplary embodiment of the disclosure, the reconstructed picture is a reconstructed picture of a current picture or a current slice or a current block.

In an exemplary embodiment of the disclosure, the first flag is a picture-level syntax element or a block-level syntax element.

In an exemplary embodiment of the disclosure, the second mode further includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data. A model used in the first mode is trained using the original data.

In an exemplary embodiment of the disclosure, the second mode further includes a chroma adjustment and fusion mode. Compared with the first mode, specified adjustment of input chroma information before filtering is added in the chroma adjustment and fusion mode, and training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data. A model used in the first mode is trained using the original data.

In an exemplary embodiment of the disclosure, the first mode includes one first mode, and the second mode includes one or more second modes. A network structure of a model used in the first mode is the same as or different from a network structure of a model used in the second mode.

In an exemplary embodiment of the disclosure, the second mode is the chroma adjustment mode, and the first flag indicates whether chroma adjustment is to be performed when performing NNLF on the reconstructed picture. The NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag as follows. When the first flag indicates that chroma adjustment is to be performed, it is determined that the second mode is to be used when performing NNLF on the reconstructed picture. When the first flag indicates that chroma adjustment is not to be performed, it is determined that the first mode is to be used when performing NNLF on the reconstructed picture.

In an example of this embodiment, a picture header is defined as follows.

Descriptor

picture_header( ) {

... ...
... ...

if(ca_enable_flag) {

picture_ca_enable_flag
u(1)

}

... ...
... ...

- In the table, cf_enable_flag is a sequence-level chroma adjustment enable-flag. When ca_enable_flag is 1, the following semantics may be defined: picture_ca_enable_flag, indicating a picture-level chroma adjustment enable-flag, i.e., the above first flag. When picture_ca_enable_flag is 1, it indicates that chroma adjustment is to be performed when performing NNLF on the reconstructed picture, that is, the second mode (the chroma adjustment mode in this embodiment) is to be used. When picture_ca_enable_flag is 0, it indicates that chroma adjustment is not to be performed when performing NNLF on the reconstructed picture, that is, the first mode is to be used. When ca_enable_flag is 0, decoding and encoding of picture_ca_enable_flag are skipped.

In an exemplary embodiment of the disclosure, the second mode includes multiple second modes, and the first flag indicates whether the second mode is to be used when performing NNLF on the reconstructed picture. The NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag as follows. When the first flag indicates that the second mode is not to be used, it is determined that the first mode is to be used when performing NNLF on the reconstructed picture. When the first flag indicates that the second mode is to be used, a second flag is further decoded, where the second flag contains index information of one second mode to be used among the multiple second modes. It is determined according to the second flag that the one second mode among the multiple second modes is to be used when performing NNLF on the reconstructed picture.

A video decoding method is further provided in an embodiment of the disclosure. The method is applied to a video decoding apparatus and includes the following. When performing NNLF on a reconstructed picture, as illustrated in FIG. 10, the following operations are executed.

At S410, whether chroma adjustment is enabled for NNLF is determined.

At S420, when chroma adjustment is enabled for NNLF, NNLF is performed on the reconstructed picture according to the NNLF method of any embodiment of the disclosure, where the NNLF method is applied to a decoding end and can use a chroma adjustment mode.

In the video decoding method of this embodiment, a better mode is selected according to the first flag from a mode in which the chroma information is adjusted and a mode in which the chroma information is not adjusted, which can enhance the filtering effect of NNLF and improve the quality of a decoded picture.

In an example of using the sequence-level chroma adjustment enable-flag, a sequence header of a video sequence is illustrated in the following table.

Descriptor

sequence_header( ) {

... ...

ca_enable_flag
u(1)

... ...

}

- In the table, ca_enable_flag is a sequence-level chroma adjustment enable-flag.

In an exemplary embodiment of the disclosure, the method further includes the following. When chroma adjustment is not enabled for NNLF, decoding of the first flag is skipped, and NNLF is performed on the reconstructed picture by using the first mode.

In this embodiment, the baseline tool NNLF1 is selected as a comparison. On the basis of NNLF1, a mode is selected for an inter coding picture (i.e., a non-I-picture), where the chroma adjustment mode is included. Under the common test conditions of random access and low delay B configurations, tests are conducted on common sequences specified by the JVET, and an anchor for the comparison is NNLF1. The results are illustrated in table 1 and table 2.

TABLE 1

performance of this embodiment compared with the baseline

NNLF1 under the random access configuration

Random access Main10

Over AHG11 reference software (NnlfOption = 1)

Y
U
V
EncT
DecT

Class A1
0.05%
−2.42%
−0.70%
107%
100%

Class A2
−0.01%
−1.18%
−0.31%
108%
100%

Class B
−0.03%
−1.21%
−0.58%
108%
101%

Class C
−0.02%
−0.63%
−0.73%
106%
101%

Class E

Overall
−0.01%
−1.29%
−0.59%
107%
101%

Class D
−0.03%
−0.19%
−1.52%
107%
101%

Class F
−0.03%
−0.31%
−0.69%
110%
108%

TABLE 2

performance of this embodiment compared with the baseline

NNLF1 under the low delay B configuration

Low delay B Main10

Over AHG11 reference software (NnlfOption = 1)

Y
U
V
EncT
DecT

Class A1
#VALUE!
#VALUE!
#VALUE!
#DIV/0!
#DIV/0!

Class A2
#VALUE!
#VALUE!
#VALUE!
#DIV/0!
#DIV/0!

Class B
−0.13%
−2.67%
−0.72%
107%
103%

Class C
−0.11%
−2.31%
−1.81%
105%
102%

Class E
−0.05%
−1.10%
−1.83%
111%
105%

Overall
−0.10%
−2.51%
−1.20%
107%
103%

Class D
−0.10%
−1.95%
−4.61%
107%
102%

Class F
−0.11%
−1.52%
−2.46%
110%
106%

The meanings of the parameters in the table are as follows.

EncT: encoding time, 10X % means that when the sorting technology for reference lines is integrated, the encoding time is 10X % compared with the encoding time before the integration, which means that the encoding time increases by X %.

DecT: decoding time, 10X % means that when the sorting technology for reference lines is integrated, the decoding time is 10X % compared with the decoding time before the integration, which means that the decoding time increases by X %.

ClassA1 and ClassA2 are test video sequences with a resolution of 3840×2160, ClassB is a test sequence with a resolution of 1920×1080, ClassC is a test sequence with a resolution of 832×480, ClassD is a test sequence with a resolution of 416×240, and ClassE is a test sequence with a resolution of 1280×720; ClassF represents several screen content sequences with different resolutions.

Y, U, and V represent three color components. The columns where Y, U, and V are located represent Bjøntegaard-Delta rate (BD-rate) indicators of the test results on Y, U, and V. The smaller the value, the better the coding performance.

As can be seen from the analysis of data in the two tables, by means of introducing an optimization method for adjusting chroma information, the coding performance can be further improved on the basis of NNLF1, especially on the chroma components. The residual adjustment in this embodiment has little influence on the decoding complexity.

For an intra coding picture (an I-picture), an NNLF mode can also be selected according to the method of this embodiment.

A solution of adjusting chroma information when performing NNLF on the reconstructed picture is proposed in the foregoing embodiments. In a codec, orders of a U component and a V component of chroma information input to an NNLF filter are swapped, thereby further improving the coding performance of NNLF. This also shows that feature information of the U component is correlated with feature information of the V component to some extent. Currently, there is a lack of solutions to study the influence on an NNLF network model caused by adjusting the chroma information such as swapping input orders of U and V during training. After study, some examples of a training method for chroma fusion (or called chroma information fusion) are provided in embodiments of the disclosure. During model training, a strategy for chroma adjustment such as UV swapping is introduced to augment training data, which can further improve the performance of an NNLF model. In some embodiments of the disclosure, NNLF is also performed on the reconstructed picture by using an NNLF model trained based on the training method for chroma fusion, i.e., NNLF is performed by using the chroma fusion mode in the foregoing embodiments, so as to further optimize the performance of NNLF.

An NNLF filter (also called a chroma fusion module herein, referred to as a CF module) of a model trained based on the training method for chroma fusion can be implemented based on a baseline NNLF (such as NNLF1 and NNLF2) or implemented based on a new NNLF. When the CF module is implemented based on the baseline NNLF, it means that a model of the baseline NNLF is re-trained. When being implemented based on the new NNLF, the CF module can be trained based on a model with a new network structure to obtain a new model.

For a position of the chroma fusion (CF) module in the codec, reference can be made to FIG. 7, and the CF module can replace the NNLF-B filter in the figure. The use of the CF module does not depend on the switches of the DB, SAO, and ALF, i.e., whether the DB, SAO, and ALF are enabled or not.

At the encoding end, by comparing a rate-distortion cost for performing NNLF on the reconstructed picture by the NNLF-A filter and a rate-distortion cost for performing NNLF on the reconstructed picture by the (re-trained/new) NNLF-B filter trained with chroma fusion, which NNLF filter (corresponding to a different NNLF mode) to use is determined, and information of the selected NNLF mode is signalled into a bitstream for a decoder to read. At the decoding end, after the NNLF mode actually used by the encoding end is parsed out, NNLF is performed on the reconstructed picture by using the NNLF mode.

An NNLF method is further provided in an embodiment of the disclosure. The method is applied to a video decoding apparatus. As illustrated in FIG. 13, the method includes the following.

At S510, a first flag of a reconstructed picture is decoded, where the first flag contains information of an NNLF mode to be used when performing NNLF on the reconstructed picture.

At S520, the NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag, and NNLF is performed on the reconstructed picture according to the determined NNLF mode.

The NNLF mode includes a first mode and a second mode. The second mode includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data.

Tests have shown that the use of a model trained with the augmented data obtained by adjusting the chroma information (i.e., the model used in the chroma fusion mode) can affect the filtering effect of NNLF and thus affect the coding performance. In this embodiment, an optimal mode can be selected from the chroma fusion mode and other modes according to a rate-distortion cost, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, specified adjustment is performed on the chroma information of the reconstructed picture in the original data in any one or more of the following adjustment manners. Orders of two chroma components of the reconstructed picture in the original data are swapped. A weighted average value and a square error value of the two chroma components of the reconstructed picture in the original data are calculated, and the weighted average value and the square error value are determined as chroma information to be input in the NNLF mode.

In an exemplary embodiment of the disclosure, a model used in the first mode is trained using the original data (i.e., supplementary data obtained after chroma adjustment is not used as training data).

In an exemplary embodiment of the disclosure, the reconstructed picture is a reconstructed picture of a current picture or a current slice or a current block. The first flag is a picture-level syntax element or a block-level syntax element.

In an exemplary embodiment of the disclosure, each second mode is the chroma fusion mode, and the first flag indicates whether the chroma fusion mode is to be used when performing NNLF on the reconstructed picture. The NNLF mode to be used when performing NNLF on the reconstructed picture is determined according to the first flag as follows. When the first flag indicates that the chroma fusion mode is to be used, it is determined that the second mode is to be used when performing NNLF on the reconstructed picture. When the first flag indicates that the chroma fusion mode is not to be used, it is determined that the first mode is to be used when performing NNLF on the reconstructed picture.

In an example of this embodiment, the second mode includes multiple second modes. The method further includes the following. When the first flag indicates that the chroma fusion mode is to be used, a second flag is further decoded, where the second flag contains index information of one second mode to be used among the multiple second modes. It is determined according to the second flag that the one second mode among the multiple second modes is to be used when performing NNLF on the reconstructed picture.

At S610, whether chroma fusion is enabled for NNLF is determined.

At S620, when chroma fusion is enabled for NNLF, NNLF is performed on the reconstructed picture according to the method of any embodiment of the disclosure, where the method is applied to a decoding end and can use a chroma fusion mode.

In this embodiment, an optimal mode can be selected from the chroma fusion mode and other modes according to a rate-distortion cost, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, when one or more of the following conditions are satisfied, it is determined that chroma fusion is enabled for NNLF. A sequence-level chroma fusion enable-flag is decoded, and it is determined according to the sequence-level chroma fusion enable-flag that chroma fusion is enabled for NNLF. A picture-level chroma fusion enable-flag is decoded, and it is determined according to the picture-level chroma fusion enable-flag that chroma fusion is enabled for NNLF.

In an example of this embodiment, a sequence header in the following table can be used.

Descriptor

sequence_header( ) {

... ...

cf_enable_flag
u(1)

... ...

}

- In the table, cf_enable_flag indicates a sequence-level chroma fusion enable-flag.

In an exemplary embodiment of the disclosure, when chroma fusion is not enabled for NNLF, decoding of a chroma fusion enable-flag is skipped, and NNLF is performed on the reconstructed picture by using the first mode.

In an example of this embodiment, a sequence header illustrated in the following table can be used.

Descriptor

picture_header( ) {

... ...
... ...

if(cf_enable_flag) {

picture_cf_enable_flag
u(1)

if(picture_cf_enable_flag) {

picture_cf_index
u(1)

}

}

... ...

}

... ...

- In the table, cf_enable_flag is a sequence-level chroma fusion enable-flag. When cf_enable_flag is 1, the following semantics may be defined: picture_cf_enable_flag, indicating a picture-level chroma fusion enable-flag, i.e., the above first flag. When picture_cf_enable_flag is 1, it indicates that the chroma fusion mode is to be used when performing NNLF on the reconstructed picture. In the case where the chroma fusion mode includes multiple chroma fusion modes (for example, uses multiple models trained with the supplementary data), which chroma fusion mode to use is indicated by an index picture_cf_index. When picture_cf_enable_flag is 0, it indicates that the chroma fusion mode is not to be used when performing NNLF on the reconstructed picture, that is, the first mode is to be used. When cf_enable_flag is 0, decoding and encoding of picture_cf_enable_flag and picture_cf_index are skipped.

An NNLF method is further provided in an embodiment of the disclosure. The method is applied to a video encoding apparatus. As illustrated in FIG. 15, the method includes the following.

At S710, a rate-distortion cost for performing NNLF on a reconstructed picture input by using a first mode is calculated, and a rate-distortion cost for performing NNLF on the reconstructed picture by using a second mode is calculated.

At S720, NNLF is performed on the reconstructed picture by using a mode with a minimum rate-distortion cost between the first mode and the second mode.

The first mode and the second mode each are a set NNLF mode. The second mode includes a chroma fusion mode. Training data for training a model used in the chroma fusion mode includes augmented data obtained after performing specified adjustment on chroma information of the reconstructed picture in original data or includes the original data and the augmented data.

In this embodiment, an optimal mode can be selected from the chroma fusion mode and other modes according to a rate-distortion cost, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, specified adjustment is performed on the chroma information in any one or more of the following adjustment manners. Orders of two chroma components of the reconstructed picture are swapped. A weighted average value and a square error value of the two chroma components of the reconstructed picture are calculated, and the weighted average value and the square error value are determined as chroma information to be input in the NNLF mode.

In an exemplary embodiment of the disclosure, a model used in the first mode is trained using the original data. The reconstructed picture is a reconstructed picture of a current picture or a current slice or a current block. A network structure of the model used in the first mode is the same as or different from a network structure of a model used in the second mode.

In an exemplary embodiment of the disclosure, the first mode includes one first mode, and the second mode includes one or more second modes. The rate-distortion cost cost₁for performing NNLF on the reconstructed picture input by using the first mode is calculated as follows. A first filtered picture output after performing NNLF on the reconstructed picture by using the first mode is obtained, and cost₁is calculated according to a difference between the first filtered picture and a corresponding original picture. The rate-distortion cost cost₂for performing NNLF on the reconstructed picture by using the second mode is calculated as follows. For each of the one or more second modes, a second filtered picture output after performing NNLF on the reconstructed picture by using the second mode is obtained, and cost₂of the second mode is calculated according to a difference between the second filtered picture and the corresponding original picture. When the second mode includes multiple second modes, cost₂of the multiple second modes is compared with cost₁of the first mode to determine the minimum rate-distortion cost.

At S810, when chroma fusion is enabled for NNLF, NNLF is performed on the reconstructed picture according to the method of any embodiment of the disclosure, where the method is applied to an encoding end and can use a chroma fusion mode.

At S820, a first flag of the reconstructed picture is encoded, where the first flag contains information of an NNLF mode to be used when performing NNLF on the reconstructed picture.

In this embodiment, an optimal mode can be selected from the chroma fusion mode and other modes according to a rate-distortion cost, thereby improving the coding performance.

In an exemplary embodiment of the disclosure, the first flag is a picture-level syntax element or a block-level syntax element.

In an exemplary embodiment of the disclosure, the method further includes the following. When it is determined that chroma fusion is not enabled for NNLF, NNLF is performed on the reconstructed picture by using the first mode, and encoding of a chroma fusion enable-flag is skipped.

A video encoding method is further provided in an embodiment of the disclosure, which can be implemented in an NNLF filter at an encoding end. When the encoding end proceeds to the NNLF process, the encoding end executes the following operations.

Step a), whether a chroma fusion mode is enabled for NNLF in a current sequence is determined according to a sequence-level chroma fusion enable-flag cf_enable_flag. If cf_enable_flag is “1”, then attempt to perform NNLF on the current sequence by using the chroma fusion mode, and proceed to step b). If cf_enable_flag is “0”, then NNLF is performed by using the first mode, encoding of a first flag is skipped, and the process ends.

Step b), for a reconstructed picture of a current picture of the current sequence, NNLF is performed on the reconstructed picture by using the first mode, for example, information is input to a model of a baseline NNLF filter to obtain a first filtered picture output by the filter.

Step c), for the reconstructed picture of the current picture of the current sequence, NNLF is performed on the reconstructed picture by using the chroma fusion mode, that is, encoded information is input to a (re-trained/new) NNLF model in the chroma fusion mode to obtain a second filtered picture output by the filter.

Step d), a rate-distortion cost C_NNLFis calculated according to a difference between the first filtered picture and an original picture, and a rate-distortion cost C_CFis calculated according to a difference between the second filtered picture and the original picture. The two rate-distortion costs are compared. If C_CF<C_NNLF, it is determined to perform NNLF on the reconstructed picture by using the chroma fusion mode, i.e., the second mode, and the second filtered picture is determined as a filtered picture which is finally output after performing NNLF on the reconstructed picture. If C_CF≥C_NNLF, it is determined to perform NNLF on the reconstructed picture by using the first mode, and the first filtered picture is determined as a filtered picture which is finally output after performing NNLF on the reconstructed picture.

If there are multiple chroma fusion modes (the above chroma adjustment and fusion mode is also a chroma fusion mode), multiple C_CFwill be compared.

Step e), a first flag picture_cf_enable_flag of the reconstructed picture of the current picture is signalled into a bitstream. If there are multiple chroma fusion modes, an index picture_cf_index indicating a chroma fusion mode to be used is further signalled into the bitstream.

If the current picture has been processed, a next picture is loaded for processing. Similarly, a current slice of the current picture or a current block in the current picture may also be processed as described above.

In some embodiments of the disclosure, the same UV-swapping adjustment strategy is also introduced during training of an NNLF model, in which the NNLF model is trained by swapping input orders of U and V. As above mentioned, training the NNLF model by swapping the input orders of U and V can be implemented based on a baseline NNLF or based on a new NNLF. The main difference lies in different NNLF network structures, which has no influence on model training, and NNLF_VAR is uniformly used herein for description. For an NNLF_VAR model, improvements have been made in both training and coding testing in embodiments of the disclosure.

A training method for an NNLF model is provided in an embodiment of the disclosure. As illustrated in FIG. 17, the method includes the following.

At S910, specified adjustment is performed on chroma components of a reconstructed picture in original data for training, to obtain augmented data.

At S920, the NNLF model is trained by taking the augmented data or the augmented data and the original data as training data.

In an exemplary embodiment of the disclosure, specified adjustment is performed on chroma information of the reconstructed picture in the original data in any one or more of the following adjustment manners. Orders of two chroma components of the reconstructed picture in the original data are swapped. A weighted average value and a square error value of the two chroma components of the reconstructed picture in the original data are calculated, and the weighted average value and the square error value are determined as chroma information to be input in an NNLF mode.

In an exemplary embodiment of the disclosure, for a certain NNLF_VAR, an arrangement order of information input to a network may be {recY, recU, recV, predY, predU, predV, baseQP, sliceQP, slicetype, . . . }. Therefore, the above data need to be collected through coding testing and be aligned with label data of an original picture. The above data are sequentially organized and packaged into a data-pair patchA{train, label} as training data for the network. For example, a data form of patchA, i.e., the original data, is illustrated in the following pseudo code.

patchA:

{

train: {recY, recU, recV, predY, predU, predV, baseQP, sliceQP,

slicetype, ...}

label: {orgY, orgU, orgV}

}

By means of introducing a concept of chroma fusion, input orders of U and V can be swapped. Therefore, in this embodiment, the training data are augmented (patchB, i.e., supplementary data, is added based on the original patchA), and the augmented data-pair patch is illustrated in the following pseudo code.

patchA:

{

train: {recY, recU, recV, predY, predU, predV, baseQP, sliceQP,

slicetype, ...}

label: {orgY, orgU, orgV}

}

patchB:

{

train: {recY, recV, recU, predY, predV, predU, baseQP, sliceQP,

slicetype, ...}

label: {orgY, orgV, orgU}

}

As can be seen, for the augmented patchB, input positions of data of U and V are swapped, and corresponding positions of the label data also need to be swapped.

In this embodiment, both the original data and the supplementary data are used as the training data. In other embodiments, only the supplementary data may be used as the training data for training of an NNLF mode in a chroma fusion mode.

In terms of performing NNLF on the reconstructed picture by using a mode (i.e., the NNLF mode) in the chroma fusion mode, input chroma information may be adjusted, for example, orders of U and V are swapped (in this case, the above chroma adjustment and fusion mode is used), and the input and output are illustrated in FIG. 12B. Alternatively, the input chroma information may not be adjusted (in this case, a non-chroma-adjustment mode is used), and the input and output are illustrated in FIG. 12A.

In the training stage, NNLF_VAR is trained according to the augmented training data generated in (1) herein. The specific training process is similar to the training process of the existing NNLF, and a multi-round training mode is used, which will not be repeated herein.

After training of the NNLF model is completed, the NNLF model is deployed in a codec for coding testing. As above mentioned, for a network model obtained through chroma fusion training, input and output information of the network model in a chroma fusion mode of UV swapping has a different form from input and output information of the network model in a mode of UV non-swapping. During coding testing, these two modes can be used separately, and their respective rate-distortion costs can be calculated separately. By comparing these two rate-distortion costs with a rate-distortion cost for the first mode (i.e., for the baseline NNLF), whether to use the chroma fusion mode and which chroma fusion mode to use (when the chroma fusion mode is to be used) can be determined.

A training method for an NNLF mode with chroma fusion (where training data at least includes supplementary data) is proposed in an embodiment of the disclosure. For a U component and a V component of chroma information input to an NNLF filter, a swapping adjustment strategy is used for training and coding testing of NNLF, thereby further optimizing the performance of NNLF.

In the above embodiment, for the Y, U, and V components, a joint rate-distortion cost is used to make a determination, and thus the three components share one first flag (or first flag and index). In other embodiments, a more refined operation may also be executed, where different first flags (or first flags and indexes) may be set for different components. For example, syntax elements in this method are illustrated in the following table.

Descriptor

picture_header( ) {

... ...
... ...

if(cf_enable_flag) {

for (compIdx=0; compIdx<3; compIdx++) {

picture_cf_enable_flag[compIdx]
u(1)

if(picture_cf_enable_flag[compIdx]) {

picture_cf_index[compIdx]
ue(v)

}

}

}

... ...

}

... ...

When a sequence-level chroma fusion enable-flag cf_enable_flag is 1, the following semantics may be defined: a picture-level chroma fusion enable-flag picture_cf_enable_flag[compIdx].

When the picture-level chroma fusion enable-flag picture_cf_enable_flag[compIdx] is 1, the following semantics may be defined: an index of a picture-level chroma fusion method picture_cf_index[compIdx]

A bitstream is further provided in an embodiment of the disclosure. The bitstream is generated according to the video encoding method of any embodiment of the disclosure.

A neural network based loop filter is further provided in an embodiment of the disclosure. As illustrated in FIG. 11, the neural network based loop filter includes a processor and a memory storing a computer program. When executing the computer program, the processor can implement the neural network based loop filtering method of any embodiment of the disclosure. As illustrated in the figure, the processor and the memory are connected via a system bus, and the loop filter may further include other components such as an internal storage and a network interface.

A video decoding apparatus is further provided in an embodiment of the disclosure. As illustrated in FIG. 11, the video decoding apparatus includes a processor and a memory storing a computer program. When executing the computer program, the processor implements the video decoding method of any embodiment of the disclosure.

A video encoding apparatus is further provided in an embodiment of the disclosure. As illustrated in FIG. 11, the video encoding apparatus includes a processor and a memory storing a computer program. When executing the computer program, the processor can implement the video encoding method of any embodiment of the disclosure.

The processor of the above embodiments of the disclosure may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP for short), a microprocessor, etc., or other conventional processors, etc. The processor may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), discrete logic or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or other equivalent integrated or discrete logic circuits, or a combination of the above devices. That is, the processor of the above embodiments may be any processing device or device combination that can implement various methods, steps, and logic block diagrams disclosed in the embodiments of the disclosure. If the embodiments of the disclosure are partially implemented in software, the instructions for the software may be stored in a suitable non-volatile computer-readable storage medium, and one or more processors in hardware may be used to execute the instructions to implement the methods of the embodiments of the disclosure. The term “processor” used herein may refer to the above structure or any other structure suitable for implementing the technology described herein.

A video coding system is further provided in an embodiment of the disclosure. As illustrated in FIG. 1A, the video coding system includes the video encoding apparatus of any embodiment of the disclosure and the video decoding apparatus of any embodiment of the disclosure.

A non-transitory computer-readable storage medium is further provided in an embodiment of the disclosure. The computer-readable storage medium stores a computer program. The computer program, when executed by a processor, can implement the neural network based loop filtering method of any embodiment of the disclosure, the video decoding method of any embodiment of the disclosure, or the video encoding method of any embodiment of the disclosure.

In one or more of the above exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions as one or more instructions or codes may be stored in a computer-readable medium or transmitted via a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may include a tangible medium such as a data storage medium, or any communication medium that facilitates a computer program being transmitted from one place to another according to a communication protocol for example. In this way, the computer-readable medium may be generally a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, codes, and/or data structures for implementing the technology described in the disclosure. A computer program product may include a computer-readable medium.

By way of example and not limitation, such computer-readable storage media may include a random access memory (RAM), a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a compact disk ROM (CD-ROM) or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, a flash memory, or any other medium that can be used to store desired program codes in the form of instructions or data structures and can be accessed by a computer. Moreover, any connection may also be referred to as a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, a fiber optic cable, a twisted pair, a digital subscriber line (DSL), or wireless technology such as infrared, radio, and microwaves, then the coaxial cable, the fiber optic cable, the twisted pair, the DSL, or the wireless technology such as infrared, radio, and microwaves are included in the definition of medium. However, it may be understood that the computer-readable storage medium and data storage medium do not include connection, carrier, signal, or other transitory media, but are directed to a non-transitory tangible storage medium. As used herein, disks and optical disks include CDs, laser optical disks, optical disks, digital versatile disks (DVDs), floppy disks, or Blu-ray disks, etc., where disks typically reproduce data magnetically, and optical disks use lasers to reproduce data optically. Combinations of the above shall also be included within the scope of the computer-readable medium.

	Number	Date	Country
Parent	PCT/CN2022/125230	Oct 2022	WO
Child	19172597		US

VIDEO CODING METHOD AND STORAGE MEDIUM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION(S)

Continuations (1)