The present document relates generally to images and video coding. More particularly, an embodiment of the present invention relates to messaging information related to messaging parameters related to neural-networks post filtering in image and video coding.
In 2020, the MPEG group in the International Standardization Organization (ISO), jointly with the International Telecommunications Union (ITU), released the first version of the Versatile Video coding Standard (VVC), also known as H.266 (Ref. [3]). More recently, the same group has been working on the development of the next generation coding standard that provides improved coding performance over existing video coding technologies. As part of this investigation, coding techniques based on artificial intelligence and deep learning are also examined. As used herein the term “deep learning” refers to neural networks (NNs) having at least three layers, and preferably more than three layers.
Neural-networks post filtering (NNPF) and neural-networks loop filtering (NNLF) have been shown to improve coding efficiency in image and video coding. While MPEG-7, part 17 (ISO/IEC 15938-17) (Ref. [11]) describes a method for the compression of the representation of neural networks, it is rather inefficient under the bit rate constraints in image and video coding. As appreciated by the inventors here, improved techniques for the carriage of neural network topology and parameters as related to NNPF in image and video coding are desired, and they are described herein.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
An embodiment of the present invention is illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Example embodiments that relate to the carriage of neural network topology and parameters as related to NNPF in image and video coding are described herein. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the various embodiments of present invention. It will be apparent, however, that the various embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating embodiments of the present invention.
Example embodiments described herein relate to the carriage of neural network topology and parameters as related to NNPF in image and video coding. In an embodiment, a processor receives a decoded image and NNPF metadata related to processing the decoded image with NNPF. The processor:
In another embodiment, a processor receives an image or a video sequence comprising pictures. The processor:
Metadata signaling, e.g., via SEI messaging, for NNPF has been proposed in the past in several JVET meetings (Refs. [4-10]). The previous proposals focused more on how to signal NN topology and NN parameters either by carrying an NNR (Neural Network Compression and Representation) bitstream (Ref. [11-13]) or with an external link (Ref. [4]), such as a given Uniform Resource Identifier (URI), with syntax and semantics as specified in IETF Internet Standard 66. Some of the proposals also addressed issues related to the NN input or output interfaces and the NN complexity (Ref. |7-9|).
Despite using compression, an NNR bitstream may still be quite large, thus affecting bandwidth utilization. Furthermore, when using NNR, a decoder needs to comply to and be able to decode yet another standard. As appreciated by the inventors, NNPF metadata must be lightweight, but still provide the necessary information for a decoder to check if it can apply NNPF, and if it can, access the required parameters to perform NNPF processing (100) as described earlier.
While neural nets may be applied also to loop filtering and other application, embodiments described herein focus, without limitation, on NNPF due to two main reasons: 1) NNPF is decoupled from decompression, so the implementation can have more freedom and be used for any image or video codec. 2) It is out of the coding loop (which, typically includes transform processing, quantization, and loop filtering (deblocking)), so it does not require fixed-point implementation to avoid drift issues. Thus, a floating point implementation, generally used in NNs, can be applied.
Since the NNPF is performed out of the decoding loop, the NNPF does not have the potential drift issue of the NNLF (loop filter) processing. For NNLF, if there is a bad filtering result for one frame or one block, which is possible since the NN may not be robust enough for all frame data, this will result in the bad quality of the currently decoded frame, which may be used as the reference frame for the later ones. Therefore, the errors and artifacts can be accumulated and propagated to other frames as a drift phenomenon. In another example, most NNs are implemented using floating point, which can have different results on different machine/platform/operation systems. This can cause encoder and decoder mismatch for one frame and the error can cause drift issues for the following decoded frames if the mismatched frame is used as reference.
Two levels of NNPF-related messaging are proposed: 1) at the CLVS (Coded Layer Video Sequence) layer (where NNPF operations persist until the end of the video sequence), and 2) at the Picture layer (where NNPF operations persist only until the end of the current picture). This allows picture-wise NNPF messaging and filtering without repeating certain filter characteristics that apply to the whole video sequence. While the proposed messaging is described using notation and syntax commonly used to describe MPEG's SEI messaging (Ref. [1-3]), the proposed metadata messaging may be carried using a variety of other suitable messaging formats, for example, as used in AV1 and other proprietary or standards-based coding formats. The proposed messaging can also be applied to other MPEG-based standards, such as AVC and HEVC. The proposed SEI message helps NNPF utilize the coding characteristics by providing information that is not available for standalone post filters, thus further improving the post filter performance.
In example embodiments, the proposed CLVS NNPF SEI aims to provide information to assist in the efficient implementation of an NNPF pipeline, such as initialization, pre-processing, model loading/unloading and post-processing. The picture layer NNPF SEI aims to allow picture-level adaptation, to further improve NNPF coding efficiency.
The scope of CLVS-layer NNPF SEI is for the entire coded sequence. The purpose is to signal it with the first picture of the CLVS and should not be changed throughout a CLVS. It should be able to assist decoders to get ready to apply the NNPF to the decoded picture after bitstream decoding. More specifically, when an NNPF SEI message is present for any picture of a CLVS of a particular layer, the NNPF SEI message shall be present for the first picture of the CLVS. The NNPF SEI message persists for the current layer in decoding order from the current picture until the end of the CLVS. All NNPF SEI messages that apply to the same CLVS shall have the same content. In an example embodiment, CLVS NNPF SEI includes the following information.
For an NNPF SEI message, it is desired to have the SEI message carry only the necessary information, so the size of the SEI message is not too big. Otherwise, an encoder can simply reduce the quantization (QP) value at the expense of higher bitrate and improve the quality of the coded sequence. The size of detailed network topology (for example, using a graph to describe the topology) and its corresponding parameter values (weights and biases, in the case of a convolutional neural network (CNN)), can be relatively big, for example in the range of kilo bytes, Megabytes or even Gigabytes. It is not realistic to carry all the information in the SEI bitstream. Compression can be applied to the models (such as NNR in Ref [11]), but still the size is not negligible. One way to signal the detailed NN model information is to use an explicit link or some external means, such as a cross-reference to URI (IETF Internet Standard 66) as discussed in Ref. [4]. Another way is to have a fixed model standardized 0—or an external reference link for a base model, and the bitstream only carries the incremental information (Ref. |14|), such as updated biases or weights, either for a full NN, or a small subset of the NN.
In addition to topology and model parameters, it is important to let the decoder know the following information too, so it can help a decoder achieve a fast initialization or quickly decide if it can implement the NNPF or bypass it.
As an example, Table 1, depicts an example of syntax parameters for NNPF topology and model parameters information for a single model. The syntax includes an NN topology and parameters for an explicit link (if it exists) or updated parameters, NN storage and exchange format, and NN complexity indications. The multiple models should loop over this SEI. It is noted that multiple models most likely use the same storage and exchange format, so an alternative solution is to move this information out and only signal once in the core NNPF SEI message.
nnpf_model_exter_link_flag equal to 1 indicates that the NNPF model is stored in an external link. nnpf_model_exter_link_flag equal to 0 indicates that the NNPF model is not stored in the external link.
nnpf_exter_uri[i] contains the i-th byte of a NULL-terminated UTF-8 character string that indicates a URI (IETF Internet Standard 66), which specifies the neural network to be used as the post-processing filter.
nnpf_model_upd_param_present_flag equal to 1 indicates that the model parameters are updated. nnpf_model_upd_param_present_flag equal to 0 indicates that the model parameters are not updated.
Note: See Ref. [4] for additional updated parameters syntax and semantics.
nnpf_model_storage_form_idc indicates the storage and exchange format for the NNPF model as specified in Table 2. The values 0 to 3 corresponds to ONNX, NNEF, Tensorflow and PyTorch respectively. Values 4 to 7 are reserved for future extensions.
nnpf_model_complexity_ind_present_flag equal to 1 indicates that the model complexity indicators are present in the SEI messages. If nnpf_model_complexity_ind_present_flag equal to 0, the model complexity indicators are not present in the SEI messages. The inferred value for all the following syntax should be 0 unless otherwise specified. “0” can be interpreted as “NULL” (which means do not exist) or “can be ignored” in this context.
nnpf_param_prec_idc indicates the NNPF model parameters precision as specified in Table 3. When not present, the syntax value of nnpf_param_prec_idc is inferred to be 5.
nnpf_num_param_frac is the fractional number to represent the total number of model parameters
log2_prec_denorm is the base 2 logarithm of the denominator for the fractional number to represent the total number of model parameters.
log2_nnpf_num_param_minus11 plus 11 is the base 2 logarithm to represent the total number of model parameters.
The variable tot_num_params is derived as follows:
The NNPF model's total number of parameters should be no larger than the value of tot_num_params.
When the above three syntax elements are not present, the value of tot_num_params is inferred to be 0 for “NULL.”
nnpf_num_ops times 1,000 specifies the maximum number of MAC (multiply-accumulate) operations per pixel for NNPF.
Note: the more precise definition of this parameter can use 1.a*2{circumflex over ( )}b as the tot_num_params.
nnpf_latency_idc specifies the latency indication of the NNPF model as specified in Table 4. It indicates that with a baseline GPU (for example, defined as Nvidia RTX 1080Ti) available, the combination of resolution and frame rate that can be supported by the NNPF model to ensure the real-time decoding and no delay in consistence with the decoder.
It is noted that the NN storage/exchange format or a complexity indication can be generated by downloading the model and using a standalone analyzer. Therefor a “present flag” such as nnpf_model_complexity_ind_present_flag is used to provide this option as a complexity indication.
The data input to the NN might be different from the decoded format. To correctly apply NNPF, the following information may be included in the bitstream.
An example of SEI messaging data information is shown in Table 5.
input_chroma_format_idc has the same semantics as specified for the syntax sps_chroma_format_idc.
output_chroma_format_idc has the same semantics as specified for the syntax sps_chroma_format_idc.
vui_matrix_coeffs has the same semantics as specified for the syntax vui_matrix_coeffs
packing_format_idc indicates the packing format for luma channel as specified in Table 6. The purpose is to allow all input channels to have the same dimension.
chroma_luma_dependency_flag equal to 1 specifies for the chroma NNPF model the chroma channels are dependent on the luma channel for the input of the NNPF. chroma_luma_dependency_flag equal to 0 specifies the chroma channels are independent of the luma channel for the input of the NNPF.
In an alternative example, one can support more cases.
luma_chroma_dependency_idc specifies the luma and chroma dependency for the input of the luma model and chroma model as specified in Table 7
precision_format_idc has the same semantics as the syntax nnpf_param_prec_idc.
tensor_format_idc indicates the tensor format of the input and output tensor as specified in Table 8.
In Table 8, the variables of N, C, H, W denote:
log2_patch_size_minus6 plus 6 specifies the base 2 logarithm of the luma patch size. The value of log2_patch_size_minus6 shall be in the range 0 to 6 inclusive.
The variable PatchSize is defined as follows:
Note: PatchSize indicates both the height and the width of a patch. In another embodiment, one can specify the patch width and the patch height separately.
picture_padding_type indicates the picture padding type as specified in Table 9.
When the picture width and height are not multiple of patchSize, padding is required based on picture_padd_type. The padding is operated on the bottom and/or the right of the picture. The decoded output picture width and height in units of luma samples, denoted by PicWidthInLumaSamples and PicHeightInLumaSamples, respectively. The filtered picture width and height in units of luma samples, denoted by FilterPic WidthInLumaSamples and FilterPicHeightInLumaSamples, respectively. The derivation is as follows:
FilterPic WidthInLumaSamples=PicWidthInLumaSamples+patchSize−(Pic WidthInLumaSamples%patchSize)
FilterPicHeightInLumaSamples=PicHeightInLumaSamples+patchSize−(PicHeightInLumaSamples%patchSize)
patch_boundary_overlap_flag equal to 1 specifies the patches overlap in the boundary. patch_boundary_overlap_flag equal to 0 specifies the patches do not overlap in the boundary.
log2_boundary_overlap_minus3 plus 3 specifies the base 2 logarithm of the boundary overlap between horizontal and vertical patches. The value of boundary overlap in units of luma samples is derived to be equal to (1<<(log2_boundary_overlap_minus3+3). The value of log2_boundary_overlap_minus3 shall be in the range 0 to 2 inclusive.
It is noted the final input patch size to the NNPF is set equal to
One of the advantages of using NNPF SEI messaging than pure NNPF is that NNPF SEI messaging is generated during encoding. This allows one to include information related to bitstream characteristics into the SEI: such as QP information, picture/slice type information, partition information, inter/intra map information, classification information, and temporal neighboring pictures as the input to the NNPF. To get the device to be ready to such auxiliary input, one can indicate an auxiliary input information hint message in the CLVS-layer NNPF SEI and carry more detailed information in the picture-layer SEI. An example of auxiliary input hint information is shown in Table 10.
nnpf_auxi_input_id contains an identifier number that may be used to identify the possible existence of NNPF auxiliary input information. nnpf_auxi_input_id equal to 0 infers that no auxiliary input is used for NNPF in the CLVS. The nnpf_auxi_input_id is interpreted as follows:
nnpf_purpose indicates the purpose of post-processing filter as specified in Table 12. The value of nnpf_purpose shall be in the range of 0 to 232-2, inclusive. Values of nnpf_purpose that do not appear in Table 12 are reserved for future specification by ITU-T|ISO/IEC and shall not be present in bitstreams conforming to this version of this Specification. Decoders conforming to this version of this Specification shall ignore SEI messages that contain reserved values of nnpf_purpose (Ref. [4]).
The nnpf_purpose syntax and semantics are taken from Ref. [4]. The allowed range is probably too big for post filter purpose.
nnpf_model_info_present_flag equal to 1 specifies that the nnpf model information is present in the SEI message. nnpf_model_info_present_flag equal to 0 specifies that the nnpf model information is not present in the SEI message.
It is noted that when counting number of models, in one embodiment, one can count one model for both luma and chroma components. Therefore, even if luma and chroma use separate models, because one can only complete one picture with both luma and chroma components by using both models, one counts them as one model. Therefore, num_nnpf_models=(nnpf_num_pic_type_minus1+1)*(nnpf_num_device_type_minus1+1). In another embodiment, one can count luma and chroma components models as individual counts. Therefore, if luma and chroma components use separate models, one counts them as two models. Therefore, num_nnpf_models=(nnpf_joint_model_flag==0?2:1)*(nnpf_num_pic_type_minus1+1)*(nnpf_num_device_type_minus1+1). In Table 11, the latter method is used.
nnpf_num_pic_type_minus1 plus 1 indicates that the number of picture types supported in the nnpf picture type based model. When not present, the value of nnpf_num_pic_type_minus1 is inferred to be equal to 0. The value shall be in the range of 0 to 3, inclusive.
nnpf_num_device_type_minus1 plus 1 indicates that the number of device types supported in the nnpf device type based model. When not present, the value of nnpf_num_device_type_minus1 is inferred to be equal to 0. The value shall be in the range of 0 to 15, inclusive.
nnpf_mode_id[i] contains an identifier number that may be used to identify the ith NNPF model. When not present, the value of nnpf_mode_id is inferred to be equal to 0. The value of nnpf_mode_id[i] shall be in the range of 0 to 255, inclusive. The nnpf_model_id is interpreted as follows:
In another example, one can also add QualityType indication. So different decoded quality can use different model. The quality can be decided by picture level QP.
num_of_ckpts_minus1[nnpf_model_id[i]] plus 1 specifies the number of the checkpoints for nnpf_model_id[i]. The index of each checkpoint is in increasing order from 0 . . . num_of_ckpts_minus1[nnpf_model_id[i]], inclusively.
In NN literature, checkpoint (ckpt) is used to save the model parameters such as weights and biases in CNN. In our application, ckpt means the same model topology is used. The difference between the ckpts is the value of model parameters.
nnpf_data_info_present_flag equal to 1 indicates that nnpf_data_info( ) is present in the SEI message. nnpf_data_info_present_flag equal to 0 indicates that the nnpf_data_info( ) is not present in the SEI message.
In alternative examples, one can associate nnpf_data_info( ) and nnpf_auxiliary_input_info( ) with nnpf_model_id to have higher flexibility.
It is noted that in another embodiment, one can just specify the number of nnpf models using syntax num_nnpf_models_minus1 and assign index i to nnpf_mode_id[i]. The drawback of this method is that nnpf_model_id[i] has no specific meaning and the decoder is using the nnpf model blindly. The advantage is that the bitstream can carry as many models as it prefers. In addition, one does not need to differentiate checkpoints from models strictly. For example, the bitstream can carry two different checkpoints for the same picture type even though for any given picture, only one checkpoint is used.
num_nnpf_models_minus1 plus1 specifies number of NNPF models.
The index of models is in increasing order from 0 . . . num_nnpf_models_minus1, inclusively.
One benefit of using picture layer NNPF SEI (denoted as nnpf_pic_adapt_SEI( ) instead of standalone NNPF is that the SEI can carry adaptation information for each picture. The information can include such parameters as: picture-layer, luma/chroma components and CTU-layer NNPF on/off flags, picture/slice type, picture/slice QP, block level QP, picture/slice/block level classification, picture/slice level inter/intra map, and the like.
To save bit overhead, nnpf_pic_adapt_SEI( ) can refer to CLVS level nnpf_sei( ) for high level controlling.
The persistence scope of the nnpf_pic_adapt_SEI( ) is for the current picture.
As for signaling nnpf_pic_model_id, several methods can be used for Table 11:1) nnpf_pic_model_id from nnpf_sei( ) can be signalled explicitly in nnpf_pic_adapt_SEI( ) at cost of ue (v) bits. This explicit model is the base model. The bit0 should always be 0 to indicate that the nnpf_pic_model_id represents a luma model. The base model can tell PicType, DeviceType or QualityType. If the model has deviceType option, the user can select the other model based on display Type and complexityType. 2) nnpf_pic_model_id is inferred from the other syntax in nnpf_pic_adapt_SEI accordingly. If the model has deviceType option, the user can select the right model based on displayType and complexityType. if implicit model is used, one needs to signal nnpf_pic_type to select the model from the pools.
Additional nnpf_pic_mode_id_chroma for chroma components can be decided based on nnpf_joint_model_flag derived as following.
For region related information, the region size can be implied to be the same as PatchSize in nnpf_sei( ) or explicitly signalled if the size different from PatchSize. Region size in general should be no smaller than PatchSize and probably be best to be a multiple of PatchSize. For the QP map, classification map, or partition map inside the region, which are used to generate auxiliary input, a smaller unit can be used, but one needs to consider the trade-offs between the accuracy and bit overhead.
The auxiliary input information should be generated either by picture level information or region level information. For example, the QP map can be generated using picture level QP or region based QP information. The classification map can be generated using region based inter/intra information. The partition map can be generated using region-based partition information
Table 18 shows an example of nnpf_pic_adapt_SEI( ) In this example, for simplicity, one sends corresponding nnpf_mode_id directly. It allows to switch picture level and CTU level on/off. Region size is inferred to be the same as the patchSize defined in nnpf_SEI( ).
nnpf_pic_enabled_flag equal to 1 specifies nnpf is applied to the current picture. nnpf_pic_enabled_flag equal to 0 specifies nnpf is not applied to the current picture. When not present, the value of nnpf_pic_enabled_flag is inferred to be equal to 0.
nnpf_pic_luma_enabled_flag equal to 1 specifies nnpf is applied to the luma components of the current picture. nnpf_pic_luma_enabled_flag equal to 0 specifies nnpf is not applied to the luma components of the current picture. When not present, the value of nnpf_pic_luma_enabled_flag is inferred to be equal to 0.
nnpf_pic_chroma_enabled_flag equal to 1 specifies nnpf is applied to the chroma components of the current picture. nnpf_pic_chroma_enabled_flag equal to 0 specifies nnpf is not applied to the chroma components of the current picture. When not present, the value of nnpf_pic_chroma_enabled_flag is inferred to be equal to 0.
nnpf_pic_model_id specifies the nnpf_mode_id used for the current picture.
nnpf_pic_ckpt_idx specifies the checkpoint index used for nnpf_pic_model_id. The value of nnpf_pic_ckpt_idx is in the range of 0 . . . num_ckpts_minus1 [nnpf_pic_model_id], inclusively.
nnpf_qp_info_present_flag equal to 1 specifies that the current SEI contains QP information. nnpf_qp_info_present_flag equal to 0 specifies that the current SEI does not contain QP information. When not present, the value of nnpf_qp_info_present_flag is inferred to be equal to 0.
nnpf_region_info_present_flag equal to 1 specifies that the current SEI contains region information. nnpf_region_info_present_flag equal to 0 specifies that the current SEI does not contain region information. When not present, the value of nnpf_region_info_present_flag is inferred to be equal to 0.
nnpf_region_qp_present_flag equal to 1 specifies that the current SEI contains region based QP information. nnpf_region_qp_present_flag equal to 0 specifies that the current SEI does not contain region based QP information. When not present, the value of nnpf_region_qp_present_flag is inferred to be equal to 0.
nnpf_region_ptt_present_flag equal to 1 specifies that the current SEI contains region-based partition information. nnpf_region_ptt_present_flag equal to 0 specifies that the current SEI does not contain region-based partition information. When not present, the value of nnpf_region_ptt_present_flag is inferred to be equal to 0.
nnpf_region_clfc_present_flag equal to 1 specifies that the current SEI contains region-based classification information. nnpf_region_clfc_present_flag equal to 0 specifies that the current SEI does not contain region-based classification information. When not present, the value of nnpf_region_clfc_present_flag is inferred to be equal to 0.
Note: nnpf_region_qp/ptt/clfc/_present_flag could also be implicitly inferred from nnpf_pic_model_id, for example, only when picType=Intra, one will need that region-level information.
nnpf_region_enabled_flag[i] equal to 1 specified that the nnpf is enabled for the i-th region. nnpf_region_enabled_flag[i] equal to 0 specified that the nnpf is not enabled for the i-th region. When not present, the value of nnpf_region_enabled_flag[i] is inferred to be equal to 0.
qp_delta_abs_map[i] has the same semantics as specified for cu_qp_delta_abs.
qp_delta_sign_map_flag[i] has the same semantics as specified for cu_qp_delta_sign_flag.
ptt_map[i] specifies the partiton map for the i-th region. The partion map is represented using the same intepretaton as MaxMttDepth Y. The value is in the range of 0 to log 2(PatchSize)−3, inclusively.
clfc_map[i] specifies the classification map for the i-th region.
In one example, the classification map only indicates intra or inter.
The CLVS-layer NNPF SEI messaging of Table 11 (which may load data as defined in Tables 1-15) may require metadata information that is deemed too large or unnecessary in some applications. To reduce the payload size, an example of an alternative and simplified CLVS NNPF SEI message is illustrated in Table 19. To generate the syntax of Table 19, some of the earlier defined parameters were deleted as explained below.
Parameter nnpf_num_device_type_minus1 is skipped because of lack of experimental support of NNPF across multiple devices. Parameter nnpf_model_upd_param_present_flag is skipped because it is from the Ref. [4] and there is no demonstrated need. Parameter nnpf_latency_idc is skipped. This is also because it requires tests under too many different resolution and frame-rate configurations. Even if the results are available, the results can only be based for a baseline GPU. In practice, devices may use a variety of GPU architectures making this indicator less accurate or useful. Parameters input_chroma_format_idc and output_chroma_format_idc have been merged to one: nnpf_chroma_format_idc, since it is considered unlikely that in practice the input and output of the NNPF will have different chroma formats. Parameter precision_format_idc is skipped because its function to indicate precision may be considered duplicate to the nnpf_param_prec_idc value defined previously. Parameter tensor_format_idc is skipped because it is highly correlated to the previously defined nnpf_model_storage_form_idc value. A storage format, such as ONNX, usually specifies the tensor format as well. patch_boundary_overlap_flag is skipped because a deblocking filter is generally applied in the bitstream. So for NNPF, overlap most likely is not needed.
Given the above syntax, an example of how to apply the NNPF SEI message in Table 17 is illustrated as follows. Suppose NNPF is used to improve visual quality, then nnpf_purpose is set to 0. Given the need to signal NNPF model related information, nnpf_model_info_present_flag is set to 1. If the luma and chroma use a different model, then nnpf_joint_model_flag is set to 0. Different models are applied to intra and inter pictures, hence, nnpf_num_pic_type_minus1 is set to 1. num_of_nnpf_models is set to 4 (luma/chroma and inter/inter). Given these four models, the value of nnpf_model_id [0] is set to 0, which is used for luma component and intra pictures, the value of nnpf_model_id [1] is set to 1, which is used for chroma component and intra picture, the value of nnpf_model_id [2] is set to 2, which is used for luma component and inter pictures, the value of nnpf_model_id [3] is set to 3, which is used for chroma component and inter pictures. The number of checkpoints provided for each model is set to 1, so num_of_ckpts_minus1[0]/[1]/[2]/[3] are all set to 0. One can provide external web link for the two model IDs. so nnpf_model_exter_link_flag [0]/[1] is set to 1. The web link is coded using IETF Internet Standard 66. For all models, Pytorch is used, so nnpf_model_storage_form_idc [0]/[1] is set to 3. To indicate the model complexity, nnpf_model_complexity_ind_present_flag [0]/[1] is set to 1. The model uses single-precision floating point format. The value of nnpf_param_prec_idc [0]/[1] is set to 4. The number of model parameters for each id is 214k=1.6327*2{circumflex over ( )}17. So value of log2_nnpf_num_param_minus11 [0]/[1] is set to 6. log2_prec_denom [0]/[1] is set to 5, nnpf_num_param_frac [0]/[1] is set to 21. So the maximal number parameters are set equal to 217k. The number of operations as kMac/pixel is 33.6k. The value of nnpf_num_op [0]/[1] is set to 34.
Continuing with the signal data formation information, nnpf_data_info_present_flag is set to 1. The input and output of NNPF is YUV420, so nnpf_chroma_format_idc is set to 1 (420 format). vui_matrix_coeffs is set to 1 or 9 (YUV). Since separate models are used for the luma and chroma component, nnpf_joint_model_flag is 0, hence there is no need to signal packing_format_idc. The chroma model also uses luma information, hence, chroma_luma_dependency_flag is set to 1. The patch size is 128, so the value of log2_patch_size_minus6 is set to 1. Suppose the picture size is 4k, one will need to add padding. For replicate padding, the value of picture_padding_type is set to 1. Since the deblocking is used in the bitstream, no overlap for patches is used. For auxiliary input information, QP map is used and the value of nnpf_auxi_input_id is set to 1.
The Picture Level NNPF SEI messaging of Table 18 may require region level metadata which may be too large or of little use in many applications. To reduce the overall payload size and focus on QP mapping SEI information, an example of an alternative and simplified Picture level NNPF SEI message is illustrated in Table 20. To generate the syntax, some of the earlier defined parameters are deleted as will be explained below.
where:
nnpf_pic_model_id_chroma specifies the index of the model used for the current picture for chroma component. The value of nnpf_pic_model_id_chroma shall be in the range of 0 . . . nnpfc_max_num_models, inclusive, for this version of this Specification. When not present, the value of nnpf_pic_model_id_chroma is inferred to be equal to nnpf_pic_model_id.
nnpf_pic_ckpt_idx_chroma specifies the index of the checkpoint for use with the model for the current picture for chroma component. The value of nnpf_pic_ckpt_idx_chroma shall be in the range of 0 . . . nnpfc_max_num_ckpts_minus1[nnpf_mode_id_chroma], inclusive. When not present, the value of nnpf_pic_ckpt_idx_chroma is inferred to be equal to nnpf_pic_ckpt_idx.
Parameters related to region level messaging are all removed, and redundancies created by said parameters are also eliminated. More specifically, nnpf_region_info_present_flag is deemed unnecessary and redundant due to the use of nnpf_qp_info_present_flag. Similarly, nnpf_region_ptt_present_flag, ptt_map, and clfc_map are not needed if region-level partitioning is not available.
In Ref. [21], auxiliary input data can be present in the neural-network input tensor only when the value of nnpfc_inp_order_idc is equal to 3, i.e., when the input tensor is configured as four interleaved luma channels and two chroma channels. Currently, auxiliary input data cannot be present in the input tensor for luma-only, chroma-only, and 3-channel luma and chroma configurations, i.e., nnpfc_inp_order_idc equal to 0, 1, and 2, respectively. It is asserted that auxiliary input data can be beneficial for all input tensor configurations.
As suggested earlier (e.g., see Table 10 and syntax parameter nnpf_auxi_input_id), it is proposed to add syntax element nnpfc_auxiliary_input_idc and corresponding semantics to the NNPF CLVS SEI message, which in Ref. is denoted as NNPFC SEI, so that the auxiliary data can be present in the input tensor for every allowed configuration of the input tensor, i.e., for every value of nnpfc_inp_order_idc. As in the current Ref. draft of the VSEI amendment, it is proposed that auxiliary input data be limited to a signal derived from the luma quantization parameter, SliceQpy. The parameter nnpfc_auxiliary_input_idc was also previously proposed in Ref. [22].
Colour description information for neural-network tensors cannot be signaled using the current text of Ref. [21]. It is asserted that colour description information for neural-network tensors can be beneficial. For example, ICTCp may be preferred when applying a neural-network post filter to an HDR WCG signal.
It is proposed to add syntax elements nnpfc_separate_colour_description_present flag, nnpfc_colour primaries, nnpfc_transfer_characteristic, and nnpfc_matrix_coeffs and corresponding semantics to the NNPFC SEI message. It is proposed that the syntax and semantics be modelled on those for the film grain characteristics SEI message.
Additionally, the following constraints are proposed for nnpfc_purpose, nnpfc_inp_order_idc, and nnpfc_out_order_idc when nnpfc_matrix_coeffs is equal to 0, which is typically used for GBR (RGB) and YZX 4:4:4 chroma format:
It is asserted that it can be beneficial to apply neural-network post-filters in specific sequence when more than one neural-network post-filter is activated for the current picture. For example, an output tensor of a luma-only neural-network post-filter can be used to derive an input tensor of a luma-chroma neural-network post-filter. As another example, an output tensor of a neural-network post-filter to increase the width or height of a decoded picture (nnpfc_purpose equal to 2, 3, or 4) can be used to derive the input tensor of a neural-network post-filter to improve video quality (nnpfc_purpose equal to 1).
It is proposed to add three syntax elements and corresponding semantics to NNPFA SEI message as follows:
Given these proposed new syntax elements, the following table represents a revised NNPF CLVS or NNPFC SEI message. Changes over Ref. are denoted using an Italic font.
ue(v)
u(1)
nnpfc_separate_colour_description_present_flag
u(8)
u(8)
u(8)
Compared to the original text and semantics for NNPFC, the following amendments are proposed.
This SEI message specifies a neural network that may be used as a post-processing filter. The use of specified post-processing filters for specific pictures is indicated with neural-network post-filter activation SEI messages.
Use of this SEI message requires the definition of the following variables:
When this SEI message specifies a neural network that may be used as a post-processing filter, the semantics specify the derivation of the luma sample array FilteredYPic[y][x] and chroma sample arrays FilteredCbPic[y][x] and FilteredCrPic[y][x], as indicated by the value of nnpfc_out_order_idc, that contain the output of the post-processing filter.
nnpfc_auxiliary_input idc not equal to 0 specifies auxiliary input data is present in the input tensor of the neural-network post-filter. nnpfc_auxiliary_input_id equal to 0 indicates that auxiliary input data is not present in the input tensor. nnpfc_auxiliary_input_idc equal to 1 specifies that auxiliary input data is derived from as specified in Table 23. Values of nnpfc_auxiliary_input_id greater than 1 are reserved for future specification by ITU-T|ISO/IEC and shall not be present in bitstreams conforming to this version of this Specification. Decoders conforming to this version of this Specification shall ignore SEI messages that contain reserved values of nnpfc_inp_order_idc.
nnpfc_separate_colour_description present_flag equal to 1 indicates that a distinct combination of colour primaries, transfer characteristics, and matrix coefficients for the neural-network post-filter characteristics specified in the SEI message is present in the neural-network post-filter characteristics SEI message syntax. nnfpc_separate_colour_description_present flag equal to 0 indicates that the combination of colour primaries, transfer characteristics, and matrix coefficients for the film grain characteristics specified in the SEI message are the same as indicated in VUI parameters for the CLVS.
nnpfc_colour primaries has the same semantics as specified in clause 7.3 of Ref. [3] for the vui_colour_primaries syntax element, except as follows:
When nnpfc_auxiliary_input_idc is equal to 0, one luma matrix is present in the
is not equal to 0 and one luma matrix and one auxiliary input matrix are present,
thus the number of channels is 2.
When nnpfc_auxiliary_input_idc is equal to 0, two chroma matrices are present in
nnpfc_auxiliary_input_idc is not equal to 0 and two chroma matrices and one
auxiliary input matrix are present, thus the number of channels is 3.
When nnpfc_auxiliary_input_idc is equal to 0, one luma and two chroma matrices
nnpfc_auxiliary_input_idc is not equal to 0 and one luma matrix, two chroma
matrices and one auxiliary input matrix are present, thus the number of channels is 4.
When nnpfc_auxiliary_input_idc is equal to 0, four luma matrices and two chroma
matrices are present in the input tensor, thus the number of channels is 6. Otherwise,
nnpfc_auxiliary_input_idc is not equal to 0 and four luma matrices, two chroma
Because of the proposed new syntax, Table 23 in Ref. may be updated as follows.
nnpfc_component_last_flag = = 0
nnpfc_component_last_flag = = 0
In Ref. [21], the picture-layer NNPF message is denoted as the NNPFA SEI message. Proposed amendments to the existing syntax are denoted in Table 24 in Italics.
u(1)
!nnpfa_independent_flag
ue(v)
ue(v)
This SEI message specifies the neural-network post-processing filter that may be used for post-processing filtering for the current picture and conveys information on dependencies, if any, on other neural-network post-filters that may be present for the current picture.
The neural-network post-processing filter activation SEI message persists only for the current picture.
As discussed in Ref. [19], in certain applications it may be necessary to define the priority order of how multiple SEI messages may be executed. As examples, priority is important when considering SEI messages for FGC (Film Grain Characteristics) and CTI (Colour Transform Information). In HEVC and AVC, post-filter hint, tone mapping information, and chroma resampling filter hint SEI messages are additional examples of SEI messages that need to be considered for defining their processing order. The processing order of NNPF SEI messaging should be also considered. The specific order needs to be decided by the user case and can be transmitted as suggested in the proposed processing-order SEI (Ref. [19]) along with the bitstream. As an example, suppose the bitstream carries SDR (standard dynamic range) video and FGC, CTI, and NNPF SEI messaging, where CTI SEI is used to convert SDR video to HDR video, and NNPF SEI is used for quality improvement on the SDR decoded video. In an embodiment, the proposed order may be: first, NNPF SEI (to improve the decoded video quality), next, CTI SEI (to convert SDR to HDR), and finally FGC SEI (to add the film grain effect for the final display). For example, if applied earlier, added film grain noise may be amplified during the SDR to HDR conversion.
Each one of the references listed herein is incorporated by reference in its entirety. The term JVET refers to the Joint Video Experts Team of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29.
Embodiments of the present invention may be implemented with a computer system, systems configured in electronic circuitry and components, an integrated circuit (IC) device such as a microcontroller, a field programmable gate array (FPGA), or another configurable or programmable logic device (PLD), a discrete time or digital signal processor (DSP), an application specific IC (ASIC), and/or apparatus that includes one or more of such systems, devices or components. The computer and/or IC may perform, control, or execute instructions relating to the carriage of neural network topology and parameters as related to NNPF in image and video coding, such as those described herein. The computer and/or IC may compute any of a variety of parameters or values that relate to the carriage of neural network topology and parameters as related to NNPF in image and video coding described herein. The image and video embodiments may be implemented in hardware, software, firmware and various combinations thereof.
Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention. For example, one or more processors in a display, an encoder, a set top box, a transcoder, or the like may implement methods related to the carriage of neural network topology and parameters as related to NNPF in image and video coding as described above by executing software instructions in a program memory accessible to the processors. Embodiments of the invention may also be provided in the form of a program product. The program product may comprise any non-transitory and tangible medium which carries a set of computer-readable signals comprising instructions which, when executed by a data processor, cause the data processor to execute a method of the invention. Program products according to the invention may be in any of a wide variety of non-transitory and tangible forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like. The computer-readable signals on the program product may optionally be compressed or encrypted.
Where a component (e.g. a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to a “means”) should be interpreted as including as equivalents of that component any component which performs the function of the described component (e.g., that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated example embodiments of the invention.
Example embodiments that relate to the carriage of neural network topology and parameters as related to NNPF in image and video coding are thus described. In the foregoing specification, embodiments of the present invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and what is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application claims priority of U.S. Provisional Patent Application No. 63/328,131 filed Apr. 6, 2022 and U.S. Provisional Patent Application No. 63/354,549 filed Jun. 22, 2022, each of which is incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/017252 | 4/3/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63328131 | Apr 2022 | US | |
63354549 | Jun 2022 | US |