3D DATA ENCODING APPARATUS AND 3D DATA DECODING APPARATUS

TECHNICAL FIELD

Embodiments of the present invention relate to a 3D data encoding apparatus and a 3D data decoding apparatus.

BACKGROUND ART

A 3D data encoding apparatus that converts 3D data into a two-dimensional image and encodes it using a video encoding scheme to generate encoded data and a 3D data decoding apparatus that decodes and reconstructs a two-dimensional image from the encoded data to generate 3D data are provided to efficiently transmit or record 3D data. Also, there is a technique for performing filter processing on a two-dimensional image by using supplemental enhancement information of a deep learning post-filter.

Specific 3D data encoding schemes include, for example, MPEG-I ISO/IEC 23090-5 Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V-PCC). V3C is used to encode and decode a point cloud including point positions and attribute information. V3C is also used to encode and decode multi-view videos and mesh videos through ISO/IEC 23090-12 (MPEG Immersive Video (MIV)) and ISO/IEC 23090-29 (Video-based Dynamic Mesh Coding (V-DMC)) that is currently being standardized. According to Supplemental Enhancement Information (SEI) of a neural network post-filter in NPL 1, neural network model information is transmitted using characteristics SEI to call a frame specified by activation SEI, whereby adaptive filter processing can be performed on point cloud data.

CITATION LIST
Non Patent Literature
NPL 1:

- K. Takada, Y. Tokumo, T. Chujoh. T. Ikai, “[V-PCC][EE2.8] V3C neural-network post-filter SEI messages,” ISO/IEC JTC 1/SC 29/WG7, m61805, January 2023

SUMMARY OF INVENTION
Technical Problem

In NPL 1, there is a problem that filter processing using the relationship between occupancies, geometries, and attributes cannot be performed because filter processing is performed only on attributes. Namely, refinement using information between images is not possible in a method of decoding 3D data using multiple images/videos.

It is an object of the present invention to solve the above problem in 3D data encoding and/or decoding using a video encoding/decoding scheme, to further reduce encoding distortion using auxiliary information of refinement, and to encode and/or decode 3D data with high quality.

Solution to Problem

To accomplish the object, an aspect of the present invention provides a 3D data decoding apparatus including a geometry decoder configured to decode a geometry frame from encoded data, and an attribute decoder configured to decode an attribute frame from the encoded data, wherein a refinement information decoder is comprised, the refinement information decoder being configured to decode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is decoded from the activation information, and an index indicating the characteristics information for the decoded number of refinements is decoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.

A 3D data encoding apparatus is provided that includes a geometry encoder configured to encode a geometry frame and an attribute encoder configured to encode an attribute frame, wherein a refinement information encoder is comprised, the refinement information encoder being configured to encode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is encoded from the activation information and an index indicating the characteristics information for the encoded number of refinements is encoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.

Advantageous Effects of Invention

According to an aspect of the present invention, it is possible to reduce distortion caused by encoding a color image and to encode and/or decode 3D data with high quality.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram illustrating a configuration of a 3D data transmission system according to the present embodiment.

FIG. 2 is a diagram illustrating a hierarchical structure of data of an encoding stream.

FIGS. 3A to 3F are diagrams for explaining 3D data, an occupancy, a geometry frame, and an attribute frame.

FIGS. 4A and 4B are diagrams illustrating a relationship between characteristics SEI and activation SEI according to an embodiment of the present invention.

FIG. 5 is a functional block diagram illustrating a schematic configuration of a 3D data decoding apparatus 31 according to an embodiment of the present invention.

FIG. 6 is a functional block diagram illustrating a configuration of a refiner 306.

FIG. 7 is a functional block diagram illustrating a schematic configuration of a 3D data decoding apparatus 31 according to an embodiment of the present invention.

FIG. 8 is a functional block diagram illustrating a configuration of the refiner 306.

FIG. 9 is a flowchart illustrating an SEI decoding process.

FIGS. 10A and 10B are flowcharts illustrating operations of the refiner 306.

FIG. 11 illustrates a part (former half) of a syntax of characteristics SEI.

FIG. 12 illustrates a part (latter half) of a syntax of characteristics SEI.

FIG. 13 illustrates a configuration example of a syntax of activation SEI.

FIG. 14 is a functional block diagram illustrating a schematic configuration of a 3D data encoding apparatus 11 according to an embodiment of the present invention.

FIG. 15 is a block diagram illustrating a relationship between the 3D data encoding apparatus, the 3D data decoding apparatus, and an SEI message according to an embodiment of the present invention.

DESCRIPTION OF EMBODIMENTS

Embodiments of the present invention will be described below with reference to the drawings.

FIG. 1 is a schematic diagram illustrating a configuration of a 3D data transmission system 1 according to the present embodiment.

The 3D data transmission system 1 is a system that transmits an encoding stream obtained by encoding 3D data to be encoded, decodes the transmitted encoding stream, and displays 3D data. The 3D data transmission system 1 includes a 3D data encoding apparatus 11, a network 21, a 3D data decoding apparatus 31, and a 3D data display apparatus 41.

3D data T is input to the 3D data encoding apparatus 11.

The network 21 transmits an encoding stream Te generated by the 3D data encoding apparatus 11 to the 3D data decoding apparatus 31. The network 21 is the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or a combination thereof. The network 21 is not limited to a bidirectional communication network and may be a unidirectional communication network that transmits broadcast waves for terrestrial digital broadcasting, satellite broadcasting, or the like. The network 21 may be replaced by a storage medium on which the encoding stream Te is recorded, such as a Digital Versatile Disc (DVD) (trade name) or a Blu-ray Disc (BD) (trade name).

The 3D data decoding apparatus 31 decodes each encoding stream Te transmitted by the network 21 and generates one or more pieces of decoded 3D data Td.

The 3D data display apparatus 41 displays all or some of one or more pieces of decoded 3D data Td generated by the 3D data decoding apparatus 31. The 3D data display apparatus 41 includes a display apparatus such as, for example, a liquid crystal display or an organic electro-luminescence (EL) display. Examples of display types include stationary, mobile, and HMD. The 3D data display apparatus 41 displays a high quality image in a case that the 3D data decoding apparatus 31 has high processing capacity and displays an image that does not require high processing or display capacity in a case that it has only lower processing capacity.

Operators

Operators used in the present specification will be described below.

- “>>” is a right bit shift, “<<” is a left bit shift, “&” is a bitwise AND, “|” is a bitwise OR, “|=” is an OR assignment operator, and “∥” indicates a logical sum.
- “x? y: z” is a ternary operator that takes y if x is true (not 0) and z if x is false (0).
- “y . . . z” indicates an ordered set of integers from y to z.
- Clip3 (a, b, c) is a function to clip c to a value equal to or greater than a and less than or equal to b, and a function to return a in a case that c is less than a (c<a), return b in a case that c is greater than b (c>b), and return c in other cases (provided that a is less than or equal to b (a<=b)).
- abs (a) is a function that returns the absolute value of a.
- Floor (a) is a function that returns the maximum integer equal to or less than a.
- Round (a) is a function that returns an integer close to a by rounding processing.
- a/d represents division of a by d (round down decimal places).

Structure of Encoding Stream Te

A data structure of the encoding stream Te generated by the 3D data encoding apparatus 11 and decoded by the 3D data decoding apparatus 31 will be described.

FIG. 2 is a diagram illustrating a hierarchical structure of data of the encoding stream Te. The encoding stream Te has a data structure of either a V3C sample stream or a V3C unit stream. A V3C sample stream includes a sample stream header and V3C units and a V3C unit stream includes V3C units.

Each V3C unit includes a V3C unit header and a V3C unit payload. A header of a V3C unit (=V3C unit header) is a Unit Type which is an ID indicating the type of the V3C unit and has a value indicated by a label such as V3C_VPS, V3C_AD, V3C_AVD, V3C_GVD, or V3C_OVD.

In a case that the Unit Type is a V3C_VPS (Video Parameter Set), the V3C unit payload includes a V3C parameter set.

In a case that the Unit Type is V3C_AD (Atlas Data), the V3C unit payload includes a VPS ID, an atlasID, a sample stream NAL header, and multiple NAL units. ID is an abbreviation for identification and has an integer value of 0 or more. This atlasID may be used as an element of activation SEI.

Each NAL unit includes a NALUnitType, a layerID, a TemporalID, and a Raw Byte Sequence Payload (RBSP).

A NAL unit is identified by NALUnitType and includes an Atlas Sequence Parameter Set (ASPS), an Atlas Adaptation Parameter Set (AAPS), an Atlas Tile Layer (ATL), Supplemental Enhancement Information (SEI), and the like.

The ATL includes an ATL header and an ATL data unit and the ATL data unit includes information on positions and sizes of patches or the like such as patch information data.

The SEI includes a payloadType indicating the type of the SEI, a payloadSize indicating the size (number of bytes) of the SEI, and an sei_payload which is data of the SEI.

In a case that the UnitType is V3C_AVD (Attribute Video Data, attribute data), the V3C unit payload includes a VPS ID, an atlasID, an attrIdx which is an attribute frame ID (whose syntax name is a vuh_attribute_index), a partIdx which is a partition ID (vuh_attribute_partition_index), a mapIdx which is a map ID (vuh_map_index), a flag auxFlag (vuh_auxiliary_video_flag) indicating whether the data is auxiliary data, and a video stream. The video stream indicates data such as HEVC and VVC.

In a case that the UnitType is V3C_GVD (Geometry Video Data, geometry data), the V3C unit payload includes a VPS ID, an atlasID, a mapIdx, an auxFlag, and a video stream.

In a case that the UnitType is V3C_OVD (Occupancy Video Data, occupancy data), the V3C unit payload includes a VPS ID, an atlasID, and a video stream.

The attribute data and the geometry data have multiple maps distinguished from one another by mapIdx, and the attribute data has multiple attributes distinguished from one another by attrIdx.

Data Structure of Three-Dimensional Stereoscopic Information

Three-dimensional stereoscopic information (3D data) in the present specification is a set of position information (x, y, z) and attribute information in a three-dimensional space. For example, 3D data is expressed in the format of a point cloud that is a group of points with position information and attribute information in a three-dimensional space or a mesh having triangle (or polygon) vertices and faces.

FIGS. 3A to 3F are diagrams for explaining 3D data, an occupancy frame (a two-dimensional image representing occupancy information which is hereinafter referred to as an occupancy frame), a geometry frame (a two-dimensional image representing depth or position information which is hereinafter referred to as a geometry frame), and an attribute frame (a two-dimensional image representing attribute information which is hereinafter referred to as an attribute frame). A point cloud and a mesh that constitute the 3D data are divided into multiple parts (regions) by the 3D data encoding apparatus 11 and a point cloud included in each part is projected onto a plane of a 3D bounding box set in a 3D space (FIG. 3A). The 3D data encoding apparatus 11 generates multiple patches from the projected point cloud. Information regarding the 3D bounding box (such as coordinates and sizes) and information regarding mapping to the projection planes (such as the projection planes, coordinates, sizes, and presence or absence of rotation of each patch) are referred to as atlas information. An occupancy frame is a two-dimensional image showing valid areas of patches (areas where a point cloud or a mesh exists) as a 2D binary image (e.g., with 1 for a valid area and 0 for an invalid area) (FIG. 3B). Here, values other than 0 and 1 such as 255 and 0 may be used as the values of the valid and invalid areas. A geometry frame is a two-dimensional image showing the depth values (distances) of patches with respect to the projection plane (FIGS. 3C and 3D). The relationship between the depth values and the pixel values may be linear or the distances may be derived from the pixel values using a lookup table, a mathematical formula, or a relational formula based on a combination of branches based on values. An attribute frame is a two-dimensional image indicating attribute information (e.g., RGB colors) of points (FIGS. 3E and 3F). Even with projection in the same direction, the depths (distant and near views) of an object surface may differ and a mapIdx indicates which depth of projection is performed.

Each of the occupancy frames, geometry frames, attribute frames, and atlas information may be an image obtained by mapping (packing) partial images (patches) from different projection planes onto a certain two-dimensional image. The atlas information includes information on the number of patches and the projection planes corresponding to the patches. The 3D data decoding apparatus 31 reconstructs the coordinates and attribute information of a point cloud or a mesh from the atlas information, the occupancy frame, the geometry frame, and the attribute frame. Here, points are points of a point cloud or vertices of a mesh. Instead of the occupancy frame and the geometry frame, mesh information (position information) indicating the vertices of the mesh may be encoded, decoded, and transmitted. Mesh information may also be encoded, decoded, and transmitted after being divided into a base mesh that forms a basic mesh that is a subset of the mesh and a mesh displacement. The mesh displacement indicates a displacement from the base mesh to indicate a mesh part other than the basic mesh.

Terms

- Atlas: A collection of 2D bounding boxes and their related information arranged in a rectangular frame, the collection corresponding to a volume in 3D space on which volumetric data is rendered.
- Attribute: A scalar or vector property optionally associated with each point in a volumetric frame such as a color, a reflectance, a surface normal, a transparency, or a material ID.
- Attribute frame: A two-dimensional rectangular array created as a collection of patches containing the values of specific attributes.
- Attribute map: An attribute frame containing attribute patch information projected to a specific depth indicated by a corresponding geometry map.
- Bitstream: An ordered series of bits that forms an encoded representation of data, or encoded data. Encoded data.
- Coded Atlas Sequence (CAS): A sequence of encoded atlas access units (AUs), in decoding order, of an IRAP encoded atlas AU of NoOutputBeforeRecoveryFlag equal to 1, followed by zero or more encoded atlas AUs that are not IRAP encoded atlas AUs of NoOutputBeforeRecoveryFlag equal to 1. Here, the zero or more encoded atlas AUs include all subsequent encoded atlas AUs up to, but not including, a subsequent IRAP encoded atlas AU of NoOutputBeforeRecoveryFlag equal to 1.
- An IRAP encoded atlas AU may be any of an instantaneous decoding refresh (IDR) encoded access unit, a Broken Link Access (BLA) encoded access unit, or a Clean Random Access (CRA) encoded access unit.
- Geometry: A set of Cartesian coordinates relating to a volumetric frame.
- Geometry frame: A 2D array created by aggregating geometry patch information relating to each patch.
- Geometry map: A geometry frame containing geometry patch information projected to a specific depth.
- Occupancy: A value that indicates whether an atlas sample corresponds to a related sample in 3D space.
- Occupancy frame: A collection of occupancy values forming a 2D array, the collection representing all occupancies of an atlas frame.
- Patch: A rectangular region relating to volumetric information within an atlas.

Overview of Characteristics SEI and Activation SEI

FIG. 4A illustrates a relationship between refinement characteristics SEI (hereinafter referred to as characteristics SEI, nn_refinement_characteristics SEI, NNRC_SEI, or characteristics information) and refinement activation SEI (hereinafter referred to as activation SEI, nn_refinement_activation SEI, NNRA_SEI, activation SEI, or activation information) according to an embodiment of the present invention. NNRC_SEI message specifies a neural network that may be used for a refinement. The use of specified neural-network refinements for specific pictures is indicated with NNRA_SEI messages. NNRA_SEI message activates or de-activates the possible use of the target neural-network refinement. The NNC decoder decodes a compressed neural network (NN) model transmitted in characteristics SEI to derive an NN model. MPEG Neural Network Coding (ISO/IEC 15938-19) may be used to encode and decode an NN model. The refiner 306 includes an NN filter 611 and the NN filter 611 performs filter processing using the derived NN model. Details of the NN filter 611 will be described later. The characteristics SEI used for refinement is specified by a syntax element, for example an nnra_target_id, of the activation SEI. The activation SEI transmits persistence information using a syntax element and specifies that filter processing is to be performed on an attribute frame during a persistence scope specified by the persistence information. The NN filter 611 may also include the NNC decoder. The activation SEI (activation information) is not limited to the SEI illustrated in FIG. 13 and may be data including NAL units or atlas data.

FIG. 4B is another diagram illustrating a relationship between refinement characteristics SEI and refinement activation SEI according to an embodiment of the present invention. The refiner 306 includes a linear filter (ALF unit) 610. A filter parameter decoder decodes a syntax of filter parameters transmitted in characteristics SEI to derive coefficients of the linear filter. The filter is not limited to a loop filter and may be a post-filter. Also, although the filter is described as linear, it may include nonlinear elements such as simple clips and square components. The ALF unit 610 performs filter processing on an attribute frame or a geometry frame using the derived linear model.

Configuration of 3D Data Decoding Apparatus

FIGS. 5 and 7 are functional block diagrams illustrating a schematic configuration of the 3D data decoding apparatus 31 according to an embodiment of the present invention.

The 3D data decoding apparatus 31 includes a V3C unit decoder 301, an atlas decoder 302 (a refinement information decoder), an occupancy decoder 303, a geometry decoder 304, an attribute decoder 305, a post-decoding converter 308, a pre-reconstructor 310, a reconstructor 311, and a post-reconstructor 312.

The V3C unit decoder 301 receives encoded data (a bitstream) such as that of a byte stream format or an ISO Base Media File Format (ISOBMFF) and decodes a V3C unit header and a V3C VPS. The V3C unit decoder 301 selects the atlas decoder 302, the occupancy decoder 303, the geometry decoder 304, or the attribute decoder 305 according to the UnitType of the V3C unit header. The V3C unit decoder 301 uses the atlas decoder 302 in a case that the UnitType is V3C_AD and uses the occupancy decoder 303, the geometry decoder 304, or the attribute decoder 305 to decode an occupancy frame, a geometry frame, or an attribute frame in a case that the UnitType is V3C_OVD, V3C_GVD, or V3C_AVD.

The atlas decoder 302 receives atlas data and decodes atlas information.

The atlas decoder 302 (a refinement information decoder) decodes characteristics SEI indicating characteristics of refinement processing from encoded data. The refinement information decoder decodes information on a target to which refinement is to be applied (refinement target information). Further, the atlas decoder 302 decodes activation SEI from the encoded data.

The atlas decoder 302 decodes an identifier atlasID indicating target atlas information indicating a refinement target from a V3C unit including the activation SEI.

The occupancy decoder 303 decodes occupancy data encoded using VVC, HEVC, or the like and outputs an occupancy frame DecOccFrames[frameIdx][compIdx][y][x]. Here, DecOccFrames, frameIdx, compIdx, y, and x respectively indicate a decoded occupancy frame, a frame ID, a component ID, a row index, and a column index. In DecOccFrames, compIdx=0 may be set.

The geometry decoder 304 decodes geometry data encoded using AVC, VVC, HEVC, or the like and outputs a geometry frame DecGeoFrames[mapIdx][frameIdx][compIdx][y][x]. Here, DecGeoFrames, frameIdx, mapIdx, compIdx, y, and x respectively indicate a decoded geometry frame, a frame ID, a map ID, a component ID, a row index, and a column index. DecGeoBitDepth, DecGeoHeight, DecGeoWidth, and DecGeoChromaFormat refer to the bit-depth of the geometry frame, the height of the geometry frame, the width of the geometry frame, and the chroma format of the geometry frame. The decoded geometry frame may include multiple geometry maps (geometry frames with projections of different depths) and mapIdx is used to distinguish between the maps. In DecGeoFrames, compIdx=0 may be set.

The attribute decoder 305 decodes attribute data encoded using VVC, HEVC, or the like and outputs an attribute frame DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x]. Here, DecAttrFrames, frameIdx, attrIdx, partIdx, mapIdx, compIdx, y, and x respectively indicate a decoded attribute frame, a frame ID, an attribute ID, a partition ID, a map ID, a component ID, a row index, and a column index. DecAttrBitDepth, DecAttrHeight, DecAttrWidth, and DecAttrChromaFormat indicate an attribute frame, the bit-depth of the attribute frame, the height of the attribute frame, the width of the attribute frame, and the chroma format of the attribute frame. The decoded attribute frame may include multiple attribute maps (attribute frames with projections of different depths) and mapIdx is used to distinguish between the maps. The decoded attribute frame includes multiple attributes such as color (R, G, B), reflection, alpha, and normal direction. Multiple attributes can be transmitted through multiple pieces of attribute data and attrIdx is used to distinguish them. For example, {R, G, B} is attribute data 0 (attrIdx=0), {reflection} is attribute data 1 (attrIdx=1), and {alpha} is attribute data 2 (attrIdx=2). An attribute can be divided into and transmitted in multiple bitstreams and partIdx is used to distinguish between them. mapIdx is as described above.

The post-decoding converter 308 receives the decoded atlas information, the decoded occupancy frame DecOccFrames, the decoded geometry frame DecGeoFrames, and the decoded attribute frame DecAttrFrames and converts them into nominal formats. The post-decoding converter 308 outputs OccFramesNF[frameIdx][CompTimeIdx][y][x], GeoFramesNF[mapIdx][CompTimeIdx][frameIdx][y][x], AttrFramesNF[attrIdx][mapIdx][CompTimeIdx][compIdx][y][x] which are the nominal formats of the occupancy frame, the geometry frame, and the attribute frame. Here, frameIdx, CompTimeIdx, y, x, mapIdx, attrIdx, and compIdx respectively indicate a frame ID, a composition time index, a row index, a column index, a map ID, an attribute ID, and a component ID.

A nominal format refers collectively to a nominal bit-depth, resolution, chroma format, and composition time index into which a decoded video is to be converted.

Each video sub-bitstream and each region of a packed video sub-bitstream are associated with a nominal bit-depth. This is an expected target bit-depth for all operations for reconstruction. The nominal bit-depth OccBitDepthNF of the occupancy bit-depth is set to oi_occupancy_2d_bit_depth_minus1[ConvAtlasID]+1 or pin_occupancy_2d_bit_depth_minus1[ConvAtlasID]+1. oi_occupancy_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which an occupancy frame of an atlas with atlasID=j is to be converted. pin_occupancy_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which a decoded region including occupancy data of the atlas with atlasID=j is to be converted. In a case that a pin_occupancy_present_flag[j] is equal to 0, it indicates that the packed video frame of the atlas with atlasID=j does not include a region having occupancy data. In a case that the pin_occupancy_present_flag[j] is equal to 1, it indicates that the packed video frame of the atlas with atlasID=j includes a region having occupancy data. In a case that pin_occupancy_present_flag[j] is not present, it is inferred that its value is equal to 0. In a case that a pin_geometry_present_flag[ConvAtlasID] is equal to 1, the nominal bit-depth GeoBitDepthNF of each geometry frame is set to gi_geometry_2d_bit_depth_minus1[ConvAtlasID]+1 or pin_geometry_2d_bit_depth_minus1[ConvAtlasID]+1. gi_geometry_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which all geometry frames of the atlas with atlasID=j are to be converted. pin_geometry_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which a decoded region including geometry data of the atlas with atlasID=j is to be converted. pin_geometry_present_flag[j]=0 indicates that the packed video frame of the atlas with atlasID=j does not include a region having geometry data. pin_geometry_present_flag[j]=1 indicates that the packed video frame of the atlas with atlasID=j includes a region having geometry data. In a case that pin_geometry_present_flag[j] is not present, it is inferred that its value is equal to 0. Finally, in a case that pin_attribute_present_flag[ConvAtlasID]=1, the nominal bit-depth AttrBitDepthNF[attrIdx] of each attribute frame with an attrIdx is set to ai_attribute_2d_bit_depth_minus1[ConvAtlasID][attrIdx]+1 or pin_attribute_2d_bit_depth_minus1[ConvAtlasID][attrIdx]. ai_attribute_2d_bit_depth_minus1[j][i] plus 1 indicates a nominal two-dimensional bit-depth into which all attribute frames with attrIdx=i are to be converted for the atlas with atlasID=j. pin_attribute_2d_bit_depth_minus1[j][i] plus 1 indicates a nominal two-dimensional bit-depth into which a region including an attribute with attrIdx=i is to be converted for the atlas with atlasID=j. pin_attribute_present_flag[j]=0 indicates that the packed video frame of the atlas with atlasID=j does not include a region of attribute data. pin_attribute_present_flag[j]=1 indicates that the packed video frame of the atlas with atlasID=j includes a region of attribute data.

The ConvAtlasID is set equal to a vuh_atlas_id or is determined by external means in a case that no V3C unit header is available. The vuh_atlas_id is indicated by a V3C unit header such as V3C_AD, V3C_OVD, V3C_GVD, or V3C_AVD and specifies the ID of an atlas corresponding to the current V3C unit.

An asps_frame_width represents the frame width of the atlas as an integer number of samples which correspond to luma samples of a video component. It is a requirement for V3C bitstream conformance that the asps_frame_width be equal to the value of vps_frame_width [j] (where j is the current atlasID). An asps_frame_height indicates the frame height of the atlas as an integer number of samples which correspond to luma samples of a video component. It is a requirement for V3C bitstream conformance that the value of asps_frame_height be equal to the value of vps_frame_height[j], where j indicates the current atlasID. The nominal frame resolution of an auxiliary video component is defined by the nominal width and height specified respectively by variables Aux VideoWidthNF and Aux VideoHeightNF. Aux Video WidthNF and Aux VideoHeightNF are obtained from an auxiliary video sub-bitstream relating to the atlas.

The nominal chroma format is defined as 4:4:4.

The functions of the post-decoding converter 308 include bit-depth conversion, resolution conversion, output order conversion, atlas composition alignment, atlas dimension alignment, chroma upsampling, geometry map synthesis, and attribute map synthesis. Video frames provided by the V3C decoder 309 may require additional processing steps before being input to the reconstruction process. Such processing steps may include converting a decoded frame into a nominal format (e.g., a nominal resolution, bit-depth, or chroma format). Nominal format information is signaled in a V3C VPS.

The pre-reconstructor 310 may receive the decoded atlas information, the decoded occupancy frame, the decoded geometry frame, and the decoded attribute frame and refine/modify them. Specifically, in a case that an occupancy synthesis flag os_method_type[k] is equal to 1 indicating patch border filtering, the pre-reconstructor 310 starts occupancy synthesis with an OccFramesNF[compTimeIdx][0] and a GeoFramesNF[0][compTimeIdx][0] as inputs and a corrected array OccFramesNF[compTimeIdx][0] as an output. OccFramesNF indicates an occupancy frame decoded in a nominal format and GeoFramesNF indicates a geometry frame decoded in a nominal format. compTimeIdx is a composition time index.

The refiner 306 refines an occupancy, a geometry, and an attribute in units of frames in the pre-reconstructor 310. The refinement may be filtering that receives any of an occupancy, a geometry, and an attribute and outputs any of an occupancy, a geometry, and an attribute. The refiner 306 may perform refinement processing using a two-dimensional occupancy, geometry, and attribute at the same time.

The refiner 306 may further include an NN filter 611 or an ALF unit 610 and perform filter processing based on received neural network parameters or linear parameters.

The reconstructor 311 receives atlas information, an occupancy, and a geometry and reconstructs the positions and attributes of a point cloud or the vertices of a mesh in 3D space. The reconstructor 311 reconstructs mesh or point cloud data of 3D data based on the reconstructed geometry information (for example, recPcGeo) and attribute information (for example, recPcAttr). Specifically, the reconstructor 311 receives OccFramesNF[compTimeIdx][0][y][x], GeoFramesNF[mapIdx][compTimeIdx][0][y][x] and AttrFramesNF[attrIdx][mapIdx][compTimeIdx][ch][y][x] which are the nominal video frames derived by the pre-reconstructor 310 and reconstructs mesh or point cloud data of 3D data. AttrFramesNF indicates an attribute frame decoded in a nominal format. The reconstructor 311 derives a variable pointCnt as the number of points in a reconstructed point cloud frame, a one-dimensional array pointToPatch[pointCnt] as a patch index corresponding to each reconstructed point, and a two-dimensional array pointToPixel[pointCnt][dimIdx] as atlas coordinates corresponding to each reconstructed point. The reconstructor 311 also derives a 2D array recPcGeo[pointCnt][dimIdx] as a list of coordinates corresponding to each reconstruction point and a 3D array recPcAttr[pointCnt][attrIdx][compIdx] as an attribute relating to the points in the reconstructed point cloud frame. Here, pointCnt, attrIdx, and compIdx correspond respectively to the reconstructed point size, attribute frame ID, and component ID. dimIdx represents the (x, y) component of each reconstructed point.

Specifically, the reconstructor 311 derives recPcGeo and recPcAttr as follows.

m = 0

for(k=0; k<3; k++) (

for(pointIdx=0; pointIdx<AtlasPatchRawPoints[pIdx]; pointIdx++) {

y = AtlasPatch2dPosY[pIdx] + (m / AtlasPatch2dSizeX[pIdx])

x = AtlasPatch2dPosX[pIdx] + (m % AtlasPatch2dSizeX[pIdx])

rawPos1D[m] = gFrame[mapIdx][y][x]

m++

}

}

for(pointIdx=0; pointIdx<AtlasPatchRawPoints[pIdx]; pointIdx++) {

recPcGeo[pointCnt][0] = rawPos1D[pointIdx] +

AtlasPatch3dOffsetU[pIdx]

recPcGeo[pointCnt][1] = rawPos1D[pointIdx +

AtlasPatchRawPoints[pIdx]] + Atla

sPatch3dOffsetV[pIdx]

recPcGeo[pointCnt][2] = rawPos1D[pointIdx + 2 *

AtlasPatchRawPoints[pIdx]] +

AtlasPatch3dOffsetD[pIdx]

y = AtlasPatch2dPosY[pIdx] + (pointIdx / AtlasPatch2dSizeX[pIdx])

x = AtlasPatch2dPosX[pIdx] + (pointIdx % AtlasPatch2dSizeX[pIdx])

for(attrIdx=0; attrIdx<ai_attribute_count[RecAtlasID]; attrIdx++) {

attrDim = ai_attribute_dimension_minus1[RecAtlasID][attrIdx] + 1

for(compIdx=0; compIdx<attrDim; compIdx++) {

recPcAttr[pointCnt][attrIdx][compIdx] =

aFrame[attrIdx][compIdx][y][x]

}

}

if(ai_attribute_count[RecAtlasID]>0) {

attrPresent[pointCnt] = 1

}

pointToPixel[pointCnt][0] = −1

pointToPixel[pointCnt][1] = −1

pointToPatch[pointCnt] = pIdx

pointCnt++

}

Here, pIdx is the index of a patch. compTimeIdx represents a composition time index, rawPos1D represents a one-dimensional position, and gFrame and aFrame represent a geometry frame and an attribute frame, respectively. TilePatch3dOffsetU is a tile patch parameter relating to the patch. ai_attribute_count[j] indicates the number of attributes relating to the atlas with atlasID=j (the number of attributes encoded and decoded using the VPS). AtlasPatch3dOffsetU, AtlasPatch3dOffsetV, and AtlasPatch3dOffsetD are parameters indicating the position of the 3D bounding box of FIG. 3A on the 3D model.

for(k=0; k<3; k++) {

for(pointIdx=0; pointIdx<AtlasPatchRawPoints[pIdx];

pointIdx++) {

y = AtlasPatch2dPosY[pIdx] + (m / AtlasPatch2dSizeX[pIdx])

x = AtlasPatch2dPosX[pIdx] + (m % AtlasPatch2dSizeX[pIdx])

rawPos1D[m] = gFrame[mapIdx][y][x]

m++

}

}

Here, AtlasPatchRawPoints, AtlasPatch2dPosX, AtlasPatch2dPosY,

- AtlasPatch2dSizeX, and AtlasPatch2dSizeY are patch information that the atlas decoder 302 derives from the atlas information.
- Arrays oFrame[y][x], gFrame[mapIdx][y][x], and
- aFrame[mapIdx][attrIdx][compTimeIdx][y][x] are derived as follows:

for(j=0; j<asps_frame_height: j++) {

for(i=0; i<asps_frame_width; i++) {

oFrame[j][i] =

OccFramesNF[compTimeIdx][0][j][i]

for(m=0; m<asps_map_count_minus1+1; m++) {

gFrame[m][j][i] =

GeoFramesNF[m][compTimeIdx][0][j][i]

for(a=0; a<ai_attribute_count[RecAtlasID]; a++) {

for(c=0; c<ai_attribute_dimension_minus1[RecAtlasID][a]+1;

c++) {

aFrame[m][a][c][j][i] =

AttrFramesNF[a][m][compTimeIdx][c][j][i]

}

}

}

}

}

Here, ai_attribute_dimension_minus1 is encoded and decoded using encoded data in the V3C VPS, and ai_attribute_dimension_minus1+1 indicates the total number of dimensions (i.e., the number of channels or components) of attributes decoded by the V3C unit decoder 301. asps_map_count_minus1 is encoded and decoded using encoded data in the V3C VPS, and asps_map_count_minus1+1 indicates the number of maps for geometry data and attribute data of the current atlas. RecAtlasID indicates a decoded atlas ID.

The post-reconstructor 312 changes (updates) the mesh or point cloud data of 3D data that has been processed by the reconstructor 311. The post-reconstructor 312 receives pointCnt as the number of reconstructed points of the current point cloud frame relating to the current atlas, a one-dimensional array attrBitDepth[ ] as a nominal bit-depth, oFrame[ ][ ], recPcGeo [ ][ ], and recPcAttr [ ][ ][ ] and applies geometric smoothing to them and outputs the changed recPcGeo [ ][ ] and recPcAttr [ ][ ][ ].

Refiner 306

FIG. 6 is a functional block diagram illustrating a configuration of the refiner 306 (306DEC).

The refiner 306 (306DEC) applies refinement to DecOccFrames, DecGeoFrames, and DecAttrFrames that have been decoded by the occupancy decoder 303, the geometry decoder 304, and the attribute decoder 305 and have not been converted by the post-decoding converter 308 yet. Then, the refined DecOccFrames, DecGeoFrames, and DecAttrFrames are output to the post-decoding converter 308.

The refiner 306 receives DecOccFrames[0][frameIdx][y][x], DecGeoFrames[mapIdx][frameIdx][0][y][x] for each mapIdx and frameIdx, and/or DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x] for each attrIdx, partIdx, mapIdx, frameIdx, and compIdx according to the refinement target information and outputs the corrected DecOccFrames, DecGeoFrames, and/or DecAttrFrames. The refiner 306 stores the output in DecOccFrames[0][frameIdx][y][x], DecGeoFrames[mapIdx][frameIdx][0][y][x], and/or DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x] according to the refinement target information.

The refiner 306 may perform refinement processing on a two-dimensional array oFrame of asps_frame_height×asps_frame_width, a two-dimensional array gFrame of asps_frame_height×asps_frame_width, and/or a three-dimensional array aFrame of (ai_attribute_dimension_minus1[RecAtlasID][attrIdx]+1×asps_frame_height×asps_frame_width. At this time, the refiner 306 may also receive oFrame=DecOccFrames[0][frameIdx], gFrame=DecGeoFrames[mapIdx][frameIdx][0], and/or aFrame=DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx]. Here, the attrIdx and mapIdx may be values nnrc_attribute_index and nnrc_map_index obtained by decoding the NNRC SEI. That is, attrIdx=nnrc_attribute_index and mapIdx=nnra_target_map_index. The frameIdx may be a compTimeIdx which is a composition time index. The attrIdx and mapIdx may be values nnra_target_map_index and nnra_target_attribute_index obtained by decoding the NNRA_SEI. attrIdx=nnra_target_attribute_index, mapIdx=nnra_map_index. The same is applied below.

According to the above configuration, a network model to be applied (filter characteristics) is specified by characteristics SEI and any of an occupancy, a geometry, and an attribute is specified using refinement target information specified in the characteristics SEI, and refinement processing is applied to multiple inputs at the same time, thereby achieving the advantage of improving image quality.

FIG. 7 illustrates an example in which the refiner 306 applies refinement to image signals (video signals) in a nominal format.

FIG. 8 is a functional block diagram illustrating a configuration of the refiner 306 (306NF). The refiner 306 (306NF) applies refinement to each of OccFramesNF, GeoFramesNF, and AttrFramesNF in nominal formats obtained through conversion by the post-decoding converter 308 according to refinement target information and outputs the GeoFramesNF and AttrFramesNF to which refinement has been applied to the pre-reconstructor 310. The processing of the refiner 306 will be described below. The processing of the refiner 306DEC is performed with OccFramesNF, GeoFramesNF, and AttrFramesNF replaced by DecOccFrames, DecGeoFrames, and DecAttrFrames of FIG. 6.

In FIG. 10A, in a case that the refinement target information indicates an occupancy (S3062I), an occupancy is added to the input tensor (S3063I). For example, the following (Eq. IN-OCC) may be used.

ch = 0

if (OccupancyUsedFlag) { (Eq. IN-OCC)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(oFrame[yP][xP])

ch++

}

In a case that the refinement target information indicates a geometry (S3064I), a geometry is added to the input tensor (S3065I). For example, the following (Eq. IN-GEO) may be used.

if (GeometryUsedFlag) { (Eq. IN-GEO)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(gFrame[yP][xP])

ch++

}

In a case that the refinement target information indicates an attribute (S3066I), an attribute is added to the input tensor (S3067I). For example, the following (Eq. IN-ATTR) may be used.

if (AttributeUsedFlag) { (Eq. IN-ATTR)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[0][yP][xP]);

ch++

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[1][yP][xP]);

ch++

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[2][yP][xP]);

ch++

According to the above configuration, a network model to be applied (filter characteristics) is specified by characteristics SEI. Any of an occupancy, a geometry, and an attribute is specified using refinement target information specified in the characteristics SEI and refinement processing is applied to multiple inputs at the same time, thereby achieving the advantage of improving image quality.

NN Filter 611

The NN filter 611 performs filter processing using a neural network. The neural network is expressed by a neural network model and may include a convolution (Conv).

Here, a neural network model (hereinafter referred to as an NN model) means elements and connection relationships (a topology) of a neural network and parameters (weights and biases) of the neural network. The NN filter 611 may fix the topology and switch only the parameters depending on an image to be filtered. A neural network may include a convolution defined by a kernel size, the number of input channels, and the number of output channels.

Let DecFrame be an input to the refiner 306. The refiner 306 derives an input InputTensor to the NN filter 611 from an input image DecFrame and the NN filter 611 performs filter processing based on the neural network model using the inputTensor to derive an outputTensor. The neural network model used is a model corresponding to an nnra_target_id. The input image may be an image for each component or may be an image having multiple components as channels.

The NN filter 611 may repeatedly apply the following process.

The NN filter 611 performs a convolution operation (conv or convolution) on the inputTensor and kernel k[m][n][yy][xx] for the number of layers to generate an output image outputTensor to which a bias has been added.

Here, m is the number of channels of inputTensor, n is the number of channels of outputTensor, yy is the height of kernel k, and xx is the width of kernel k.

Each layer generates an outputTensor from an inputTensor.

$outputTensor [nn] [yy] [xx] = ΣΣΣ (k [m m] [nn] [i] [j] * inputTensor [m m] [yy + j - of] [xx + i - of] + bias [nn])$

Here, nn=0 . . . n−1, mm=0 . . . m−1, yy=0 . . . height−1, xx=0 . . . width−1, i=0 . . . yy−1, and j=0 . . . xx−1. “width” is the width of inputTensor and outputTensor and “height” is the height of inputTensor and outputTensor. 2 are the sum over mm=0 . . . m−1, i=0 . . . yy−1, and j=0 . . . xx−1. “of” is the width or height of an area required around the inputTensor to generate the outputTensor.

In a case of 1×1 Conv, Σ represents the sum for each of mm=0 . . . m−1, i=0, and j=0. In this case, of =0 is set. In a case of 3×3 Conv, Σ represents the sum for each of mm=0 . . . m−1, i=0 . . . 2, and j=0 . . . 2. In this case, of =1 is set.

In a case that the value of yy+j−of is less than 0 or “height” or more or in a case that the value of xx+i−of is less than 0 or “width” or more, the value of inputTensor [mm][yy+j−of][xx+i−of] may be 0. Alternatively, the value of inputTensor [mm][yy+j−of][xx+i−of] may be inputTensor [mm][yclip][xclip]. Here, yclip is max(0, min(yy+j−of, height−1)) and xclip is (0, min(xx+i−of, width−1)).

In the next layer, the obtained outputTensor is used as a new inputTensor and the same process is repeated for the number of layers. An activation layer may be provided between layers. Pooling layers or skip connections may be used. A FilteredFrame is derived from the outputTensor finally obtained.

A process called Depth-wise Conv which is represented by the following equation may also be performed using kernel k′[n][yy][xx]. Here, nn=0 . . . n−1, xx=0 . . . width−1, and yy=0 . . . height−1.

$outputTensor [nn] [yy] [xx] = ΣΣ (k [nn] [i] [j] * inputTensor [m m] [yy + j - of] [xx + i - of] + bias [nn])$

Non-linear processing referred to as Activate, such as ReLU, may be used. ReLU (x)=x>=0?x: 0

Alternatively, leakyReLU shown in the following formula may be used.

$leakyReLU (x) = x >= 0 ? x : a * x$

Here, a is a predetermined value less than 1, for example 0.1 or 0.125. To perform integer operations, all values of k, bias, and a described above may be set to integers and a right shift may be performed after conv to generate an outputTensor.

ReLU always outputs 0 for values less than 0 and directly outputs input values as they are for values equal to or greater than 0. In contrast, in leakyReLU, for values less than 0, linear processing is performed with a gradient being set equal to a. In ReLU, the gradient for values less than 0 disappears, and learning may not advance steadily. leakyReLU leaves a gradient for values less than 0, making such a problem less likely to occur. Of above leakyReLU (x), PRELU using a parameterized value of a may be used.

Filter Processing of NN Filter 611

attrIndex is set to nnrc_tareget_attribute_index and mapIndex is set to nnra_tareget_map_index[i], and the following is applied. With the input including OccFramesNF[compTimeIdx][0], GeoFramesNF[mapIdx][compTimeIdx][0], AttrFramesNF[attrIdx][mapIdx][compTimeIdx], and nnra_strength[i] and with the output including changed GeoFramesNF[mapIdx][compTimeIdx][0] and AttrFramesNF[attrIdx][mapIdx] [compTimeIdx], the following process is invoked.

- The input of this process includes the variable nnrc_occupancy_flag, the variable nnrc_geometry_flag, the variable nnrc_attribute_flag, the variable nnrc_strength_num, 2D array oFrame, 2D array gFrame, 3D array aFrame, 1D array nnraStrength.
- the variable nnrc_occupancy_flag, defined in the characteristics SEI and indicating whether to use an occupancy frame;
- the variable nnrc_geometry_flag, defined in the characteristics SEI and indicating whether to use a geometry frame;
- the variable nnrc_attribute_flag, defined in the characteristics SEI and indicating whether to use an attribute frame;
- the variable nnrc_strength_num, defined in the characteristics SEI and indicating the number of pieces of strength information to use;
- 2D array oFrame, the size of which is asps_frame_height×asps_frame_width;
- 2D array gFrame, the size of which is asps_frame_height×asps_frame_width; 3D array aFrame, the size of which is (ai_attribute_dimension_minus1[RecAtlasID][attrIdx]+1) xasps_frame_height×asps_frame_width;
- 1D array nnraStrength, the size of which is nnrc_strength_num. nnraStrength is a parameter decoded from the encoded data of the activation information and which indicates the strength of refinement input to the input tensor.
- The output of this process includes the changed arrays gFrame and aFrame.

Here, ai_attribute_dimension_minus1 indicates the number of elements−1. The element is an attribute's element encoded and decoded using the VPS. Normally 3 in the case of YUV or RGB.

In a configuration corresponding to simultaneous refinement of multiple attributes, mapIndex may be set to nnra_tareget_map_index[i] and the following may be applied. With the input including OccFramesNF[compTimeIdx][0], GeoFramesNF [mapIdx][compTimeIdx][0], AttrFramesNF[attrIdx][mapIdx][compTimeIdx], and nnra_strength[i] and with the output including changed

- GeoFramesNF[mapIdx][compTimeIdx][0] and AttrFramesNF, the following process invoked. The input of this process includes the variable OccupancyUsedFlag, the variable GeometryUsedFlag, the variable AttributeUsedFlag, the variable nnrc_strength_num, 2D array gFrame, 5D array aFrame, 1D array nnraStrength.
- the variable OccupancyUsedFlag (nnrc_occupancy_flag), defined in the characteristics SEI and indicating whether to use an occupancy frame;
- the variable GeometryUsedFlag (nnrc_geometry_flag), defined in the characteristics SEI and indicating whether to use a geometry frame;
- the variable AttributeUsedFlag (nnrc_attribute_flag), defined in the characteristics SEI and indicating whether to use an attribute frame;
- the variable nnrc_strength_num, defined in the characteristics SEI and indicating the number of pieces of strength information to use;
- 2D array oFrame, the size of which is asps_frame_height×asps_frame_width;
- 2D array gFrame, the size of which is asps_frame_height×asps_frame_width;
- 5D array aFrame, the size of which is ai_attribute_count [RecAtlasID]×ai_attribute_dimension_minus1[RecAtlasID][attrIdx]×asps_frame_height×asps_frame_width.
- 1D array nnraStrength, the size of which is nnrc_strength_num. This is defined in the characteristics SEI and indicates strength information controlling the strength of the refinement processing.
- The output of this process includes the changed arrays gFrame and aFrame.

The NN filter 611 performs NN filter processing and derives an outputTensor from the inputTensor. Refinement processing (filter processing) indicated by RefineFilter( ) may be performed in units of blocks (block Width×blockHeight) as described below.

for(cTop=0;cTop<FrameHeight:cTop+=blockHeight)

for(cLeft=0;cLeft<FrameWidth;cLeft+=blockWidth) {

DeriveInputTensors( )

outputTensor=RefineFilter(inputTensor)

StoreOutputTensors( )

}

Here, DeriveInputTensors( ) is a function indicating input data setting and StoreOutputTensors( ) is a function indicating output data storage. Frame Width and FrameHeight are the sizes of input data and may be asps_frame_width and asps_frame_height. blockWidth and blockHeight are the width and height of each block.

In DeriveInputTensors( ) the NN filter 611 derives input data inputTensor [ ][ ][ ] to the NN filter 611 based on one or more of an occupancy frame oFrame, a geometry frame gFrame, and an attribute frame aFrame, in accordance with refinement target information. The following is an example configuration and a configuration that will be described later may also be used.

for(yP = −overlapSize; yP < blockHeight+overlapSize; yP++)

for(xP = −overlapSize; xP < blockWidth+overlapSize; xP++) {

yT = cTop + yP

xT = cLeft + xP

ch=0

if (OccupancyUsedFlag) { (Eq. IN-OCC)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(oFrame[yT][xT])

ch++

}

if (GeometryUsedFlag) { (Eq. IN-GEO)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(gFrame[yT][xT])

ch++

}

if (AttributeUsedFlag) { (Eq. IN-ATTR)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[0][yT][x

T]); ch++

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[1][yT][x

T]); ch++

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(aFrame[2][yT][x

T]); ch++

}

}

}

for (i=0; i<nnrc_strength_num; i++) {

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

InpS( nnraStrength[i]):

ch++;

}

Here, OccupancyUsedFlag, GeometryUsedFlag, and AttributeUsedFlag are variables indicating the refinement target information, and syntax elements described later and obtained by decoding the characteristics SEI may be used as follows.

- OccupancyUsedFlag=nnrc_occupancy_flag
- GeometryUsedFlag=nnrc_geometry_flag
- AttributeUsedFlag=nnrc_attribute_flag
  
  Specification of Input Tensor and Output Tensor according to Refinement Target Information

Further, another configuration will be described.

FIGS. 10A and 10B are flowcharts illustrating operations of the refiner 306.

An example illustrated in FIGS. 10A and 10B will be described in further detail.

In FIG. 10A, the refiner 306 or the atlas decoder 302 decodes refinement target information (S3061).

The refiner 306 may derive an inputTensor according to the refinement target information. Hereinafter, ch indicates the index (position) of a channel to be set. ch++ is an abbreviation for ch=ch+1 and indicates that the index of the channel to be set is increased by 1.

- ch=0

In a case that the refinement target information indicates an occupancy (S3062I), an occupancy is added to the input tensor (S3063I). For example, the following (Equation IN-OCC) may be used.

if (OccupancyUsedFlag) { (Eq. IN-OCC)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(oFrame[yT][xT])

ch++

}

In a case that the refinement target information indicates a geometry (S3064I), a geometry is added to the input tensor (S3065I). For example, the following (Equation IN-GEO) may be used.

if (GeometryUsedFlag) { (Eq. IN-GEO)

inputTensor[ch][yP+overlapSize][xP+overlapSize] =

Inp(gFrame[yT][xT])

ch++

}

In a case that the refinement target information indicates an attribute (S3066I), an attribute is added to the input tensor (S30671). For example, as indicated in the following (Equation IN-ATTR-DIM), all of the number of channels of the attribute designated by the index of the attrIdx are input to inputTensor.

if (AttributeUsedFlag) { (Eq. IN-ATTR-DIM)

for (k=0;k< ai_attribute_dimension_minus1][ RecAtlasID ][ attrIdx ];k++) {

inputTensor[ch][yP+overlapSize][xP+overlapSize]= Inp(InpSampleVal(yT, xT,

asps_frame_height, asps_frame_width, aFrame[k]); ch++

}

Note that following.

InpSampleVal (y, x, FrameHeight, FrameWidth, targetPic) {

if(y<0 || x<0 || y>=FrameHeight || x>=FrameWidth) {

sampleVal = 0

sampleVal = targetPic[Clip3(0, FrameHeight −1, y)][Clip3(0, FrameWidth −1

, x)])

} else {

sampleVal = targetPic[y][x]

}

As another configuration, in a case that the refinement target information indicates an attribute, all of the number of channels of the attribute designated by the indexes of the attrIdx may be input to inputTensor. The attrIdx is starting from nnrc_attribute_index and the number of attribute is nnrc_attribute_num_minus1+1.

if (AttributeUsedFlag) { (Eq. IN-ATTR-POS)

for (attrIdx = nnrc_attribute_index; attrIdx < nnrc_attribute_index + nnrc_at

tribute_num_minus1+1; attrIdx++) {

for (k=0;k<= ai_attribute_dimension_minus1[ RecAtlasID ][ attrIdx ];k++) text missing or illegible when filed

inputTensor[ch][yP+overlapSize][xP+overlapSize] = inputTensor[0][ch][yP+ove

rlapSize][xP+overlapSize] = Inp(InpSampleVal(yT, xT, asps_frame_height, asps_fr

ame_width), aFrame[attrIdx][mapIdx][ compTimeIdx ][k]); ch++

text missing or illegible when filed

indicates data missing or illegible when filed

In FIG. 10B, in DeriveInputTensors( ) the refiner 306 may derive (update) an output frame from outputTensor according to the refinement target information.

In a case that the refinement target information indicates a geometry (S3064O), a specific component of the output tensor is set to a geometry (S3065O). For example, the following (Equation OUT-GEO) may be used.

ch=0

if (GeometryUsedFlag) {

gFrame[yP][xP]=outputTensor[ch][yP+overlapSize][xP+overlapSize]

ch++

}

In a case that the refinement target information indicates an attribute (S3066O), a specific component of the output tensor is set to an attribute (S3067O). For example, as indicated by the following (Equation OUT-ATTR-DIM), the value of outputTensor may be set to the number of channels of the attribute designated by the index of attrIdx.

if (AttributeUsedFlag) {( custom-character

OUT-ATTR-DIM)

for (k=0;k< ai_attribute_dimension_minus1[ RecAtlasID ][ attrIdx ];k++) {

aFrame[k][yT][xT]=Out(outputTensor[ch][yP+overlapSize][xP+overlapSize]); ch

++

}

if (AttributeUsedFlag) {

for (attrIdx = nnrc_attribute_index: attrIdx < nnrc_attribute_index + nnrc_at

tribute_num_minus1+1; attreIdx++) {

for (k=0;k<= ai_attribute_dimension_minus1[ RecAtlasID ][ attrIdx ];k++) {

aFrame[attrIdx][mapIdx][0][k][yT][xT]=Out(outputTensor[ch][yP+overlapSize][x

P+overlapSize]); ch++

}

}

The refinement information decoder further decodes the number strengthNum of pieces of strength information from the characteristics information. And the refinement processing unit inputs at least one of the occupancy frame, the geometry frame, or the attribute frame to the input tensor of the neural network, and further inputs each of strengthNum values derived from the decoded strength information into a respective one of strengthNum channels of the input tensor of the neural network.

- Inp(oFrame[yT][xT]) may be replaced by Inp(InSampleVal(yT, xT, FrameHeight, FrameWidth, oFrame)) that includes picture boundary processing.
- Inp(gFrame[yT][xT]) may be replaced by Inp(InSampleVal (yT, xT, FrameHeight, FrameWidth, gFrame)) that includes picture boundary processing.
- Inp(aFrame[k][yT][xT]) may be replaced by Inp(InSampleVal (yT, xT, FrameHeight, FrameWidth, aFrame[k])) that includes picture boundary processing. k=0, 1, 2, . . . . The same is applied below.
  
  Here, overlapSize is the overlap size, for which a value decoded from encoded data of characteristics SEI may be used.

In StoreOutputTensors( ) below, the NN filter 611 sets the content of outputTensor to the changed occupancy frame oFrame, geometry frame gFrame and attribute frame aFrame according to the refinement target information.

for(yP=0; yP<blockHeight; yP++)

for (xP=0; xP<blockWidth; xP++) {

yT = cTop + yP

xT = cLeft + xP

if (yT<FrameHeight && xT<FrameWidth)

ch=0

if (OccupancyUsedFlag) {

oFrame[yT][xT]=outputTensor[ch][yP+overlapSize][xP+overlapSize]

ch++

}

if (GeometryUsedFlag) {

gFrame[yT][xT]=outputTensor[ch][yP+overlapSize][xP+overlapSize]

ch++

}

if (AttributeUsedFlag) {

aFrame[0][yT][xT]=Out(outputTensor[ch][yP+overlapSize][xP+overlapSize]

); ch++

aFrame[1][yY][xT]= Out(outputTensor[ch][yP+overlapSize][xP+overlapSiz

e]); ch++

aFrame[2][yT][xT]= Out(outputTensor[ch][yP+overlapSize][xP+overlapSiz

e]); ch++

}

}

ALF Unit 610

Refinement specified by persistence information of NNRA_SEI and an identifier of characteristics SEI may be performed using a filter (a Wiener Filter) that uses a linear model. The linear filter may perform filter processing in the ALF unit 610. Specifically, a filter target image DecFrame is divided into small regions (x=xSb . . . xSb+bSW−1, y=ySb . . . ySb+bSH−1) of a constant size (for example, 4×4 or 1×1). (xSb, ySb) are the coordinates of the upper left corner of the small region and bSW and bSH are the width and height of the small region. Then, filter processing is performed in units of small regions. A refined image outFrame is derived from the DecFrame using a selected filter coefficient coeff[ ].

outframe [cIdx][y][x] = Σ(coeff[i] * DecFrame[cIdx][y + ofy][x + ofx] + offset) >>

shift

Here, ofx and ofy are offsets of a reference position determined according to a filter position i. offset=1<< (shift−1). shift is a constant such as 6, 7, or 8 that corresponds to the precision of the filter coefficient.

The image may be classified into small regions and filter processing may be performed by selecting a filter coefficient coeff[classId][ ] according to a classId (=classId[y][x]) of each derived small region.

outframe [cIdx][y][x] = (Σ(coeff[classId][i] * DecFrame[cIdx][y + ofy][x + ofx] +

offset) >> shift

Here, DecFrame[ch]=inputTensor [ch][yP+overlapSize][xP+overlapSize]. That is, the following may be used.

ch=0

if (OccupancyUsedFlag) {

DecFrame[ch][yP+overlapSize][xP+overlapSize] = Out(oFrame[yP][xP])

ch++

}

if (GeometryUsedFlag) {

DecFrame[ch][yP+overlapSize][xP+overlapSize] = Out(gFrame[yP][xP])

ch++

}

if (AttributeUsedFlag) {

DecFrame[ch][yP+overlapSize][xP+overlapSize] = Out(aframe[0][yP][xP])

; ch++

DecFrame[ch][yP+overlapSize][xP+overlapSize] = Out(aframe[1][yP][xP])

; ch++

DecFrame[ch][yP+overlapSize][xP+overlapSize] = Out(aFrame[2][yP][xP])

; ch++

}

The classId may be derived using an activity level or directionality of a block (one pixel in the case of a 1×1 block). For example, the classId may be derived using the activity level Act derived from a sum of absolute differences or the like as follows.

- classId=Act
  
  Act may be derived as follows.

$Act = Σ ❘ x - xi ❘$

Here, i=0 . . . 7 and xi indicates a pixel adjacent to a target pixel x. Directions toward adjacent pixels may be eight directions which are directions toward the top, bottom, left and right and four diagonal directions at 45 degrees with respect to i=0 . . . 7. | Following formula may be used.

$Act = Σ ❘ - x 1 + 2 * x - xr ❘ + Σ ❘ - ya + 2 * x - y b$

Here, l, r, a, and b are abbreviations for left, right, above, and below, respectively, and indicate pixels on the left, right, top, and bottom of x. Here, Act may be clipped into NumA−1 after being quantized by a shift value as follows, where NumA indicates the number of the activity indices.

Act=Min(NumA−1,Act>>shift)

Further, the classId may be derived based on the following formula using the directionality D.

$ClassId = Act + D * NumA$

Here, for example, Act=0 . . . . NumA−1 and D=0 . . . 4.

The ALF unit 610 directly outputs an OutFrame as a FilteredFrame.

Decoding of Activation SEI and Application of Filter

FIG. 9 is a diagram showing a flowchart of a process performed by the 3D data decoding apparatus. The 3D data decoding apparatus performs the following process including decoding of an activation SEI message.

- S6001: The atlas decoder 302 decodes an nnra_cancel_flag and an nnra_target_id from activation SEI. The nnra_cancel_flag may also be an nnrc_cancel_flag.
- S6002: In a case that the nnra_cancel_flag is 1, the process ends for a frame for which the nnra_cancel_flag is targeted. In a case that the nnra_cancel_flag is 0, the process proceeds to S6003.
- S6003: The atlas decoder 302 decodes an nnra_persistence_flag from the activation SEI.
- S6005: The atlas decoder 302 identifies characteristics SEI having the same nnrc_id as the nnra_target_id and derives parameters of an NN model from the characteristics SEI.
- S6006: The NN filter 611 and the ALF unit 610 perform refinement (filter) processing using the derived parameters of the NN model.

Configuration Example of Syntax
Neural Network Refinement Characteristics SEI

FIG. 11 and FIG. 12 illustrate a syntax of nn_refinement_characteristics (payloadSize) (characteristics SEI). The argument payloadSize represents the number of bytes of this SEI message.

A persistence scope to which the characteristics SEI is applied is a CAS. Namely, characteristics SEI is SEI applied in units of CASs.

A syntax element nnrc_id indicates the ID of the characteristics SEI. In the activation SEI, the value of an nnrc_id of characteristics SEI indicating refinement characteristics to be applied is transmitted as a target ID (an nnra_target_id) to specify refinement processing to be applied.

In a case that an nnrc_mode_idc is 0, it indicates that this SEI message contains an ISO/IEC 15938-17 bitstream and specifies a base NNRE (a neural-network refinement) or specifies an update from a base NNRE having the same nnrc_id value.

In a case that the nnrc_mode_idc is equal to 1 in a case that an NNRC SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, a base NNRE regarding that nnrc_id value is identified as a neural network identified by a URI indicated by an nnrc_uri which is a format identified by a tag URI nnrc_tag_uri. In a case that an NNRC SEI message is neither the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order nor a repeat of the first NNRC SEI message in decoding order and the nnrc_mode_idc is 1, it indicates that an update to a base NNRE having the same nnrc_id value is defined by a URI indicated by an nnrc_uri. The nnrc_uri is a format identified by a tag URI nnrc_tag_uri.

The value of nnrc_mode_idc shall be in a range of 0 to 1 (inclusive) in bitstreams complying with the version.

In a case that this SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, the same RefineFilter( ) as that of the base NNRE is assigned.

In a case that this SEI message is neither the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order nor a repeat of the first NNRC SEI message in decoding order, an update defined in this SEI message is applied to the base NNRE to obtain a RefineFilter( )

Updates are not cumulative and each update is applied to the base NNRE. The base NNRE is an NNRE specified in the first NNRC SEI message (in decoding order) having a specific nnrc_id value within the current CAS.

An nnrc_reserved_zero_bit_a shall be equal to 0 in bitstreams complying with the version. The decoder shall ignore NNRC SEI messages in which the nnrc_reserved_zero_bit_a is not 0.

An nnrc_tag_uri includes a tag URI having syntax and semantics specified in IET FRFC 4151 that identifies the format and related information of a neural network that is used as a base NNRE having the same nnrc_id value specified in an nnrc_uri or as an update to the base NNRE.

The nnrc_tag_uri being “tag: iso.org, 2023:15938-17” indicates that neural network data identified by the nnrc_uri complies with ISO/IEC15938-17.

An nnrc_uri includes a URI having syntax and semantics specified in IETF Internet Standard 66 that identifies a neural network used as a base NNRE or identifies an update to a base NNRE having the same nnrc_id value.

In a case that an nnrc_property_present_flag is equal to 1, it indicates that syntax elements relating to an input format, an output format, and complexity are present. In a case that the nnrc_property_present_flag is 0, it indicates that no syntax elements relating to an input format, an output format, and complexity are present.

In a case that this SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, the nnrc_property_present_flag needs to be equal to 1.

In the case that the nnrc_property_present_flag is equal to 0, it is inferred that the values of all syntax elements which can only exist in the case that the nnrc_property_present_flag is equal to 1 and for which no inferred values are specified are the same as those of corresponding syntax elements in an NNRC SEI message including a base NNRE that is to be updated by this SEI.

The nnrc_base_flag being equal to 1 indicates that the SEI message specifies a base NNRE. The nnrc_base_flag being equal to 0 indicates that the SEI message specifies an update relating to a base NNRE. In a case that no nnrc_base_flag is present, it is inferred that the value of the nnrc_base_flag is equal to 0.

The value of the nnrc_base_flag is subject to the following constraints:

- In a case that an NNRC SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, the value of the nnrc_base_flag needs to be equal to 1;
- In a case that an NNRC SEI message nnrcB is not the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order and the value of the nnrc_base_flag is equal to 1, the NNRC SEI message is a repeat of the first NNRC SEI message nnrcA having the same nnrc_id in decoding order. That is, the payload content of the next NNRC SEI message nnrcB shall be the same as that of the NNRC SEI message nnrcA. In a case that an NNRC SEI message is neither the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order nor a repeat of the first NNRC SEI message having the specific nnrc_id value, the following is applied:
- This SEI message defines an update to a preceding base NNRE having the same nnrc_id value in decoding order.
- This SEI message relates to the current decoded picture in output order and subsequent all decoded pictures of layers, up to the earlier of a picture at the end of the current CAS or a picture other than decoded pictures that is associated with a subsequent NNRC SEI message in decoding order having a specific nnrc_id value within the current CAS subsequent to the current decoded picture in output order within the current CAS.

The syntax elements nnrc_occupancy_flag, nnrc_geometry_flag, and nnrc_attribute_flag indicate targets of the refinement processing (refinement target information). Each of the syntax elements indicates whether an occupancy, a geometry, or an attribute is included in the refinement input or output.

The syntax element nnrc_attribute_index is refinement target information and indicates the index (attrIdx) of the attribute to be refined. For nnrc_attribute_flag!=0, nnrc_attribute_index is encoded and decoded. Although not shown, the syntax elements of nnra_all_attribute_flag may be decoded and encoded. nnra_all_attribute_flag=1 indicates that refinement processing is to be performed on all attributes. In other words, the attrIdx to which refinement is applied includes all attrIdxs of all of 0 . . . ai_attribute_count [RecAtlasID]−1. For nnra_all_attribute_flag=1, nnrc_attribute_index and nnrc_attribute_num_minus1 are not encoded or decoded, but are inferred as nnrc_attribute_index=0 and nnrc_attribute_num_minus1=ai_attribute_count [RecAtlasID]−1. A list of nnrc_attribute_index may be inferred as 0 . . . ai_attribute_count [RecAtlasID]−1. For nnra_all_attribute_flag=0, nnrc_attribute_index and nnrc_attribute_num_minus1 are encoded and decoded.

The syntax element nnrc_attribute_num_minus1 is refinement target information and indicates the number of attributes to be refined. For nnrc_attribute_flag!=0, nnrc_attribute_num_minus1 is encoded and decoded.

The syntax element nnrc_strength_num is refinement input additional information and indicates the number strengthNum of the strength information nnra_strength input to the refinement processing (inputTensor).

A syntax element nnrc_inp_out_format_idc (input/output tensor format information) indicates a method of converting pixel values of the decoded image into input/output values for the refinement processing. In a case that the value of the nnrc_inp_out_format_idc is 0, input values to the refinement processing (especially, the input tensor) are real numbers (floating point values) specified in IEEE754 and a function Inp is specified as follows. The value range of the input tensor is 0 . . . 1.

Inp (x) = x ÷ ((1 << BitDepth) − 1)

InpS (x) = x ÷ 63

In a case that the value of the nnrc_inp_out_format_idc is 1, the input and output values to the refinement processing are unsigned integers and the function Inp is specified as follows. The value range of the input tensor is 0 . . . 1<<(inpTensorBitDepth)−1.

shift = BitDepth − inpTensorBitDepth

if(inpTensorBitDepth >= BitDepth)

Inp(x) = x << (inpTensorBitDepth − BitDepth)

else

Inp(x) = Clip3(0, (1 << inpTensorBitDepth) − 1, (x + (1 << (shift−1))) >> shift)

InpS(x) = x

A value obtained by adding 8 to the value of the syntax element nnrc_inp_tensor_bitdepth_minus8 indicates the pixel bit-depth of the luma pixel value of the integer input tensor. The value of the variable inpTensorBitDepth is derived from the syntax element nnrc_inp_tensor_bitdepth_minus8 as follows.

inpTensorBitDepth=nnrc_inp_tensor_bitdepth_minus8+8

A block is an array of pixels. The refinement processing is performed in units of fixed blocks. A block may also be called a patch.

A syntax element nnrc_block_size_idc indicates the block size. The block size may be a multiple of 64 such as 64, 128, or 192 as follows.

blockWidth = 64 * nnrc_block_size_idc (= nnrc_block_size_idc << 6)

blockHeight = 64 * nnrc_block_size_idc (= nnrc_block_size_idc << 6)

The block size may also be defined from the nnrc_block_size_idc excluding 0 as follows.

blockWidth = (nnrc_block_size_idc + 1) << 6

blockHeight = (nnrc_block_size_idc + 1) << 6

A syntax element nnrc_overlap_size_idc specifies the number of horizontal and vertical pixels over which adjacent input tensors overlap. The value of nnrc_overlap_size_idc may be a multiple of 4 as follows.

overlapSize=nnrc_overlap_size_idc<<2

A function Out that converts each of a luma pixel value and a chroma pixel value output by post-processing into an integer value of a pixel bit-depth is specified as follows using the pixel bit-depth BitDepth.

Out(x)=Clip3(0,(1<<BitDepth)−1,Round(x*((1<<BitDepth)−1)))

The function Out is specified as follows.

shift = outTensorBitDepth − BitDepth

if(outTensorBitDepth >= BitDepth)

Out(x) = Clip3(0, (1 << BitDepth) − 1, (x + (1 << (shift − 1))) >> shift)

else

Out(x) = x << (BitDepth − outTensorBitDepth)

An nnrc_out_tensor_bitdepth_minus8+8 specifies the pixel bit-depth of pixel values of an integer output tensor. The value of outTensorBitDepth is derived from the syntax element nnrc_out_tensor_bitdepth_minus8 as follows.

outTensorBitDepth=nnrc_out_tensor_bitdepth_minus8+8

A syntax element nnrc_reserved_zero_bit_b shall be equal to 0.

A syntax element nnrc_payload_byte[i] contains the i-th byte of an ISO/IEC 15938-17 compliant bitstream. All nnrc_payload_bytes[i] shall be those of an ISO/IEC 15938-17 compliant bitstream.

Neural Network Refinement Activation SEI

FIG. 13 shows an example configuration of a syntax of activation SEI.

For this activation SEI, the atlas decoder 302 (a refinement information decoder) and an atlas encoder 102 (a refinement information encoder) decode and encode the following syntax elements.

An nnra_target_id indicates the ID of characteristics SEI to be applied (an identifier or identification information of refinement characteristics information). Refinement processing specified by the characteristics SEI having the same nnre_id as the nnra_target_id is applied to an image.

An nnra_cancel_flag is a cancel flag. The nnra_cancel_flag being 1 indicates that persistence of refinement set for the image in already decoded NNRA_SEI is to be canceled. The nnra_cancel_flag being 0 indicates that a subsequent syntax element (nnra_persistence_flag) is to be transmitted, encoded, and decoded.

The nnra_persistence_flag indicates persistence information of a target refinement. In a case that the nnra_persistence_flag is 0, it indicates that the target refinement is applied only to pictures indicated by an atlasID. In a case that the nnra_persistence_flag is 1, it indicates that the target refinement indicated by the nnra_target_id is applied to the current picture and all subsequent pictures until one of the following conditions is met:

- A new CAS starts;
- A bitstream ends; and
- n NNRA_SEI message with nnra_cancel_flag=1 is output following the current picture in output order. That is, the persistence is canceled by nnra_cancel_flag=1. The following condition may also be used.
- An NNRA_SEI message with nnra_cancel_flag=1 and having the same nnra_target_id as the current SEI message is output following the current picture in output order. That is, the persistence is canceled by nnra_cancel_flag=1.

As illustrated in FIG. 13, the atlas decoder 302 and the atlas encoder 102 decode and encode a syntax element indicating second refinement target information. The second refinement target information includes nnra_target_map_index indicating the mapIdx to which the refinement is applied. Furthermore, although not illustrated, the second refinement target information may further include an nnra_target_attribute_index indicating the index attrIdx of an attribute to which refinement is applied, nnra_attribute_num indicating the number of attributes to which refinement is applied, an nnra_attribute_partition_index indicating a partIdx to which refinement is applied, and an nnra_auxiliary_video_flag indicating whether the target to which refinement is applied is auxiliary data. Note that nnra_target_attribute_index and nnra_attribute_num are the same as the nnrc_target_attribute_index and nnrc_attribute_num described above.

In the syntax structure of FIG. 13, multiple refinement processing operations are specified and transmitted in one activation SEI. Here, in the refinement number N (>=1), the value of N (actually, the value of N−1) is transmitted, and N refinement processing operations are defined. nnra_count_minus1 indicating the number of instances of refinement processing is included as a syntax element. For each i from i=0 to i=nnra_count_minus1+1, nnra_cancel_flag[i], nnra_persistent_flag[i], and nnra_target_id[i] indicating the id of the characteristics SEI indicating NNRE may be included. nnra_strength_num[i] may be included. nnra_strength_num[i] is a parameter indicating the number of pieces of strength information to be used. nnra_streangth_num[i] shall be equal to nnrc_strength_num specified in the NNRC SEI message identified by nnra_target_id[i].

Furthermore, for each i, the second refinement target information includes nnra_target_map_index[i].

Furthermore, for each i, a syntax element indicating nnra_all_map_flag[i] may be decoded and encoded as the second refinement target information. Here, nnra_all_map_flag[i] =1 indicates that refinement processing is performed on all maps. In other words, the mapIdx to which refinement is applied includes all mapIdxs. For nnra_all_map_flag[i]=1, nnra_target_map_index[i] is not encoded or decoded, and is inferred as nnra_target_map_index[i]=1 and i=0 . . . asps_map_count_minus1. For nnra_all_map_flag[i]=0, nnra_target_map_index[i] is encoded and decoded. Note that nnra_all_map_flag[i] and nnra_target_map_index[i] may be encoded and decoded by the characteristics SEI instead of the activation SEI. In this case, in the characteristics SEI, nnrc_all_map_flag and nnrc_target_map_index are encoded and decoded as corresponding syntax elements.

Furthermore, although not illustrated, nnra_all_partition_flag[i] may be included as second refinement target information for each i.

Here, nnra_all_partiotion_flag[i]=1 indicates that refinement processing is performed on all attributes. In other words, this indicates that the attrIdx to which refinement is applied includes all attrIdxs. For nnra_all_partiotion_flag[i]=1, nnra_target_partiotion_index[i] is not encoded or decoded, but is inferred as nnra_target_partiotion_index[i]=1 and i=0 . . . asps_map_count_minus1. For nnra_all_partiotion_flag[i]=0, nnra_target_partiotion_index[i] is encoded and decoded.

In a case that nnra_cancel_flag[i] indicates that cancellation is not performed, for example, in a case that nnra_cancel_flag[i]==0 (!nnra_cancel_flag[i]), the atlas decoder 302 decodes the persistence information nnra_persistent_flag[i], and further decodes nnra_strength_num[i] indicating the number of pieces of strength information to be applied and nnra_strength [i][j] indicating the strength information to be applied.

For each i of i=0 . . . nnra_count_minus1+1, the refiner 306 provides mapIdx=nnra_target_map_index[i] and attrIdx=nnra_target_attribute_index as input data. The refiner 306 identifies oFrame=OccFramesNF[compTimeIdx][0], gFrame=GeoFramesNF[mapIdx][compTimeIdx][frameIdx][0], and aFrame=AttrFramesNF[attrIdx][mapIdx][compTimeIdx]. Then, the following processing is performed using the neural network model of the characteristics SEI (NNRC SEI having the same value as nnra_target_id[i] as nnrc_id) designated and activated by the activation information.

In a case that the refinement target information indicates an attribute, the attribute frame of the map indicated by the second refinement target information nnra_target_map_index[i] is added to the input tensor. For example, mapIdx=nnra_target_map_index[i] may be set in (Eq. IN-ATTR-DIM) and (Eq. IN-ATTR-POS).

In a case that the refinement target information indicates an occupancy, a specific component of the output tensor may be set to an occupancy. For example, (Eq. OUT-OCC) may be used.

In a case that the refinement target information indicates a geometry, a specific component of the output tensor is set to a geometry frame with the mapIdx indicated by the second refinement target information nnra_target_map_index[i]. For example, mapIdx=nnra_map_index[i] may be set in (Eq. OUT-GEO).

In a case that the refinement target information indicates an attribute, a specific component of the output tensor is set to an attribute indicated by the second refinement target information nnra_target_map_index. For example, mapIdx=nnra_target_map_index[i] may be set in (Eq. OUT-ATTR), (Eq. OUT-ATTR-DIM), and (Eq. OUT-ATTR-POS).

Further, in a case that an attribute is included as refinement target information, a syntax element nnrc_target_attribute_index[i] indicating an attrId to be refined may be included.

In a case that the refinement target information indicates an attribute, an attribute indicated by the second refinement target information nnrc_target_attribute_index and nnra_target_map_index is added to the input tensor. For example, mapIdx=nnra_target_map_index[i] (or nnrc_target_map_index) and attrIdx=nnrc_target_attribute_index (or nnra_target_attribute_index[i]) may be set in (Eq. IN-ATTR-POS2) and (Eq. IN-ATTR-MAP).

In a case that the refinement target information indicates an attribute, a specific component of the output tensor is set equal to an attribute indicated by the second refinement target information nnra_target_map_index[i] and nnra_target_attribute_index[i]. For example, mapIdx=nnra_target_map_index[i] (or nnrc_target_map_index) and attrIdx=nnrc_target_attribute_index (or nnra_target_attribute_index[i]) may be set in (Eq. OUT-ATTR), (Eq. OUT-ATTR-POS), and (Eq. OUT-ATTR-POS2). nnra_attribute_partition_index and nnra_auxiliary_video_flag may be further included.

In the above, the atlas decoder 302 decodes the syntax elements nnra_target_map_index from the activation SEI to derive the mapIdx to be refined and decodes the syntax element nnra_target_attribute_index[i] from the activation SEI to derive the attrIdx of the attribute to be refined. An attribute specified by the attrIdx is selected to perform refinement. This achieves the advantage that it is possible to apply refinement optimized to the specific mapIdx and attrIdx. Further, refinement specified by the same characteristics SEI can be applied to attributes indicated by an attrIdx having the same value. Refinements specified by different pieces of characteristics SEI can be applied to attributes indicated by different attrIdxs at the same time.

In summary, a 3D data decoding apparatus is provided that includes a geometry decoder configured to decode a geometry frame from encoded data and an attribute decoder configured to decode an attribute frame from the encoded data, wherein a refinement information decoder is comprised, the refinement information decoder being configured to decode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is decoded from the activation information and an index indicating the characteristics information for the decoded number of refinements is decoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.

The refinement information decoder further decodes the number strengthNum of pieces of strength information from the characteristics information and decodes the strengthNum pieces of strength information from the characteristics information, and the refinement processing unit inputs at least one of an occupancy frame, a geometry frame, or an attribute frame to an input tensor of a neural network, and further inputs each of strengthNum values to a respective one of strengthNum channels of the input tensor of the neural network.

The refinement information decoder decodes a flag indicating whether to perform the refinement processing on all maps, and decodes an index indicating which map the refinement processing is to be performed on in a case that the flag indicates that the refinement processing is not to be performed on all the maps.

The refinement information decoder decodes an index of an attribute frame to be refined and the number of attribute frames from encoded data of the characteristics information or the activation information, inputs, to an input tensor of the neural network, attribute frames for the decoded number of attribute frames, with the decoded index of the attribute frame being a start point, and performs refinement processing.

The refinement information decoder decodes, from encoded data of the activation information, refinement characteristics information and information of a map to be refined, decodes a cancel flag indicating whether to cancel refinement processing related to the refinement characteristics information and the map to be refined, and further decodes persistence information in a case that the cancel flag indicates that the refinement processing is not to be canceled.

The refinement information encoder further encodes the number strengthNum of pieces of strength information from the characteristics information and encodes the strengthNum pieces of strength information from the characteristics information, and the refinement processing unit inputs at least one of an occupancy shift frame, a geometry frame, or an attribute frame to an input tensor of a neural network, and further inputs each of strengthNum values derived from the encoded strength information to a respective one of strengthNum channels of the input tensor of the neural network.

The refinement information encoder encodes a flag indicating whether to perform refinement processing on all maps, and encodes an index indicating which map the refinement processing is to be performed on in a case that the flag indicates that the refinement processing is not to be performed on all the maps.

The refinement information encoder encodes an index of an attribute frame to be refined and the number of attribute frames from encoded data of the characteristics information or the activation information, inputs, to an input tensor of the neural network, attribute frames for the encoded number of attribute frames, with the encoded index of the attribute frame being a start point, and performs refinement processing.

The refinement information encoder encodes, from encoded data of the activation information, refinement characteristics information and information of a map to be refined, encodes a cancel flag indicating whether to cancel refinement processing related to the refinement characteristics information and the map to be refined, and further encodes persistence information in a case that the cancel flag indicates that the refinement processing is not to be canceled.

Configuration of 3D Data Encoding Apparatus

FIG. 14 is a functional block diagram illustrating a schematic configuration of the 3D data encoding apparatus 11 according to an embodiment of the present invention.

The 3D data encoding apparatus 11 includes a patch generator 101, an atlas encoder 102, an occupancy generator 103, an occupancy encoder 104, a geometry generator 105, a geometry encoder 106, an attribute generator 108, an attribute encoder 109, a refinement parameter deriver 110, and a multiplexer 111. The 3D data encoding apparatus 11 receives a point cloud or a mesh as 3D data and outputs encoded data.

The patch generator 101 receives 3D data and generates and outputs a set of patches (here, rectangular images). Specifically, 3D data is divided into multiple regions and each region is projected onto one plane of a 3D bounding box set in 3D space to generate multiple patches. The patch generator 101 outputs information regarding the 3D bounding box (such as coordinates and sizes) and information regarding mapping to the projection planes (such as the projection planes, coordinates, sizes, and presence or absence of rotation of each patch) as atlas information.

The atlas encoder 102 encodes the atlas information output from the patch generator 101 and outputs atlas data.

The occupancy generator 103 receives the set of patches output from the patch generator 101 and generates an occupancy that represents valid areas of patches (areas where 3D data exists) as a 2D binary image (e.g., with 1 for a valid area and 0 for an invalid area). Here, other values such as 255 and 0 may be used for a valid area and an invalid area.

The occupancy encoder 104 receives the occupancy output from the occupancy generator 103 and outputs an occupancy and occupancy data. VVC, HEVC, or the like is used as an encoding scheme.

The geometry generator 105 generates a geometry frame that stores depth values for the projection planes of patches based on the 3D data, the occupancy, the occupancy data, and the atlas information. The geometry frame generator 105 derives a point with the smallest depth to the projection plane among points that are projected onto pixels g(x, y) as p_min(x, y, z). The geometry generator 105 also derives a point with the maximum depth among points that are projected onto pixel g(x, y) and located at a predetermined distance d from p_min(x, y, z) as p_max(x, y, z). A geometry frame obtained by projecting p_min(x, y, z) on all pixels onto the projection plane is set as a geometry frame of a Near layer. A geometry frame obtained by projecting p_max(x, y, z) on all pixels onto the projection plane is set as a geometry frame of a Far layer.

The geometry encoder 106 receives a geometry frame and outputs a geometry frame and geometry data. VVC, HEVC, or the like is used as an encoding scheme.

The attribute generator 108 generates an attribute frame that stores color information (e.g., YUV values and RGB values) for the projection plane of each patch based on the 3D data, the occupancy, the geometry frame, and the atlas information. The attribute generator 108 obtains a value of an attribute corresponding to the point p_min(x, y, z) with the minimum depth calculated by the geometry generator 105 and sets an attribute frame onto which the value is projected as an attribute frame of the Near layer. An attribute frame similarly obtained for p_max(x, y, z) is set as an attribute frame of the Far layer.

The attribute encoder 109 receives an attribute frame and outputs an attribute and attribute data. VVC, HEVC, or the like is used as an encoding scheme.

The refinement parameter deriver 110 receives the attribute frame and the original attribute frame or the geometry frame and the original geometry frame and selects or derives optimal filter parameters for NN filter processing and outputs the optimal filter parameters. The refinement parameter deriver 110 sets values such as an nnra_target_id, an nnra_cancel_flag, and an nnra_persistance_flag in the SEI.

The multiplexer 111 receives the filter parameters output from the refinement parameter deriver 110 and outputs them in a predetermined format. Examples of the predetermined format include SEI which is supplemental enhancement information of video data, an ASPS and an AFPS which are data structure specification information in the V3C standard, and an ISOBMFF which is a media file format. The multiplexer 111 multiplexes the atlas data, the occupancy data, the geometry data, the attribute data, and the filter parameters and outputs the multiplexed data as encoded data. A byte stream format, the ISOBMFF, or the like is used as a multiplexing method.

FIG. 15 is a block diagram illustrating a relationship between the 3D data encoding apparatus 11, the 3D data decoding apparatus 31, and an SEI message according to an embodiment of the present invention.

The 3D data encoding apparatus 11 includes a video encoder and an SEI encoder. In the configuration of FIG. 14, the video encoder corresponds to the occupancy encoder 104, the geometry encoder 106, and the attribute encoder 109. In the configuration of FIG. 14, the SEI encoder corresponds to the atlas encoder 102.

The 3D data decoding apparatus 31 includes a video decoder, an SEI decoder, a switch, and a refiner. The video decoder corresponds to the occupancy decoder 303, the geometry decoder 304, and the attribute decoder 305 in the configuration of FIG. 5. The SEI decoder corresponds to the atlas decoder 302 in the configuration of FIG. 5. The refiner corresponds to the refiner 306 in the configuration of FIG. 5. The refiner 306 may include either or both of the NN filter 611 or the ALF unit 610.

The video encoder encodes an occupancy, a geometry frame, and an attribute frame generated from 3D data. The video decoder decodes the encoded data to reconstruct a decoded image. The SEI encoder generates characteristics SEI and activation SEI from the 3D data. The SEI decoder decodes these SEI messages. The activation SEI is input to a switch to specify an image for which the refinement processing is to be performed and only the image for which the refinement processing is to be performed is input to the refiner. The characteristics SEI is input to the refiner to specify refinement to be applied to the decoded image. The image for which the refinement processing has been performed or decoded image is displayed on the 3D data display apparatus 41 (FIG. 1).

Although embodiments of the present invention have been described above in detail with reference to the drawings, the specific configurations thereof are not limited to those described above and various design changes or the like can be made without departing from the spirit of the invention.

An embodiment of the present invention is not limited to the embodiments described above and various changes can be made within the scope indicated by the claims. That is, embodiments obtained by combining technical means appropriately modified within the scope indicated by the claims are also included in the technical scope of the present invention.

INDUSTRIAL APPLICABILITY

Embodiments of the present invention are suitably applicable to a 3D data decoding apparatus that decodes encoded data into which 3D data has been encoded and a 3D data encoding apparatus that generates encoded data into which 3D data has been encoded. Embodiments of the present invention are also suitably applicable to a data structure for encoded data generated by a 3D data encoding apparatus and referenced by a 3D data decoding apparatus.

REFERENCE SIGNS LIST

11 3D data encoding apparatus

101 Patch generator

102 Atlas encoder

103 Occupancy generator

104 Occupancy encoder

105 Geometry generator

106 Geometry encoder

108 Attribute generator

109 Attribute encoder

110 Refinement parameter deriver

111 Multiplexer

21 Network

31 3D data decoding apparatus

301 Header decoder

302 Atlas information decoder

303 Occupancy decoder

304 Geometry decoder

306 Geometry reconstructor

307 Attribute decoder

308 Refiner

309 V3C decoder

310 Pre-reconstructor

311 Reconstructor

41 3D data display apparatus

3D DATA ENCODING APPARATUS AND 3D DATA DECODING APPARATUS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)