Embodiments of the present invention relate to a 3D data encoding apparatus and a 3D data decoding apparatus.
A 3D data encoding apparatus that converts 3D data into a two-dimensional image and encodes it using a video encoding scheme to generate encoded data and a 3D data decoding apparatus that decodes and reconstructs a two-dimensional image from the encoded data to generate 3D data are provided to efficiently transmit or record 3D data. Also, there is a technique for performing filter processing on a two-dimensional image by using supplemental enhancement information of a deep learning post-filter.
Specific 3D data encoding schemes include, for example, MPEG-I ISO/IEC 23090-5 Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V-PCC). V3C is used to encode and decode a point cloud including point positions and attribute information. V3C is also used to encode and decode multi-view videos and mesh videos through ISO/IEC 23090-12 (MPEG Immersive Video (MIV)) and ISO/IEC 23090-29 (Video-based Dynamic Mesh Coding (V-DMC)) that is currently being standardized. According to Supplemental Enhancement Information (SEI) of a neural network post-filter in NPL 1, neural network model information is transmitted using characteristics SEI to call a frame specified by activation SEI, whereby adaptive filter processing can be performed on point cloud data.
In NPL 1, there is a problem that filter processing using the relationship between occupancies, geometries, and attributes cannot be performed because filter processing is performed only on attributes. Namely, refinement using information between images is not possible in a method of decoding 3D data using multiple images/videos.
It is an object of the present invention to solve the above problem in 3D data encoding and/or decoding using a video encoding/decoding scheme, to further reduce encoding distortion using auxiliary information of refinement, and to encode and/or decode 3D data with high quality.
To accomplish the object, an aspect of the present invention provides a 3D data decoding apparatus including a geometry decoder configured to decode a geometry frame from encoded data, and an attribute decoder configured to decode an attribute frame from the encoded data, wherein a refinement information decoder is comprised, the refinement information decoder being configured to decode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is decoded from the activation information, and an index indicating the characteristics information for the decoded number of refinements is decoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.
A 3D data encoding apparatus is provided that includes a geometry encoder configured to encode a geometry frame and an attribute encoder configured to encode an attribute frame, wherein a refinement information encoder is comprised, the refinement information encoder being configured to encode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is encoded from the activation information and an index indicating the characteristics information for the encoded number of refinements is encoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.
According to an aspect of the present invention, it is possible to reduce distortion caused by encoding a color image and to encode and/or decode 3D data with high quality.
Embodiments of the present invention will be described below with reference to the drawings.
The 3D data transmission system 1 is a system that transmits an encoding stream obtained by encoding 3D data to be encoded, decodes the transmitted encoding stream, and displays 3D data. The 3D data transmission system 1 includes a 3D data encoding apparatus 11, a network 21, a 3D data decoding apparatus 31, and a 3D data display apparatus 41.
3D data T is input to the 3D data encoding apparatus 11.
The network 21 transmits an encoding stream Te generated by the 3D data encoding apparatus 11 to the 3D data decoding apparatus 31. The network 21 is the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or a combination thereof. The network 21 is not limited to a bidirectional communication network and may be a unidirectional communication network that transmits broadcast waves for terrestrial digital broadcasting, satellite broadcasting, or the like. The network 21 may be replaced by a storage medium on which the encoding stream Te is recorded, such as a Digital Versatile Disc (DVD) (trade name) or a Blu-ray Disc (BD) (trade name).
The 3D data decoding apparatus 31 decodes each encoding stream Te transmitted by the network 21 and generates one or more pieces of decoded 3D data Td.
The 3D data display apparatus 41 displays all or some of one or more pieces of decoded 3D data Td generated by the 3D data decoding apparatus 31. The 3D data display apparatus 41 includes a display apparatus such as, for example, a liquid crystal display or an organic electro-luminescence (EL) display. Examples of display types include stationary, mobile, and HMD. The 3D data display apparatus 41 displays a high quality image in a case that the 3D data decoding apparatus 31 has high processing capacity and displays an image that does not require high processing or display capacity in a case that it has only lower processing capacity.
Operators used in the present specification will be described below.
A data structure of the encoding stream Te generated by the 3D data encoding apparatus 11 and decoded by the 3D data decoding apparatus 31 will be described.
Each V3C unit includes a V3C unit header and a V3C unit payload. A header of a V3C unit (=V3C unit header) is a Unit Type which is an ID indicating the type of the V3C unit and has a value indicated by a label such as V3C_VPS, V3C_AD, V3C_AVD, V3C_GVD, or V3C_OVD.
In a case that the Unit Type is a V3C_VPS (Video Parameter Set), the V3C unit payload includes a V3C parameter set.
In a case that the Unit Type is V3C_AD (Atlas Data), the V3C unit payload includes a VPS ID, an atlasID, a sample stream NAL header, and multiple NAL units. ID is an abbreviation for identification and has an integer value of 0 or more. This atlasID may be used as an element of activation SEI.
Each NAL unit includes a NALUnitType, a layerID, a TemporalID, and a Raw Byte Sequence Payload (RBSP).
A NAL unit is identified by NALUnitType and includes an Atlas Sequence Parameter Set (ASPS), an Atlas Adaptation Parameter Set (AAPS), an Atlas Tile Layer (ATL), Supplemental Enhancement Information (SEI), and the like.
The ATL includes an ATL header and an ATL data unit and the ATL data unit includes information on positions and sizes of patches or the like such as patch information data.
The SEI includes a payloadType indicating the type of the SEI, a payloadSize indicating the size (number of bytes) of the SEI, and an sei_payload which is data of the SEI.
In a case that the UnitType is V3C_AVD (Attribute Video Data, attribute data), the V3C unit payload includes a VPS ID, an atlasID, an attrIdx which is an attribute frame ID (whose syntax name is a vuh_attribute_index), a partIdx which is a partition ID (vuh_attribute_partition_index), a mapIdx which is a map ID (vuh_map_index), a flag auxFlag (vuh_auxiliary_video_flag) indicating whether the data is auxiliary data, and a video stream. The video stream indicates data such as HEVC and VVC.
In a case that the UnitType is V3C_GVD (Geometry Video Data, geometry data), the V3C unit payload includes a VPS ID, an atlasID, a mapIdx, an auxFlag, and a video stream.
In a case that the UnitType is V3C_OVD (Occupancy Video Data, occupancy data), the V3C unit payload includes a VPS ID, an atlasID, and a video stream.
The attribute data and the geometry data have multiple maps distinguished from one another by mapIdx, and the attribute data has multiple attributes distinguished from one another by attrIdx.
Three-dimensional stereoscopic information (3D data) in the present specification is a set of position information (x, y, z) and attribute information in a three-dimensional space. For example, 3D data is expressed in the format of a point cloud that is a group of points with position information and attribute information in a three-dimensional space or a mesh having triangle (or polygon) vertices and faces.
Each of the occupancy frames, geometry frames, attribute frames, and atlas information may be an image obtained by mapping (packing) partial images (patches) from different projection planes onto a certain two-dimensional image. The atlas information includes information on the number of patches and the projection planes corresponding to the patches. The 3D data decoding apparatus 31 reconstructs the coordinates and attribute information of a point cloud or a mesh from the atlas information, the occupancy frame, the geometry frame, and the attribute frame. Here, points are points of a point cloud or vertices of a mesh. Instead of the occupancy frame and the geometry frame, mesh information (position information) indicating the vertices of the mesh may be encoded, decoded, and transmitted. Mesh information may also be encoded, decoded, and transmitted after being divided into a base mesh that forms a basic mesh that is a subset of the mesh and a mesh displacement. The mesh displacement indicates a displacement from the base mesh to indicate a mesh part other than the basic mesh.
The 3D data decoding apparatus 31 includes a V3C unit decoder 301, an atlas decoder 302 (a refinement information decoder), an occupancy decoder 303, a geometry decoder 304, an attribute decoder 305, a post-decoding converter 308, a pre-reconstructor 310, a reconstructor 311, and a post-reconstructor 312.
The V3C unit decoder 301 receives encoded data (a bitstream) such as that of a byte stream format or an ISO Base Media File Format (ISOBMFF) and decodes a V3C unit header and a V3C VPS. The V3C unit decoder 301 selects the atlas decoder 302, the occupancy decoder 303, the geometry decoder 304, or the attribute decoder 305 according to the UnitType of the V3C unit header. The V3C unit decoder 301 uses the atlas decoder 302 in a case that the UnitType is V3C_AD and uses the occupancy decoder 303, the geometry decoder 304, or the attribute decoder 305 to decode an occupancy frame, a geometry frame, or an attribute frame in a case that the UnitType is V3C_OVD, V3C_GVD, or V3C_AVD.
The atlas decoder 302 receives atlas data and decodes atlas information.
The atlas decoder 302 (a refinement information decoder) decodes characteristics SEI indicating characteristics of refinement processing from encoded data. The refinement information decoder decodes information on a target to which refinement is to be applied (refinement target information). Further, the atlas decoder 302 decodes activation SEI from the encoded data.
The atlas decoder 302 decodes an identifier atlasID indicating target atlas information indicating a refinement target from a V3C unit including the activation SEI.
The occupancy decoder 303 decodes occupancy data encoded using VVC, HEVC, or the like and outputs an occupancy frame DecOccFrames[frameIdx][compIdx][y][x]. Here, DecOccFrames, frameIdx, compIdx, y, and x respectively indicate a decoded occupancy frame, a frame ID, a component ID, a row index, and a column index. In DecOccFrames, compIdx=0 may be set.
The geometry decoder 304 decodes geometry data encoded using AVC, VVC, HEVC, or the like and outputs a geometry frame DecGeoFrames[mapIdx][frameIdx][compIdx][y][x]. Here, DecGeoFrames, frameIdx, mapIdx, compIdx, y, and x respectively indicate a decoded geometry frame, a frame ID, a map ID, a component ID, a row index, and a column index. DecGeoBitDepth, DecGeoHeight, DecGeoWidth, and DecGeoChromaFormat refer to the bit-depth of the geometry frame, the height of the geometry frame, the width of the geometry frame, and the chroma format of the geometry frame. The decoded geometry frame may include multiple geometry maps (geometry frames with projections of different depths) and mapIdx is used to distinguish between the maps. In DecGeoFrames, compIdx=0 may be set.
The attribute decoder 305 decodes attribute data encoded using VVC, HEVC, or the like and outputs an attribute frame DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x]. Here, DecAttrFrames, frameIdx, attrIdx, partIdx, mapIdx, compIdx, y, and x respectively indicate a decoded attribute frame, a frame ID, an attribute ID, a partition ID, a map ID, a component ID, a row index, and a column index. DecAttrBitDepth, DecAttrHeight, DecAttrWidth, and DecAttrChromaFormat indicate an attribute frame, the bit-depth of the attribute frame, the height of the attribute frame, the width of the attribute frame, and the chroma format of the attribute frame. The decoded attribute frame may include multiple attribute maps (attribute frames with projections of different depths) and mapIdx is used to distinguish between the maps. The decoded attribute frame includes multiple attributes such as color (R, G, B), reflection, alpha, and normal direction. Multiple attributes can be transmitted through multiple pieces of attribute data and attrIdx is used to distinguish them. For example, {R, G, B} is attribute data 0 (attrIdx=0), {reflection} is attribute data 1 (attrIdx=1), and {alpha} is attribute data 2 (attrIdx=2). An attribute can be divided into and transmitted in multiple bitstreams and partIdx is used to distinguish between them. mapIdx is as described above.
The post-decoding converter 308 receives the decoded atlas information, the decoded occupancy frame DecOccFrames, the decoded geometry frame DecGeoFrames, and the decoded attribute frame DecAttrFrames and converts them into nominal formats. The post-decoding converter 308 outputs OccFramesNF[frameIdx][CompTimeIdx][y][x], GeoFramesNF[mapIdx][CompTimeIdx][frameIdx][y][x], AttrFramesNF[attrIdx][mapIdx][CompTimeIdx][compIdx][y][x] which are the nominal formats of the occupancy frame, the geometry frame, and the attribute frame. Here, frameIdx, CompTimeIdx, y, x, mapIdx, attrIdx, and compIdx respectively indicate a frame ID, a composition time index, a row index, a column index, a map ID, an attribute ID, and a component ID.
A nominal format refers collectively to a nominal bit-depth, resolution, chroma format, and composition time index into which a decoded video is to be converted.
Each video sub-bitstream and each region of a packed video sub-bitstream are associated with a nominal bit-depth. This is an expected target bit-depth for all operations for reconstruction. The nominal bit-depth OccBitDepthNF of the occupancy bit-depth is set to oi_occupancy_2d_bit_depth_minus1[ConvAtlasID]+1 or pin_occupancy_2d_bit_depth_minus1[ConvAtlasID]+1. oi_occupancy_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which an occupancy frame of an atlas with atlasID=j is to be converted. pin_occupancy_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which a decoded region including occupancy data of the atlas with atlasID=j is to be converted. In a case that a pin_occupancy_present_flag[j] is equal to 0, it indicates that the packed video frame of the atlas with atlasID=j does not include a region having occupancy data. In a case that the pin_occupancy_present_flag[j] is equal to 1, it indicates that the packed video frame of the atlas with atlasID=j includes a region having occupancy data. In a case that pin_occupancy_present_flag[j] is not present, it is inferred that its value is equal to 0. In a case that a pin_geometry_present_flag[ConvAtlasID] is equal to 1, the nominal bit-depth GeoBitDepthNF of each geometry frame is set to gi_geometry_2d_bit_depth_minus1[ConvAtlasID]+1 or pin_geometry_2d_bit_depth_minus1[ConvAtlasID]+1. gi_geometry_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which all geometry frames of the atlas with atlasID=j are to be converted. pin_geometry_2d_bit_depth_minus1[j]+1 indicates a nominal 2D bit-depth into which a decoded region including geometry data of the atlas with atlasID=j is to be converted. pin_geometry_present_flag[j]=0 indicates that the packed video frame of the atlas with atlasID=j does not include a region having geometry data. pin_geometry_present_flag[j]=1 indicates that the packed video frame of the atlas with atlasID=j includes a region having geometry data. In a case that pin_geometry_present_flag[j] is not present, it is inferred that its value is equal to 0. Finally, in a case that pin_attribute_present_flag[ConvAtlasID]=1, the nominal bit-depth AttrBitDepthNF[attrIdx] of each attribute frame with an attrIdx is set to ai_attribute_2d_bit_depth_minus1[ConvAtlasID][attrIdx]+1 or pin_attribute_2d_bit_depth_minus1[ConvAtlasID][attrIdx]. ai_attribute_2d_bit_depth_minus1[j][i] plus 1 indicates a nominal two-dimensional bit-depth into which all attribute frames with attrIdx=i are to be converted for the atlas with atlasID=j. pin_attribute_2d_bit_depth_minus1[j][i] plus 1 indicates a nominal two-dimensional bit-depth into which a region including an attribute with attrIdx=i is to be converted for the atlas with atlasID=j. pin_attribute_present_flag[j]=0 indicates that the packed video frame of the atlas with atlasID=j does not include a region of attribute data. pin_attribute_present_flag[j]=1 indicates that the packed video frame of the atlas with atlasID=j includes a region of attribute data.
The ConvAtlasID is set equal to a vuh_atlas_id or is determined by external means in a case that no V3C unit header is available. The vuh_atlas_id is indicated by a V3C unit header such as V3C_AD, V3C_OVD, V3C_GVD, or V3C_AVD and specifies the ID of an atlas corresponding to the current V3C unit.
An asps_frame_width represents the frame width of the atlas as an integer number of samples which correspond to luma samples of a video component. It is a requirement for V3C bitstream conformance that the asps_frame_width be equal to the value of vps_frame_width [j] (where j is the current atlasID). An asps_frame_height indicates the frame height of the atlas as an integer number of samples which correspond to luma samples of a video component. It is a requirement for V3C bitstream conformance that the value of asps_frame_height be equal to the value of vps_frame_height[j], where j indicates the current atlasID. The nominal frame resolution of an auxiliary video component is defined by the nominal width and height specified respectively by variables Aux VideoWidthNF and Aux VideoHeightNF. Aux Video WidthNF and Aux VideoHeightNF are obtained from an auxiliary video sub-bitstream relating to the atlas.
The nominal chroma format is defined as 4:4:4.
The functions of the post-decoding converter 308 include bit-depth conversion, resolution conversion, output order conversion, atlas composition alignment, atlas dimension alignment, chroma upsampling, geometry map synthesis, and attribute map synthesis. Video frames provided by the V3C decoder 309 may require additional processing steps before being input to the reconstruction process. Such processing steps may include converting a decoded frame into a nominal format (e.g., a nominal resolution, bit-depth, or chroma format). Nominal format information is signaled in a V3C VPS.
The pre-reconstructor 310 may receive the decoded atlas information, the decoded occupancy frame, the decoded geometry frame, and the decoded attribute frame and refine/modify them. Specifically, in a case that an occupancy synthesis flag os_method_type[k] is equal to 1 indicating patch border filtering, the pre-reconstructor 310 starts occupancy synthesis with an OccFramesNF[compTimeIdx][0] and a GeoFramesNF[0][compTimeIdx][0] as inputs and a corrected array OccFramesNF[compTimeIdx][0] as an output. OccFramesNF indicates an occupancy frame decoded in a nominal format and GeoFramesNF indicates a geometry frame decoded in a nominal format. compTimeIdx is a composition time index.
The refiner 306 refines an occupancy, a geometry, and an attribute in units of frames in the pre-reconstructor 310. The refinement may be filtering that receives any of an occupancy, a geometry, and an attribute and outputs any of an occupancy, a geometry, and an attribute. The refiner 306 may perform refinement processing using a two-dimensional occupancy, geometry, and attribute at the same time.
The refiner 306 may further include an NN filter 611 or an ALF unit 610 and perform filter processing based on received neural network parameters or linear parameters.
The reconstructor 311 receives atlas information, an occupancy, and a geometry and reconstructs the positions and attributes of a point cloud or the vertices of a mesh in 3D space. The reconstructor 311 reconstructs mesh or point cloud data of 3D data based on the reconstructed geometry information (for example, recPcGeo) and attribute information (for example, recPcAttr). Specifically, the reconstructor 311 receives OccFramesNF[compTimeIdx][0][y][x], GeoFramesNF[mapIdx][compTimeIdx][0][y][x] and AttrFramesNF[attrIdx][mapIdx][compTimeIdx][ch][y][x] which are the nominal video frames derived by the pre-reconstructor 310 and reconstructs mesh or point cloud data of 3D data. AttrFramesNF indicates an attribute frame decoded in a nominal format. The reconstructor 311 derives a variable pointCnt as the number of points in a reconstructed point cloud frame, a one-dimensional array pointToPatch[pointCnt] as a patch index corresponding to each reconstructed point, and a two-dimensional array pointToPixel[pointCnt][dimIdx] as atlas coordinates corresponding to each reconstructed point. The reconstructor 311 also derives a 2D array recPcGeo[pointCnt][dimIdx] as a list of coordinates corresponding to each reconstruction point and a 3D array recPcAttr[pointCnt][attrIdx][compIdx] as an attribute relating to the points in the reconstructed point cloud frame. Here, pointCnt, attrIdx, and compIdx correspond respectively to the reconstructed point size, attribute frame ID, and component ID. dimIdx represents the (x, y) component of each reconstructed point.
Specifically, the reconstructor 311 derives recPcGeo and recPcAttr as follows.
Here, pIdx is the index of a patch. compTimeIdx represents a composition time index, rawPos1D represents a one-dimensional position, and gFrame and aFrame represent a geometry frame and an attribute frame, respectively. TilePatch3dOffsetU is a tile patch parameter relating to the patch. ai_attribute_count[j] indicates the number of attributes relating to the atlas with atlasID=j (the number of attributes encoded and decoded using the VPS). AtlasPatch3dOffsetU, AtlasPatch3dOffsetV, and AtlasPatch3dOffsetD are parameters indicating the position of the 3D bounding box of
Here, AtlasPatchRawPoints, AtlasPatch2dPosX, AtlasPatch2dPosY,
Here, ai_attribute_dimension_minus1 is encoded and decoded using encoded data in the V3C VPS, and ai_attribute_dimension_minus1+1 indicates the total number of dimensions (i.e., the number of channels or components) of attributes decoded by the V3C unit decoder 301. asps_map_count_minus1 is encoded and decoded using encoded data in the V3C VPS, and asps_map_count_minus1+1 indicates the number of maps for geometry data and attribute data of the current atlas. RecAtlasID indicates a decoded atlas ID.
The post-reconstructor 312 changes (updates) the mesh or point cloud data of 3D data that has been processed by the reconstructor 311. The post-reconstructor 312 receives pointCnt as the number of reconstructed points of the current point cloud frame relating to the current atlas, a one-dimensional array attrBitDepth[ ] as a nominal bit-depth, oFrame[ ][ ], recPcGeo [ ][ ], and recPcAttr [ ][ ][ ] and applies geometric smoothing to them and outputs the changed recPcGeo [ ][ ] and recPcAttr [ ][ ][ ].
The refiner 306 (306DEC) applies refinement to DecOccFrames, DecGeoFrames, and DecAttrFrames that have been decoded by the occupancy decoder 303, the geometry decoder 304, and the attribute decoder 305 and have not been converted by the post-decoding converter 308 yet. Then, the refined DecOccFrames, DecGeoFrames, and DecAttrFrames are output to the post-decoding converter 308.
The refiner 306 receives DecOccFrames[0][frameIdx][y][x], DecGeoFrames[mapIdx][frameIdx][0][y][x] for each mapIdx and frameIdx, and/or DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x] for each attrIdx, partIdx, mapIdx, frameIdx, and compIdx according to the refinement target information and outputs the corrected DecOccFrames, DecGeoFrames, and/or DecAttrFrames. The refiner 306 stores the output in DecOccFrames[0][frameIdx][y][x], DecGeoFrames[mapIdx][frameIdx][0][y][x], and/or DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx][compIdx][y][x] according to the refinement target information.
The refiner 306 may perform refinement processing on a two-dimensional array oFrame of asps_frame_height×asps_frame_width, a two-dimensional array gFrame of asps_frame_height×asps_frame_width, and/or a three-dimensional array aFrame of (ai_attribute_dimension_minus1[RecAtlasID][attrIdx]+1×asps_frame_height×asps_frame_width. At this time, the refiner 306 may also receive oFrame=DecOccFrames[0][frameIdx], gFrame=DecGeoFrames[mapIdx][frameIdx][0], and/or aFrame=DecAttrFrames[attrIdx][partIdx][mapIdx][frameIdx]. Here, the attrIdx and mapIdx may be values nnrc_attribute_index and nnrc_map_index obtained by decoding the NNRC SEI. That is, attrIdx=nnrc_attribute_index and mapIdx=nnra_target_map_index. The frameIdx may be a compTimeIdx which is a composition time index. The attrIdx and mapIdx may be values nnra_target_map_index and nnra_target_attribute_index obtained by decoding the NNRA_SEI. attrIdx=nnra_target_attribute_index, mapIdx=nnra_map_index. The same is applied below.
According to the above configuration, a network model to be applied (filter characteristics) is specified by characteristics SEI and any of an occupancy, a geometry, and an attribute is specified using refinement target information specified in the characteristics SEI, and refinement processing is applied to multiple inputs at the same time, thereby achieving the advantage of improving image quality.
The refiner 306 may perform refinement processing on a two-dimensional array oFrame of asps_frame_height×asps_frame_width, a two-dimensional array gFrame of asps_frame_height×asps_frame_width, and/or a three-dimensional array aFrame of (ai_attribute_dimension_minus1[RecAtlasID][attrIdx]+1) x asps_frame_height×asps_frame_width. At this time, the refiner 306 may also receive oFrame=OccFramesNF[compTimeIdx][0], gFrame=GeoFramesNF[mapIdx][compTimeIdx][frameIdx], and/or aFrame=AttrFramesNF[attrIdx][mapIdx][compTimeIdx].
In
In a case that the refinement target information indicates a geometry (S3064I), a geometry is added to the input tensor (S3065I). For example, the following (Eq. IN-GEO) may be used.
In a case that the refinement target information indicates an attribute (S3066I), an attribute is added to the input tensor (S3067I). For example, the following (Eq. IN-ATTR) may be used.
According to the above configuration, a network model to be applied (filter characteristics) is specified by characteristics SEI. Any of an occupancy, a geometry, and an attribute is specified using refinement target information specified in the characteristics SEI and refinement processing is applied to multiple inputs at the same time, thereby achieving the advantage of improving image quality.
The NN filter 611 performs filter processing using a neural network. The neural network is expressed by a neural network model and may include a convolution (Conv).
Here, a neural network model (hereinafter referred to as an NN model) means elements and connection relationships (a topology) of a neural network and parameters (weights and biases) of the neural network. The NN filter 611 may fix the topology and switch only the parameters depending on an image to be filtered. A neural network may include a convolution defined by a kernel size, the number of input channels, and the number of output channels.
Let DecFrame be an input to the refiner 306. The refiner 306 derives an input InputTensor to the NN filter 611 from an input image DecFrame and the NN filter 611 performs filter processing based on the neural network model using the inputTensor to derive an outputTensor. The neural network model used is a model corresponding to an nnra_target_id. The input image may be an image for each component or may be an image having multiple components as channels.
The NN filter 611 may repeatedly apply the following process.
The NN filter 611 performs a convolution operation (conv or convolution) on the inputTensor and kernel k[m][n][yy][xx] for the number of layers to generate an output image outputTensor to which a bias has been added.
Here, m is the number of channels of inputTensor, n is the number of channels of outputTensor, yy is the height of kernel k, and xx is the width of kernel k.
Each layer generates an outputTensor from an inputTensor.
Here, nn=0 . . . n−1, mm=0 . . . m−1, yy=0 . . . height−1, xx=0 . . . width−1, i=0 . . . yy−1, and j=0 . . . xx−1. “width” is the width of inputTensor and outputTensor and “height” is the height of inputTensor and outputTensor. 2 are the sum over mm=0 . . . m−1, i=0 . . . yy−1, and j=0 . . . xx−1. “of” is the width or height of an area required around the inputTensor to generate the outputTensor.
In a case of 1×1 Conv, Σ represents the sum for each of mm=0 . . . m−1, i=0, and j=0. In this case, of =0 is set. In a case of 3×3 Conv, Σ represents the sum for each of mm=0 . . . m−1, i=0 . . . 2, and j=0 . . . 2. In this case, of =1 is set.
In a case that the value of yy+j−of is less than 0 or “height” or more or in a case that the value of xx+i−of is less than 0 or “width” or more, the value of inputTensor [mm][yy+j−of][xx+i−of] may be 0. Alternatively, the value of inputTensor [mm][yy+j−of][xx+i−of] may be inputTensor [mm][yclip][xclip]. Here, yclip is max(0, min(yy+j−of, height−1)) and xclip is (0, min(xx+i−of, width−1)).
In the next layer, the obtained outputTensor is used as a new inputTensor and the same process is repeated for the number of layers. An activation layer may be provided between layers. Pooling layers or skip connections may be used. A FilteredFrame is derived from the outputTensor finally obtained.
A process called Depth-wise Conv which is represented by the following equation may also be performed using kernel k′[n][yy][xx]. Here, nn=0 . . . n−1, xx=0 . . . width−1, and yy=0 . . . height−1.
Non-linear processing referred to as Activate, such as ReLU, may be used. ReLU (x)=x>=0?x: 0
Alternatively, leakyReLU shown in the following formula may be used.
Here, a is a predetermined value less than 1, for example 0.1 or 0.125. To perform integer operations, all values of k, bias, and a described above may be set to integers and a right shift may be performed after conv to generate an outputTensor.
ReLU always outputs 0 for values less than 0 and directly outputs input values as they are for values equal to or greater than 0. In contrast, in leakyReLU, for values less than 0, linear processing is performed with a gradient being set equal to a. In ReLU, the gradient for values less than 0 disappears, and learning may not advance steadily. leakyReLU leaves a gradient for values less than 0, making such a problem less likely to occur. Of above leakyReLU (x), PRELU using a parameterized value of a may be used.
attrIndex is set to nnrc_tareget_attribute_index and mapIndex is set to nnra_tareget_map_index[i], and the following is applied. With the input including OccFramesNF[compTimeIdx][0], GeoFramesNF[mapIdx][compTimeIdx][0], AttrFramesNF[attrIdx][mapIdx][compTimeIdx], and nnra_strength[i] and with the output including changed GeoFramesNF[mapIdx][compTimeIdx][0] and AttrFramesNF[attrIdx][mapIdx] [compTimeIdx], the following process is invoked.
Here, ai_attribute_dimension_minus1 indicates the number of elements−1. The element is an attribute's element encoded and decoded using the VPS. Normally 3 in the case of YUV or RGB.
In a configuration corresponding to simultaneous refinement of multiple attributes, mapIndex may be set to nnra_tareget_map_index[i] and the following may be applied. With the input including OccFramesNF[compTimeIdx][0], GeoFramesNF [mapIdx][compTimeIdx][0], AttrFramesNF[attrIdx][mapIdx][compTimeIdx], and nnra_strength[i] and with the output including changed
The NN filter 611 performs NN filter processing and derives an outputTensor from the inputTensor. Refinement processing (filter processing) indicated by RefineFilter( ) may be performed in units of blocks (block Width×blockHeight) as described below.
Here, DeriveInputTensors( ) is a function indicating input data setting and StoreOutputTensors( ) is a function indicating output data storage. Frame Width and FrameHeight are the sizes of input data and may be asps_frame_width and asps_frame_height. blockWidth and blockHeight are the width and height of each block.
In DeriveInputTensors( ) the NN filter 611 derives input data inputTensor [ ][ ][ ] to the NN filter 611 based on one or more of an occupancy frame oFrame, a geometry frame gFrame, and an attribute frame aFrame, in accordance with refinement target information. The following is an example configuration and a configuration that will be described later may also be used.
Here, OccupancyUsedFlag, GeometryUsedFlag, and AttributeUsedFlag are variables indicating the refinement target information, and syntax elements described later and obtained by decoding the characteristics SEI may be used as follows.
Further, another configuration will be described.
An example illustrated in
In
The refiner 306 may derive an inputTensor according to the refinement target information. Hereinafter, ch indicates the index (position) of a channel to be set. ch++ is an abbreviation for ch=ch+1 and indicates that the index of the channel to be set is increased by 1.
In a case that the refinement target information indicates an occupancy (S3062I), an occupancy is added to the input tensor (S3063I). For example, the following (Equation IN-OCC) may be used.
In a case that the refinement target information indicates a geometry (S3064I), a geometry is added to the input tensor (S3065I). For example, the following (Equation IN-GEO) may be used.
In a case that the refinement target information indicates an attribute (S3066I), an attribute is added to the input tensor (S30671). For example, as indicated in the following (Equation IN-ATTR-DIM), all of the number of channels of the attribute designated by the index of the attrIdx are input to inputTensor.
Note that following.
As another configuration, in a case that the refinement target information indicates an attribute, all of the number of channels of the attribute designated by the indexes of the attrIdx may be input to inputTensor. The attrIdx is starting from nnrc_attribute_index and the number of attribute is nnrc_attribute_num_minus1+1.
indicates data missing or illegible when filed
In
In a case that the refinement target information indicates a geometry (S3064O), a specific component of the output tensor is set to a geometry (S3065O). For example, the following (Equation OUT-GEO) may be used.
In a case that the refinement target information indicates an attribute (S3066O), a specific component of the output tensor is set to an attribute (S3067O). For example, as indicated by the following (Equation OUT-ATTR-DIM), the value of outputTensor may be set to the number of channels of the attribute designated by the index of attrIdx.
OUT-ATTR-DIM)
As another configuration, in a case that the refinement target information indicates an attribute, all of the number of channels of the attribute designated by the indexes of the attrIdx may be input to inputTensor. The attrIdx is starting from nnrc_attribute_index and the number of attribute is nnrc_attribute_num_minus1+1.
The refinement information decoder further decodes the number strengthNum of pieces of strength information from the characteristics information. And the refinement processing unit inputs at least one of the occupancy frame, the geometry frame, or the attribute frame to the input tensor of the neural network, and further inputs each of strengthNum values derived from the decoded strength information into a respective one of strengthNum channels of the input tensor of the neural network.
In StoreOutputTensors( ) below, the NN filter 611 sets the content of outputTensor to the changed occupancy frame oFrame, geometry frame gFrame and attribute frame aFrame according to the refinement target information.
Refinement specified by persistence information of NNRA_SEI and an identifier of characteristics SEI may be performed using a filter (a Wiener Filter) that uses a linear model. The linear filter may perform filter processing in the ALF unit 610. Specifically, a filter target image DecFrame is divided into small regions (x=xSb . . . xSb+bSW−1, y=ySb . . . ySb+bSH−1) of a constant size (for example, 4×4 or 1×1). (xSb, ySb) are the coordinates of the upper left corner of the small region and bSW and bSH are the width and height of the small region. Then, filter processing is performed in units of small regions. A refined image outFrame is derived from the DecFrame using a selected filter coefficient coeff[ ].
Here, ofx and ofy are offsets of a reference position determined according to a filter position i. offset=1<< (shift−1). shift is a constant such as 6, 7, or 8 that corresponds to the precision of the filter coefficient.
The image may be classified into small regions and filter processing may be performed by selecting a filter coefficient coeff[classId][ ] according to a classId (=classId[y][x]) of each derived small region.
Here, DecFrame[ch]=inputTensor [ch][yP+overlapSize][xP+overlapSize]. That is, the following may be used.
The classId may be derived using an activity level or directionality of a block (one pixel in the case of a 1×1 block). For example, the classId may be derived using the activity level Act derived from a sum of absolute differences or the like as follows.
Here, i=0 . . . 7 and xi indicates a pixel adjacent to a target pixel x. Directions toward adjacent pixels may be eight directions which are directions toward the top, bottom, left and right and four diagonal directions at 45 degrees with respect to i=0 . . . 7. | Following formula may be used.
Here, l, r, a, and b are abbreviations for left, right, above, and below, respectively, and indicate pixels on the left, right, top, and bottom of x. Here, Act may be clipped into NumA−1 after being quantized by a shift value as follows, where NumA indicates the number of the activity indices.
Act=Min(NumA−1,Act>>shift)
Further, the classId may be derived based on the following formula using the directionality D.
Here, for example, Act=0 . . . . NumA−1 and D=0 . . . 4.
The ALF unit 610 directly outputs an OutFrame as a FilteredFrame.
A persistence scope to which the characteristics SEI is applied is a CAS. Namely, characteristics SEI is SEI applied in units of CASs.
A syntax element nnrc_id indicates the ID of the characteristics SEI. In the activation SEI, the value of an nnrc_id of characteristics SEI indicating refinement characteristics to be applied is transmitted as a target ID (an nnra_target_id) to specify refinement processing to be applied.
In a case that an nnrc_mode_idc is 0, it indicates that this SEI message contains an ISO/IEC 15938-17 bitstream and specifies a base NNRE (a neural-network refinement) or specifies an update from a base NNRE having the same nnrc_id value.
In a case that the nnrc_mode_idc is equal to 1 in a case that an NNRC SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, a base NNRE regarding that nnrc_id value is identified as a neural network identified by a URI indicated by an nnrc_uri which is a format identified by a tag URI nnrc_tag_uri. In a case that an NNRC SEI message is neither the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order nor a repeat of the first NNRC SEI message in decoding order and the nnrc_mode_idc is 1, it indicates that an update to a base NNRE having the same nnrc_id value is defined by a URI indicated by an nnrc_uri. The nnrc_uri is a format identified by a tag URI nnrc_tag_uri.
The value of nnrc_mode_idc shall be in a range of 0 to 1 (inclusive) in bitstreams complying with the version.
In a case that this SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, the same RefineFilter( ) as that of the base NNRE is assigned.
In a case that this SEI message is neither the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order nor a repeat of the first NNRC SEI message in decoding order, an update defined in this SEI message is applied to the base NNRE to obtain a RefineFilter( )
Updates are not cumulative and each update is applied to the base NNRE. The base NNRE is an NNRE specified in the first NNRC SEI message (in decoding order) having a specific nnrc_id value within the current CAS.
An nnrc_reserved_zero_bit_a shall be equal to 0 in bitstreams complying with the version. The decoder shall ignore NNRC SEI messages in which the nnrc_reserved_zero_bit_a is not 0.
An nnrc_tag_uri includes a tag URI having syntax and semantics specified in IET FRFC 4151 that identifies the format and related information of a neural network that is used as a base NNRE having the same nnrc_id value specified in an nnrc_uri or as an update to the base NNRE.
The nnrc_tag_uri being “tag: iso.org, 2023:15938-17” indicates that neural network data identified by the nnrc_uri complies with ISO/IEC15938-17.
An nnrc_uri includes a URI having syntax and semantics specified in IETF Internet Standard 66 that identifies a neural network used as a base NNRE or identifies an update to a base NNRE having the same nnrc_id value.
In a case that an nnrc_property_present_flag is equal to 1, it indicates that syntax elements relating to an input format, an output format, and complexity are present. In a case that the nnrc_property_present_flag is 0, it indicates that no syntax elements relating to an input format, an output format, and complexity are present.
In a case that this SEI message is the first NNRC SEI message having a specific nnrc_id value within the current CAS in decoding order, the nnrc_property_present_flag needs to be equal to 1.
In the case that the nnrc_property_present_flag is equal to 0, it is inferred that the values of all syntax elements which can only exist in the case that the nnrc_property_present_flag is equal to 1 and for which no inferred values are specified are the same as those of corresponding syntax elements in an NNRC SEI message including a base NNRE that is to be updated by this SEI.
The nnrc_base_flag being equal to 1 indicates that the SEI message specifies a base NNRE. The nnrc_base_flag being equal to 0 indicates that the SEI message specifies an update relating to a base NNRE. In a case that no nnrc_base_flag is present, it is inferred that the value of the nnrc_base_flag is equal to 0.
The value of the nnrc_base_flag is subject to the following constraints:
The syntax elements nnrc_occupancy_flag, nnrc_geometry_flag, and nnrc_attribute_flag indicate targets of the refinement processing (refinement target information). Each of the syntax elements indicates whether an occupancy, a geometry, or an attribute is included in the refinement input or output.
The syntax element nnrc_attribute_index is refinement target information and indicates the index (attrIdx) of the attribute to be refined. For nnrc_attribute_flag!=0, nnrc_attribute_index is encoded and decoded. Although not shown, the syntax elements of nnra_all_attribute_flag may be decoded and encoded. nnra_all_attribute_flag=1 indicates that refinement processing is to be performed on all attributes. In other words, the attrIdx to which refinement is applied includes all attrIdxs of all of 0 . . . ai_attribute_count [RecAtlasID]−1. For nnra_all_attribute_flag=1, nnrc_attribute_index and nnrc_attribute_num_minus1 are not encoded or decoded, but are inferred as nnrc_attribute_index=0 and nnrc_attribute_num_minus1=ai_attribute_count [RecAtlasID]−1. A list of nnrc_attribute_index may be inferred as 0 . . . ai_attribute_count [RecAtlasID]−1. For nnra_all_attribute_flag=0, nnrc_attribute_index and nnrc_attribute_num_minus1 are encoded and decoded.
The syntax element nnrc_attribute_num_minus1 is refinement target information and indicates the number of attributes to be refined. For nnrc_attribute_flag!=0, nnrc_attribute_num_minus1 is encoded and decoded.
The syntax element nnrc_strength_num is refinement input additional information and indicates the number strengthNum of the strength information nnra_strength input to the refinement processing (inputTensor).
A syntax element nnrc_inp_out_format_idc (input/output tensor format information) indicates a method of converting pixel values of the decoded image into input/output values for the refinement processing. In a case that the value of the nnrc_inp_out_format_idc is 0, input values to the refinement processing (especially, the input tensor) are real numbers (floating point values) specified in IEEE754 and a function Inp is specified as follows. The value range of the input tensor is 0 . . . 1.
In a case that the value of the nnrc_inp_out_format_idc is 1, the input and output values to the refinement processing are unsigned integers and the function Inp is specified as follows. The value range of the input tensor is 0 . . . 1<<(inpTensorBitDepth)−1.
A value obtained by adding 8 to the value of the syntax element nnrc_inp_tensor_bitdepth_minus8 indicates the pixel bit-depth of the luma pixel value of the integer input tensor. The value of the variable inpTensorBitDepth is derived from the syntax element nnrc_inp_tensor_bitdepth_minus8 as follows.
inpTensorBitDepth=nnrc_inp_tensor_bitdepth_minus8+8
A block is an array of pixels. The refinement processing is performed in units of fixed blocks. A block may also be called a patch.
A syntax element nnrc_block_size_idc indicates the block size. The block size may be a multiple of 64 such as 64, 128, or 192 as follows.
The block size may also be defined from the nnrc_block_size_idc excluding 0 as follows.
A syntax element nnrc_overlap_size_idc specifies the number of horizontal and vertical pixels over which adjacent input tensors overlap. The value of nnrc_overlap_size_idc may be a multiple of 4 as follows.
overlapSize=nnrc_overlap_size_idc<<2
A function Out that converts each of a luma pixel value and a chroma pixel value output by post-processing into an integer value of a pixel bit-depth is specified as follows using the pixel bit-depth BitDepth.
Out(x)=Clip3(0,(1<<BitDepth)−1,Round(x*((1<<BitDepth)−1)))
The function Out is specified as follows.
An nnrc_out_tensor_bitdepth_minus8+8 specifies the pixel bit-depth of pixel values of an integer output tensor. The value of outTensorBitDepth is derived from the syntax element nnrc_out_tensor_bitdepth_minus8 as follows.
outTensorBitDepth=nnrc_out_tensor_bitdepth_minus8+8
A syntax element nnrc_reserved_zero_bit_b shall be equal to 0.
A syntax element nnrc_payload_byte[i] contains the i-th byte of an ISO/IEC 15938-17 compliant bitstream. All nnrc_payload_bytes[i] shall be those of an ISO/IEC 15938-17 compliant bitstream.
For this activation SEI, the atlas decoder 302 (a refinement information decoder) and an atlas encoder 102 (a refinement information encoder) decode and encode the following syntax elements.
An nnra_target_id indicates the ID of characteristics SEI to be applied (an identifier or identification information of refinement characteristics information). Refinement processing specified by the characteristics SEI having the same nnre_id as the nnra_target_id is applied to an image.
An nnra_cancel_flag is a cancel flag. The nnra_cancel_flag being 1 indicates that persistence of refinement set for the image in already decoded NNRA_SEI is to be canceled. The nnra_cancel_flag being 0 indicates that a subsequent syntax element (nnra_persistence_flag) is to be transmitted, encoded, and decoded.
The nnra_persistence_flag indicates persistence information of a target refinement. In a case that the nnra_persistence_flag is 0, it indicates that the target refinement is applied only to pictures indicated by an atlasID. In a case that the nnra_persistence_flag is 1, it indicates that the target refinement indicated by the nnra_target_id is applied to the current picture and all subsequent pictures until one of the following conditions is met:
As illustrated in
In the syntax structure of
Furthermore, for each i, the second refinement target information includes nnra_target_map_index[i].
Furthermore, for each i, a syntax element indicating nnra_all_map_flag[i] may be decoded and encoded as the second refinement target information. Here, nnra_all_map_flag[i] =1 indicates that refinement processing is performed on all maps. In other words, the mapIdx to which refinement is applied includes all mapIdxs. For nnra_all_map_flag[i]=1, nnra_target_map_index[i] is not encoded or decoded, and is inferred as nnra_target_map_index[i]=1 and i=0 . . . asps_map_count_minus1. For nnra_all_map_flag[i]=0, nnra_target_map_index[i] is encoded and decoded. Note that nnra_all_map_flag[i] and nnra_target_map_index[i] may be encoded and decoded by the characteristics SEI instead of the activation SEI. In this case, in the characteristics SEI, nnrc_all_map_flag and nnrc_target_map_index are encoded and decoded as corresponding syntax elements.
Furthermore, although not illustrated, nnra_all_partition_flag[i] may be included as second refinement target information for each i.
Here, nnra_all_partiotion_flag[i]=1 indicates that refinement processing is performed on all attributes. In other words, this indicates that the attrIdx to which refinement is applied includes all attrIdxs. For nnra_all_partiotion_flag[i]=1, nnra_target_partiotion_index[i] is not encoded or decoded, but is inferred as nnra_target_partiotion_index[i]=1 and i=0 . . . asps_map_count_minus1. For nnra_all_partiotion_flag[i]=0, nnra_target_partiotion_index[i] is encoded and decoded.
In a case that nnra_cancel_flag[i] indicates that cancellation is not performed, for example, in a case that nnra_cancel_flag[i]==0 (!nnra_cancel_flag[i]), the atlas decoder 302 decodes the persistence information nnra_persistent_flag[i], and further decodes nnra_strength_num[i] indicating the number of pieces of strength information to be applied and nnra_strength [i][j] indicating the strength information to be applied.
For each i of i=0 . . . nnra_count_minus1+1, the refiner 306 provides mapIdx=nnra_target_map_index[i] and attrIdx=nnra_target_attribute_index as input data. The refiner 306 identifies oFrame=OccFramesNF[compTimeIdx][0], gFrame=GeoFramesNF[mapIdx][compTimeIdx][frameIdx][0], and aFrame=AttrFramesNF[attrIdx][mapIdx][compTimeIdx]. Then, the following processing is performed using the neural network model of the characteristics SEI (NNRC SEI having the same value as nnra_target_id[i] as nnrc_id) designated and activated by the activation information.
In a case that the refinement target information indicates an attribute, the attribute frame of the map indicated by the second refinement target information nnra_target_map_index[i] is added to the input tensor. For example, mapIdx=nnra_target_map_index[i] may be set in (Eq. IN-ATTR-DIM) and (Eq. IN-ATTR-POS).
In a case that the refinement target information indicates an occupancy, a specific component of the output tensor may be set to an occupancy. For example, (Eq. OUT-OCC) may be used.
In a case that the refinement target information indicates a geometry, a specific component of the output tensor is set to a geometry frame with the mapIdx indicated by the second refinement target information nnra_target_map_index[i]. For example, mapIdx=nnra_map_index[i] may be set in (Eq. OUT-GEO).
In a case that the refinement target information indicates an attribute, a specific component of the output tensor is set to an attribute indicated by the second refinement target information nnra_target_map_index. For example, mapIdx=nnra_target_map_index[i] may be set in (Eq. OUT-ATTR), (Eq. OUT-ATTR-DIM), and (Eq. OUT-ATTR-POS).
Further, in a case that an attribute is included as refinement target information, a syntax element nnrc_target_attribute_index[i] indicating an attrId to be refined may be included.
In a case that the refinement target information indicates an attribute, an attribute indicated by the second refinement target information nnrc_target_attribute_index and nnra_target_map_index is added to the input tensor. For example, mapIdx=nnra_target_map_index[i] (or nnrc_target_map_index) and attrIdx=nnrc_target_attribute_index (or nnra_target_attribute_index[i]) may be set in (Eq. IN-ATTR-POS2) and (Eq. IN-ATTR-MAP).
In a case that the refinement target information indicates an attribute, a specific component of the output tensor is set equal to an attribute indicated by the second refinement target information nnra_target_map_index[i] and nnra_target_attribute_index[i]. For example, mapIdx=nnra_target_map_index[i] (or nnrc_target_map_index) and attrIdx=nnrc_target_attribute_index (or nnra_target_attribute_index[i]) may be set in (Eq. OUT-ATTR), (Eq. OUT-ATTR-POS), and (Eq. OUT-ATTR-POS2). nnra_attribute_partition_index and nnra_auxiliary_video_flag may be further included.
In the above, the atlas decoder 302 decodes the syntax elements nnra_target_map_index from the activation SEI to derive the mapIdx to be refined and decodes the syntax element nnra_target_attribute_index[i] from the activation SEI to derive the attrIdx of the attribute to be refined. An attribute specified by the attrIdx is selected to perform refinement. This achieves the advantage that it is possible to apply refinement optimized to the specific mapIdx and attrIdx. Further, refinement specified by the same characteristics SEI can be applied to attributes indicated by an attrIdx having the same value. Refinements specified by different pieces of characteristics SEI can be applied to attributes indicated by different attrIdxs at the same time.
In summary, a 3D data decoding apparatus is provided that includes a geometry decoder configured to decode a geometry frame from encoded data and an attribute decoder configured to decode an attribute frame from the encoded data, wherein a refinement information decoder is comprised, the refinement information decoder being configured to decode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is decoded from the activation information and an index indicating the characteristics information for the decoded number of refinements is decoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.
The refinement information decoder further decodes the number strengthNum of pieces of strength information from the characteristics information and decodes the strengthNum pieces of strength information from the characteristics information, and the refinement processing unit inputs at least one of an occupancy frame, a geometry frame, or an attribute frame to an input tensor of a neural network, and further inputs each of strengthNum values to a respective one of strengthNum channels of the input tensor of the neural network.
The refinement information decoder decodes a flag indicating whether to perform the refinement processing on all maps, and decodes an index indicating which map the refinement processing is to be performed on in a case that the flag indicates that the refinement processing is not to be performed on all the maps.
The refinement information decoder decodes an index of an attribute frame to be refined and the number of attribute frames from encoded data of the characteristics information or the activation information, inputs, to an input tensor of the neural network, attribute frames for the decoded number of attribute frames, with the decoded index of the attribute frame being a start point, and performs refinement processing.
The refinement information decoder decodes, from encoded data of the activation information, refinement characteristics information and information of a map to be refined, decodes a cancel flag indicating whether to cancel refinement processing related to the refinement characteristics information and the map to be refined, and further decodes persistence information in a case that the cancel flag indicates that the refinement processing is not to be canceled.
A 3D data encoding apparatus is provided that includes a geometry encoder configured to encode a geometry frame and an attribute encoder configured to encode an attribute frame, wherein a refinement information encoder is comprised, the refinement information encoder being configured to encode refinement characteristics information and refinement activation information from the encoded data, a syntax element indicating the number of refinements is encoded from the activation information and an index indicating the characteristics information for the encoded number of refinements is encoded, and a refinement processing unit is comprised, the refinement processing unit being configured to perform refinement processing on the attribute frame or the geometry frame according to the characteristics information.
The refinement information encoder further encodes the number strengthNum of pieces of strength information from the characteristics information and encodes the strengthNum pieces of strength information from the characteristics information, and the refinement processing unit inputs at least one of an occupancy shift frame, a geometry frame, or an attribute frame to an input tensor of a neural network, and further inputs each of strengthNum values derived from the encoded strength information to a respective one of strengthNum channels of the input tensor of the neural network.
The refinement information encoder encodes a flag indicating whether to perform refinement processing on all maps, and encodes an index indicating which map the refinement processing is to be performed on in a case that the flag indicates that the refinement processing is not to be performed on all the maps.
The refinement information encoder encodes an index of an attribute frame to be refined and the number of attribute frames from encoded data of the characteristics information or the activation information, inputs, to an input tensor of the neural network, attribute frames for the encoded number of attribute frames, with the encoded index of the attribute frame being a start point, and performs refinement processing.
The refinement information encoder encodes, from encoded data of the activation information, refinement characteristics information and information of a map to be refined, encodes a cancel flag indicating whether to cancel refinement processing related to the refinement characteristics information and the map to be refined, and further encodes persistence information in a case that the cancel flag indicates that the refinement processing is not to be canceled.
The 3D data encoding apparatus 11 includes a patch generator 101, an atlas encoder 102, an occupancy generator 103, an occupancy encoder 104, a geometry generator 105, a geometry encoder 106, an attribute generator 108, an attribute encoder 109, a refinement parameter deriver 110, and a multiplexer 111. The 3D data encoding apparatus 11 receives a point cloud or a mesh as 3D data and outputs encoded data.
The patch generator 101 receives 3D data and generates and outputs a set of patches (here, rectangular images). Specifically, 3D data is divided into multiple regions and each region is projected onto one plane of a 3D bounding box set in 3D space to generate multiple patches. The patch generator 101 outputs information regarding the 3D bounding box (such as coordinates and sizes) and information regarding mapping to the projection planes (such as the projection planes, coordinates, sizes, and presence or absence of rotation of each patch) as atlas information.
The atlas encoder 102 encodes the atlas information output from the patch generator 101 and outputs atlas data.
The occupancy generator 103 receives the set of patches output from the patch generator 101 and generates an occupancy that represents valid areas of patches (areas where 3D data exists) as a 2D binary image (e.g., with 1 for a valid area and 0 for an invalid area). Here, other values such as 255 and 0 may be used for a valid area and an invalid area.
The occupancy encoder 104 receives the occupancy output from the occupancy generator 103 and outputs an occupancy and occupancy data. VVC, HEVC, or the like is used as an encoding scheme.
The geometry generator 105 generates a geometry frame that stores depth values for the projection planes of patches based on the 3D data, the occupancy, the occupancy data, and the atlas information. The geometry frame generator 105 derives a point with the smallest depth to the projection plane among points that are projected onto pixels g(x, y) as p_min(x, y, z). The geometry generator 105 also derives a point with the maximum depth among points that are projected onto pixel g(x, y) and located at a predetermined distance d from p_min(x, y, z) as p_max(x, y, z). A geometry frame obtained by projecting p_min(x, y, z) on all pixels onto the projection plane is set as a geometry frame of a Near layer. A geometry frame obtained by projecting p_max(x, y, z) on all pixels onto the projection plane is set as a geometry frame of a Far layer.
The geometry encoder 106 receives a geometry frame and outputs a geometry frame and geometry data. VVC, HEVC, or the like is used as an encoding scheme.
The attribute generator 108 generates an attribute frame that stores color information (e.g., YUV values and RGB values) for the projection plane of each patch based on the 3D data, the occupancy, the geometry frame, and the atlas information. The attribute generator 108 obtains a value of an attribute corresponding to the point p_min(x, y, z) with the minimum depth calculated by the geometry generator 105 and sets an attribute frame onto which the value is projected as an attribute frame of the Near layer. An attribute frame similarly obtained for p_max(x, y, z) is set as an attribute frame of the Far layer.
The attribute encoder 109 receives an attribute frame and outputs an attribute and attribute data. VVC, HEVC, or the like is used as an encoding scheme.
The refinement parameter deriver 110 receives the attribute frame and the original attribute frame or the geometry frame and the original geometry frame and selects or derives optimal filter parameters for NN filter processing and outputs the optimal filter parameters. The refinement parameter deriver 110 sets values such as an nnra_target_id, an nnra_cancel_flag, and an nnra_persistance_flag in the SEI.
The multiplexer 111 receives the filter parameters output from the refinement parameter deriver 110 and outputs them in a predetermined format. Examples of the predetermined format include SEI which is supplemental enhancement information of video data, an ASPS and an AFPS which are data structure specification information in the V3C standard, and an ISOBMFF which is a media file format. The multiplexer 111 multiplexes the atlas data, the occupancy data, the geometry data, the attribute data, and the filter parameters and outputs the multiplexed data as encoded data. A byte stream format, the ISOBMFF, or the like is used as a multiplexing method.
The 3D data encoding apparatus 11 includes a video encoder and an SEI encoder. In the configuration of
The 3D data decoding apparatus 31 includes a video decoder, an SEI decoder, a switch, and a refiner. The video decoder corresponds to the occupancy decoder 303, the geometry decoder 304, and the attribute decoder 305 in the configuration of
The video encoder encodes an occupancy, a geometry frame, and an attribute frame generated from 3D data. The video decoder decodes the encoded data to reconstruct a decoded image. The SEI encoder generates characteristics SEI and activation SEI from the 3D data. The SEI decoder decodes these SEI messages. The activation SEI is input to a switch to specify an image for which the refinement processing is to be performed and only the image for which the refinement processing is to be performed is input to the refiner. The characteristics SEI is input to the refiner to specify refinement to be applied to the decoded image. The image for which the refinement processing has been performed or decoded image is displayed on the 3D data display apparatus 41 (
Although embodiments of the present invention have been described above in detail with reference to the drawings, the specific configurations thereof are not limited to those described above and various design changes or the like can be made without departing from the spirit of the invention.
An embodiment of the present invention is not limited to the embodiments described above and various changes can be made within the scope indicated by the claims. That is, embodiments obtained by combining technical means appropriately modified within the scope indicated by the claims are also included in the technical scope of the present invention.
Embodiments of the present invention are suitably applicable to a 3D data decoding apparatus that decodes encoded data into which 3D data has been encoded and a 3D data encoding apparatus that generates encoded data into which 3D data has been encoded. Embodiments of the present invention are also suitably applicable to a data structure for encoded data generated by a 3D data encoding apparatus and referenced by a 3D data decoding apparatus.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2023-159371 | Sep 2023 | JP | national |