The present disclosure generally relates to volumetric video capture and more particularly to volumetric video captures that use multiplane image formats.
Volumetric video capture is a technique that allows moving images, often in real scenes be captured in a way that can later be viewed later from any angle. This is very different than regular camera captures that are limited in capturing images of people and objects from a particular angle. In addition, video capture allows the captures of scenes in a three-dimensional (3D) space. Consequently, data that is acquired can then be used to establish immersive experiences that are real or generated by a computer. With the growing popularity of virtual, augmented and mixed reality environments, volumetric video capture techniques are also growing in popularity. This is because the technique uses visual quality of photography and mixes it with the immersion and interactivity of spatialized content. The technique is complex and combines many of the recent advancements in the fields of computer graphics, optics, and data processing.
The resulting immersive experiences appear extremely realistic but have the drawback of handling a large amount of data. The management and storage of this data, even on a temporary basis, is both expensive and challenging. Consequently, it is desirous to provide solutions that reduce the amount of data that needs to be managed and stored without affecting the speed and quality of the final product.
In one embodiment, apparatus and associated methods are provided. In one embodiment, the method comprises obtaining a multi-plane image (MPI) representation of a three dimensional (3D) scene. The MPI representation includes a plurality of slices of content from the 3D scene, each slice corresponding to a different depth relative to a position of a first virtual camera. Each slice is decomposed into regular tiles; and the orientation of each tile is determined.
In a different embodiment a device and associated method is provided to render a view of a 3D scene. The method comprises obtaining an encoded MPI representation of the 3D scene. The encoded MPI representation comprising one of a bitstream or an atlas. The encoded MPI representation is then decoded to obtain a plurality of tiles, orientation information for each tile of the plurality, and information associating each tile to a slice of the MPI representation and a position within the slice, wherein each slice corresponds to a different depth relative to a position of a first virtual camera. A stacked representation of the slices are then constructed. Each slice comprising the tiles associated to the slice, each tile is oriented according to the orientation information of the tile. Finally the content from the stacked representation of the slices is projected to a merged image. The merged image representing a view of the 3D scene from a position of a second virtual camera.
The teachings of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
An MPI is a camera-centric, layered 3D representation of images used to create final renderings that are extremely detail oriented and can be used in a number of immersive technologies. MPI is a multi plane image which is often an intermediate data object used to compute a synthetic image. It consists in a collection of planes (or slices) which defines a cube of data. The planes are perpendicular to the optical axis of the virtual camera for which a synthetic image is computed from the real cameras. MP's are used to provide very detail-oriented images of both real and computer-generated views.
In the example of
MPI's can be computed with deep learning applications. Often MP's are used to provide view synthesis that can then be used in a variety of applications including deep learning applications. Image view synthesis describes an algorithm which permits an image to be computed from the scene observed from a position which has not been captured by the matrix of cameras. The extrinsic and intrinsic parameters of the virtual camera can be freely defined. The quality of a synthetized image will be good when the virtual camera is sharing the same intrinsic parameters as the real camera(s).
View synthesis aims at creating a final rendering starting from a number of pictures taken from given points of view. There are several problems associated with the current state of art, however, that need to be addressed. One problem has to do with the challenges of building any synthetic image from a number of given camera settings and orientations that may be real or virtual. The final rendering of this synthetic image is desirously taken from a virtual camera placed in a given location and with given settings.
Another challenge has to do with the fact that MPI planes are defined perpendicular to the optical axis of the virtual camera, and that existing MPI techniques restrict the content of each slice to lie on the flat plane associated with the slice. The MPI is encoded by keeping tiles made from images or slices of images to a reasonable volume that can be stored. The size of the tiles for MPI depends on the number of slices defined in the MPI. Interpolated views are then generated from the tiled MPI which also depends on the number of slices provided. Having a large number of tiles, however, requires a large amount of data to be stored which becomes problematic as mentioned. Therefore a solution to add for each tile some information that will help at the end to reduce the number of slices and the number of tiles to be stored in a tiled MPI can be one way to address these challenges as will be discussed with one embodiment. In this embodiment, the global size of the tiled MPI is reduced for a given picture quality of view synthesis.
In a different embodiment, the virtual views (see equation (12)), can be used to approximate a complete algorithm that can then be used in conjunction with view renderings. In this embodiment, the tiled MPI are computed for projective cameras which allow for a faster rendering generation. One challenge in such areas is to avoid visible defects that occur when the projective camera is disposed outside of the boundaries defined by the real cameras. In such a case the embodiment uses view extrapolations to allow for the projection planes from the tiled MPI and the planes of the slices to become visible.
Volumetric video capture techniques had been made possible through the growing advancements in the fields of computer graphics, optics, and data processing amongst which are evolutions in the developments of cameras that capture images in a particular manner. One such camera is a light field camera that can be used in generation of MP's because it provides multiple views of the same scene simultaneously.
Light-Field cameras allow real content to be captured from various point of views. The two major families of light-field cameras are either: the matrix of cameras; or the plenoptic cameras. A matrix of cameras can be replaced by a single camera which is used to perform many acquisitions from various point of views. The light-field being captured is therefore limited to a static scene. With plenoptic cameras, a micro-lens is located between the main-lens and the sensor. The micro-lenses produce micro-images which correspond to various point of views. The matrix of micro-images collected by the sensor can be transformed into so-called sub-aperture images which are equivalent to the acquisition obtained with a matrix of cameras. Embodiments of are described considering a matrix of cameras, but would apply equally well to the set of sub-aperture images extracted from a plenoptic camera.
Camera calibration is important and involve a set of algorithms and special images which are acquired in order to estimate the so-called extrinsic- and intrinsic-parameters. The extrinsic parameters describe the position of the cameras in a real World Coordinate System—3 translations to characterize the position of centre of the main-lens pupil, and 3 rotation angles to characterize the orientations of the main optical axis of the cameras. The intrinsic parameters describe the internal properties of each camera such as the focal length, the principal point, the pixel size. It might also include the geometric distortion produced by the main-lens which distorts the captured images compared to an ideal thin lens. Many calibration procedures rely on a checkerboard which is observed many times from various point of views.
Cameras can be calibrated geometrically. In a scenario where there are N cameras, the N cameras are calibrated using for instance a black and white checkerboard which is simultaneously observed by all cameras. Several pictures are taken with the checkerboard positioned at different positions from the cameras. On each picture, the 2D coordinates of the corners delimited by 2 black and 2 white squares of the checkerboard are extracted. From one image, the 2D coordinates of the corners are associated with the 2D coordinates of the same corners observed by the other cameras.
With the N 2D coordinates of corners observed by the N cameras and also for the different exposures, it is possible to estimate the position of the cameras according to a World Coordinate System (WCS). In this system, the centre of the pupil of the main-lens from camera i∈[1,N] is positioned in space by a translation vector Ti=(X,Y,Z)t, and the orientation of the optical axis is defined by a 3D rotation matrix Ri. The pose matrix of the camera i is defined by Pi=(Ri Ti)∈3×4. The extrinsic matrix of the camera i is defined by Qi=(Ri−1−Ri−1·Ti)∈
3×4. The intrinsic camera parameters: focal length; principal point; pixel size; geometrical distortion are estimated simultaneously with the extrinsic camera parameters.
With camera calibration it is possible to convert a 2D pixel coordinate (x,y) from one camera i into a 3D WCS coordinate (X,Y,Z)t for any distance z between the camera i to the object visible at pixel (x,y). It is also possible from any point in space (X,Y,Z)t to compute its coordinate observed at pixel (x,y) from camera i.
Point Clouds are one or more sets of 3D points in the WCS. Each 3D point is associated with an RGB color. Point clouds can be easily obtained from a Multi-View plus Depth (MVD) by throwing each RGB pixel into the WCS knowing the camera calibration parameters, and the corresponding depth.
Another important concept is that of Depth-map estimation. With a light-field camera, a given object in a scene is observed many times with a varying parallax. It is therefore possible to estimate the distance of that object from all cameras. One deduces a so-called depth-map where each pixel quantifies the distance of objects which are visible in the corresponding image acquired by a given camera.
When using MVDs, a set of images obtained by the matrix of cameras is designated, plus a corresponding set of depth-map images. One depth-map is associated with one image, it shares the same spatial resolution and same viewing positions.
In one embodiment as will be presently discussed, additional data is to be added to each tile in order to orient the tiles such as to decrease the space visible between the tiles when MPI is observed from an extrapolated position. Each slice of the MPI thus no longer restricts its associated content to a flat plane, but rather may be thought of as a collection of oriented (or tilted) tiles which may extend outside the plane according to the orientation of each tile.
Such a representation may permit a faster processing at the rendering side allowing real-time immersion.
Traditionally, the quality of the synthetic views from the MPI, depends on the number of slices. In a tiled version of the MPI, the amount of data is reduced but the quality of the synthetic views still depends on the initial number of slices. In one embodiment, it is possible to add information related to orientation to each of the tiles in order to reduce (globally) the number of slices for a given expected view synthesis quality. In this embodiment having these oriented tiles, it is not necessary to split the object space in many slices (e.g. oriented tiles may allow for an MPI with fewer slices to produce the same view synthesis/rendering quality as an MPI with more slices but which lacks the oriented tile information). If the original MPI content has 500 slices for instance, 100 slices of this may be sufficient to obtain the needed information from these oriented tiles. Through this information, the orientation of the tiles can be determined, and depth accuracy can be maintained. The reduction of tiles also allows for a reduction of total amount of data to be stored.
In one embodiment, the orientation of each tile can be managed in following manner:
In another embodiment a view synthesis algorithm can be used which allows the computation of a specific volumetric data. From the acquisition of the raw images given the matrix of cameras, to the computation of the synthetic image as seen from a virtual camera position, several steps are performed.
The following examples and explanations will help with an understanding of the estimation of depth-maps. For illustration, two cameras will be used in the following explanation but other numbers of cameras can be used in alternate embodiments, as can be appreciated by those skilled in the art.
In one embodiment a method to estimate the depth associated to a pixel is made using the epipolar line as follows:
In different embodiments, similaritites are computed using various estimators. For ease of understanding two common similarity estimators will be listed, however, as known to those skilled in the art other estimators can be used in alternate embodiments.
The first estimator has to do with L1 norm between 2 pixels—Let pixel p being observed to be color pixels defined by the 3 scalars corresponding to the 3 color components Red, Green and Blue (pR,pG,pB). The L1 norm between 2 pixels pref(x,y) and psec(xz
The second estimator has to do with Squared L2 norm between 2 pixels—This is similar to the L1 norm previously described, except that the similarity measure for the squared L2 norm is defined by sL2(pref,psec)=√{square root over (|pref,R−psec,R|2+|pref,G−psec,G|2+|pref,G−psec,G|2)}
Under one scenario, if the similarity is estimated only with the color component of one pixel, the depth estimation is very sensitive to noise. To overcome this limitation the similarity between 2 pixels is computed using a patch which includes a few surrounding pixels. This technique refers to cross-patch depth-estimation. Obviously, it requires much more computation since it requires P2 more computation for a patch of P×P pixels compared to similarity between 2 pixels. This is a critical point for real-time estimation and especially when embedded into mobile devices. The similarity operator describes above can be used for patches surrounding a pixel:
In one embodiment, the depth map is computed between a reference camera and another camera. In case of the matrix made of N cameras, for a given camera, N−1 depth-map is estimated. These depth-maps can be merged into a single one (by averaging, taking the closest data . . . ) in order to estimate one depth-map per camera. At the end of this procedure, N images obtained by the N cameras are associated with N depth-maps. This data is called Multi-View plus Depth (MVD).
In one embodiment, the acquiring of a View Synthesis denotes the computation of an image from a virtual camera which is located close to the matrix of cameras from which the MVD has been observed/computed. The view synthesis algorithm can in one example be provided through the following steps:
The above steps will be expanded with additional details in the description to follow. Note that the first three steps of the above step list provide one way to generate an MPI representation of the 3D scene. The resulting MPI is denoted as ‘virtual colorcube’ in the above step list. The fourth step describes how the MPI representation is used to efficiently generate or synthesize a new view of the 3D scene. As noted previously, there are many known techniques for generating the MPI representation, and steps 1, 2 and 3 of the above list are provided as a concrete illustration. However, the current invention is not limited to the MPI generation technique characterized by steps 1, 2 and 3 above. Rather the current invention may utilize any known technique for generating the MPI representation of the 3D scene. For example, various deep learning approaches for generating the MPI could be employed to generate the MPI representation.
Referring back to
To compute the consensus of camera i, a ray is cast from that camera and passes through a pixel (x,y) (i=2 in
Also the Heaviside H(a,b) function is defined as follows:
The value of the consensus at pixel (x,y) for the camera i at the slice s is equal to:
Where M is the set of cameras which are used to compute the consensus of camera i. For a precise computation M is chosen equal to all cameras. da(Cv,Jk) is the algebraic measure between the virtual camera Cv and point Jk. da(Jk,Ps) is the algebraic measure between point Jk and the plane Ps. These distances are computed using Qv the intrinsic matrix of the virtual camera:
d
a(Cv,Jk)=[0 0 1]·Qv·[Jk1]T
d
a(Jk,Ps)=[0 0 1]·Qv·[Jk1]T−z(s) (5)
Δz is the thickness of a slice with Δz=z(s+½)−z(s−½). Projection and de-projection are computed with the intrinsic and extrinsic camera parameters. The consensus is defined as the ratio between the numbers of depth-maps (e.g. the number of cameras) which agree that an object is within a slice divided by the total number of depth-maps (e.g. the total number of cameras) which can still see this slice and beyond. da(Jk,Ps) are illustrated in
The computation of the consensus Ci is noisy especially when most of the images are occluded beyond a certain distance. In this case, the denominator of equation (4) tends to zero. One option is to set a minimum value for the denominator. This minimum value is experimentally set to M/4. The consensus Ci at slice s can be smoothed in order to improve its signal to noise. Denoising is performed slice per slice by so-called guided denoising algorithms. A local smoothing kernel is computed with surrounding pixels around Ci(x,y,s) from the consensus at slice s and around pixels from the observed image Ii(x,y).
Soft Visibility is computed for a given image Ii by integrating its consensus Ci through slices according to the following equation:
The visibility is equal to 1 for the first slice and decreases until 0. When the visibility is decreasing toward 0, this means that beyond a given slice, the image Ii is occluded by an object visible at pixel Ii(x,y). The max( ) in equation (6) prevents the visibility to decrease bellow 0. This occurs frequently because the consensus is the agreement between all cameras which can see beyond occluded objects from the view i. Potentially the Σs′=1s′=sCi(x,y,z(s′)) can be equal to M the number of cameras used to compute Ci.
To estimate a virtual image seen from a virtual camera position, a virtual colorcube also called MPI Colorsynth(x,y,z(s)) is computed as a preliminary step. The colorcube is in the coordinate system of the virtual camera which is characterized with intrinsic and extrinsic camera parameters. Each slice of this virtual cube is computed as an average of the M′ images weighted by the corresponding soft visibility.
In (7), (xk′,yk′,zk′) denotes the re-projected coordinate (x,y,z(s) from the virtual camera to the real camera k. The great advantage of this approach is that the integer coordinates (x,y,z(s)) from the virtual color cube are computed with a backward warping approach which is made possible thanks to the sampling of z(s) by the cube. The virtual color cube is like a focal stack where only objects lying at the given slice are visible, the foreground objects have been removed.
In one embodiment, a virtual color cube can also be created. In this embodiment, the MPI is merged to form a unique virtual colorimage. In this embodiment, it may be helpful to first compute the consensus cube Consensussynth(x,y,z(s)) and the visibility cube SoftVissynth(x,y,z(s)) associated with the color virtual images. Similarly to equation (7) the computation is done by averaging the M′ initial consensus or visibility cube:
Where (x,y,z(s)) is a voxel coordinate of the virtual consensus cube. Consensussynth (x′,y′,z′) is computed by deprojecting voxel (x,y,z(s)) into the WCS (X,Y,Z) and then projected into the coordinates (xk′,yk′,zk′) with zk′ being the distance from point (X,Y,Z) to camera ck.
Both cubes defined above are combined into CC(x,y,z(s))
CC(x,y,z(s))=min(Consensussynth(x,y,z(s)),SoftVissynth(x,y,z(s))) (10)
The CC is a kind of probability which varies between 0 to 1. The typical values are:
The color slices are then weighted by consensus and accumulated until ray visibility reaches zero:
In one embodiment, the virtual colorcube (that is, the MPI representation of the 3D scene) is saved with pixels made of 4 values: Red, Green, Blue and α (RGBα). The RGB encodes the colors computed by equation (7). The a encodes the CC(x,y,z(s)) component which has computed with equation (10).
In the embodiment discussed, as a final step of the view synthesis algorithm, the virtual colorcube is merged into a single virtual image according to some weights.
Once the MPI is defined for a given virtual camera position, in one embodiment, other virtual views are approximated, and the virtual color cube is provided with perspective projections (
The equation (12) is modified by the projection of the 3D coordinate (x,y,z) with the 4×4 projection matrix P:
Where [xp,yp,zp,1]=P×[x,y,z(s),1]. The projected coordinate (xp,yp,yp) being non-integer, value Colorsynth(xp,yp,zp) are extracted with interpolation. Merging the virtual colorcube with a slanted projection produces a virtual image with slightly lower quality than the complete algorithm computed for the first virtual camera. Nevertheless, this approach permits to split the computation of the 3 first steps of the algorithm, including the computation of the virtual colorcube, from the stacking of that cube into the virtual image. The real-time rendering is therefore possible with recorded content and some precomputation up to the virtual color cube.
Choosing all tiles with the same size makes the splitting of the virtual colorcube easier.
An Atlas is made of:
In order to allow the orientation of each tile, the following operations are done:
Equation (4) describes the computation of the consensus cube for a given camera and a given slice. The proposed algorithm defines the average z consensus CZi(x,y,z) defined by:
While the virtual colorcube (MPI) is being computed by projecting the raw images into the virtual slices, it is important to project also the average z consensus CZi(x,y,z) into a cube Zsynth(x,y,z(s)) having the same size as the MPI. This cube keeps track of the average z of the objects which are more accurate than slice thicknesses. Zsynth(x,y,z(s)) is computed by the following equation:
As for equation (7), (xk′,yk′,zk′) denotes the re-projected coordinate (x,y,z(s)) from the virtual camera to the real camera k.
da(Cv,Jk) is the distance between the virtual camera Cv and point Pk. This distance contributes to CZi(x,y,z(s)) if point Jk belongs to slice s. CZi(x,y,s) represents the average distance to the virtual camera of points Jk that belong to slice s. This average distance is very accurate for depth-maps computed on textured areas, for these areas, the thickness of the slice is too large compared to the accuracy of the depth-maps. The CZi(x,y,s) permits to keep trace of this accuracy. On texture-less areas the point Jk are spread in several slices.
In this way, the MPI is computed, and the cube Zsynth(x,y,z(s)) defines for each pixel in the MPI the distance to the camera with an accuracy greater than the corresponding slice thickness. The MPI is converted into tiled MPI in order to save spaces. Tiles are extracted from the MPI, and the tiles are oriented by using Zsynth(x,y,z(s)).
To compute the four corners of the tiles, first the average distance of the tile is computed. (xt,yt) is the left bottom pixel coordinate of the tile of size (Tx,Ty):
The slopes of the z is estimated with the x and y derivative of Zsynth:
From the previous equation, one derives the 4 distances of the 4 corners of the tile t.
The Atlas is extended to comprise the orientation information for the tiles. An atlas is made of nx*ny tiles each tile having the size Tx*Ty. Each pixel of a tile is defined by 4 components RGBα where α is the CC as defined in equation (10). In the general case, each tile has a (x,y,z) coordinate corresponding to its location in the scene. The oriented tile will have another set of 4 coordinates corresponding to the depth of the 4 corners (Z1, Z2, Z3, Z4)
In one embodiment, an example can be provided where the extended Atlas is made of:
The atlas (e.g. the extended atlas comprising the tile orientation information) is used to reconstruct a stacked tile representation comprising the oriented tiles, and this is projected into a 2D image according to the projection matrix P. The atlas image Az gives the 4 distances z of the tile corners.
Where [xp, yp, zp, 1]=P×[xt+i,yt+j,zt(i,j),1] with zt(i,j)=Z1+i(Z2−Z1)+j(Z3−Z1). A graphics API such as OpenGL is commonly used for real-time projection of the MPIs. With OpenGL, it is sufficient to give the coordinates of 2 triangles to plot a tile. In the virtual camera system, the first triangle will have the following coordinates: [(xt,yt,Z1), (xt,yt+Ty,Z3), (xt+Tx,yt,Z2)]. The second triangle has the following coordinates: [(xt,yt+Ty,Z3),(xt+Tx,yt,Z2)], (xt+Tx,yt+Ty,Z4)]. The 2 triangles are associated to the textures (e.g. RGB values) given by the [Tx, Ty] pixels recorded for the tile in the atlas. OpenGL is performing the projection of the 2 triangles, and the rasterization according the projection matrix P. The size of AZ is negligible compared to the size of the atlas image A which stores the tile textures. Also the computation time taking into consideration the z distances of the tile corners has no impact compared to projecting the tiles at a given z(st).
In an alternate embodiment, the atlas (e.g. extended atlas comprising the tile orientation information) may be stored. The atlas may be saved to a file, or written to a bitstream. The file or bitstream may be stored for later use, or it may be conveyed over a network to another device so that the other device may use the atlas information to render views of the 3D scene (e.g. in real time). At the decoder side the atlas information is decoded and RGBα of each pixel, (xt,yt,zt) and 4 depth values (Z1, Z2, Z3, Z4) of each tile are decoded. These (xt,yt,zt) and the 4 depths are used to recalculate the coordinates of each pixel for each tile [xp,yp,zp,1]. The coordinates of a given pixel belonging to a given tile is performed using the projection matrix P and the z coordinate calculated from the depth values of 4 corners following the equation:
[xp,yp,zp,1]=P×[xt+i,yt+j,zt(i,j),1] with zt(i,j)=Z1+i(Z2−Z1)+j(Z3−Z1)
OpenGL may be used by the decoder device to achieve real-time projection of the MPIs. With OpenGL, it is sufficient to give the coordinates of 2 triangles to plot a tile. In the virtual camera system, the first triangle will have the following coordinate: [(xt,yt,Z1),(xt,yt+Ty,Z3),(xt+Tx,yt,Z2)]. The second triangle has the following coordinates: [(xt,yt+Ty,Z3), (xt+Tx,yt,Z2)], (xt+Tx,yt+Ty,Z4)]. The 2 triangles are associated to the textures (RGB) given by the [Tx,Ty] pixels recorded in the atlas. OpenGL is performing the projection of the 2 triangles, and the rasterization according to the projection matrix P. Each pixel is synthetized following the equation:
Computing an MPI with many slices has in the past been a necessity in order to have accurate synthetized views (both from virtual camera positions between the real cameras, and also from extrapolated positions). But having many slices produces many significant tiles, and thus the tiled MPI becomes larger. In one example, with a given scene, for an MPI with 500 slices, 280000 tiles can be extracted requiring 55Mbytes of data. Using oriented tiles, in one embodiment, this can be achieved with comparable rendering performance with a reduced number of slices (e.g. 100 slices instead of 500). With only 100 slices, the number of extracted tiles is reduced to 130000 which reduces the size to 26 Mbytes of data. The additional cost to encode the oriented tiles is equal to 8% of the extracted tiles, therefore the tile orientation information incurs an additional cost of 2 Mbyte which is small in regards to the total size, and much smaller than the total size in case of 500 slices. In one embodiment, simulations illustrate that an MPI does not need to have too many slices if tiles are oriented accordingly. Having less slices permits to reduce the size of the tiled MPI despite the additional encoding cost of the orientation of the tiles.
As discussed in one of the previous embodiments, when the tile orientations can be characterized with the 4 corner depth values Z1, Z2, Z3 and Z4 recorded into the 2D image AZ, the values can be computed by the slopes estimated on the pixels belonging to that tile as given by equations (16) and (17). In another embodiment, the image AZ could be defined slightly differently by keeping only 3 components as for instance Zcenter, ZslopeX and ZslopeY, representing respectively the depth of a center point of the tile, the slope of depth with respect to the x dimension, and the slope of depth with respect to the y direction. The essence of Az is to model the variation of average z consensus CZi(x,y,z) for the pixels belonging to the extracted tile for a given slice. In one embodiment, a linear model (e.g. corresponding to a flat oriented tile) may be used with orientations characterized with slopes. For this model, there are many different ways that the spatial position and orientation of such tiles may be specified. For example, depth values could be specified for four corners of the tile, depth values could be specified for three corners of the tile, or depth values could be specified for a center point of the tile as well as two of the tile's corners. Depth values could be provided for points located at the center of one or more of the edge boundaries of the tiles. Alternately a single depth value may be provided, along with horizontal and vertical slope parameters to define the tile orientation. Instead of slope parameters, two component angular values may be used to specify the orientation. Such angular values may indicate the angle at which the tile is oriented relative to the slice plane, for example in the horizontal and vertical directions. Alternately the angular values may indicate the angle of a surface normal of the tile relative to the slice plane. In one embodiment, any parameters may be utilized which specify a position (e.g. depth) and an orientation for the tiles. Moreover, such parameters may be stored in an atlas image Az as previously specified, however other techniques for storing and providing access to the tile orientation parameters are also possible. Moreover, models other than the linear (e.g. flat tile) model could be used, as for instance a second order model which allows the tile to take on a non-planar shape. Any models able to describe a surface in 3D space could be used.
In one embodiment, the decoding device 1170 can be used to obtain an image that includes at least one color component, the at least one color component including interpolated data and non-interpolated data and obtaining metadata indicating one or more locations in the at least one color component that have the non-interpolated data.
In one embodiment, a method or device can be implemented that can generate an enhanced multi-plane image (MPI) representation of a 3D scene. In this embodiment, the device can have a processor that can obtain an MPI representation of the scene. The MPI representation comprises a plurality of slices of content from the 3D scene, where each slice corresponds to a different depth relative to a position of a first virtual camera. Each slice is then decomposed into regular tiles and the orientation information for each of the tiles is determined. The tiles can then be stored, including their orientation information, and also information associating each tile to a slice of the MPI representation, and a tile position within the slice.
In another embodiment, a similar method and device can be used to render a view of a 3D scene. In this embodiment the MPI representation is obtained in a similar manner and then slices are decomposed and each tile orientation is also determined similarly. However, a stacked representation of each slice is then constructed. In this embodiment, each slice comprises the tiles decomposed from the slice, and each tile is oriented according to the orientation information of the tile. The content is then projected from the stacked representation of the slices to a merged image, the merged image representing a view of the 3D scene from a position of a second virtual camera.
A number of enhancements can also be implemented in either embodiment above. For example obtaining the MPI representation can comprise generating the MPI representation from a multi-view plus depth (MVD) capture of the 3D scene. Obtaining the MPI representation can also comprise computing the MPI representation from captured scene information using a deep learning algorithm.
The different depth of each slice corresponds to at least one of a minimum depth, a maximum depth, or an average depth of the slice. For each tile, the depth values for the 3D scene content of the tile and the orientation information of the tile based on depth values is determined.
In one embodiment, each slice is decomposed into regular tiles. It is then determined as which of the regular tiles contain significant content and those that have significant content would then be retained and others discarded.
In one embodiment, the orientation information for each tile, and information associating each tile to a slice of the MPI and a tile position within the slice are stored in an atlas file. In another embodiment, the orientation information for each tile, and information associating each tile to a slice of the MPI and a tile position within the slice are written to a bitstream.
Furthermore, in one embodiment, a stacked representation of the slices are constructed. Each slice comprises, the tiles decomposed from the slice and each tile is oriented according to the orientation information of the tile. The content is projected from the stacked representation of the slices to a merged image, the merged image representing a view of the 3D scene from a position of a second virtual camera.
The orientation information, in one embodiment, can include one or more of: depth values for corners of the tile; a depth value for the center point of the tile; a horizontal slope value; a vertical slope value; or angular values of a surface normal of the tile.
In another embodiment, a method is introduced to render a view of the 3D scene. In this embodiment, an encoded MPI representation is obtained. The encoded MPI representation may comprise one of a bitstream or an atlas file. The encoded MPI is then decoded to obtain the tiles, as well as orientation information for each tile and information associating each tile to a slice of the MPI representation and a tile position within the slice. Each slice corresponds to a different depth relative to a position of a first virtual camera. A stacked representation of slices is then constructed and each slice comprises the tiles associated to the slice and each tile is oriented according to the orientation information. The content is then projected from the stacked representation of the slices to a merged image, the merged image representing a view of the 3D scene from a position of a second virtual camera.
The encoded MPI representation can be obtained by receiving the encoded MPI representation via a communication network. It can also be obtained by reading the encoded MPI representation from one of a file system or memory. Furthermore, the projecting of the content can include decomposing each oriented tile into a pair of triangles and determining vertex positions of each triangle of the pair of triangles so that the vertex positions can be sent to a graphics processing unit (GPU) via an application programming interface (API) such as an OpenGL.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application.
It should be understood that while the specification provides and discusses steps for generating an MPI representation of a 3D scene, those steps are presented by way of example to aid understanding. Therefore, it is obvious that one skilled in the art will recognize that there are various known techniques for generating an MPI representation of a 3D scene, and so our invention may be used with any known MPI generation technique.
Number | Date | Country | Kind |
---|---|---|---|
20306088.4 | Sep 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/076306 | 9/24/2021 | WO |