Embodiments relate to capturing and rendering three-dimensional (3D) video. Embodiments further relate to training a neural network model for use in re-rendering an image for display.
The rise of augmented reality (AR) and virtual reality (VR) has created a demand for high quality display of 3D content (e.g., humans, characters, actors, animals, and/or the like) using performance capture rigs (e.g., camera and video rigs). Recently, real-time performance capture systems have enabled new use cases for telepresence, augmented videos and live performance broadcasting (in addition to offline multi-view performance capture systems). Existing performance capture systems can suffer from one or more technical problems, including some combination of distorted geometry, poor texturing, and inaccurate lighting, and therefore can make it difficult to reach the level of quality required in AR and VR applications. These technical problems can result in a less than desirable final user experience.
In at least one aspect, the present disclosure generally describes a method for re-rendering an image rendered using a volumetric reconstruction to improve its quality. The method includes receiving the image rendered using the volumetric reconstruction, the image having imperfections. The method further includes defining a synthesizing function and a segmentation mask to generate an enhanced image from the image, the enhanced image having fewer imperfections than the image. The method further includes computing the synthesizing function and the segmentation mask using a neural network trained based on minimizing a loss function between a predicted image generated by the neural network and a ground truth image captured by a ground truth camera during training. Accordingly, rendering can mean to generate a photorealistic or non-photorealistic image from a 3D model.
In one possible implementation, the method may be performed by a computing device based on the execution of program code by a processor, the program code contained on a non-transitory computer readable storage medium.
In another possible implementation of the method, the loss function includes one or more of a reconstruction loss, a mask loss, a head loss, a temporal loss, and a stereo loss.
In another possible implementation of the method, the imperfections include artifacts in the image such as holes, noise, poor lighting, color artifacts, and/or low resolution.
In another possible implementation of the method, the method further includes capturing a 3D model using a volumetric capture system and rendering the image using the volumetric reconstruction prior to receiving the image.
In another possible implementation of the method, the ground truth camera and the volumetric capture system are both directed to a view during training, the ground truth camera producing higher quality images than the volumetric capture system
In another possible implementation of the method, the loss function includes a reconstruction loss based on a reconstruction difference between a segmented ground truth image mapped to activations of layers in a neural network and a segmented predicted image mapped to activations of layers in a neural network, the segmented ground truth image segmented by a ground truth segmentation mask to remove background pixels and the segmented predicted image segmented by a predicted segmentation mask to remove back ground pixels. Further, the reconstruction difference may be saliency re-weighted to down-weight reconstruction differences for pixels above a maximum error or below a minimum error.
In another possible implementation of the method, the loss function includes a head reconstruction loss based on a reconstruction difference between a cropped ground truth image mapped to activations of layers in a neural network and a cropped predicted image mapped to activations of layers in a neural network, the cropped ground truth image cropped to a head of a person identified in a ground truth segmentation mask and the cropped predicted image cropped to the head of the person identified in a predicted segmentation mask. Further, the reconstruction difference may be saliency re-weighted to down-weight reconstruction differences for pixels above a maximum error or below a minimum error.
In another possible implementation of the method, the loss function includes a mask loss based on a mask difference between a ground truth segmentation mask and a predicted segmentation mask. Further the mask different may be saliency re-weighted to down-weight reconstruction differences for pixels above a maximum error or below a minimum error.
In another possible implementation of the method, the predicted image is one of a series of consecutive frames of a predicted sequence and the ground truth image is one of a series of consecutive frames of a ground truth sequence. Further, the loss function includes a temporal loss based on a gradient difference between a temporal gradient of the predicted sequence and a temporal gradient of the ground truth sequence.
In another possible implementation of the method, the predicted image is one of a predicted stereo pair of images and the loss function includes a stereo loss based on a stereo difference between the predicted stereo pair of images.
In another possible implementation of the method, the neural network is based on a fully convolutional model.
In another possible implementation of the method, computing the synthesizing function and segmentation mask using a neural network includes computing the synthesizing function and segmentation mask for a left eye viewpoint, and computing the synthesizing function and segmentation mask for a right eye view point.
In another possible implementation of the method, computing the synthesizing function and segmentation mask using a neural network is performed in real time.
In at least one other aspect, the present disclosure generally describes a performance capture system. The performance capture system includes a volumetric capture system that is configured to render a at least one image reconstructed from at least one viewpoint of a captured 3D model, the at least one image including imperfections. The performance capture system further includes a rendering system that is configured to receive the at least one image from the volumetric capture system and to generate, e.g., in real time, at least one enhanced image in which the imperfections of the at least one image are reduced. The rendering system includes a neural network that is configured to generate the at least one enhanced image by training prior to use. The training includes minimizing a loss function between predicted images generated by the neural network during training and corresponding ground truth images captured by at least one ground truth camera coordinated with the volumetric capture system during training.
In one possible implementation of the performance capture system, the at least one ground truth camera is included in the performance capture system during training and otherwise not included in the performance capture system.
In another possible implementation of the performance capture system, the volumetric capture system includes a plurality of active stereo cameras directed to multiple views and, during training, includes a plurality of ground truth cameras directed to the multiple views.
In another possible implementation of the performance capture system, a stereo display is included and configured to display one of the at least one enhanced image as a left eye view and one of the at least one enhanced image as a right eye view. For example, the performance capture system may be a virtual reality (VR) headset.
Example embodiments will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of the example embodiments.
It should be noted that these Figures are intended to illustrate the general characteristics of methods, structure and/or materials utilized in certain example embodiments and to supplement the written description provided below. These drawings are not, however, to scale and may not precisely reflect the precise structural or performance characteristics of any given embodiment and should not be interpreted as defining or limiting the range of values or properties encompassed by example embodiments. For example, the relative thicknesses and positioning of layers, regions and/or structural elements may be reduced or exaggerated for clarity. The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of a similar or identical element or feature.
While example embodiments may include various modifications and alternative forms, embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit example embodiments to the particular forms disclosed, but on the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of the claims. Like numbers refer to like elements throughout the description of the figures.
A performance capture rig (i.e., performance capture system) may be used to capture a subject (e.g., person) and their movements in three dimensions (3D). The performance capture rig can include a volumetric capture system configured to capture data necessary to generate a 3D model and (in some cases) to render a 3D volumetric reconstruction (i.e., an image) using volumetric reconstruction of a view. A variety of volumetric capture systems can be implemented, including (but not limited to) active stereo cameras, time of flight (TOF) systems, lidar systems, passive stereo cameras and the like. Further, in some implementations a single volumetric capture system is utilized, while in others a plurality of volumetric capture systems may be used (e.g., in a coordinated capture).
The volumetric reconstruction may render a video stream of images (e.g., in real time) and may render separate images corresponding to a left-eye viewpoint and a right-eye viewpoint. The left-eye viewpoint and right eye-viewpoint 2D images may be displayed on a stereo display. The stereo display may be a fixed viewpoint stereo display (e.g., 3D movie) or a head-tracked stereo display. A variety of stereo displays may be implemented, including (but not limited to) augmented reality (AR) glasses display, virtual reality (VR) headset display, auto-stereo displays (e.g., head-tracked auto-stereo displays).
Imperfections (i.e., artifacts) may exist in the rendered 2D image(s) and/or in their presentation on the stereo display. The artifacts may include graphic artifacts such as intensity noise, low resolution textures, and off colors. The artifacts may also include time artifacts such as flicker in a video stream. The artifacts may further include stereo artifacts such as inconsistent left/right views. The artifacts may be due limitations/problems associated with performance capture rig. For example, due to complexity or cost constraints the performance capture rig may be limited in the data collected. Additionally, the artifacts may be due to limitations associated with transferring data over a network (e.g., bandwidth). The disclosure describes systems and methods to reduce or eliminate the artifacts regardless of their source. Accordingly, the disclosed systems and methods are not limited to any particular performance capture system or stereo display.
In one possible implementation, technical problems associated with existing performance capture systems can result in the 3D volumetric reconstructed images containing holes, noise, low resolution textures, and color artifacts. These technical problems can result in a less than desirable user experience in VR and AR applications.
Technical solutions to the above-mentioned technical problem implements machine learning to enhance volumetric videos in real-time. Geometric non-rigid reconstruction pipelines can be combined with deep learning to produce higher quality images. The disclosed system can focus on visually salient regions (e.g., human faces), discarding non-relevant information, such as the background. The described solution can produce temporally stable renderings for implementation in VR and AR applications, where left and right views should be consistent for an optimal user experience.
The technical solutions can include real-time performance capture (i.e., image and/or video capture) to obtain approximate geometry and texture in real time. The final 2D rendered output of such systems can be low quality due to geometric artifacts, poor texturing, and inaccurate lighting. Therefore, example implementations can use deep learning to enhance the final rendering to achieve higher quality results in real-time. For example, a deep learning architecture that takes, as input, a deferred shading deep buffer and/or the final 2D rendered image from a single or multiview performance capture system, and learns to enhance such imagery in real-time, producing a final high-quality re-rendering (see
Described herein is a neural re-rendering technique. Technical advantages of using the neural re-rendering technique include learning to enhance low-quality output from performance capture systems in real-time, where images contain holes, noise, low resolution textures, and color artifacts. Some examples of low-quality images are shown in
Technical advantages of using the neural re-rendering technique also include a specialized loss function can use semantic information to produce high quality results on faces. To reduce the effect of outliers a saliency reweighing scheme that focuses the loss on the most relevant regions can be used. The loss function is design for VR and AR headsets, where the goal is to predict two consistent views of the same object. Technical advantages of using the neural re-rendering technique also include temporally stable re-rendering by enforcing consistency between consecutive reconstructed frames.
The encoder 120 can be configured to compress the 3D video captured by the first set of cameras. The encoder 120 can be configured to receive video data 5 and generate compressed video data 10 using a standard compression technique. The decoder 130 can be configured to receive compressed video data 10 and generate reconstructed video data 15 using the inverse of the standard compression technique. The dashed/dotted line shown in
The rendering module 140 is configured to generate a left eye view 20 and a right eye view 25 based on the reconstructed video data 15 (or the video data 5). The left eye view 20 can be an image for display on a left eye display of a head-mounted display (HMD). The right eye view 25 can be an image for display on a right eye display of a HMD. Rendering can include processing scene (e.g., a 3D model) associated the reconstructed video data 15 (or the video data 5) to generate a digital image. The 3D model can include, for example, shading information, lighting information, texture information, geometric information and the like. Rendering can include implementing a rendering algorithm by a graphical processing unit (GPU). Therefore, rendering can include passing the 3D model to the GPU.
The learning module 150 can be configured to train a neural network or model to generate a high-quality image based on a low-quality image. In an example implementation, an image is iteratively predicted based on the left eye view 20 (or the right eye view 25) using the neural network or model. Then each iteration of the predicted image is compared to a corresponding image selected from the ground truth image data 30 using a loss function until the loss function is minimized (or below a threshold value). The learning module 150 is described in more detail below.
The neural re-rendering module 210 is configured to generate a re-rendered left eye view 35 based on the left eye view 20 and to generate a re-rendered right eye view 40 based on the right eye view 25. The neural re-rendering module 210 is configured to use the neural network or model trained by the learning module 150 to generate the re-rendered left eye view 35 as a higher quality representation of the left eye view 20. The neural re-rendering module 210 is configured to use the neural network or model trained by the learning module 150 to generate the re-rendered right eye view 40 as a higher quality representation of the right eye view 25. The neural re-rendering module 210 is described in more detail below.
The capture system 100 shown in
As shown in
In step S310 at least one two-dimensional (2D) ground truth image is captured for each of the plurality of frames of the first 3D video using the at least one witness camera. For example, the at least one 2D ground truth image can be a high-quality image captured by the at least one witness camera. The at least one 2D ground truth image can be captured at substantially the same moment in time as a corresponding one of the plurality of frames of the first 3D video.
In step S315 at least one of the plurality of frames of the first 3D video is compressed. For example, the at least one of the plurality of frames of the first 3D video is compressed using a standard compression technique. In step S320 the at least one frame of the plurality of frames of the first 3D video is decompressed. For example, the at least one of the plurality of frames of the first 3D video is decompressed using a standard decompression technique corresponding to the standard compression technique.
In step S325 at least one first 2D left eye view image is rendered based on the decompressed frame and at least one first 2D right eye view image is rendered based on the decompressed frame. For example, a 3D model of a scene corresponding to a frame of the decompressed first 3D video (e.g., reconstructed video data 15) is communicated to a GPU. The GPU can generate digital images (e.g., left eye view 20 and right eye view 25) based on the 3D model of a scene and return the digital images as the first 2D left eye view and the first 2D right eye view.
In step S330 a model for a left eye view of a head mount display (HMD) is trained based on the rendered first 2D left eye view image and the corresponding 2D ground truth image and a model for a right eye view of the HMD is trained based on the rendered first 2D right eye view image and the corresponding 2D ground truth image. For example, an image is iteratively predicted based on the first 2D left eye view using a neural network or model. Then each iteration of the predicted image is compared to the corresponding 2D ground truth image using a loss function until the loss function is minimized (or below a threshold value). In addition, an image is iteratively predicted based on the first 2D right eye view using a neural network or model. Then each iteration of the predicted image is compared to the corresponding 2D ground truth image using a loss function until the loss function is minimized (or below a threshold value).
As shown in
In step S340 the video data corresponding to the second 3D video is decompressed. For example, the second 3D video (e.g., compressed video data 10) is decompressed using a standard decompression technique corresponding to the standard compression technique used by the remote device.
In step S345 a frame of the second 3D video is selected. For example, a next frame of the decompressed second 3D video can be selected for display on a HMD playing back the second 3D video. Alternatively, or in addition to, playing back the second 3D video can utilize a buffer or queue of video frames. Therefore, selecting a frame of the second 3D video can include selecting a frame from the queue based on a buffering or queueing technique (e.g., FIFO, LIFO, and the like).
In step S350 a second 2D left eye view image is rendered based on the selected frame and a second 2D right eye view image is rendered based on the selected frame. For example, a 3D model of a scene corresponding to a frame of the decompressed second 3D video (e.g., reconstructed video data 15) is communicated to a GPU. The GPU can generate digital images (e.g., left eye view 20 and right eye view 25) based on the 3D model of a scene and return the digital images as the second 2D left eye view and the second 2D right eye view.
In step S355 the second 2D left eye view image is re-rendered using a convolutional neural network architecture and the trained model for the left eye view of the HMD, and the second 2D right eye view image is re-rendered using the convolutional neural network architecture and the trained model for the right eye view of the HMD. For example, the neural network or model trained in phase 1 can be used to generate the re-rendered second 2D left eye view (e.g., re-rendered left eye view 35) as a higher quality representation of the second 2D left eye view (e.g., left eye view 20). The neural network or model trained in phase 1 can be used to generate the re-rendered second 2D right eye view (e.g., re-rendered right eye view 35) as a higher quality representation of the second 2D right eye view (e.g., right eye view 25). Then, in step S360, the re-rendered second 2D left eye view image and the re-rendered second 2D right eye view image are displayed on at least one display of the HMD.
As shown in
The at least one memory 410 may be configured to store data and/or information associated with the learning module system 150. For example, the at least one memory 410 may be configured to store model(s) 420, a plurality of coefficients 425 and a plurality of loss functions 430. The at least one memory 410 further includes a metrics module 435 and an enumeration module 450. The metrics module 435 includes a plurality of error definitions 440 and an error calculator 445.
In an example implementation, the at least one memory 410 may be configured to store code segments that when executed by the at least one processor 405 cause the at least one processor 405 to select and communicate one or more of the plurality of coefficients 425. Further, the at least one memory 410 may be configured to store code segments that when executed by the at least one processor 405 cause the at least one processor 405 to receive information used by the learning module 150 system to generate new coefficients 425 and/or update existing coefficients 425. The at least one memory 410 may be configured to store code segments that when executed by the at least one processor 405 cause the at least one processor 405 to receive information used by the learning module 150 to generate a new model 420 and/or update an existing model 420.
The model(s) 420 represent at least one neural network model. A neural network model can define the operations of a neural network, the flow of the operations and/or the interconnections between the operations. For example, the operations can include normalization, padding, convolutions, rounding and/or the like. The model can also define an operation. For example, a convolution can be defined by a number of filters C, a spatial extent (or filter size) K×K, and a stride S. A convolution does not have to be square. For example, the spatial extent can be K×L. In a convolutional neural network context (see
A convolutional neural network can have layers with differing numbers of neurons. The K×K spatial extent (or filter size) can include K columns and K (or L) rows. The K×K spatial extent can be 2×2, 3×3, 4×4, 5×5, (K×L) 2×4 and so forth. Convolution includes centering the K×K spatial extent on a pixel and convolving all of the pixels in the spatial extent and generating a new value for the pixel based on all (e.g., the sum of) the convolution of all of the pixels in the spatial extent. The spatial extent is then moved to a new pixel based on the stride and the convolution is repeated for the new pixel. The stride can be, for example, one (1) or two (2) where a stride of one moves to the next pixel and a stride of two skips a pixel.
The coefficients 425 represent variable value that can be used in one or more of the model(s) 420 and/or the loss function(s) 430 for using and/or training a neural network. A unique combination of a model(s) 420, a coefficients 425 and loss function(s) can define a neural network and how to train the unique neural network. For example, a model of the model(s) 420 can be defined to include two convolution operations and an interconnection between the two. The coefficients 425 can include a corresponding entry defining the spatial extent (e.g., 2×4, 2×2, and/or the like) and a stride (e.g., 1, 2, and/or the like) for each convolution. In addition, the loss function(s) 430 can include a corresponding entry defining a loss function to train the model and a threshold value (e.g., min, max, min change, max change, and/or the like) for the loss.
The metrics module 435 includes the plurality of error definitions 440 and the error calculator 445. Error definitions can include, for example, functions or algorithms used to calculate an error and a threshold value (e.g., min, max, min change, max change, and/or the like) for an error. The error calculator 445 can be configured to calculate an error between two images based on a pixel-by-pixel difference between the two images using the algorithm. Types of errors can include photometric error, peak signal-to-noise ratio (PSNR), structural similarity (SSIM), multiscale SSIM (MS-SSIM), mean squared error, perceptual error, and/or the like. The enumeration module 450 can be configured to iterate one or more of the coefficients 425.
In an example implementation, one of the coefficients is changed for a model of the model(s) 420 by the enumeration module 450 while holding the remainder of the coefficients constant. During each iteration (e.g., an iteration to train the left eye view), the processor 405 predicts an image using the model with the view (e.g., left eye view 20) as input and calculates the loss (possibly using the ground truth image data 30) until the loss function is minimized and/or a change in loss is minimized. Then the error calculator 445 calculates an error between the predicted image and the corresponding image of the ground truth image data 30. If the error is unacceptable (e.g., greater than a threshold value or greater than a threshold change compared to a previous iteration) another of the coefficients is changed by the enumeration module 450. In an example implementation, two or more loss functions can be optimized. In this implementation, the enumeration module 450 can be configured to select between the two or more loss functions.
According to an example implementation, an image I (e.g., left eye view 20 and right eye view 25) rendered from a volumetric reconstruction (e.g., reconstructed video data 15), an enhanced version of I, denoted as Ie can be generated or computed. The transformation function between I and Ie should target VR and AR applications. Therefore, the following principles should be considered: a) the user typically focuses more on salient features, like faces, and artifacts in those areas should be highly penalized, b) when viewed in stereo, the outputs of the network have to be consistent between left and right pairs to prevent user discomfort, and c) in VR applications, the renderings are composited into the virtual world, requiring accurate segmentation masks. Further, enhanced images should be temporally consistent. A synthesis function F(I) used to generate a predicted image Ipred and a segmentation mask Mpred that indicates foreground pixels can be defined as Ie=Ipred⊙Mpred where 573 is the element-wise product, such that background pixels in Ie are set zero.
At training time, a body part semantic segmentation algorithm can be used to generate Iseg, the semantic segmentation of the ground-truth image Igt captured by the witness camera, as illustrated in
The training of a neural network that computes F(I) can include training a neural network to optimize the loss function:
=W1rec+W2mask+W3head+W4temporal+W1stereo (1)
where the weights wi are empirically chosen such that all the losses can provide a similar contribution.
Instead of using standard 2 or 1 losses in the image domain, the 1 loss can be computed in the feature space of a 16 layer network (e.g., VGG16) trained on an image database (e.g., ImageNet). The loss can be computed as the -1 distance of the activations of conv1 through conv5 layers. This gives very comparable results to using a Generative adversarial networks (GAN) loss, without the overhead of employing a GAN architecture during training. Reconstruction Loss rec can be computed as:
L
rec
=Σ
i=1
5
∥VGG
i(Mgt⊙Igt)-VGGi(Mpred⊙Ipred)∥* (2)
where Mgt=(Iseg≠background) is a binary segmentation mask that turns off background pixels (see
Mask loss mask can cause the model to predict an accurate foreground mask Mpred. This can be seen as a binary classification task. For foreground pixels the value y+=1 is assigned, whereas for background pixels y−=0 is used. The final loss can be defined as:
mask
=∥M
gt-Mpred∥* (3)
where ∥·∥* is the saliency re-weighted 1 loss. Other classification losses such as a logistic loss can be considered. However, they can produce very similar results. An example of the mask loss is shown in
The head loss head can focus the neural network on the head to improve the overall sharpness of the face. Similar to the body loss, a 16 layer network (e.g., VGG16) can be used to compute the loss in the feature space. In particular, the crop IC can be defined for an image I as a patch cropped around the head pixels as given by the segmentation labels of Iseg and resized to 512×512 pixels. The loss can be computed as:
headΣi=15∥VGGi(MgtC⊙IgtC)-VGGi(MpredC⊙IpredC)∥* (4)
An example of the head loss is shown in
Temporal Loss temporal can be used to minimize the amount of flickering between two consecutive frames. The temporal loss between a frame It and It-1 can be used. Minimizing the difference between It and It-1 would produce temporally blurred results. Therefore, a loss that tries to match the temporal gradient of the predicted sequence, i.e.Ipredt-Ipredt-1, with the temporal gradient of the ground truth sequence, i.e.Igtt-Igtt-1 can be used. The loss can be computed as:
temporal=∥(Ipredt-Ipredt-1)-(Igtt-Igtt-1)∥1 (5)
An example of the computed temporal loss is shown in
Stereo Loss stereo can be designed for VR and AR applications, when the neural network is applied on the left and right eye views. In this case, inconsistencies between both eyes may limit depth perception and result in discomfort for the user. Therefore, a loss that ensures self-supervised consistency in the output stereo images can be used. A stereo pair of the volumetric reconstruction can be rendered and each eye's image can be used as input to the neural network, where the left image IL matches ground-truth camera viewpoint and the right image Ir is rendered at an offset distance (e.g., 65 mm) along the x-coordinate. The right prediction IpredR is then warped to the left viewpoint using the (known) geometry of the mesh and compared to the left prediction IpredR. A warp operator Iwarp can be defined using a Spatial Transformer Network (STN), which uses a bi-linear interpolation of 4 pixels and fixed warp coordinates. The loss can be computed as:
stereo
=∥I
pred
L-Iwarp(IpredR)∥1 (6)
An example of the stereo loss is shown in
The above losses receive a contribution from every pixel in the image (with the exception of the masked pixels). However, imperfections in the segmentation mask, may bias the network towards unimportant areas. Pixels with the highest loss can be outliers (e.g., next to the boundary of the segmentation mask). These outlier pixels can dominate the overall loss (see
where Γ(i, y) extracts the i′th percentile across the set of values in y and pmin, pmax, αi are empirically chosen and depend on the task at hand.
This saliency as a weight on each pixel of the residual y computed for rec and head can be defined as:
∥y∥*=∥γ(y)⊙y∥1 (8)
where ⊙ is the element-wise product.
A continuous formulation of γp (y) defined by the product of a sigmoid and an inverted sigmoid can also be used. Gradients with respect to the re-weighing function are not computed. Therefore, the re-weighing function does not need to be continuous for SGD to work. The effect of saliency reweighing is shown in
As shown in
The at least one memory 510 may be configured to store data and/or information associated with the neural re-rendering module 210. For example, the at least one memory 510 may be configured to store model(s) 420, a plurality of coefficients 425, and a neural network 520. In an example implementation, the at least one memory 510 may be configured to store code segments that when executed by the at least one processor 505 cause the at least one processor 505 to select one of the models 420 and/or one or more of the plurality of coefficients 425.
The neural network 520 can include a plurality of operations (e.g., convolution 530-1 to 530-9). The plurality of operations, interconnections and the data flow between the plurality of operations can be a model selected from the model(s) 420. The model (as operations, interconnects and data flow) illustrated in the neural network is an example implementation. Therefore, other models can be used to enhance images as described herein.
In the example implementation shown in
The super-resolution 550 can include upscaling the resultant image (e.g., x2, x4, x6, and the like) and applying a neural network as a filter to the upscaled image to generate a high-quality image from the relatively lower quality upscaled image. In an example implementation, the filter is selectively applied to each pixel from a plurality of trained filters.
In the example implementation shown in
As is shown, the neural network 520 architecture includes 18 layers. Nine (9) layers are used for encoding/compressing/contracting/downsampling and nine (9) layers are used for decoding/decompressing/expanding/upsampling. For example, convolutions 530-1, 530-2, 530-3, 530-4, 530-5, 530-6, 530-7, 530-8 and 530-9 are used for encoding and convolutions 540-1, 540-2, 540-3, 540-4, 540-5, 540-6, 540-7, 540-8 and 540-9 are used for decoding. Convolution 535 can be used as a bottleneck. A bottleneck can be a 1×1 convolution layer configured to decrease the number of input channels for K×K filters. The neural network 520 architecture can include skip connections between the encoder and decoder blocks. For example, skip connections are shown between convolution 530-1 and convolution 540-9, convolution 530-3 and convolution 540-7, convolution 530-5 and convolution 540-5, and convolution 530-7 and convolution 540-3.
In the example implementation, the encoder begins with convolution 530-1 configured with a 3×3 convolution with Ninit filters followed by a sequence of downsampling blocks including convolutions 530-2, 530-3, 530-4, and 530-5. Convolutions 530-2, 530-3, 530-4, 530-5, 530-6, and 530-7 where i ∈{1, 2, 3, 4} can include two convolutional layers each with Ni filters. The first layer, 530-2, 530-4, and 530-6, can have a filter size 4×4, stride 2 and padding 1, whereas the second layer, 530-3, 530-5, and 530-7 can have a filter size of 3×3 and stride 1. Thus, each of the convolutions can reduce the size of the input by a factor of 2 due to the strided convolution. Finally, two dimensionality preserving convolutions, 530-8, and 530-9, are performed. The outputs of the convolutions are can pass through a ReLU activation function. In an example implementation, set Ninit=32 and Ni=Gi·Ninit, where G is the filter size growth factor after each downsampling block.
The decoder includes upsampling blocks 540-3, 540-4, 540-5, 540-6, 540-7, 540-8 and 540-9 that mirror the downsampling blocks but in reverse. Each such block i ∈ {4, 3, 2, 1} consists of two convolutional layers. The first layer 540-3, 540-5, and 540-7 bilinearly upsamples its input, performs a convolution with Ni filters, and leverages a skip connection to concatenate the output with that of its mirrored encoding layer. The second layer 540-4, 540-6 and 540-8 performs a convolution using 2Ni filters of size 3×3. The final network output is produced by a final convolution 540-9 with 4 filters, whose output is passed through a ReLU activation function to produce the reconstructed image and a single channel binary mask of the foreground subject. To produce stereo images for VR and AR headsets, both left and right views are enhanced using the same neural network (with shared weights). The final output is an improved stereo output pair. Data (e.g., filter size, stride, weights, Ninit, Ni, Gi and/or the like) associated with neural network 520 can be stored in model(s) 420 and coefficients 425.
Returning to
Random crops of images were used for training, ranging from 512×512 to 960×896. These images can be crops from the original resolution of the input and output pairs. In particular, the random crop can contain the head pixels in 75% of the samples, and for which the head loss is computed. Otherwise, the head loss may be disabled as the network might not see it completely in the input patch. This can result in high quality results for the face, while not ignoring other parts of the body. Using random crops along with standard l-2 regularization on the weights of the network may be sufficient to prevent over-fitting. When high resolution witness cameras are employed the output can be twice the input size.
The percentile ranges for the saliency re-weighing can be empirically set to remove the contribution of the imperfect mask boundary and other outliers without affecting the result otherwise. When pmax=98, pmin values in range [25, 75] can be acceptable. In particular, pmin=50 for the reconstruction loss and pmin=25 for the head loss and α1=α2=1.1 may be set.
The system was evaluated on two different datasets one for single camera (upper body reconstruction) and one for multiview, full body capture. The single camera dataset includes 42 participants of which 32 are used for training. For each participant, four 10 second sequences were captured, where they a) dictate a short text, with and without eyeglasses, b) look in all directions, and c) gesticulate extremely.
For the full body capture data, a diverse set of 20 participants were recorded. Each performer was free to perform any arbitrary movement in the capture space (e.g. walking, jogging, dancing, etc.) while simultaneously performing facial movements and expressions.
For each subject 10 sequences of 500 frames were recorded. Five (5) subjects were left out from the training datasets to assess the performances of the algorithm on unseen people. Moreover, for some participants in the training set 1 sequence (i.e. 500 or 600 frames) was left out for testing purposes.
A core component of the framework is a volumetric capture system that can generate approximate textured geometry and render the result from any arbitrary viewpoint in real-time. For upper bodies, a high-quality implementation of a standard rigid-fusion pipeline was used. For full bodies, a non-rigid fusion setup where multiple cameras provide a full 360° coverage of the performer was used. Upper Body Capture (Single View). The upper body capture setting uses a single 1500×1100 active stereo camera paired with a 1600×1200 RGB view. To generate high quality geometry, a method that extends PatchMatch Stereo to spacetime matching, and produces depth images at 60 Hz was used. Meshes were computed by applying volumetric fusion and texture map the mesh with the color image as shown in
In the upper body capture scenario, a single camera was mounted at a 25° degree angle to the side from where the subject is looking at, of the same resolution as the capture camera. See
In the full body capture rig, 8 high resolution (4096×2048) witness cameras were mounted (see
The performance of the system was tested, analyzing the importance of each component. A first analysis can be qualitative seeking to assess the viewpoint robustness, generalization to different people, sequences and clothing. A second analysis can be a quantitative evaluation on the architectures. Multiple perceptual measurements such as PSNR, Multi Scale-SSIM, Photometric Error, e.g. l1-loss, and Perceptual Loss were used. The experimental evaluation supports each design choice of the system and also shows the trade-offs between quality and model complexity.
Qualitative results were determined for different test sequences and under different conditions. Upper Body Results (Single View). In the single camera case, the network has to learn mostly to in-paint missing areas and fix missing fine geometry details such as eyeglasses frames. Some results are shown in
Full Body Results (Multi View). The multi view case carries the additional complexity of blending together different images that may have different lighting conditions or have small calibration imprecisions. This affects the final rendering results as shown in
Although the ground truth viewpoints are limited to a sparse set of cameras, the system can be shown to be robust to unseen camera poses. Viewpoint robustness can be demonstrated by simulating a camera trajectory around the subject. Results are shown in
Generalization across different subjects (e.g., people, clothing) is shown in
The behavior of the system was assessed with different clothes or accessories. Examples shown in
The main quantitative results are summarized in Table 1, where multiple statistics were calculated for the proposed model and all its variants. As shown in Table 1, Quantitative evaluations on test sequences of subjects seen in training and subjects unseen in training. Photometric error is measured as the l1-norm, and perceptual is the same loss based on VGG16 used for training. The architecture was fixed and the proposed loss function was compared with the same loss minus a specific loss term indicated in each columns. On seen subjects all the models perform similarly, whereas on new subjects the proposed loss has better generalization performances. Notice how the output of the volumetric reconstruction, i.e. the input to the network is outperformed by all variants of the neural network.
The following summarizes the findings. The segmentation mask plays an important role in in-painting missing parts, discarding the background and preserving input regions. As shown in
Stable results across multiple viewpoints have already been shown in
The importance of the model size was assessed. Three different network models were trained, starting with Ninit=16, 32, 64 filters respectively. In
Real-Time Free Viewpoint Neural re-Rendering
A real-time demonstration of the system was implemented as shown in
The run-time of the system was assessed using a single NVIDIA Titan V. The model with Ninit=32 filters was implemented where input and output are generated at the same resolution (512×1024). Using the standard TensorFlow graph export tool, the average running time to produce a stereo pair with neural re-rendering is around 92 ms, which may not be sufficient for real-time applications. Therefore, NVIDIA TensorRT, which performs inference optimization for a given deep architecture, was used. This resulted in a standard export with 32 bits floating point weight which brings the computational time down to 47 ms. Finally, the optimizations implemented on the NVIDIA Titan V were used, and the network weights were quantized using a 16-bit floating point. This resulted in the final run-time of 29 ms per stereo pair, with no loss in accuracy, hitting the real-time requirements.
Each block of the network was profiled to determine potential bottlenecks. The analysis is shown in
A small qualitative user study on was performed on the results of the output system. Ten (10) subjects were recruited and 12 short video sequences were prepared showing the renderings of the capture system, the predicted results and the target witness views masked with the semantic segmentation as described above. The order of the videos was randomized and sequences were selected that included both seen subjects and unseen subjects.
The participants were asked whether they preferred the renders of the performance capture system (e.g., the input to the enhancement algorithm), the re-rendered versions using neural re-rendering, or the masked ground truth image (e.g., Mgt⊙Igt). A vast majority (most if not all) of the users agreed that the output of the neural re-rendering was better compared to the renderings from the volumetric capture systems. Also, the users did not seem to notice substantial differences between seen and unseen subjects. Unexpectedly, most (greater than 50%) of the subjects preferred the output of the system even compared to the ground truth. The participants found the predicted masks using the network to be more stable than the ground truth masks used for training, which suffers from more inconsistent predictions between consecutive frames. However, a vast majority (most if not all) of the subjects agreed that ground truth is still sharper indicating a higher resolution than the neural re-rendering output, and more must be done in this direction to improve the overall quality.
Where neural networks are to be scaled up to work on inputs with a relatively high number of dimensions, it can therefore become computationally complex for all neurons 620 in each layer 605, 610, 615 to be networked to all neurons 620 in the one or more neighboring layers 605, 610, 615. An initial sparsity condition can be used to lower the computational complexity of the neural network, for example when the neural network is functioning as an optimization process, by limiting the number of connection between neurons and/or layers thus enabling a neural network approach to work with high dimensional data such as images.
An example of a neural network is shown in
Alternatively, in some embodiments neural networks can be use that are fully connected or not fully connected but in different specific configurations to that described in relation to
Further, in some embodiments, convolutional neural networks are used, which are neural networks that are not fully connected and therefore have less complexity than fully connected neural networks. Convolutional neural networks can also make use of pooling or max-pooling to reduce the dimensionality (and hence complexity) of the data that flows through the neural network and thus this can reduce the level of computation required.
Computing device 2400 includes a processor 2402, memory 2404, a storage device 2406, a high-speed interface 2408 connecting to memory 2404 and high-speed expansion ports 2410, and a low speed interface 2412 connecting to low speed bus 2414 and storage device 2406. Each of the components 2402, 2404, 2406, 2408, 2410, and 2412, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 2402 can process instructions for execution within the computing device 2400, including instructions stored in the memory 2404 or on the storage device 2406 to display graphical information for a GUI on an external input/output device, such as display 2416 coupled to high speed interface 2408. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 2400 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
The memory 2404 stores information within the computing device 2400. In one implementation, the memory 2404 is a volatile memory unit or units. In another implementation, the memory 2404 is a non-volatile memory unit or units. The memory 2404 may also be another form of computer-readable medium, such as a magnetic or optical disk.
The storage device 2406 is capable of providing mass storage for the computing device 2400. In one implementation, the storage device 2406 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid-state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 2404, the storage device 2406, or memory on processor 2402.
The high-speed controller 2408 manages bandwidth-intensive operations for the computing device 2400, while the low speed controller 2412 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 2408 is coupled to memory 2404, display 2416 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 2410, which may accept various expansion cards (not shown). In the implementation, low-speed controller 2412 is coupled to storage device 2406 and low-speed expansion port 2414. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 2400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 2420, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 2424. In addition, it may be implemented in a personal computer such as a laptop computer 2422. Alternatively, components from computing device 2400 may be combined with other components in a mobile device (not shown), such as device 2450. Each of such devices may contain one or more of computing device 2400, 2450, and an entire system may be made up of multiple computing devices 2400, 2450 communicating with each other.
Computing device 2450 includes a processor 2452, memory 2464, an input/output device such as a display 2454, a communication interface 2466, and a transceiver 2468, among other components. The device 2450 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 2450, 2452, 2464, 2454, 2466, and 2468, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
The processor 2452 can execute instructions within the computing device 2450, including instructions stored in the memory 2464. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 2450, such as control of user interfaces, applications run by device 2450, and wireless communication by device 2450.
Processor 2452 may communicate with a user through control interface 2458 and display interface 2456 coupled to a display 2454. The display 2454 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 2456 may comprise appropriate circuitry for driving the display 2454 to present graphical and other information to a user. The control interface 2458 may receive commands from a user and convert them for submission to the processor 2452. In addition, an external interface 2462 may be provide in communication with processor 2452, to enable near area communication of device 2450 with other devices. External interface 2462 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
The memory 2464 stores information within the computing device 2450. The memory 2464 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 2474 may also be provided and connected to device 2450 through expansion interface 2472, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 2474 may provide extra storage space for device 2450 or may also store applications or other information for device 2450. Specifically, expansion memory 2474 may include instructions to carry out or supplement the processes described above and may include secure information also. Thus, for example, expansion memory 2474 may be provide as a security module for device 2450 and may be programmed with instructions that permit secure use of device 2450. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 2464, expansion memory 2474, or memory on processor 2452, that may be received, for example, over transceiver 2468 or external interface 2462.
Device 2450 may communicate wirelessly through communication interface 2466, which may include digital signal processing circuitry where necessary. Communication interface 2466 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 2468. In addition, short-range communication may occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 2470 may provide additional navigation- and location-related wireless data to device 2450, which may be used as appropriate by applications running on device 2450.
Device 2450 may also communicate audibly using audio codec 2460, which may receive spoken information from a user and convert it to usable digital information. Audio codec 2460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 2450. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 2450.
The computing device 2450 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 2480. It may also be implemented as part of a smart phone 2482, personal digital assistant, or other similar mobile device.
Although the above description describes experiencing traditional three-dimensional (3D) content including accessing a head-mounted display (HMD) device to properly view and interact with such content, described techniques can also be used for rendering to 2D displays (e.g., a left view and/or right view displayed on one or more 2D displays), mobile AR, and to 3D TVs. Further, the use of HMD devices can be cumbersome for a user to continually wear. Accordingly, the user may utilize autostereoscopic displays to access user experiences with 3D perception without requiring the use of the HMD device (e.g., eyewear or headgear). The autostereoscopic displays employ optical components to achieve a 3D effect for a variety of different images on the same plane and providing such images from a number of points of view to produce the illusion of 3D space.
Autostereoscopic displays can provide imagery that approximates the three-dimensional (3D) optical characteristics of physical objects in the real world without requiring the use of a head-mounted display (HMD) device. In general, autostereoscopic displays include flat panel displays, lenticular lenses (e.g., microlens arrays), and/or parallax barriers to redirect images to a number of different viewing regions associated with the display.
In some example autostereoscopic displays, there may be a single location that provides a 3D view of image content provided by such displays. A user may be seated in the single location to experience proper parallax, little distortion, and realistic 3D images. If the user moves to a different physical location (or changes a head position or eye gaze position), the image content may begin to appear less realistic, 2D, and/or distorted. The systems and methods described herein may reconfigure the image content projected from the display to ensure that the user can move around, but still experience proper parallax, low rates of distortion, and realistic 3D images in real time. Thus, the systems and methods described herein provide the advantage of maintaining and providing 3D image content to a user regardless of user movement that occurs while the user is viewing the display.
A mask may be calculated and generated for each of a left and right eye. The masks 2500 may be different for each eye. For example, a mask 2500A may be calculated for the left eye while a mask 2500B may be calculated for the right eye. In some implementations, the mask 2500A may be a shifted version of the mask 2500B. Consistent with implementations described herein, the autostereoscopic display assembly 2502 may be a glasses-free, lenticular, three-dimensional display that includes a plurality of microlenses. In some implementations, an array 2506 may include microlenses in a microlens array. In some implementations, 3D imagery can be produced by projecting a portion (e.g., a first set of pixels) of a first image in a first direction through the at least one microlens (e.g., to a left eye of a user) and projecting a portion (e.g., a second set of pixels) of a second image in a second direction through the at least one other microlens (e.g., to a right eye of the user). The second image may be similar to the first image, but the second image may be shifted from the first image to simulate parallax to thereby simulating a 3D stereoscopic image for the user viewing the autostereoscopic display assembly 2502.
Each of the persons 2602 and 2604 can have a corresponding 3D pod. Here, the person 2602 has a pod 2606 and the person 2604 has a pod 2608. The pods 2606 and 2608 can provide functionality relating to 3D content, including, but not limited to: capturing images for 3D display, processing and presenting image information, and processing and presenting audio information. The pod 2606 and/or 2608 can constitute processor and a collection of sensing devices integrated as one unit.
The 3D content system 2600 can include one or more 3D displays. Here, a 3D display 2610 is provided for the pod 2606, and a 3D display 2612 is provided for the pod 2608. The 3D display 2610 and/or 2612 can use any of multiple types of 3D display technology to provide a stereoscopic view for the respective viewer (here, the person 2602 or 2604, for example). In some implementations, the 3D display 2610 and/or 2612 can include a standalone unit (e.g., self-supported or suspended on a wall). In some implementations, the 3D display 2610 and/or 2612 can include wearable technology (e.g., a head-mounted display). In some implementations, the 3D display 2610 and/or 2612 can include an autostereoscopic display assembly such as autostereoscopic display assembly 2502 described above.
The 3D content system 2600 can be connected to one or more networks. Here, a network 2614 is connected to the pod 2606 and to the pod 2608. The network 2614 can be a publicly available network (e.g., the internet), or a private network, to name just two examples.
The network 2614 can be wired, or wireless, or a combination of the two. The network 2614 can include, or make use of, one or more other devices or systems, including, but not limited to, one or more servers (not shown).
The pod 2606 and/or 2608 can include multiple components relating to the capture, processing, transmission or reception of 3D information, and/or to the presentation of 3D content. The pods 2606 and 2608 can include one or more cameras for capturing image content for images to be included in a 3D presentation. Here, the pod 2606 includes cameras 2616 and 2618. For example, the camera 2616 and/or 2618 can be disposed essentially within a housing of the pod 2606, so that an objective or lens of the respective camera 2616 and/or 2618 captured image content by way of one or more openings in the housing. In some implementations, the camera 2616 and/or 2618 can be separate from the housing, such as in form of a standalone device (e.g., with a wired and/or wireless connection to the pod 2606). The cameras 2616 and 2618 can be positioned and/or oriented so as to capture a sufficiently representative view of (here) the person 2602. While the cameras 2616 and 2618 should preferably not obscure the view of the 3D display 2610 for the person 2602, the placement of the cameras 2616 and 2618 can generally be arbitrarily selected. For example, one of the cameras 2616 and 2618 can be positioned somewhere above the face of the person 2602 and the other can be positioned somewhere below the face. For example, one of the cameras 2616 and 2618 can be positioned somewhere to the right of the face of the person 2602 and the other can be positioned somewhere to the left of the face. The pod 2608 can in an analogous way include cameras 2620 and 2622, for example.
The pod 2606 and/or 2608 can include one or more depth sensors to capture depth data to be used in a 3D presentation. Such depth sensors can be considered part of a depth capturing component in the 3D content system 2600 to be used for characterizing the scenes captured by the pods 2606 and/or 2608 in order to correctly represent them on a 3D display. Also, the system can track the position and orientation of the viewer's head, so that the 3D presentation can be rendered with the appearance corresponding to the viewer's current point of view. Here, the pod 2606 includes a depth sensor 2624. In an analogous way, the pod 2608 can include a depth sensor 2626. Any of multiple types of depth sensing or depth capture can be used for generating depth data. In some implementations, an assisted-stereo depth capture is performed. The scene can be illuminated using dots of lights, and stereomatching can be performed between two respective cameras. This illumination can be done using waves of a selected wavelength or range of wavelengths. For example, infrared (IR) light can be used. Here, the depth sensor 2624 operates, by way of illustration, using beams 2628A and 2628. The beams 2628A and 2628B can travel from the pod 2606 toward structure or other objects (e.g., the person 2602) in the scene that is being 3D captured, and/or from such structures/objects to the corresponding detector in the pod 2606, as the case may be. The detected signal(s) can be processed to generate depth data corresponding to some or the entire scene. As such, the beams 2628A-B can be considered as relating to the signals on which the 3D content system 2600 relies in order to characterize the scene(s) for purposes of 3D representation. For example, the beams 2628A-B can include IR signals. Analogously, the pod 2608 can operate, by way of illustration, using beams 2630A-B.
Depth data can include or be based on any information regarding a scene that reflects the distance between a depth sensor (e.g., the depth sensor 2624) and an object in the scene. The depth data reflects, for content in an image corresponding to an object in the scene, the distance (or depth) to the object. For example, the spatial relationship between the camera(s) and the depth sensor can be known, and can be used for correlating the images from the camera(s) with signals from the depth sensor to generate depth data for the images.
In some implementations, depth capturing can include an approach that is based on structured light or coded light. A striped pattern of light can be distributed onto the scene at a relatively high frame rate. For example, the frame rate can be considered high when the light signals are temporally sufficiently close to each other that the scene is not expected to change in a significant way in between consecutive signals, even if people or objects are in motion. The resulting pattern(s) can be used for determining what row of the projector is implicated by the respective structures. The camera(s) can then pick up the resulting pattern and triangulation can be performed to determine the geometry of the scene in one or more regards.
The images captured by the 3D content system 2600 can be processed and thereafter displayed as a 3D presentation. Here, 3D image 2604′ is presented on the 3D display 2610. As such, the person 2602 can perceive the 3D image 2604′ as a 3D representation of the person 2604, who may be remotely located from the person 2602. 3D image 2602′ is presented on the 3D display 2612. As such, the person 2604 can perceive the 3D image 2602′ as a 3D representation of the person 2602. Examples of 3D information processing are described below.
The 3D content system 2600 can allow participants (e.g., the persons 2602 and 2604) to engage in audio communication with each other and/or others. In some implementations, the pod 2606 includes a speaker and microphone (not shown). For example, the pod 2608 can similarly include a speaker and a microphone. As such, the 3D content system 2600 can allow the persons 2602 and 2604 to engage in a 3D telepresence session with each other and/or others.
Generating high quality output from textured 3D models is the ultimate goal of many performance capture systems. Below briefly review methods including image-based approaches, full 3D reconstruction systems and finally learning based solutions.
Image-based Rendering (IBR). IBR techniques warp a series of input color images to novel viewpoints of a scene using geometry as a proxy. These methods can be expanded to video inputs, where a performance is captured with multiple RGB cameras and proxy depth maps are estimated for every frame in the sequence. This work is limited to a small 30° coverage, and its quality strongly degrades when the interpolated view is far from the original cameras.
Recent works introduced optical flow methods to IBR, however their accuracy is usually limited by the optical flow quality. Moreover these algorithms are restricted to off-line applications. Another limitation of IBR techniques is their use of all input images in the rendering stage, making them ill-suited for real-time VR or AR applications as they require transferring all camera streams, together with the proxy geometry. However, IBR techniques have been successfully applied to constrained applications like 360° degree stereo video which produce two separate video panoramas, one for each eye, but are constrained to a single viewpoint.
Volumetric capture systems can use more than 100 cameras to generate high quality offline volumetric performance capture. A controlled environment with green screen and carefully adjusted lighting conditions can be used to produce high quality renderings. Methods can produce rough point clouds via multi-view stereo, that is then converted into a mesh using Poisson Surface Reconstruction. Based on the current topology of the mesh, a keyframe is selected which is tracked over time to mitigate inconsistencies between frames. The overall processing time is ˜28 minutes per frame. Some examples can be extended to support texture tracking. These frameworks then deliver high quality volumetric captures at the cost of sacrificing real-time capability.
Methods can use single RGB-D sensors to either track a template mesh or reference volume. However, these systems require careful motions and none support high quality texture reconstruction. The systems can use fast correspondence tracking to extend the single view non-rigid tracking pipeline to handle topology changes robustly. This method however, can suffer from both geometric and texture inconsistency.
Even in the latest state of the art reconstruction can suffer from geometric holes, noise, and low quality textures. A realtime texturing method that can be applied on top of the volumetric reconstruction may improve quality. This is based on a simple Poisson blending scheme, as opposed to offline systems that use a Conditional Random Field (CRF) model. The final results are still coarse in terms of texture. Moreover these algorithms require streaming all of the raw input images, which means it does not scale with high resolution input images.
Learning-based solutions to generate high quality renderings have shown promising results. However, models only a few, explicit object classes, and the final results do not necessary resemble high-quality real objects. Follow-up work can use end-to-end encoder-decoder networks to generate novel views of an image starting from a single viewpoint. However, due to the large variability, the results are usually low resolution. Some systems employ some notion of 3D geometry in the end-to-end process to deal with the 2D-3D object mapping. For instance, an explicit flow that maps pixels from the input image to the output novel view can be used. In Deep View Morphing two input images and an explicit rectification stage, that roughly aligns the inputs, are used to generate intermediate views. Another trend explicitly employs multiview stereo in an end-to-end fashion to generate intermediate view of city landscapes.
3D shape completion methods can use 3D filters to volumetrically complete 3D shapes. But given the cost of such filters both at training and at test time, these have shown low resolution reconstructions and performance far from real-time. PointProNets show results for denoising point clouds but again are computationally demanding, and do not consider the problem of texture reconstruction.
The problem considered herein can be related to the image-to-image translation task where the goal is to start from input images from a certain domain and “translate” them into another domain, e.g. from semantic segmentation labels to realistic images. The scenario described herein is similar, as we transform low quality 3D renderings into higher quality images. Despite the huge amount of work on the topic, it is still challenging to generate high quality renderings of people in real-time for performance capture. Contrary to previous work, we leverage recent advances in real-time volumetric capture and use these systems as input for our learning based framework to generate high quality, real-time renderings of people performing arbitrary actions.
In one aspect, the disclosure describes a system comprising a camera rig including at least one first camera configured to capture three dimensional (3D) video at a first quality, and at least one second camera configured to capture a two dimensional (2D) image at a second quality, the second quality being a higher quality than the first quality; and a processor configured to perform steps including: rendering a first digital image based on the captured 3D video, rendering a second digital image based on the captured 3D video, training a neural network to generate a third digital image based on the first digital image and the 2D image, the third digital image having a third quality, the third quality being a higher quality than the first quality, and training the neural network to generate a fourth digital image based on the second digital image and the 2D image, the third digital image having the third quality.
In another aspect, the disclosure describes A non-transitory computer-readable storage medium having stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform steps comprising: receiving a file including compressed three dimensional (3D) video data, the 3D video data including a plurality of frames of a 3D video; selecting a frame from the plurality of frames of the 3D video; decompressing the frame; rendering a first digital image based on the decompressed frame, the first digital image having a first quality; rendering a second digital image based on the decompressed frame, the second digital image having the first quality; generating a third digital image by re-rendering the first digital image using a trained neural network, the third digital image having a second quality, the second quality being a higher quality than the first quality; and generating a fourth digital image by re-rendering the second digital image using the trained neural network, the fourth digital image having the second quality.
In another aspect the disclosure describes a method comprising a first phase and a second phase. In a first phase: capturing a three dimensional (3D) video at a first quality; capturing a two dimensional (2D) image at a second quality, the second quality being a higher quality than the first quality, a frame of the 3D video and the 2D image being captured at substantially the same moment in time; rendering a first digital image based on the captured 3D video; rendering a second digital image based on the captured 3D video; training a neural network to generate a third digital image based on the first digital image and the 2D image, the third digital image having a third quality, the third quality being a higher quality than the first quality; and training the neural network to generate a fourth digital image based on the second digital image and the 2D image, the third digital image having the third quality. In a second phase: receiving a file including compressed three dimensional (3D) video data, the 3D video data including a plurality of frames of a received 3D video; selecting a frame from the plurality of frames of the received 3D video; decompressing the frame; rendering a fifth digital image based on the decompressed frame, the fifth digital image having the first quality; rendering a sixth digital image based on the decompressed frame, the sixth digital image having the first quality; generating a seventh digital image by re-rendering the fifth digital image using the trained neural network, the seventh digital image having the third quality; and generating an eighth digital image by re-rendering the sixth digital image using the trained neural network, the eighth digital image having the third quality.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. Various implementations of the systems and techniques described here can be realized as and/or generally be referred to herein as a circuit, a module, a block, or a system that can combine software and hardware aspects. For example, a module may include the functions/acts/computer program instructions executing on a processor (e.g., a processor formed on a silicon substrate, a GaAs substrate, and the like) or some other programmable data processing apparatus.
Some of the above example embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the operations as sequential processes, many of the operations may be performed in parallel, concurrently or simultaneously. In addition, the order of operations may be re-arranged. The processes may be terminated when their operations are completed, but may also have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, subprograms, etc.
Methods discussed above, some of which are illustrated by the flow charts, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. A processor(s) may perform the necessary tasks.
Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments. Example embodiments, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. As used herein, the term and/or includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element is referred to as being connected or coupled to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being directly connected or directly coupled to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., between versus directly between, adjacent versus directly adjacent, etc.).
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises, comprising, includes and/or including, when used herein, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Portions of the above example embodiments and corresponding detailed description are presented in terms of software, or algorithms and symbolic representations of operation on data bits within a computer memory. These descriptions and representations are the ones by which those of ordinary skill in the art effectively convey the substance of their work to others of ordinary skill in the art. An algorithm, as the term is used here, and as it is used generally, is conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of optical, electrical, or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
In the above illustrative embodiments, reference to acts and symbolic representations of operations (e.g., in the form of flowcharts) that may be implemented as program modules or functional processes include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types and may be described and/or implemented using existing hardware at existing structural elements. Such existing hardware may include one or more Central Processing Units (CPUs), digital signal processors (DSPs), application-specific-integrated-circuits, field programmable gate arrays (FPGAs) computers or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, or as is apparent from the discussion, terms such as processing or computing or calculating or determining of displaying or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical, electronic quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Note also that the software implemented aspects of the example embodiments are typically encoded on some form of non-transitory program storage medium or implemented over some type of transmission medium. The program storage medium may be magnetic (e.g., a floppy disk or a hard drive) or optical (e.g., a compact disk read only memory, or CD ROM), and may be read only or random access. Similarly, the transmission medium may be twisted wire pairs, coaxial cable, optical fiber, or some other suitable transmission medium known to the art. The example embodiments not limited by these aspects of any given implementation.
Lastly, it should also be noted that whilst the accompanying claims set out particular combinations of features described herein, the scope of the present disclosure is not limited to the particular combinations hereafter claimed, but instead extends to encompass any combination of features or embodiments herein disclosed irrespective of whether or not that particular combination has been specifically enumerated in the accompanying claims at this time.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/774,662, filed on Dec. 3, 2018, entitled “ENHANCING PERFORMANCE CAPTURE WITH REAL-TIME NEURAL RENDERING”, the disclosure of which is incorporated by reference herein in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/063969 | 12/2/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62774662 | Dec 2018 | US |