A portion of the disclosure of this patent document contains material which is subject to copyright protection. This patent document may show and/or describe matter which is or may become trade dress of the owner. The copyright and trade dress owner has no objection to the facsimile reproduction by anyone of the patent disclosure as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright and trade dress rights whatsoever.
1. Field
This disclosure relates generally to methods for computer graphics rendering. More specifically, it relates to removing noise from and improving Monte Carlo rendered images.
2. Description of the Related Art
Producing photorealistic images from a scene model requires computing a complex multidimensional integral of the scene function at every pixel of the image. For example, generating effects like depth of field and motion blur requires integrating over domains such as lens position and time. Monte Carlo (MC) rendering systems approximate this integral by tracing light rays (samples) in the multidimensional space to evaluate the scene function. Although an approximation to this integral can be quickly evaluated with just a few samples, the inaccuracy of this estimate relative to the true value appears as unacceptable noise in the resulting image. Since the variance of the MC estimator decreases linearly with the number of samples, many samples are required to get a reliable estimate of the integral. The high cost of computing additional rays results in lengthy render times that negatively affect the applicability of MC renderers in modern film production.
One way to mitigate this problem is to quickly render a noisy image with a few samples and then filter it as a post-process to generate an acceptable, noise-free result. This approach has been the subject of extensive research in recent years. The more successful methods typically use feature-based filters (e.g., cross-bilateral or cross non-local means filters) to leverage additional scene features, such as world position, that help guide the filtering process. Since these features are highly correlated with scene detail, using them in the filtering process greatly improves the quality of the results.
Some approaches have used this information to handle specific distributed effects such as global illumination and depth of field. However, a major challenge is how to exploit this additional information to denoise distributed effects, which requires setting the filter weights for all features (called “filter parameters” hereafter) so that noise is removed while scene detail is preserved. To do this, some have proposed to use the functional dependencies between scene features and random parameters calculated using mutual information, a process that removes noise but was slow. Several other algorithms build upon this by using error estimation metrics to select the best filter parameters from a discrete set. The main drawback of these methods is that their error metrics are usually noisy at low sampling rates, reducing the accuracy of filter selection. Furthermore, they choose the filter parameters from a preselected, discrete set that may not contain the optimum. As a result, these methods produce images with over/under blurred regions.
Since the introduction of distributed ray tracing by Cook et al., researchers have proposed a variety of algorithms to address the noise in Monte Carlo (MC) rendering. Some of these include variance reduction techniques, low-discrepancy sampling patterns, new Monte Carlo formulations with faster convergence, and methods that exploit specific properties of the integrand}\add{position or reuse samples based on the shape of the multidimensional integrand.
Filtering approaches render a noisy image with a few samples and then denoise images through a filtering process. Some methods adaptively sample as well, further improving the results. Some previous work on MC filtering use only sample color during filtering and others use additional scene information.
Color-Based Filter Methods
These methods are inspired by traditional image denoising techniques and use only pixel color information from the rendering system to remove MC noise. Early work by Lee and Redner used nonlinear filters (median and alpha-trimmed mean filters) to remove spikes while preserving edges. Rushmeier and Ward proposed to spread the energy of input samples through variable width filter kernels. To reduce the noise in path-traced images, Jensen and Christensen separated illumination into direct and indirect components, filtered the indirect portion, and then added the components back together. Bala et al. exploited an edge image to facilitate the filtering process, while Xu and Pattanaik used a bilateral filter to remove MC noise. Egan et al. used frequency analysis to shear a filter for specific distributed effects such as motion blur and occlusion/shadowing, while Mehta et al. used related analysis to derive simple formulas that set the variance of a screen-space Gaussian filter to target noise from specific effects. Most of these approaches use the analysis to adaptively position samples as well.
For denoising general distributed effects, Overbeck et al. adapted wavelet shrinkage to MC noise reduction, while Rousselle et al. selected an appropriate scale for a Gaussian filter at every pixel to minimize the reconstruction error. Rousselle later improved this using a non-local means filter. Using the median absolute deviation to estimate the noise at every pixel, Kalantari and Sen were able to apply arbitrary image denoising techniques to MC rendering. Finally, Delbracio et al. proposed a method based on non-local means filtering which computes the distance between two patches using their color histograms. Although these color-based methods are general and work on a variety of distributed effects, they need many samples to produce reasonable results. At low sampling rates, they generate unsatisfactory results on challenging scenes.
Filters That Use Additional Information
The approaches in this category leverage additional scene features (e.g., world positions, shading normals, texture values, etc.) which are computed by the MC renderer. Thus, they tend to generate higher-quality results compared to the color-based approaches described above.
For example, McCool removed MC noise by using depths and normals to create a coherence map for an anisotropic diffusion filter. To efficiently render scenes with global illumination, Segovia et al. and Laine et al. used a geometry buffer. Meanwhile, to reduce global illumination noise, Dammertz et al. incorporated wavelet information into the bilateral filter and Bauszat et al. used guided image filtering. Shirley et al. used a depth buffer to handle depth of field and motion blur effects, while Chen et al. combined a depth map with sample variance to filter the noise from depth of field. These methods are directed to a fixed set of distributed effects and are not general.
Hachisuka et al. performed adaptive sampling and reconstruction based on discontinuities in the multidimensional space. Although this method handles general distributed effects, it suffers from dimensionality.
To handle general MC noise using additional scene features, Sen and Darabi observed the need to vary the filter's feature weights across the image. Specifically, they proposed to compute these weights using mutual information to approximate the functional dependencies between scene features and the random parameters. Li et al. used Stein's unbiased risk estimator (SURE) to estimate the appropriate spatial filter parameter in a cross-bilateral filter, while hard coding the weights of the remaining cross terms. Rousselle et al. significantly improved upon this by using the SURE metric to select between three candidate cross non-local means filters that each weight color and features differently. Moon et al. compute a weighted local regression on a reduced feature space and evaluate the error for a discrete set of filter parameters to select the best one.
The main problem the aforementioned approaches, which constitute the state of the art, is that they weight each filter term through either heuristic rules and/or an error metric which is quite noisy at low sampling rates. Thus, they are not able to robustly estimate the appropriate filter weights in challenging cases.
Neural Networks in Graphics/Denoising
Neural networks have been used in computer graphics processing. Grzeszczuk et al. used neural networks to create physically realistic animation. Nowrouzezahrai et al. used neural networks to predict per vertex visibility. Dachsbacher classified different visibility configurations using neural networks. Ren et al. used a neural network to model the radiance regression function to render indirect illumination of a fixed scene in real time. Neural networks have also been used in image denoising where they have been directly trained on a set of noisy and clean patches.
In addition, Jakob et al. have a method that, while not utilizing neural networks, performs learning through expectation maximization to find the appropriate parameters of a Gaussian mixture model to denoise photon maps, a different but related problem.
Monte Carlo rendering allows for the creation of realistic and creative images. However, the resulting images may be full of noise and artifacts. As such, the images are considered noisy. The term “noise” when used alone herein refers to Monte Carlo or MC noise that reduces image quality and not desirable noise.
A machine learning approach to reduce noise in Monte Carlo (MC) rendered images is described herein. To model the complex relationship between ideal filter parameters and a set of features extracted from the input noisy images, machine learning is used. In one embodiment, a multilayer perceptron (MLP) neural network as a nonlinear regression model is used for the machine learning. To effectively train the neural network, the MLP neural network is combined with a filter. In this arrangement, the MLP evaluates a set of features extracted from a local neighborhood at each pixel and outputs a set of filter parameters. The filter parameters and the noisy samples are provided as inputs to the filter to generate a filtered pixel that is compared to the ground truth pixel during training. The neural network is trained on a set of images with a variety of distributed effects and then applied to different images containing various distributed effects or characteristics such as, for example, motion blur, depth of field, area lighting, glossy reflections, and global illumination. The machine learning approach includes training an MLP neural network with a filter to provide denoised or noise-free images.
There is a complex relationship between the input noisy image and the optimal filter parameters needed to create an accurate image. These filter parameters can be effectively estimated using different factors (e.g., feature variances and noise in local regions), but each individual factor by itself cannot accurately predict them. Based on these observations, a supervised learning method is described herein. The supervised learning method learns the complex relationship between these factors and the optimal filter parameters. In this way, the methods avoid the problems of previous approaches. According to one version of the method, a nonlinear regression model is trained on a set of noisy MC rendered images and their corresponding ground truth images, using a multilayer perceptron (MLP) coupled with a matching filter during training and refinement.
During the training stage, the method renders both noisy images at low sampling rates as well as their corresponding ground truth images for a set of scenes with a variety of distributed effects. The method then processes the noisy images and extracts a set of useful features in square regions around every pixel. The method is trained based on the extracted features to drive the filter to produce images that resemble the ground truth. This is done according to a specific error metric.
After the neural network has been trained, in an application stage the method filters new noisy renderings with general distributed effects. The method is fast (and may take a few seconds or less) and produces better results than existing methods for a wide range of distributed effects including depth of field, motion blur, area lighting, glossy reflections, and global illumination. Further, unlike earlier approaches, in one embodiment, no adaptive sampling is performed. In another embodiment of the method, adaptive sampling may be included. The method described herein is a post-process step that effectively removes MC noise.
The method includes: reducing general MC noise using machine learning including supervised learning for MC noise reduction; and training a neural network in combination with a filter to generate results that are close to ground truth images. In other implementations, the machine learning may be support vector machines, random forests, and other kinds of machine learning. As such, the methods are not limited to neural networks.
Description of Apparatus
The methods described herein may be implemented on a computing device such as a computer workstation or personal computer. An example computing device 100 is shown in
In one version, the method was implemented and run on a computing device having an INTEL quad-core 3.7 GHz CPU with 24 GB of RAM and a GeForce GTX TITAN GPU from NVIDIA Corporation. Many other computing device configurations may be used; this is merely provided as an example.
To implement one version of the methods described herein, a learning-based filtering (LBF) component was written in the C++ programming language and integrated into the PBRT2 platform (Physically Based Rendering, Second Edition, see pbrt.org). In one implementation, the neural network and filter were written in CUDA (a parallel computing platform for graphics processing available from NVIDIA Corporation) to take advantage of GPU acceleration.
Description of Processes
The goal of the method described herein is to take a noisy input image rendered with only a few samples and generate a noise-free image that is similar to the ground truth image rendered with many samples. Referring now to
Examples of the results of the application of the method are shown in
Returning now to discussion of the method. The filtered image is defined as
ĉ={ĉr, ĉg, ĉb}
at pixel i is computed as a weighted average of all of the pixels in a square neighborhood N(i) (for example, 55×55) centered around pixel i:
where di,j is the weight between pixel i and its neighbor j as defined by the filter and
where
The filtering process may be written as:
Here,
This energy function is used to compute the filter parameters that will generate a filtered image close to ground truth.
To avoid problems in computing the energy function and the filter parameters, the method uses a learning system that directly minimizes errors. The method uses a nonlinear regression model based on a neural network and directly combine the neural network with a matching filter during training and later application. Ground truth images are used during training to directly compute the error between the filtered and ground truth image without need for error estimation. During an application stage, the trained machine learning model (resulting from iterations that minimize the error computed by the energy function) is applied to additional or secondary features from new scenes to compute filter parameters that produce results close to the ground truth.
We now describe how to train a neural network in combination with a filter by minimizing the energy function to create filter parameters. Referring now to
Neural Network.
In one embodiment, the neural network includes three elements: (1) a model for representing the energy function, (2) an appropriate error metric to measure the distance between the filtered and ground truth images, and (3) an optimization strategy to minimize the energy function.
The Energy Function
In one embodiment, the machine learning model is represented as a neural network in the form of a multilayer perceptron (MLP). The MLP is a regression model since it is a simple and powerful system for discovering complex nonlinear relationships between inputs and outputs. Moreover, MLPs are inherently parallel and can be efficiently implemented on a GPU and are very fast once trained, which is important for rendering. The method described herein differs from standard MLPs in that a filter is incorporated into the training process. By using a filter during machine learning and particularly with the MLP, the method “backpropagates” to update the weights of the neural network during training. To be used in this way, the filter must be differentiable with respect to filter parameters. Filters such as Gaussian, cross-bilateral, and cross non-local means filters are all differentiable and may be incorporated in the method. Other appropriate filters may also be used.
As shown in
where n(l-1) is the number of nodes in layer l-1, wt,sl is the weight associated with the connection between node t in layer l-1, and node s in layer l, w0,sl is the bias for this node, and ƒl is the activation function for layer l. In one implementation, nonlinear activation functions are used in all layers. Multiple kinds of nonlinear activation functions may be used, such as the sigmoid function ƒl(x)=1/(1+e−x). In various implementations, combinations of linear and nonlinear activation functions may be used.
The Error Metric
The error metric to measure the error between the filtered and ground truth pixel values used in the method is a modified relative mean squared error (RelMSE) metric:
where n is the number of samples per pixel, ĉi,q and ci,q are the ith color channel of the filtered and ground truth pixels, respectively, and is a small number (0.01 in one implementation) to avoid division by zero. In this equation, division by is incorporated to account for human visual sensitivity to color variations in darker regions of the image by giving higher weight to the regions where the ground truth image is darker. Further, by multiplying the squared error by n, an inverse relationship to training image bias is removed and all of the images have an equal contribution to the error regardless of sampling rate. In addition, division by 2 is included to produce a simpler derivative.
Optimization Strategy
The optimization starts with a large set of noisy images and the corresponding ground truth images, which can be generated prior to training. For each noisy image, a set of secondary features at each pixel are extracted. The secondary features are used to train the neural network through an iterative, three-step process called “backpropagation”. The goal of backpropagation is to determine the optimal weights wt,sl for all nodes in the neural network which minimize the error between the computed and desired outputs (i.e., the ground truth values) for all pixels in the training images, E=Σi∈ all pEi.
Before starting the backpropagation process, the weights are randomly initialized to small values around zero (for example, between −0.5 to 0.5). Then in the first step, known as the feed-forward pass, the output of the neural network is computed using all inputs. This can be implemented efficiently using a series of matrix multiplications and activation functions applied to the input data to evaluate asl using the equation above. In the second step, the error between the computed and desired outputs is used to determine the effect of each weight on the output error. To do this, the derivative of the error is taken with respect to each weight ∂E/∂ωt,sl. Thus, the activation functions (and the filter as well) need to be differentiable. These two steps are performed for all of the data in the training set. The error gradient of each weight is accumulated. In the third step, all the weights are updated according to their error gradient and the actual error computed by the Error Metric above. This completes a single iteration of training, known as an epoch. Epochs are performed until a converged set of weights is obtained.
Next, a chain rule is used to express the derivative of the energy function.
where M is the number of filter parameters. The first term is the derivative of the error with respect to the filtered pixels ĉi,q. This first term can be calculated as:
In addition, is the output of the MLP network (shown in
The derivative energy function is computed for each weight within the neural network, and the weights are updated after every epoch. The process iterates until convergence is achieved.
Primary Features
Primary features are those directly output by the rendering system. In one version of the method, seven primary features (M=7) are used in the cross-bilateral filter. The primary features are: screen position, color, and five additional features (K=5): world position, shading normal, texture values for the first and second intersections, and direct illumination visibility.
During rendering, for each sample screen position in x, y coordinates, color in RGB format, world position in Cartesian coordinates (x, y, z), shading normal (i, j, k), texture values for the first and second intersections in RGB format, and a single binary value for the direct illumination visibility, for a total of 18 floating point values. These values are averages for all samples in a pixel to produce the mean primary features for every pixel in the image. At this point, the average direct illumination visibility represents the fraction of shadow rays that see the light and is not a binary value. Moreover, additional features are prefiltered using a non-local means filter in an 11×11 window with patch size 7×7.
The distance of the color and additional features are normalized by their variances. The following function is used for the color term:
Where ψi and ψj are the standard deviation of color samples at pixel i and j, respectively, and ζ is a small number (such as, for example, 10−10) to avoid division by zero. For the additional features are expressed by the following function,
where ψk,i is the standard deviation of the kth feature at pixel i and δ is a small number (such as, for example, 10−4) to avoid division by zero. The method smooths the noisy standard deviations for the additional features ψk,i by filtering them using the same weights computed by the non-local means filter when filtering the primary features.
Secondary Features
At every pixel, the method computes a set of secondary features from the neighboring noisy samples to serve as inputs to the neural network.
Feature statistics: the mean and standard deviation for the K=5 additional features are computed for all samples in the pixel. To capture more global statistics, the method also calculates the mean and standard deviation of the pixel-averaged features in a 7×7 block around each pixel. The method computes the statistics for each component (e.g., i, j, k for shading normal) separately and averages them together to create a single value per feature. Thus, according to the method, there are 20 total values for each pixel and the block around it.
Gradients: The gradients of features may be used to decrease the weight of a feature in regions with sharp edges. The method calculates the gradient magnitude (scalar) of the K additional features using a Sobel operator (5 values total).
Mean deviation: This term is the average of the absolute difference between each individual pixel in a block and the block mean. This feature can help identify regions with large errors. In response, the neural network can adjust the filter parameters. For each of the K additional features, the method computes the mean deviation of all the pixel-averaged features in a 3×3 block around each pixel. This feature is computed on each component separately and then averaged to obtain a single value for each additional feature (5 values total).
Median Absolute Deviation (MAD): The method uses the MAD to estimate the amount of noise in each pixel, which is directly related to the size of the filter. The method computes the MAD for each K additional features (5 values total).
Sampling rate: The method uses the inverse of the sampling rate as a secondary feature. The variance of MC noise decreases linearly with the number of samples and, therefore, the filter parameters should reflect this. Since the method includes training a single neural network, the neural network is capable of handling different sampling rates and adjusting the filter size accordingly.
In one version of the system, the method computes a total of N=36 secondary features for each pixel. These secondary features are used as input to the neural network. The neural network outputs the parameters to be used by the filter to generate the final filtered pixel. The method does this for all the pixels to produce a final result.
Video Application
Although described herein regarding scene images, the method may be applied to frames of video. To handle video sequences, the existing neural network described herein may be used without retraining and the cross-bilateral filter may be extended to operate on 3-D spatio-temporal volumes. This modification to the filter is incorporated to reduce the flickering that might appear if each frame is independently filtered. In one version of the method, only three neighboring frames on each side of a current frame (7 frames total) were used for spatio-temporal filtering. The method generates high-quality, temporally-coherent videos from noisy input sequences with low sampling rates.
Closing Comments
Throughout this description, the embodiments and examples shown should be considered as exemplars, rather than limitations on the apparatus and procedures disclosed or claimed. Although many of the examples presented herein involve specific combinations of method acts or system elements, it should be understood that those acts and those elements may be combined in other ways to accomplish the same objectives. With regard to flowcharts, additional and fewer steps may be taken, and the steps as shown may be combined or further refined to achieve the methods described herein. Acts, elements and features discussed only in connection with one embodiment are not intended to be excluded from a similar role in other embodiments.
As used herein, “plurality” means two or more. As used herein, a “set” of items may include one or more of such items. As used herein, whether in the written description or the claims, the terms “comprising”, “including”, “carrying”, “having”, “containing”, “involving”, and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of”, respectively, are closed or semi-closed transitional phrases with respect to claims. Use of ordinal terms such as “first”, “second”, “third”, etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements. As used herein, “and/or” means that the listed items are alternatives, but the alternatives also include any combination of the listed items.
This patent claims priority from Provisional Patent Application No. 62/155,104, filed Apr. 30, 2015, titled A LEARNING-BASED APPROACH FOR FILTERING MONTE CARLO NOISE which is included by reference in its entirety.
This invention was made with Government support under Grant (or Contract) Nos. IIS-1321168 and IIS-1342931 awarded by the National Science Foundation. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62155104 | Apr 2015 | US |