Film grain and digital camera sensor noise are important characteristics of analog film and digital cameras. In modern digital movie production pipelines, noise often has to be removed for visual effects (VFX) compositing. Noise also has to be removed for optimized data compression. Then, in order to better portray the cinematographic aspect of a film, several post-processing operations are commonly applied to the digital content for adding the noise back to the content. A good noise modeling algorithm is necessary to retain the noise characteristics of the original noise distribution and to synthesize the desired quality the artists intend.
However, existing methods are capable of synthesizing only limited types of noise. For example, the existing method in the AV1 codec produces repetitive noise patterns and lack of randomness in color channels, which makes the synthesized noisy frame unrealistic and distracting. Furthermore, the existing methods cannot accurately model the distribution of the targeted noise source. In other words, the renoising process cannot produce the artistic intent in the video production. Also, the existing methods cannot provide artistic controls, such as specific types of camera sensor noise synthesis or noise synthesis for different camera settings (including ISO levels, shutter speed, and color temperature).
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
As noted above, film grain noise and digital camera sensor noise (hereinafter “digital camera noise”) are important characteristics of analog film and digital cameras. As known in the art, digital camera noise manifests as random variations of brightness or color information in images produced by the image sensor and circuitry of a scanner or digital camera. Film grain noise, by contrast, arises from the random optical texture of processed photographic film due to the presence of small particles of metallic silver, or dye clouds, and results in an optical effect that can be noticeable. It is noted that not only are the respective sources of digital camera noise and film grain noise different, but their respective statistical distributions are typically different as well.
In modern digital movie production pipelines, noise often has to be removed for visual effects (VFX) compositing. Noise also has to be removed for optimized data compression. Then, in order to better portray the cinematographic aspect of a film or other video content, several post-processing operations are commonly applied to the digital content for adding the noise back to the content. A good noise modeling algorithm is necessary to retain the noise characteristics of the original noise distribution and to synthesize the desired quality the artists intend.
As further noted above, existing methods are capable of synthesizing only limited types of noise. For example, the existing method in the AV1 codec produces repetitive noise patterns and lack of randomness in color channels, which makes the synthesized noisy frame unrealistic and distracting. Furthermore, the existing methods cannot accurately model the distribution of the targeted noise source. In other words, the renoising process cannot produce the artistic intent in the film production. Also, the existing methods cannot provide artistic controls, such as specific types of camera sensor noise synthesis, or noise synthesis for different camera settings (including ISO levels, shutter speed, and color temperature).
The present disclosure provides a deep learning method for digital camera noise modeling and synthesis in the raw Red-Green-Blue (raw-RGB) and standard RGB (sRGB) color spaces, as well as film grain image noise modeling and synthesis. With respect to the distinction between raw-RGB images and sRGB images, it is noted that a raw-RGB image represents a scene-referenced digital camera sensor image having RGB values that are specific to the color sensitivities of the color filter array of the sensor, while an sRGB image represents a display-referenced image that has been rendered through the image signal processor (ISP) of the digital camera. The image noise modeling and synthesis solution disclosed in the present application includes at least three key features that advance the existing art: One key feature of the present method is using a deep generative model with noise injection to the network for sensor noise/file grain modeling in raw-RGB and sRGB color spaces. Another key feature is the artistic control mechanism has been designed for generating different types of sensor noise under different camera settings. A further key feature is a generic network design that can be used for multi-camera sensors noise modeling and film grain synthesis.
Moreover, experimental results indicate that, without requiring extensive manual adjustment, the synthesized noisy frames produced by the present image noise modeling and synthesis solution contain visually realistic camera noise or film grain noise without the aforementioned issues in existing tools. Quantitatively, the present solution achieves superior synthesis results by computing the KL (Kullback—Leibler) divergence between the synthesized noise and the ground truth noise.
The present image noise modeling and synthesis solution may automatically synthesize different types of digital camera noise or film grain noise with specific artistic intents, which gives artists more freedom during the image processing passes in a movie production pipeline. As used in the present application, the terms “automation,” “automated,” “automating,” and “automatically” refer to systems and processes that do not require the participation of a human system operator. Although, in some implementations, a system operator or administrator may review or even adjust the performance of the automated systems and according to the automated methods described herein, that human involvement is optional. Thus, the methods described in the present application may be performed under the control of hardware processing components of the disclosed automated systems.
The present image noise modeling and synthesis solution solves the problems encountered in existing approaches, such as the requirement of prior knowledge of what camera noise or film grain noise looks like for the purpose of manually tuning the parameters, tedious manual process for renoising each shot, a limited amount of types of noise, unrealistic or repetitive noise pattern, and lack of randomness in the temporal domain, while enhancing artistic control.
In practice, a dataset with clean-noisy image pairs is used. The machine learning model implemented as part of the present image noise modeling and synthesis solution predicts the residual image (subtracting the clean image from the noisy pairing one), which is denoted herein as the “noise map.” The separation of the noise signal and clean image signal makes it much easier for the discriminator model to judge whether a sample belongs to the real distribution. In addition to the clean image, noise settings in the case of digital camera noise, e.g., camera brandmark, ISO sensitivity, shutter speed, and the like, or film grain settings in the case of film grain noise, e.g., grain size, grain amount, and the like, can be fed to the generator as control information. The present image noise modeling and synthesis solution introduces the concept of injecting noise into the generator to imitate the stochastic property of real noise.
As defined in the present application, the expression “machine learning model” (hereinafter “ML model”) may refer to a mathematical model for making future predictions based on patterns learned from samples of data or “training data.” For example, ML models may be trained to perform image processing, natural language understanding (NLU), and other inferential data processing tasks. Various learning algorithms can be used to map correlations between input data and output data. These correlations form the mathematical model that can be used to make future predictions on new input data. Such a predictive model may include one or more logistic regression models, Bayesian models, or artificial neural networks (NNs). A “deep neural network,” in the context of deep learning, may refer to a NN that utilizes multiple hidden layers between input and output layers, which may allow for learning based on features not explicitly defined in raw data. As used in the present application, a feature identified as a NN refers to a deep neural network.
As further shown in
Although the present application refers to software code 110 and ML model(s) 120 as being stored in system memory 106 for conceptual clarity, more generally, system memory 106 may take the form of any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium,” as used in the present application, refers to any medium, excluding a carrier wave or other transitory signal that provides instructions to hardware processor 104 of computing platform 102. Thus, a computer-readable non-transitory storage medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory storage media include, for example, optical discs such as DVDs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
Moreover, in some implementations, system 100 may utilize a decentralized secure digital ledger in addition to, or in place of, system memory 106. Examples of such decentralized secure digital ledgers may include a blockchain, hashgraph, directed acyclic graph (DAG), and Holochain® ledger, to name a few. In use cases in which the decentralized secure digital ledger is a blockchain ledger, it may be advantageous or desirable for the decentralized secure digital ledger to utilize a consensus mechanism having a proof-of-stake (PoS) protocol, rather than the more energy intensive proof-of-work (PoW) protocol.
Although
Hardware processor 104 may include multiple hardware processing units, such as one or more central processing units, one or more graphics processing units, and one or more tensor processing units, one or more field-programmable gate arrays (FPGAs), custom hardware for machine-learning training or inferencing, and an application programming interface (API) server, for example. By way of definition, as used in the present application, the terms “central processing unit” (CPU), “graphics processing unit” (GPU), and “tensor processing unit” (TPU) have their customary meaning in the art. That is to say, a CPU includes an Arithmetic Logic Unit (ALU) for carrying out the arithmetic and logical operations of computing platform 102, as well as a Control Unit (CU) for retrieving programs, such as software code 110, from system memory 106, while a GPU may be implemented to reduce the processing overhead of the CPU by performing computationally intensive graphics or other processing tasks. A TPU is an application-specific integrated circuit (ASIC) configured specifically for AI processes such as machine learning.
In some implementations, computing platform 102 may correspond to one or more web servers accessible over a packet-switched network such as the Internet, for example. Alternatively, computing platform 102 may correspond to one or more computer servers supporting a wide area network (WAN), a local area network (LAN), or included in another type of private or limited distribution network. In addition, or alternatively, in some implementations, system 100 may utilize a local area broadcast method, such as User Datagram Protocol (UDP) or Bluetooth, for instance. Furthermore, in some implementations, system 100 may be implemented virtually, such as in a data center. For example, in some implementations, system 100 may be implemented in software, or as virtual machines. Moreover, in some implementations, communication network 108 may be a high-speed network suitable for high performance computing (HPC), for example a 10 GigE network or an Infiniband network.
It is further noted that, although user system 128 is shown as a desktop computer in
It is also noted that display 129 of user system 128 may take the form of a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light-emitting diode (OLED) display, a quantum dot (QD) display, or any other suitable display screen that perform a physical transformation of signals to light. Furthermore, display 129 may be physically integrated with user system 128 or may be communicatively coupled to but physically separate from user system 128. For example, where user system 128 is implemented as a smartphone, laptop computer, or tablet computer, display 129 will typically be integrated with user system 128. By contrast, where user system 128 is implemented as a desktop computer, display 129 may take the form of a monitor separate from user system 128 in the form of a computer tower.
At least one m-channel Gaussian noise map 238 is then concatenated to the latent bottleneck features (output of 237) as input to decoder block 240 of ML model 230, which acts as the source of the synthesized noise. The noises are sampled independently on each three-dimensional (3D) position with a standard deviation of one. Next, decoder blocks 242 with noise injections inside each decoder block 242 provided by skip connections (as further shown in
It is noted that synthesized noise map 224, in
Thus, as shown by
Each encoder block 236 and 237 used in ML model 230 may be implemented as a sequence of simplified Nonlinear Activation Free (NAF) blocks (hereinafter “SNAF blocks”).
Each decoder block 240 and 242 used in ML model 230 may be implemented as a sequence of SNAF blocks receiving noise injections 344 in
Referring to
In addition to conditional discriminator 450,
The functionality of exemplary conditional discriminator 450 depicted in
p=D(n*,I,CM), (Equation 1)
where n* represents either real (ñ) noise map 122/422 or synthesized ({circumflex over (n)}) noise map 124/224/424, 1 is clean image 112/212/412, D is conditional discriminator 450, CM is the control information provided by noise setting(s) 116/216/416 (hereinafter control map “CM”), and p is critic score 452. Real noise map 122/422 or synthesized noise map 124/224/424, along with clean image 112/212/412 and noise settings 116/216/416 are fed to conditional discriminator 450 symmetrically so that conditional discriminator 450 only needs to differentiate the conditional distribution.
Hardware processor 104 of system 100 may execute software code 110 to train ML model 230 based on critic scores 452 output by conditional discriminator 450. For example, hardware processor may execute software code 110 to provide a test input to conditional discriminator 450, the test input being one of: (i) clean image 112/212/412, synthesized noise map 124/224/424 generated by ML model 230, and noise setting(s) 116/216/416 or (ii) clean image 112/212/412, real noise map 122/422 derived from noisy image 114, and noise setting(s) 116/216/416. Hardware processor 104 can then execute software code 110 to determine, using conditional discriminator 450, whether the test input includes real noise map 122/422 or synthesized noise map 124/424, and train ML model 230 on that basis until conditional discriminator 450 is consistently fooled by the synthesized noise generated by ML model 230.
By way of overview of an image noise modeling and synthesizing process, the generation of one image noise sample can be formulated as:
{circumflex over (n)}=G(I, CM, n), (Equation 2)
where {circumflex over (n)} is synthesized noise map 124/224/424, G is ML model 230, and n is the Gaussian noise with a standard deviation of one.
The present image noise modeling and synthesis solution uses the standard conditional generative adversarial network (GAN) loss as discriminator loss and adversarial loss, which can be described by Equation 3 as:
where Pdata(ñ, I, CM) represents sampling one clean noise image pair and its corresponding settings, and Pg(n) represents sampling the Gaussian noise maps in the transition point and all decoder blocks. More specifically, Squared Error is used as the loss on critic scores 452, which can be described as:
disc=[(1−D(ñ))2+D({circumflex over (n)})2]/2 (Equation 4)
adv=(1−D({circumflex over (n)}))2, (Equation (5)
In addition to the use of standard conditional GAN loss, the present novel and inventive image noise modeling and synthesis solution introduces the use of style loss in noise modeling. Not only is the training process for ML model 230 stabilized by the addition of style loss, but the pattern of synthesized noise is advantageously closer to the pattern of the real noise without sacrificing the randomness.
Style loss has been used in Style Transfer tasks to measure the distance of two images in style space. Because style loss computes the distance between the Gram Matrices of two features, rather than the naive Euclidean distance between features, such supervision will not reduce the variation of synthesized noise. It is noted that minimizing the distance between Gram Matrices can be interpreted as a Maximum Mean Discrepancy (MMD) minimization process with the second-order polynomial kernel, which measures the distance between two distributions through the samples:
where M is the number of pixels in one patch, and {circumflex over (n)} represents the synthesized noise and ñ represents the real noises. As such, the style loss effectively reasons on statistical properties which makes it a promising candidate to supervise the noise generation process.
A pre-trained Visual Geometry Group Network (VGG-Network) can be used to extract the perceptual features of real noise map 122/422 and synthesized noise map 124/224/424 and compute the average style loss with uniform weights over different layers, which can be formulated as:
where L is the number of layers, MSE(∵) is the Mean Squared Error loss, and Graml is the Gram Matrix of the feature in layer l. Thus, ML model 230 may be trained using style loss to predict image noise.
The functionality of system 100 including software code 110 and ML model 230, shown in
Referring to
Continuing to refer to
Continuing to refer to
As noted above, in various use cases, the noise included in noisy image 114 may include digital camera noise or film grain noise. In use cases in which the noise included in noisy image 114 is digital camera noise, as well as those in which the noise included in noisy image 114 is from film grain noise, synthesized noise map 124/224 may be generated in one of raw-RGB color space or sRGB color space, depending on whether the real noise was generated prior to processing of noisy image 114 by an Image Signal Processor (ISP) or after processing by the ISP, respectively. It is noted that the ISP typically introduces non-linear transformations and spatial mixing to the raw data. Therefore, the noise distributions in raw-RGB color space and sRGB color space deviate from each other. Although most existing approaches to noise modeling focus on one or the other of raw-RGB color space or sRGB color space, the present image noise modeling and synthesis solution can advantageously generalize to both color spaces. Generation of synthesized noise map 124/224 using ML model 230, in action 563, may be performed by software code 110, executed by hardware processor 104 of system 100.
Continuing to refer to
With respect to the actions described in flowchart 560, it is noted that actions 561, 562, and 563, or actions 561, 562, 563, and optional action 564, may be performed in a substantially automated process from which human involvement can be omitted.
The present approach to noise modeling and synthesis can also be adapted to provide a neural network model that can synthesize different types of noise with controllable parameters (hereinafter “noise parameter conditioning”). According to this approach, signal or content dependent parameters (hereinafter “content dependent parameters”) are distinguished from signal or content independent parameters (hereinafter “content independent parameters”), which can each be described in generic terms. As an example, it is possible to describe those parameters using the weights of the heteroscedastic noise model. It is noted that content dependency in this use case means that the neural network model needs to condition the noise synthesis on the image. For example, the intensity of the noise might be inversely proportional to the color intensity values. By contrast, content independent parameters are unrelated to characteristics of the content and can be used synthesize noise values independently of the content.
Referring to
For the content independent parameters, random values are sampled following a normal distribution. The values are then processed by neural network 682. Finally, the respective content dependent branch output 681 and the content independent branch output 683 are summed and added to original image 684. To train the model, GAN training may be utilized, where the objective is to learn a noise generator that produces image noise that matches the distribution of noise in ground truth noisy images 686. Depending on the exact training setting and dataset, discriminator 650 may be conditioned on the noise parameters. It is noted that discriminator 650 may correspond in general to conditional discriminator 450, in
As yet another variant application focused on synthesizing film grain noise, a Style-based Wavelet-driven GAN (SWAGAN) model for which the generative process is conditioned on a clean image may be utilized. In the case of the generator, the wavelet transform of the downscaled clean image may be concatenated at the points A, B, and C, of the SWAGAN model as shown by diagram 700, in
Thus, the present application discloses systems and methods for performing ML model-based image noise synthesis that address and overcome the deficiencies in the conventional art. The present image noise synthesis solution advances the state-of-the-art in several ways, including providing a generic network architecture and solution for digital camera noise modeling and synthesis in raw-RGB color space or sRGB color space, film grain noise modeling and synthesis, as well as accurate noise modeling that achieves quantitatively and qualitatively superior noise modeling performance for different sensors and noise settings. In addition, the present image noise modeling and synthesis solution advantageously enhances artistic control by providing control maps that give artists a way to control the synthesized digital camera noise and film grain noise in accord with their creative intent. Moreover, the present solution may be utilized with multiple different formats available in multiple forms, e.g., for movie production, including command-line interfaces running natively or on proprietary containers or plugins.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
The present application claims the benefit of and priority to a pending U.S. Provisional Patent Application Ser. No. 63/424,368 filed on Nov. 10, 2022, and titled “Machine Learning Techniques For Camera Sensor Noise and Film Grain Modeling,” which is hereby incorporated fully by reference into the present application.
Number | Date | Country | |
---|---|---|---|
63424368 | Nov 2022 | US |