This disclosure relates to methods and devices for providing coordinate-based single-camera self-supervision and/or multi-camera supervision for burst demosaicing and denoising.
Demosaicing and denoising may be used to obtain high quality color images from related art cameras. Multi-frame approaches may leverage image bursts to provide improved quality over single frame methods. Training models in a supervised manner may be challenging as paired data may be difficult to obtain. Self-supervised methods can directly leverage bursts to avoid the difficulty of capturing paired data by formulating a reconstruction task with a held out target frame. However, this may rely on standard interpolation methods to align grid-based model outputs with the target frame, which are bandlimited, and may cause undesired artifacts in the output. In supervised multi-camera setups, a mobile phone camera is paired with a high quality DSLR/mirrorless camera capturing the same object from a slightly different view point. This high-quality image serves as a target for the enhancing the mobile camera image, but needs to be first aligned spatially and in color to the mobile phone image.
Due to a small sensor size on mobile phone cameras, many mobile photography modes use burst photography. In burst photography, many images are captured in quick succession (e.g., a burst) and are automatically combined together into a single enhanced image. This technology may be used for low-light photography, digital zoom, and other applications. Additionally, deep learning solutions may be used for image enhancement, including for burst photography. However, it may be difficult to obtain paired training data for burst photography training. The data may need to contain realistic handheld bursts and an aligned target image.
One or more embodiments of the present application address the above and/or other aspects of demosaicing and denoising by modifying a burst model to output a coordinate-based image representation using single-camera self-supervision and/or multi-camera supervision, which enables a model to learn improved interpolation that interpolates a target image in burst processing network training using implicit interpolation, resulting in improved network output and a reduction in artifacts arising from explicit interpolation.
According to one or more embodiments, a method performed by at least one processor, includes obtaining a plurality of images using an image sensor of the electronic device; obtaining synthetic training data for pre-training a burst processing network; performing implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combining the plurality of images into a target image based on the implicit interpolation; and outputting the target image to a display of the electronic device.
According to one or more embodiments, an electronic device including: an image sensor; a display; a memory configured to store instructions; and at least one processor configured to execute the instructions to cause the electronic device to: obtain a plurality of images using the image sensor of the electronic device; obtain synthetic training data for pre-training a burst processing network; perform implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combine the plurality of images into a target image based on the implicit interpolation; and output the target image to the display of the electronic device.
According to one or more embodiments, a non-transitory computer readable medium having instructions stored therein, which when executed by a processor cause the processor to execute a method including: obtaining a plurality of images using an image sensor of an electronic device; obtaining synthetic training data for pre-training a burst processing network; performing implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combining the plurality of images into a target image based on the implicit interpolation; and outputting the target image to a display of the electronic device.
Further features, the nature, and various aspects of the disclosed subject matter will be more apparent from the following detailed description and the accompanying drawings in which:
The following detailed description of example embodiments refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements.
The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations. Further, one or more features or components of one embodiment may be incorporated into or combined with another embodiment (or one or more features of another embodiment). Additionally, in the flowcharts and descriptions of operations provided below, it is understood that one or more operations may be omitted, one or more operations may be added, one or more operations may be performed simultaneously (at least in part), and the order of one or more operations may be switched.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware or firmware. The actual specialized control hardware used to implement these systems and/or methods is not limiting of the implementations.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” “include,” “including,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise. Furthermore, expressions such as “at least one of [A] and [B]” or “at least one of [A] or [B]” are to be understood as including only A, only B, or both A and B.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the indicated embodiment is included in at least one embodiment of the present solution. Thus, the phrases “in one embodiment”, “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
Furthermore, the described features, aspects, and characteristics of the present disclosure may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, in light of the description herein, that the present disclosure may be practiced without one or more of the specific features or aspects of a particular embodiment. In other instances, additional features and aspects may be recognized in certain embodiments that may not be present in all embodiments of the present disclosure.
Demosaicing may refer to a process in digital photography and digital image processing where the colors of a digital image are reconstructed from image sensor data. According to one or more embodiments, demosaicing refers to the task of reconstructing full color images from sensors that use color filter arrays (CFAs), which may be a Bayer filter. The image sensors may spatially subsample multiple color channels. A Bayer pattern may be used, which spatially tiles a 2×2 red-green-green-blue (RGGB) pattern over a sensor. To obtain a full RGB image, demosaicing methods may be used to interpolate the remaining two thirds of the image from the available observations. Related art methods have relied on learned image priors to address an issue of demosaicing from a single image. Multi-frame methods may leverage bursts of frames to reconstruct images with higher quality than possible with single-frame methods. Bursts may be captured using handheld devices, and the frames may differ by non-integer pixel displacements caused by the user. These misalignments allow each frame to provide color samples in regions that are missed by the other frames in the burst. Given that the alignments between the frames can be estimated, bursts can provide an additional signal to achieve accurate reconstructions of a full color image.
According to one or more embodiments, learning based methods for demosaicing can be trained with full supervision, but may rely on access to paired data with demosaiced ground truth images. Synthetic data may limit the generalization of models to real images. Related art camera sensors rely on CFAs, which may not provide the demosaiced images. Multi-camera datasets are simpler to obtain, but supervised multi-camera training may need to address spatial misalignments and color space mismatches between the cameras. These differences between the ground truth and the burst should be accounted for during training using spatial and color alignment methods, but may leak information about the camera used to capture the ground truth into the model.
According to one or more embodiments, a learning model may learn images that are misaligned with the reference frame, and may have different colors than the sensor used to capture the burst. Self-supervised approaches for demosaicing and denoising may rely only on noisy raw bursts which are easy to capture. Additionally, the model produces images in the same color space as the inputs, and can jointly perform denoising under certain statistical assumptions. This may be formulated as a reconstruction task of a held out target frame from the burst which may require alignment. Because the aligned pixel positions of the model output often do not lie directly on the pixel grid of the target frame, interpolation methods are employed to obtain sub-pixel color values, according to one or more embodiments. However, interpolation methods may introduce artifacts into the reconstructions due to the bandlimited nature of grid based images.
A description of example embodiments is provided on the following pages. The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of this invention.
The device 100 may be a smart phone, tablet, laptop, personal computer, etc. For example, embodiments may include a smart phone capable of capturing images. The device 100 may be smart glasses, and/or augmented and virtual reality (AR/VR) headsets (having a camera).
The bus 110 includes a component that permits communication among the components of the device 100. The processor 120 is implemented in hardware, firmware, or a combination of hardware and software. The processor 120 is a central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), a microprocessor, a microcontroller, a digital signal processor (DSP), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), or another type of processing component. In some implementations, the processor 120 includes one or more processors capable of being programmed to perform a function. The memory 130 includes a random access memory (RAM), a read only memory (ROM), and/or another type of dynamic or static storage device (e.g., a flash memory, a magnetic memory, and/or an optical memory) that stores information and/or instructions for use by the processor 120.
The storage component 140 stores information and/or software related to the operation and use of the device 100. For example, the storage component 140 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optic disk, and/or a solid state disk), a compact disc (CD), a digital versatile disc (DVD), a floppy disk, a cartridge, a magnetic tape, and/or another type of non-transitory computer-readable medium, along with a corresponding drive.
The input component 150 includes a component that permits the device 100 to receive information, such as via user input (e.g., a touch screen display, a keyboard, a keypad, a mouse, a button, a switch, and/or a microphone). Additionally, or alternatively, the input component 150 may include a sensor for sensing information (e.g., a global positioning system (GPS) component, an accelerometer, a gyroscope, and/or an actuator). The output component 160 includes a component that provides output information from the device 100 (e.g., a display, a speaker, and/or one or more light-emitting diodes (LEDs)).
The communication interface 170 includes a transceiver-like component (e.g., a transceiver and/or a separate receiver and transmitter) that enables the device 100 to communicate with other devices, such as via a wired connection, a wireless connection, or a combination of wired and wireless connections. The communication interface 170 may permit the device 100 to receive information from another device and/or provide information to another device. For example, the communication interface 170 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like.
The device 100 may perform one or more processes described herein. The device 100 may perform these processes in response to the processor 120 executing software instructions stored by a non-transitory computer-readable medium, such as the memory 130 and/or the storage component 140. A computer-readable medium is defined herein as a non-transitory memory device. A memory device includes memory space within a single physical storage device or memory space spread across multiple physical storage devices.
Software instructions may be read into the memory 130 and/or the storage component 140 from another computer-readable medium or from another device via the communication interface 170. When executed, software instructions stored in the memory 130 and/or the storage component 140 may cause the processor 120 to perform one or more processes described herein. Additionally, or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
The number and arrangement of components shown in
As illustrated in
As illustrated in
As illustrated in
Another method is to use a pair of cameras (e.g., multi-camera supervised training method) with one capturing the handheld burst and the other capturing the high quality target. Although it may be easier to put a single camera on a tripod, using a single camera may yield a non realistic burst as the camera is not moving as it does in a handheld burst. The multi-camera supervised method may include training burst photography using a multi-camera method.
As illustrated in
At operations 205a and 205b, the training may be provided in a supervised manner with multiple cameras (e.g., 205a) or a self-supervised manner with a single camera (e.g., 205b), according to one or more embodiments.
Iin a multi-camera supervised setting of operation 205a, one camera will capture a burst of images, and a second camera will capture a single high-quality image. Under this setting, the first camera and the second camera will be capturing the scene from slightly different angles. To produce a quality final image, the image should be aligned to the high-quality image captured by the second camera.
According to another embodiment, a self-supervised setting of operation 205b may be used. A self-supervised setting may be used for denoising. In self-supervised training, a burst is taken by a single camera, and one frame is held out from the burst of images for supervision. If a burst includes N frames, then N−1 frames may be input to a burst processing network 202 and one frame is held out for supervision. Although the frame is noisy, it is useful because noise is random.
According to one or more embodiments, the self-supervised method and the multi-camera method may require an alignment of the generated image from the network to the target image (e.g., ground truth from another camera in multi-camera scenario or a held-out frame from the burst in self-supervised scenario).
At operation 301, the method includes creating synthetic training data using a set of high-quality images. At operation 303, the method includes providing synthetic data (e.g., 302) to the burst processing network to pretrain the burst processing network. At operation 304, the method includes acquiring a paired dataset from a multi-camera setup: a low-quality burst from the target camera and a high-quality image from another camera. At operation 305, the method includes providing a real multi-camera dataset to the burst processing network. At operation 306, the method includes performing supervised training and fine-tuning of the burst processing network using real data and artifact-free interpolation (e.g., implicit interpolation).
In a supervised multi-camera training method, although SCA should account for the differences between the cameras, it may leak alignment information from the ground truth into the model during finetuning. In an embodiment, a lightweight two layer residual adapter network may be fine tuned on top of self-supervised base demosaicing model, while freezing the parameters of the base model. This enables an adapter to learn the color space transformation from the burst camera to the ground truth camera, while its limited capacity discourages it from learning undesirable spatial transformations.
Although fully supervised training may be performed with access to noise-free, demosaiced ground truth images that are aligned with a reference image, the images may be difficult to obtain due hardware limitations, as many sensors have CFAs.
According to an embodiment, at operation 401, the method includes creating synthetic training data using a set of high-quality images. At operation 403, the method includes providing synthetic data (e.g., 402) to the burst processing network to pretrain the burst processing network. At operation 404, the method includes acquiring a dataset of low-quality bursts from a target camera. At operation 405, the method includes providing a real burst dataset to the burst processing network. At operation 406, the method includes performing self-supervised fine-tuning of the burst processing network on real data, using artifact-free interpolation (e.g., implicit interpolation).
According to an embodiment, frames from a burst may differ by small motions which can provide color samples at locations that are missing in the reference frame, and can be leveraged if the alignments are estimated accurately. To leverage bursts for self-supervised training, one frame may be held out from the burst as a target frame, T, which is not used as an input to the model. A self-supervised method may minimize a reconstruction error between the observed color samples in the target frame, and an aligned demosaiced output from the model. For a model with a grid based output: grid=∥M(T−W(D(ē), Fb
Using a coordinate-based decoder, the self-supervised loss may be modified to: coord=∥M(T−F(ē, Fb
Implicit interpolation refers to directly passing a non-integer (x, y) to a neural network to get a color at (x, y), and placing the obtained color from 2 or 3 in the integer pixel location (x′, y′) in the warped image.
When performing warping, it is unlikely that the warping lands exactly on integer coordinates. When it lands on non-integer coordinates, an interpolation is performed to identify the color. Pixel 520 illustrates an example of bilinear interpolation of area 510. Pixel 520 is closer to the top left but it is in between the four pixels 521, 522, 523, 524. Thus, pixel 520 is a linear blend of the colors of those four pixels. Performing this interpolation may result in more blur. Thus, implicit interpolation may be used to reduce blur.
Using a burst of N noisy RAW images {bi}i=1N, the model encodes each frame independently using a convolutional encoder E({bi}i=1N) to obtain embeddings {ei}i=1N. The first frame is considered the reference frame, and the final demosaiced image will be aligned with it. Frame alignments are estimated by computing the optical flow between each frame, and the reference frame. Aligned embeddings {êi}i=1N are obtained by warping the embeddings {ei}i=1N using the optical flow, and bilinear interpolation. Every frame in the burst may be merged using an attention based fusion module Z({ei}i=1N) to produce the merged embeddings ē. The merged embeddings are decoded into a grid-based image using a convolutional decoder ŷ=D(ē).
According to an embodiment, a coordinate-based decoder ŷ(c)=F(ē, c) may be used which outputs the color values for a set of image coordinates c. The coordinate-based decoder may enable the model to learn a suitable interpolator for providing image intensities for pixel positions that are not aligned with the image grid of the reference frame. To obtain a demosaiced image that is aligned with the reference frame, a zero flow field can be used as the coordinate query. The architecture, and the examples of the decoder are illustrated in
As illustrated in 6C, the coordinate-based decoder directly conditions the decoder to produce aligned images. According to an embodiment illustrated in
According to an embodiment,
The embodiments have been described above and illustrated in terms of blocks, as shown in the drawings, which carry out the described function or functions. These blocks may be physically implemented by analog and/or digital circuits including one or more of a logic gate, an integrated circuit, a microprocessor, a microcontroller, a memory circuit, a passive electronic component, an active electronic component, an optical component, and the like, and may also be implemented by or driven by software and/or firmware (configured to perform the functions or operations described herein). The circuits may, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like. Circuits included in a block may be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block. Each block of the embodiments may be physically separated into two or more interacting and discrete blocks. Likewise, the blocks of the embodiments may be physically combined into more complex blocks.
While this disclosure has described several non-limiting embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of the disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope thereof.
The above disclosure also encompasses the embodiments listed below:
(1) A method, performed by at least one processor of an electronic device, the method comprising: obtaining a plurality of images using an image sensor of the electronic device; obtaining synthetic training data for pre-training a burst processing network; performing implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combining the plurality of images into a target image based on the implicit interpolation; and outputting the target image to a display of the electronic device
(2) The method according to feature (1), in which the performing implicit interpolation comprises passing a non-integer coordinate to obtain a color.
(3) The method according to feature (1), in which the performing implicit interpolation comprises, for each of the plurality of images, decoding image values for each two-dimensional coordinate.
(4) The method according to feature (1), in which the obtaining the plurality of images comprises obtaining the plurality of images using a plurality of cameras.
(5) The method according to feature (1), in which the obtaining the plurality of images comprises capturing the plurality of images using a single camera.
(6) The method according to feature (1), in which the performing implicit interpolation comprises fine tuning the burst processing network using self-supervised loss computation.
(7) The method according to feature (1), in which the performing implicit interpolation comprises fine tuning the burst processing network using supervised loss computation.
(8) An electronic device comprising: an image sensor; a display; a memory configured to store instructions; and at least one processor configured to execute the instructions to cause the electronic device to: obtain a plurality of images using the image sensor of the electronic device; obtain synthetic training data for pre-training a burst processing network; perform implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combine the plurality of images into a target image based on the implicit interpolation; and output the target image to the display of the electronic device.
(9) The electronic device according to feature (8), in which the at least one processor is further configured to perform the implicit interpolation by passing a non-integer coordinate to obtain a color.
(10) The electronic device according to feature (8), in which the at least one processor is further configured to, for each of the plurality of images, decode image values for each two-dimensional coordinate.
(11) The electronic device according to feature (8), in which the at least one processor is further configured to obtain the plurality of images using a plurality of cameras.
(12) The electronic device according to feature (8), in which the at least one processor is further configured to obtain the plurality of images using a single camera.
(13) The electronic device according to feature (8), in which the at least one processor is further configured to perform the implicit interpolation by fine tuning the burst processing network using self-supervised loss computation.
(14) The electronic device according to feature (8), in which the at least one processor is further configured to perform the implicit interpolation by fine tuning the burst processing network using supervised loss computation.
(15) A non-transitory computer readable medium having instructions stored therein, which when executed by a processor cause the processor to execute a method comprising: obtaining a plurality of images using an image sensor of an electronic device; obtaining synthetic training data for pre-training a burst processing network; performing implicit interpolation on the obtained plurality of images based on the pre-trained burst processing network; combining the plurality of images into a target image based on the implicit interpolation; and outputting the target image to a display of the electronic device.
(16) The non-transitory computer readable medium according to feature (15), in which the performing implicit interpolation comprises passing a non-integer coordinate to obtain a color.
(17) The non-transitory computer readable medium according to feature (15), in which the performing implicit interpolation comprises, for each of the plurality of images, decoding image values for each two-dimensional coordinate.
(18) The non-transitory computer readable medium according to feature (15), in which the obtaining the plurality of images comprises obtaining the plurality of images using a plurality of cameras.
(19) The non-transitory computer readable medium according to feature (15), in which the obtaining the plurality of images comprises capturing the plurality of images using a single camera.
(20) The non-transitory computer readable medium according to feature (15), in which the performing implicit interpolation comprises fine tuning the burst processing network using self-supervised loss computation.
This application claims priority to U.S. provisional application No. 63/536,833 filed on Sep. 6, 2023, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63536833 | Sep 2023 | US |