This application claims the benefit under 35 U.S.C. § 119(a)-(d) of United Kingdom Patent Application No. 2303244.4, filed on Mar. 6th 2023 and titled “JOINT REAL-WORLD IMAGE DENOISING AND SUPER-RESOLUTION”. The above cited patent application is incorporated herein by reference in its entirety.
The present disclosure relates to a method of training a neural network to extract a degradation map from a degraded image and a super resolution imaging method using the extracted degradation map.
Image Super-Resolution (SR) aims to enhance the resolution and details of low resolution (LR) images to generate High-Resolution (HR) images. Most recent SR methods accomplish this task by learning a mapping from LR images, generated synthetically by bicubic downsampling, to the corresponding HR images. However, Deep Neural Network (DNN)-based SR methods often suffer from overfitting to the training data distribution, which consequently leads to decreased performance when applied to images with different degradations (i.e. noise).
Recent attempts to overcome this issue include elaborate degradation models [Jingyun Liang et al: Image restoration using swin transformer, In IEEE International Conference on Computer Vision Workshops, 2021; Chong Mou et al. Metric learning based interactive modulation for real-world super-resolution, In European Conference on Computer Vision, pages 723-740, Springer,2022; Kai Zhang et al, Designing a practical degradation model for deep blind image super-resolution, IEEE International Conference on Computer Vision, pages 4791-4800, 2021] and network conditioning based on degradation estimation [Jie Liang et al, Efficient and degradation-adaptive network for real-world image super-resolution, European Conference on Computer Vision, 2022].
US2022/0148130 discloses an image restoration method that combines image super resolution and denoising, including a step of generating a noise map and inputting this into a CNN (SR subnetwork). The noise map is itself generated by a noise estimator CNN (NE subnetwork), which is trained using pairs of LR and HR images generated by creating each noisy LR image from a corresponding clean HR image by downsampling and adding white Gaussian noise.
Therefore, while many methods have been proposed to solve the Super-Resolution (SR) issue of Low-Resolution (LR) images with complex unknown degradations, their performance still drops significantly when evaluated on images with challenging real-world degradations.
The present disclosure provides a method of training a neural network to extract a degradation map from a degraded image.
The present disclosure also provides a computer implemented super resolution imaging method for generating a higher resolution image with reduced noise from a degraded lower resolution image that includes noise, in which a pixel level degradation map obtained by the method above, and a feature map of the degraded lower resolution image are input to a second trained neural network to perform a pixel-wise feature modulation to generate the higher resolution image with reduced noise.
One often overlooked factor contributing to the poor performance of prior art methods on real-world degradations, is the presence of spatially varying degradations in real LR images. To address this issue, the present disclosure provides a degradation pipeline capable of generating paired LR/HR images with spatially varying noise, a key contributor to reduced image quality.
Prior art methods assume uniformly distributed degradations and hereby ignore the phenomenon of spatially varying noise present in real images acquired in photon-limited situations. This key factor compromising the image quality is not contingent upon specific image sensors, but a result of the physics involved in the imaging process, such as random photon arrival and non-ideal-sensor characteristics, which may lead to higher Signal-to-Noise Ratio (SNR) in brighter pixels (low noise) and lower SNR in darker pixels (high noise). Since the SNR is ultimately controlled by the quantum nature of light, the noise stemming from this phenomenon is an inherent characteristic of any realizable imaging device operating under natural settings.
In addition to the extraction of a degradation map, the present disclosure provides a Super-Resolution model capable of adapting the reconstruction process based on the degradation map.
Embodiments of the present disclosure will now be described, with reference to the accompanying drawings, in which:
The super resolution network 20 comprises Restormer Transformer Blocks (RTBs) 2 and Spatial Feature Transformation Blocks (SFTBs) 3. The Restormer Transformer is described in Syed Waqas Zamir et al, “Restormer: Efficient transformer for high-resolution image restoration”, CoRR, abs/2111.09881, 2021. The RTBs 2 are organized in a U-Net shaped architecture together with the Spatial Feature Transformation Blocks (SFTBs) 3.
The above architecture provides a super-resolution model that is conditioned on pixel-wise degradation features provided by the degradation map extraction block 100 for improved refinement of location specific degradations.
The components, training and operation of the super-resolution network of
A necessity for a Deep Neural Network (DNN) based super resolution model to perform well is prior training on equivalent training data. A known degradation pipeline for creating realistic LR/HR training image pairs involves convolution with a blur kernel k on the HR image y, followed by downsampling with scale factor s, and lastly, degradation by additive noise to produce the degraded LR image xd. The pipeline is described as:
However, a limitation of this method is the use of spatially uniform degradations, which limits the generalization performance on real images.
In the present disclosure, the degradation map extraction block 100 is trained using pairs of training images y, xd generated by a degradation model that comprises synthesizing degraded low resolution source images xd from clean source images y by downsampling and addition of spatially variant noise. Thus, the present disclosure uses a method of generating the LR/HR training image pairs such that the noise strength varies spatially across the image, which better resembles the distribution of noise in real images, which varies naturally as a result of different Signal-to-Noise Ratio (SNR) levels. More specifically, the degraded source images xd are synthesized from the clean source images y with spatially varying noise by the concept of mask blending. First, a clean source image y is downsampled to a LR image x. A mask m of the same spatial size as the LR image x is generated, which contains either a randomly oriented and randomly shaped gradient mask, or a mask based on the image brightness level. Next, a noisy image xn is generated by adding spatially invariant Gaussian or Poisson noise to x. Then x and xn are blended according to the varying intensity levels defined in the mask, to form the degraded image Xd with spatially varying noise:
The noise addition could be carried out before the downsampling, but for computational efficiency the downsampling preferably occurs first. Also, adding noise before the downsampling would change the noise distribution.
Examples of different masks and the resulting noisy images are shown in
During training, the degradation feature extraction network 10 estimates the degradation of an input image which is a degraded low resolution source image xd on a pixel level, and estimates the degradations by learning to extract the degradations directly from the degraded source image xd to generate an image space degradation map {circumflex over (d)}. More specifically, as shown in
The degradation feature extraction network 10 is trained by inputting each degraded source image xd of each training image pair and extracting a degradation map d such that when the degradation map d is applied to the corresponding clean source image y of the training image pair, the loss between the degraded source image xd and its corresponding clean source image after the degradation map is applied is minimised. As shown in
The design of each of the DFEBs 1 is shown in
The degradation feature extraction network 10 is optimized by the loss between Rd and xd. To encourage images with similar frequency distributions a combination of SSIM and focal frequency loss was used. The whole degradation feature extraction model has a moderate receptive field of 51×51 and 4.6M parameters. During inference, degradation features are extracted as a degradation feature map from the 9th DFEB to condition the super resolution block 200.
As shown in
To condition the super resolution network 20 on the pixel-wise degradation map d estimated by the degradation feature extraction network 10, the super resolution network 20 is designed to transform the deep spatial features of the super resolution block network 20 adaptively and individually for each pixel accordingly.
The super resolution network 20 follows the Restormer architecture, and includes an initial 3×3 convolution layer 15 that extracts shallow features f. These are the modulated by the first SFTB.
In each SFT layer 8a, 8b, feature maps f are conditioned on the degradation map d by a scaling and shifting operation:
A Spatially Variant Super-Resolution (SVSR) benchmarking dataset was used to compare the performance of the method of the present disclosure with known methods. The SVSR dataset can be found at https://zenodo.org/records/10044260, and the data collection method for this dataset along with an analysis of its characteristics will be described below.
The purpose of the SVSR dataset is to enable evaluation of Super Resolution methods on real LR images with challenging and spatially variant degradations. To achieve this, the dataset should include high-quality HR reference images and corresponding low-quality LR images.
To capture static scenes with diverse content, both in- and outdoors, three different Canon® Digital single-lens reflex (DSLR) cameras, two different zoom lenses, and three different aperture values were used. This ensured diverse degradations, as the noise characteristics and point-spread-function vary between the different cameras, lenses, and aperture settings. A scale difference was obtained by changing the focal length of the zoom lenses, to collect image pairs of both ×2 and ×4 scale difference. To obtain varying degrees of noise, multiple images of the same static scene were captured using aperture priority and changing the camera's ISO setting. At low ISO settings (low signal gain) the camera will produce the most noise-free images, while at higher ISO settings, and appropriately shorter exposure times, the images will contain more noise due to the lower signal-to noise ratio. Hence, the clean images were captured at the camera's native ISO setting (ISO100), while the noisy images were captured at incrementally higher ISO levels up to the maximum setting for each camera. ISO1600 was established as the threshold to distinguish noisy images, as this is the point at which all three cameras introduced visible noise.
The captured dataset comprises a total of 978 images, with 141 noise-free images for each scale level, and 555 images as the noisy LR counterparts.
A breakdown of the dataset is set out below.
indicates data missing or illegible when filed
Note that due to different technologies, images captured at the same ISO setting by different cameras do not necessarily contain similar noise levels and types.
Even though the image collection was done with the camera mounted on a tripod and using a remote trigger, misalignment between the LR and HR image pairs can still occur, as the different focal lengths distort the image differently. To mitigate this, a pre-processing pipeline was applied.
First, the lens distortion was removed using Adobe Lightroom®, followed by centre cropping to keep only the sharpest part of the images. Next, pixel-wise registration of LR and HR images was obtained using a luminance-aware iterative algorithm, which was found to be more accurate for the highly noisy images, compared to keypoint-based algorithms. To maintain the scale difference between the LR and HR images, the alignment was performed in LR space. Finally, all image pairs were examined, and ones with misalignment, out-of-focus, or other unwanted defects were discarded. The resulting image pairs have a resolution of 640×640, 1280×1280, and 2560×2560px for the ×1, 2, and 4 scale factors, respectively.
To quantify the effect of varying ISO levels on the image the table below presents the average standard deviation of the noise, and the average change in image fidelity and perceptual quality as the ISO increase. As seen, high ISO settings results in higher noise contributions, which translates to accordingly lower image quality, e.g. the Peak Signal-to-Noise Ratio (PSNR) for the highest ISO setting is 12.53 dB lower than for ISO1600.
In the embodiment of the present disclosure used for testing, the DIV2K [Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017] and Flick2K [Bee Lim, Sanghyun Son, Keewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136-144, 2017] datasets were used for training.
The training images were degraded using the spatially variant noise degradation model described above together with the degradation pipeline described in “Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE International Conference on Computer Vision, pages 4791-4800, 2021” (Zhang et al), by replacing the degradation with uniform Gaussian noise in Zhang et al with spatially variant Gaussian and Poisson noise. The noise standard deviation was set to [1,50] and scale to [2,4] for Gaussian and Poisson noise, respectively. The remaining steps in the degradation pipeline include Gaussian blur, downsampling and JPEG compression noise, with the same hyperparameters as defined in Zhang et al. for comparability. As such, any performance improvements related to the degradation modelling are solely due to the introduction of spatially variant noise.
We perform experiments on Restormer [Syed Waqas Zamir, Aditya Arora, Salman H. Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. CoRR, abs/2111.09881, 2021], a Transformer based image reconstruction model, where we add SFTBs for each encoder level, and before the final refinement block. We use average pooling of the degradation maps to match the spatial dimensions of the feature maps at the different encoder levels. x4 upsampling is done as final step by nearest-neighbour interpolation+convolutional-layers. Otherwise, the architecture follows the original implementation. The degradation feature extraction network 10 and the super resolution network 20 were jointly trained for 1M iterations with a batch size of 16 using the ADAM optimizer [Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. CoRR, abs/1412.6980, 2014.], a learning rate of 2×10-4, LR patch sizes of 64×64, and L1-loss.
The performance of the embodiment and the comparative examples was tested on the SVSR dataset discussed above, for evaluation on real-world degraded LR images. For evaluation on images with synthetic degradations Set14 [Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In Curves and Surfaces: 7th International Conference, Avignon, France, Jun. 24-30, 2010, Revised Selected Papers 7, pages 711-730. Springer, 2012], BSD100 [David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416-423 vol. 2, 2001] and Urban100 [Narendra Ahuja Jia-Bin Huang, Abhishek Singh. Single image super-resolution from transformed self-exemplars. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197-5206, 2015] which are degraded by additive Gaussian noise with zero mean and standard deviation 6=15, 25, 50, respectively were used.
As metrics, the reconstruction performance were evaluated using two hand-crafted (PSNR, SSIM), and two DNN-based (LPIPS, DISTS) Full-Reference Image Quality Assessment (FR-IQA) metrics. PSNR (peak signal to noise ratio) reports the image fidelity as a measure of the peak pixel-wise error between the prediction and target, while SSIM (structural similarity index), LPIPS (learned perceptual image patch similarity), and DISTS (deep image structure and texture similarity) are more focused on the perceived image quality.
The performance of the embodiment was compared with a number of prior art real world super resolution methods. These included one codebook based method (FeMaSR), two degradation estimation and adaptation-based methods (DASR and DAN), five methods relying on elaborate degradation modelling (Real-ESRNet, MM-RealSRNet, BSRNet, PDM-SR) and Transformers (SwinIR), and for completeness, one method trained on bicubicly downsampled images (RRDBNet). For reference, a filter-based method i.e. 3×3 Median filter followed by Bicubic upsampling was also used. For all DNN-based methods, the pre-trained weights provided by the authors were used for enhancement of real images and optimized for PSNR, rather than perceptual quality, since the goal is to restore the original image with the highest possible fidelity.
As mentioned above, for evaluation on images with synthetic degradations Set14, BSD100 and Urban100 datasets were used, which are degraded by additive Gaussian noise with zero mean and standard deviation 6=15, 25, 50, respectively. The following Table shows the results. In this experiment, where the degradations are uniformly distributed, the embodiment of the present disclosure (PDA-RWSR) outperforms all the competing methods on both noise levels, except for σ15 on Urban100 where the present disclosure performs comparably with Real-ESRNet.
indicates data missing or illegible when filed
The performance of the embodiment (PDA-RWSR) and the comparative examples was tested on the SVSR dataset disclosed above, for evaluation on real-world degraded LR images with complex degradations. The table below shows the results.
indicates data missing or illegible when filed
Contrary to the experiments on synthetic data, the SVSR dataset poses a more challenging reconstruction task, where the assumption of spatially invariant Gaussian noise employed by most of the comparative example methods will not hold. As such, the global degradation estimation-based methods (DASR, DAN) cannot handle such real-world scenarios, resulting in low performance based on all Image Quality Assessment (IQA) metrics. Furthermore, while methods based on elaborate degradation models (Real-ESRNet, MM-RealSRNet, BSRNet, PDM-SR, Swin-IR) are trained on more complex degradations, their reconstruction quality is very inconsistent on images with spatially variant noise from the SVSR dataset. This can be seen visually in
To demonstrate the contribution of the degradation model of the present disclosure, a comparison of different models was carried out. In the table below, DM and FM denote the degradation model, and whether the model uses feature modulation, respectively. Giga Multiply-Accumulates per Second (GMACS) are computed for an input image of 64×64 pixels.
indicates data missing or illegible when filed
It can be seen that using the degradation model of the present disclosure with spatially variant noise (C) results in 0.09 dB higher PSNR compared to using the degradation model from BSRNet (B). The complementary effect between the spatially variant degradation model and the per-pixel-based degradation feature extraction and adaptation method (D) results in the best performance, although with the cost of additional computations.
In conclusion, the present disclosure makes significant progress towards SR of real images with complex and spatially varying degradations. This is enabled by a method of training a neural network to extract a degradation map using training data comprising pairs of images, each pair of images comprising a clean source image and a degraded source image. The pairs of images are generated by, for each clean source image, generating a corresponding noisy image by adding spatially invariant noise to the clean source image, and blending the noisy image with the clean source image according to varying intensity levels defined by a spatially variant mask to obtain the degraded image.
The degradation map obtained by the above method can then be used in a computer implemented super resolution imaging method for generating a higher resolution image with reduced noise from a degraded lower resolution image that includes noise. A feature map of the degraded lower resolution image is obtained, then the degradation map and the feature map are used by a second trained neural network to perform a pixel-wise feature modulation to generate the higher resolution image with reduced noise.
Some embodiments of the present disclosure may be implemented as a recording medium including a computer-readable instruction such as a computer-executable program module. The computer-readable recording medium may be an arbitrary available medium accessible by a computer, and examples thereof include all volatile and non-volatile media and separable and non-separable media. Further, examples of the computer-readable recording medium may include a computer storage medium and a communication medium. Examples of the computer storage medium include all volatile and non-volatile media and separable and non-separable media, which have been implemented by an arbitrary method or technology, for storing information such as computer-readable instructions, data structures, program modules, and other data. The communication medium generally includes a computer-readable instruction, a data structure, a program module, other data of a modulated data signal, or another transmission mechanism, and an example thereof includes an arbitrary information transmission medium.
While the disclosure has been particularly shown and described with reference to embodiments thereof, it will be understood by one of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as defined by the following claims. Hence, it will be understood that the embodiments described above are not limiting of the scope of the disclosure.
The scope of the disclosure is indicated by the claims rather than by the detailed description of the disclosure, and it should be understood that the claims and all modifications or modified forms drawn from the concept of the claims are included in the scope of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2303244.4 | Mar 2023 | GB | national |