This application claims the benefit under 35 U.S.C. § 119(a)-(d) of United Kingdom Patent Application No. 2205153.6, filed on Apr. 7, 2022 and titled “Image processing method, apparatus, computer program and computer-readable data carrier”. The above cited patent application is incorporated herein by reference in its entirety.
The present invention relates to image processing using machine learning. More precisely, the present invention relates to a method, an apparatus, a non-transitory computer-readable medium storing a program, for enhancing the spatial resolution (dimensions) of an image and its lightness (prediction of how an observer or computer will perceive the quantity of light in the image or the luminances of the objects in the image).
The invention finds particular applications in the fields of computer vision and video surveillance, where there is a need to enhance the visibility, quality, (spatial) resolution and details of natural Low-Light Low-Resolution (LLLR) images prior to carrying out further operations on them such as object detection and/or recognition.
Single-Image Super-Resolution (SISR) aims at increasing the spatial resolution and producing High-Resolution (HR) details given a Low-Resolution (LR) input image.
Due to the many practical applications of enhancing details in images, Super-Resolution (SR) has been an active research field for decades. However, current State-of-the-Art (SoTA) SR methods are trained on well-illuminated images and they are therefore not suitable for reconstruction of real LR images captured in poor lighting conditions, e.g., by surveillance or remote sensing cameras.
The conventional strategy is therefore to correct the exposure level with dedicated Low-Light Enhancement (LLE) algorithms before super-resolving the image. However, this sequential processing scheme leads to poor reconstruction accuracy mainly due to error accumulation and the fact that both LLE and SR are highly ill-posed and ill-conditioned inverse problems. In contrast, it has been shown that joint SR and denoising, demosaicing, and deblurring leads to superior performance in all cases, compared to sequential processing.
Current SoTA SR methods are based on Convolutional Neural Networks (CNNs) which are typically trained on LR patches with a dimension of 64×64 pixels and their corresponding HR patch, typically of ×2, ×3, or ×4 times larger scale. As reconstruction of HR details is mostly a local problem, i.e. distant neighbour pixels provide little information regarding the reconstruction of the local pixel, SR models do not benefit much from using larger training patches.
Early attempts at LLE relied on histogram equalization, illumination map estimation, and Retinex theory to correct the image illumination. However, as these methods fail to consider the inherent noise in the Low-Light (LL) images, the reconstruction results are often unsatisfactory. Recently, deep-learning has been utilized to learn an end-to-end mapping between LL and Normal-Light (NL) images. The Retinex theory was further explored in combination with deep learning, where CNNs were used to learn decomposition and illumination enhancement, and most recently, a self-reinforced Retinex projection model was proposed. Furthermore, Generative Adversarial Networks (GANs) have also been applied to image enhancement problem.
Nonetheless, LLE methods do not increase the spatial resolution of the images, but mainly aim at correcting the brightness level. As such, these methods only recover limited additional details in the image.
Moreover, for the problem of LLE, the inventors have found that the use of more global contextual information can provide valuable cues about the light enhancement level of specific pixels.
Part of the reason for this could be the ineffective long-range dependency modelling capabilities of CNNs, which limit their ability to benefit from more global contextual information.
Like LLE, image super-resolution is one of the fundamental low-level computer vision problems. From the first CNN based SR network, researchers have improved the reconstruction performance of the SR models by extending the network depth, utilizing residual learning, applying dense connections, and attention mechanisms. Research has also been focusing on improving the perceptual quality, and not only the reconstruction accuracy, by the use of feature losses and GANs. However, most approaches assume that the LR images are created by an ideal bicubic downsampling kernel, which is an oversimplification of the real-world situation.
Furthermore, real-world images are often degraded by additional factors besides just downsampling, e.g. blur, low-contrast, color-distortion, noise, and low-light to name a few. To remedy this, a research direction focused on SR methods that can handle more diverse degradations has emerged. These methods often improve upon classical SR methods by extending the degradation model to include more diverse degradations e.g. Gaussian noise, blur, and compression artifacts in the LR training images. Yet only very few works in the literature consider LR images degraded by low-light. Some of the most closely related works to the goal of SR of real natural LLLR RGB images address the problem within different image specific domains. For instance, a GAN-based method for reconstruction of synthetic LLLR face images has been presented. In addition, a dedicated method for SR of LL Near-Infrared (NIR) images has been presented, while a method for LL images captured by intensified charge-coupled devices has also been presented.
Therefore, as discussed above, no existing SR model has been developed for reconstructing real LLLR RGB images.
The present invention addresses at least some of the above-mentioned issues by using a novel transformer-based multi-scale hierarchical encoder-decoder network (hereinafter called RELIEF for Resolution and Light Enhancement Transformer), for joint LLE and SR.
The present invention uses Transformers to effectively utilise additional global contextual information for reconstruction of Low-Light Low-Resolution (LLLR) images, as Transformers can show impressive performance on both high- and low-level vision tasks due to their high capability in modelling long-range dependencies.
Aspects of the present invention are set out by the independent claims and preferred features of the invention are set out in the dependent claims.
According to a first aspect there is provided an image processing method comprising: acquiring a first image whose spatial resolution and lightness are to be enhanced; generating a residual image from the first image using a multi-scale hierarchical neural network for joint learning of low-light enhancement and super-resolution, the network comprising an encoder stage and a decoder stage forming a plurality of symmetrical encoder-decoder levels, each encoder and decoder in each level comprising a vision transformer block; and generating a reconstructed image based on the first and residual images.
Optionally, the network is a residual neural network comprising skip-connections.
Optionally, the network has a U-shaped architecture, the encoder stage reducing the spatial resolution of the first image while increasing the number of feature channels of the first image at every level, and the decoder stage increasing the said spatial resolution while reducing the said number of feature channels at every level, and the spatial resolution of the generated residual image is identical to the spatial resolution of the first acquired image.
Optionally, each vision transformer block uses a Cross-Shaped Window multi-headed self-attention mechanism.
Optionally, the self-attention mechanism comprises horizontal and vertical stripes in parallel that form a cross-shaped window, and the widths of the stripes are gradually increased throughout the depth of the network.
Optionally, each vision transformer block is an Enhanced Cross-Shaped Window transformer block obtained by combining a Cross-Shaped Window self-attention mechanism with a Locally-enhanced Feed-Forward module and a Locally-Enhanced Positional Encoding module.
Optionally, the reconstructed image ÎNLHR is generated based on the following equation:
Î
NLHR=(ILLLR+IR)↑s
wherein ILLLR is the first image, IR is the residual image and s is a scaling factor for the upsampling and the symbol + means element-wise addition.
Optionally, upsampling the combination of the acquired first image and generated residual image comprises performing pixel-shuffling and convolutional operations.
Optionally, the method further comprises extracting a low-level feature map F0ϵH×W×C from the first image, wherein W and H are a width and a height of the first image and C a number of feature channels of the first image, and inputting the low-level feature map F0 to the first encoder level.
Optionally, extracting a low-level feature map F0 comprises performing convolutional operations.
Optionally, generating the residual image comprises extracting deep-level features Fd from the low-level features F0 in the plurality of symmetrical encoder-decoder levels.
Optionally, generating the residual image comprises, after each encoder level, reshaping the features output by that encoder to 2D feature maps and downsampling the features output by that encoder.
Optionally, generating the residual image comprises, after each decoder level, upsampling the features output by the decoder in that decoder level.
Optionally, upsampling the features output by the decoder comprises at least one transposed convolutional operation.
Optionally, the network comprises a bottleneck stage between the last encoder level and the first decoder level.
Optionally, an output of the bottleneck stage is processed to upsample the size of a latent feature map output at the last encoder level and to reduce the number of feature channels input to the first decoder level.
Optionally, the network comprises a skip-connection which concatenates the output of the last decoder level with the output of the bottleneck, so as to input a concatenated feature map in the first decoder level.
Optionally, the network comprises other skip-connections which respectively concatenate a feature map from the encoder of that level and a feature map from the decoder of the preceding decoder level resulting in each level in a feature map input in the decoder of that level with twice the number of feature channels of the encoder in that level.
Optionally, the neural network is trained beforehand with low-resolution patch images and corresponding high-resolution patch images, wherein the low-resolution patch images are bigger than 64×64 pixels, and wherein the corresponding high-resolution patch images are at least 2 to 4 times bigger.
According to a second aspect there is provided a non-transitory computer-readable medium storing a program that, when run on a computer, causes the computer to carry out a method, the method comprising: acquiring a first image whose spatial resolution and lightness are to be enhanced; generating a residual image from the first image using a multi-scale hierarchical neural network for joint learning of low-light enhancement and super-resolution, the network comprising an encoder stage and a decoder stage forming a plurality of symmetrical encoder-decoder levels, each encoder and decoder in each level comprising a vision transformer block; and generating a reconstructed image based on the first and residual images.
According to a third aspect there is provided an image processing apparatus comprising: acquisition means configured to acquire a first image whose spatial resolution and lightness are to be enhanced; first generation means configured to generate a residual image from the first image using a multi-scale hierarchical neural network for joint learning of low-light enhancement and super-resolution, the network comprising an encoder stage and a decoder stage forming a plurality of symmetrical encoder-decoder levels, each encoder and decoder in each level comprising a vision transformer block; and second generation means configured to generate a reconstructed image based on the first and residual images.
Optionally, the network is a residual neural network comprising skip-connections.
Optionally, each vision transformer block uses a Cross-Shaped Window multi-headed self-attention mechanism, wherein the self-attention mechanism comprises horizontal and vertical stripes in parallel that form a cross-shaped window, and wherein the widths of the stripes are gradually increased throughout the depth of the network.
Optionally, each vision transformer block is an Enhanced Cross-Shaped Window transformer block combining a Cross-Shaped Window self-attention mechanism with a Locally-enhanced Feed-Forward module and a Locally-Enhanced Positional Encoding module.
Optionally, the network comprises a bottleneck stage between the last encoder level and the first decoder level.
Additional features of the present invention will become apparent from the following description of embodiments with reference to the attached drawings.
Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings in which:
Given an LLLR image ILLLR∈H×W×3,
where W and H are the width and height, respectively, the goal is to restore its Normal-Light High-Resolution (NLHR) version INLHR. To accomplish this, RELIEF first extracts low-level features F0∈H×W×C, where C is the number of channels, from ILLLR. F0 is preferably obtained by a 3×3 convolutional layer with LeakyReLU. Next, deep features Fd are extracted from the low-level features F0 in K symmetrical encoder-decoder levels. Each level contains multiple ECSWin Transformer blocks. The blocks preferably have large attention areas to capture long-range dependencies.
After each encoder level, the features are preferably reshaped to 2D feature maps and downsampled, while the number of channels is increased. We preferably perform this operation using a 4×4 convolutional operation with stride 2. We preferably use K=4 encoder levels and as such the latent feature output at the last encoder stage is
given an F0∈H×W×C input feature map.
Next, to capture even longer dependencies, we preferably incorporate a bottleneck stage between the encoder and decoder at the lowest level. The output from the bottleneck stage is preferably processed by a 2×2 transposed convolution IR∈H×W×3 operation with stride 2 to upsample the size of the latent features and reduce the channel number before entering the first decoder level. To improve the reconstruction process, skip-connections (SC in
Finally, the reconstructed HR and light-enhanced image is obtained preferably as ÎNLHR=(ILLLR+IR)↑s, where s is the scaling factor of the upsampling operation. The latter is preferably performed with pixel-shuffle and 3×3 convolutional operations. We optimize RELIEF with L1 pixel loss.
The computational complexity of the original full self-attention mechanism grows quadratically with the input size and is therefore not feasible to use in combination with large training image patches. Several works have tried to reduce the computational complexity by shifted, halo, and focal windows to perform self-attention. However, for most methods, the effective receptive field grows slowly, which hinders the long-range modelling capability. To reduce the computational burden, while maintaining strong long-range modelling capability, we use a Cross-Shaped Window (CSWin) attention mechanism. With CSWin, self-attention is calculated in horizontal and vertical stripes by splitting the multi-heads into parallel groups to achieve efficient global self-attention. We preferably gradually increase the widths of the stripes throughout the depth of the network to further enlarge the attention area and limit the computational cost. To further enhance the reconstruction performance, we preferably combine the CSWin self-attention mechanism, with Locally-enhanced Feed-Forward (LeFF) and Locally-Enhanced Positional Encoding (LePE) and form our ECSWin Transformer block. The different components will be described in detail in the following sections.
As illustrated in
{circumflex over (X)}
l=CSWin-Attention(LN(Xl-1))+Xl-1,
X
l=LeFF(LN({circumflex over (X)}l))+{circumflex over (X)}l,
where LN represents the layer normalization, and {circumflex over (X)}l and Xl are the outputs of the CSWin and LeFF modules, respectively. We design our RELIEF architecture to contain multiple CSWin Transformer blocks at each encoder-decoder level. Next, we describe the locally-enhanced feed-forward network and positional encoding in ECS Win.
To better utilize local context, which is essential in image restoration, we exchange the Multi-Layer Perceptron (MLP) based feed-forward network used in the vanilla Transformer block with a LeFF layer. In the LeFF layer, the feature dimension of the tokens is preferably increased with a linear projection layer and hereafter reshaped to 2D feature maps. Next, a 3×3 depth-wise convolutional operation is preferably applied to the reshaped feature maps. Lastly, the feature maps are preferably flattened to tokens, and the channels are reduced with a linear layer such that the dimension of the enhanced tokens matches the dimension of the input. A GELU activation function is preferably used after each linear and convolutional layer.
As the self-attention mechanism inherently ignores positional information in the 2D image space, we preferably use positional encoding to add such information back. Different from the typical encoding mechanisms Absolute Positional Encoding (APE), Relative Positional Encoding (RPE), and Conditional Positional Encoding (CPE) that add positional information into the input tokens before the Transformer Blocks, we preferably use LePE, implemented with a depth-wise convolution operator, to incorporate positional information within each Transformer block. As seen in
As such, the self-attention computation is preferably formulated as:
where DWC is the depth-wise convolution operator.
Datasets
The recent RELLISUR dataset, is the only publicly available dataset of real degraded LLLR images and their high-quality NLHR counterparts. The RELLISUR dataset contains 850 distinct sequences of LLLR images, with five different degrees of under-exposure in each sequence, paired with NLHR images of three different scale levels. In our work, we experiment with ×4 upscaling which is the most challenging scale factor in the dataset.
We follow a known pre-defined split, and as such the number of train, val, and test images are 3610, 215, and 425, respectively.
SICE is a dataset of 589 various scenes captured at different exposure levels, ranging from under to overexposed including a correctly exposed Ground-Truth (GT) image. We follow a known train test split, resulting in 58 test and 531 train images. We preferably use the GT normal-light images as is, but preferably use only the darkest exposure of each scene as the LL images during both training and testing. We synthetically create degraded LR versions of the LL images to obtain paired degraded LLLR and NLHR images. We degrade the LL images by first convolving the images with an 11×11 Gaussian blur kernel with a standard deviation of 1.5 before downsampling with factor ×4. Next, we model sensor noise by adding Gaussian noise with zero mean and a standard deviation of 8. Finally, we save the images in JPEG format with a quality setting of 70 to add compression artifacts. We discard a total of 8 images from the training set, which resolution are less than 256×256 pixels after the downsampling. Evaluation is performed on 256×256 center crops.
Evaluation Metrics
We adopt two hand-crafted (PSNR, SSIM) and one learning-based (DISTS) Full-Reference Image Quality Assessment (FRIQA) metrics for our quantitative comparisons. PSNR is a measure of the peak error between the reconstructed image and the GT, while SSIM is more focused on visible structure differences. However, none of these metrics correlates well with the perceived image quality. To this end, we preferably use DISTS which better captures the perceptual image quality as judged by human observers. For all metrics, we report scores computed on the RGB channels.
Implementation Details
We train our model from scratch for 5×105 iterations with a batch size of 16. We preferably use the ADAM optimizer with a learning rate of 2e-4 which we decrease with a factor 0.5 at 2×105, 4×105 and 4.5×105. For data augmentation, we perform rotation and horizontal and vertical flips. We preferably use 4 encoder-decoder levels in our RELIEF implementation, with two ECSWin Transformer blocks at each level, and one in the bottleneck. The number of attention heads and dimensions of the stripe widths in the encoder is preferably set to [4,8,16,32] and [1,2,8,8], respectively, which are mirrored in the decoder. In the bottleneck, 32 heads and a stripe width of 8 are preferably used. We preferably use channel dimension C=48 for the first encoder level in all experiments. As such, the resulting number of feature channels from level-1 to level-5 becomes [48,96,192,384,768].
Comparison with Existing Methods
To the best of our knowledge, no existing method in the literature can handle reconstruction of real LLLR RGB images. To this end, we compare our proposed method against dedicated methods for LLE, SR, and general image restoration. MIRNet and ESRGAN are SoTA methods for LLE and SR, respectively. To enable upsampling together with LLE we append a Pixel-shuffle layer to MIRNet. As the VGG-discriminator in ESRGAN is not compatible with large training patches, we preferably use a known patch discriminator instead. SwinIR is a SoTA Transformer based method for general image restoration e.g. SR, JPEG compression artifact reduction, and denoising. We preferably use the real-world SR configuration and Pixel-shuffle upsampling for SwinIR. We preferably use a LR training patch size of 256×256 pixels, and re-train all competing methods using the same training hyper-parameters as used for our RELIEF for a fair comparison. MIRNet and SwinIR are optimized with L1 loss, while ESRGAN is optimized with a combination of L1, perceptual and adversarial loss as proposed by the inventors. We emphasize that none of the above-mentioned exiting methods are designed for joint LLE and SR, but once trained on such data they can still serve as baselines against our proposed method.
Results
Quantitative results. As seen in Table 2, RELIEF significantly outperforms the other methods on all metrics. Our method obtains gains in PSNR of 0.28 and 0.78 dB on the RELLISUR and SICE datasets, respectively. Similarly, our RELIEF also achieves the best perceptual quality, according to the DISTS metric, even though our method is not optimized with perceptual losses like ESRGAN.
As seen in Table 1, our RELIEF has the highest number of parameters, but a significantly lower computational burden than any of the compared methods, e.g. 5.7 vs. 47.2 GMACs for SwinIR. However, as proved by empirical evidence in Section 4.6.2, we can obtain comparable performance with a RELIEF variant with less than half the number of parameters.
Qualitative results. We show visual comparisons of different methods on both the RELLISUR and SICE datasets in
Ablation Studies
In this section, we investigate the effectiveness and necessity of the components in RELIEF. All evaluations are conducted on RELLISUR using a LR training patch size of 64×64 and a channel dimension C=48, unless otherwise stated.
Impact of skip-connections and bottleneck layer. Table 3 shows three variants of our network: no skip-connections, no bottleneck layer, and the proposed RELIEF network. From the table it can be seen that the skip-connections and bottleneck layer are both important as the PSNR drops by 0.64 and 0.59 dB by removal of these network components, respectively.
Model parameters. We experiment with different amount of model parameters to find a trade-off between accuracy and complexity by varying the channel number C. As shown in Table 4, we design three variants of RELIEF: RELIEFS, RELIEFM, and RELIEFL. We observe that the PSNR is correlated with the number parameters until a certain point, but also that the parameters and GMACs grows quadratically. We choose a channel number of 48 to balance performance and model size.
Training patch size.
Attention and locality. We compare different multi-headed self-attention mechanisms, feed-forward networks, and positional-encoding mechanisms for the Transformer blocks in RELIEF to show the effect on the reconstruction performance. As seen in Table 6, the best performing configuration with cross-shaped window attention, and enhanced locality in the feed-forward network and positional-embedding yields 0.97 dB improvement over the configuration with shifted-window attention, MLP feed-forward network and relative-positional encoding without locality enhancement. Compared to CSWin, our ECSWin block with locality enhanced feed-forward network results in 0.15 dB PSNR gain.
The invention introduces RELIEF, a novel U-shaped multi-scale hierarchical Transformer network, particularly applicable for reconstruction of real LLLR images. With its efficient ECSWin Transformer blocks, capable of capturing long-range dependencies and local context, RELIEF can utilize large training patch sizes which leads to better reconstruction performance making it capable of revealing previously hidden details in real low-visibility images. Experimental results on two benchmark datasets show that the method according to the invention outperforms state-of-the-art methods in terms of reconstruction accuracy and visual quality.
The invention also provides a non-transitory computer-readable medium storing a program that, when run on a computer, causes the computer to carry out a method, the method comprising: acquiring a first image whose spatial resolution and lightness are to be enhanced; generating a residual image from the first image using a multi-scale hierarchical neural network for joint learning of low-light enhancement and super-resolution, the network comprising an encoder stage and a decoder stage forming a plurality of symmetrical encoder-decoder levels, each encoder and decoder in each level comprising a vision transformer block; and generating a reconstructed image based on the first and residual images.
The method may be carried according to any one of the previous embodiments and features.
The invention also provides an image processing apparatus comprising: acquisition means configured to acquire a first image whose spatial resolution and lightness are to be enhanced; first generation means configured to generate a residual image from the first image using a multi-scale hierarchical neural network for joint learning of low-light enhancement and super-resolution, the network comprising an encoder stage and a decoder stage forming a plurality of symmetrical encoder-decoder levels, each encoder and decoder in each level comprising a vision transformer block; and second generation means configured to generate a reconstructed image based on the first and residual images.
The image processing apparatus according to the invention may be configured to perform some or all of the steps or operations described in connection with the image processing method of the invention. That is to say, the features described in connection with the image processing method can also be part of or be performed by the apparatus. The apparatus may for instance be configured to run the above-mentioned computer program, preferably from the above-mentioned non-transitory computer-readable medium.
The present invention also provides a video surveillance system comprising one or more video cameras and the aforementioned image processing apparatus, which also preferably runs a video management system (VMS) (which can be in the form of a software, hardware, or a combination of both) receiving one or more video streams and/or metadata from the said one or more video cameras. For instance, XProtect® is a VMS developed and distributed by the Applicant that can be used to retrieve and play live and recorded video surveillance data from one or more video cameras and optionally from one or more recording servers in the video surveillance system.
In such a video surveillance system, the image processing apparatus is configured to process at least some of the frames included in the received video surveillance data, each first image to be processed corresponding to at least a part of a frame of the received surveillance data. Preferably, the image processing apparatus may process several or all frames of the received video surveillance data, each respective first image to be processed corresponding to at least a part of a respective frame of the received surveillance data. In other words, the image processing apparatus may process the images of the one or more video surveillance cameras on a continuous basis or for at least a period of time, in a real-time or delayed manner as need be, for the frames received from the one or more video cameras. That is to say, the image processing apparatus may process one or more video streams, the first image(s) to be processed being acquired from the one or more video streams.
Within the context of the present invention, the term “first image” should be construed as being a full frame or at least a part of such a frame and corresponds for instance to the LLLR image described above. Preferably, the “first image” corresponds to a part of a frame (as captured by a video surveillance camera or otherwise) but several “first images” may also correspond to different parts of the same frame. This allows to limit the computational burden to only those parts of the frame(s) which need to be subjected to LLE and SR.
As an example embodiment, assume a video camera overlooking a parking lot, where a part of a captured image is in the sun, and another part is in the shade or is otherwise a low light part of the image. The video camera may adapt its exposure settings to the bright part of the picture, i.e. the part in the sun. The operator of the VMS will then have a hard time seeing the part of the image in the low light part of the image. To solve this issue, the operator may run the aforementioned method on one or more parts of the image (or preferably video) where there is low light. Alternatively, the image processing apparatus may be configured to automatically run the aforementioned method without intervention of the operator.
As another example embodiment, the video surveillance system may be installed in a casino having at least one video camera overlooking at least one gambling table. Light conditions are good on the surface of the table, but the operator may also want to examine what the players are doing with their hands next to the table, which would be the low light part of the image(s). Thus, the image processing apparatus may be configured to select, as the first image(s), the one or more parts of the captured video surveillance data where there is low light.
While the present invention has been described with reference to examples and embodiments, it is to be understood that the invention is not limited to the disclosed examples and embodiments. The present invention can be implemented in various forms without departing from the principal features of the present invention as defined by the claims.
Number | Date | Country | Kind |
---|---|---|---|
2205153.6 | Apr 2022 | GB | national |