Embodiments of the invention relate to neural network operations for image quality enhancement.
Deep Convolutional Neural Networks (CNNs) have been widely adopted for image processing such as image refinement and super-resolution. The CNNs have been used to restore an image degraded by blur, noise, low resolution, and the like. The CNNs have been shown to be effective in solving single image super-resolution (SISR) problems, where a high-resolution (HR) image is reconstructed from a low-resolution (LR) image.
Some CNN-based methods have the assumption that a degraded image is subject to one fixed combination of degrading effects, e.g., blurring and bicubic down-sampling. These methods have limited capability in handling images where the degrading effects vary from one image to another. These methods also cannot handle an image that has one combination of degrading effects in one region and another combination of degrading effects in another region of the same image.
Another approach is to train an individual network for each combination of degrading effects. For example, if an image is degraded by three different combinations of degrading effects: bicubic down-sampling, bicubic down-sampling and noise, and direct down-sampling and blurring, three networks are trained to handle these degradations.
Therefore, there is a need for improving the existing methods for refining an image that is subject to variational degradation effects.
In one embodiment, a method is provided for image refinement. The method includes the steps of: receiving an input including a degraded image concatenated with a degradation estimation of the degraded image; performing feature extraction operations to apply pre-trained weights to the input to generate feature maps; and performing operations of a refinement network that includes a sequence of dynamic blocks. One or more of the dynamic blocks dynamically generates per-grid kernels to be applied to corresponding grids of an intermediate image output from a prior dynamic block in the sequence. Each per-grid kernel is generated based on the intermediate image and the feature maps.
In another embodiment, a system includes memory to store parameters of a feature extraction network and a refinement network. The system further includes processing hardware coupled to the memory. The processing hardware is operative to: receive an input including a degraded image concatenated with a degradation estimation of the degraded image; perform operations of the feature extraction network to apply pre-trained weights to the input to generate feature maps; and perform operations of the refinement network that includes a sequence of dynamic blocks. One or more of the dynamic blocks dynamically generates per-grid kernels to be applied to corresponding grids of an intermediate image output from a prior dynamic block in the sequence. Each per-grid kernel is generated based on the intermediate image and the feature maps.
Other aspects and features will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments in conjunction with the accompanying figures.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that different references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail in order not to obscure the understanding of this description. It will be appreciated, however, by one skilled in the art, that the invention may be practiced without such specific details. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.
Embodiments of the invention provide a framework of a Unified Dynamic Convolutional Network for Variational Degradation (UDVD). The UDVD performs single image super-resolution (SISR) operations for a wide range of variational degradation. Furthermore, the UDVD can also restore image quality from blurring and noise degradation. The variational degradation can occur inter-image and/or intra-image. Inter-image variational degradation is also known as cross-image variational degradation. For example, a first image may be low resolution and blurred, and a second image may be noisy. Intra-image variational degradation is degradation with spatial variations in an image. For example, one region in an image may be blurred and another region in the same image may be noisy. The UDVD can be trained to enhance the quality of images that suffer from inter-image and/or intra-image variational degradation. The UDVD incorporates dynamic convolution, which provides more flexibility in handling different degradation variations than standard convolution. In SISR with a non-blind setting, the UDVD has demonstrated the effectiveness on both synthetic and real images.
Dynamic convolutions have been an active area in neural network research. Brabandere et al. “Dynamic filter networks,” in Proc. Conf. Neural Information Processing Systems (NIPS) 2016, describes a dynamic filter network that dynamically generates filters conditioned on an input. Dynamic filter networks are adaptive to input content and therefore offers increased flexibility.
The UDVD generates dynamic kernels based on the concept of dynamic filter networks with modifications. The dynamic kernels disclosed herein adapt to not only image contents but also diverse variations of degrading effects. The dynamic kernels are effective in handling inter-image and intra-image variational degradation.
The standard convolution uses kernels that are learned from training. Each kernel is applied to all pixel locations. In contrast, the dynamic convolution disclosed herein uses per-grid kernels that are generated by a parameter-generating network. Moreover, the kernels of standard convolution are content-agnostic which are fixed after training is completed. In contrast, the dynamic convolution kernels are content-adaptive and can adapt to different inputs during inference. Due to these properties, the dynamic convolution is a better alternative to the standard convolution in handling variational degradation.
In the following description, two types of dynamic convolutions are disclosed. Moreover, multistage losses are integrated to gradually refine images throughout consecutive dynamic convolutions. Extensive experiments show that the UDVD achieves favorable or comparable performance on both synthetic and real images.
In a practical use case, degrading effects such as blurring, noise, and down-sampling can simultaneously occur. The degradation process is formulated as:
I
LR=(IHR⊗k)↓s+n, (1)
where IHR and ILR represent high resolution (HR) and low resolution (LR) images, respectively, k represents a blur kernel, n represents additive noise. Equation (1) indicates that the LR image is equal to the HR image convolved with a blur kernel, downsampled with a scale factor s, and plus noise. An example of the blur kernel is the Isotropic Gaussian blur kernel. An example of additive noise is the additive white Gaussian noise (AWGN) with covariance (noise level). An example of downsampling is the bicubic downsampler. Other degradation operators may also be used to synthesize realistic degradations for SISR training. For real images, a search on degradation parameters is performed area by area to obtain visually satisfying results. In this disclosure, a non-blind setting is adopted. Any degradation estimation methods can be prepended to extend the disclosed method to a blind setting.
The degraded image (denoted as I0) is concatenated with a degradation map (D). The degradation map D, also referred to as a degradation estimation, may be generated based on known degradation parameters in the degraded image; e.g., a known blur kernel and a known noise level σ. For example, the blur kernel may be projected to a t-dimensional vector by using the principal component analysis (PCA) technique. An extra dimension of noise level σ is concatenated to the t-dimensional vector to obtain a (1+t) vector. The (1+t) vector is then stretched to get a degradation map D of size (1+t)×H×w.
The feature extraction network 110 includes an input convolution 111 and N residual blocks 112. The input convolution 111 is performed on the degraded image (I0) concatenated with the degradation map (D). The convolution result is sent to the N residual blocks 112, and is added to the output of the N residual blocks 112 to generate feature maps (F).
The refinement network 120 includes a sequence of M dynamic blocks 123 to perform feature transformation. Each dynamic block 123 receives the feature maps (F) as one input. In one embodiment, the dynamic block 123 is extended to perform upsampling with an upsampling rate r. Each dynamic block 123 can learn to upsample and reconstruct the variationally degraded image.
Each dynamic block 123 includes a first path and a second path. The first path predicts dynamic kernels 350 and then performs dynamic convolution by applying the dynamic kernels 350 to the image Im-1. The dynamic convolution can be regular or upsampling. An example of the different types of dynamic convolutions is provided in connection with
In
The second path contains two 3×3 convolution layers (shown as CONV*2 330) with 16 and 3 channels, respectively, to generate a residual image Rm for enhancing high-frequency details. The residual image Rm is then added to the output of dynamic convolution Om to generate an image Im. A sub-pixel convolution layer may be used to align the resolutions between the two paths.
In a regular dynamic convolution, convolutions are conducted by using dynamic kernels K of kernel size k×k. Such operation can be expressed as:
I
out(i,j)=Σu=−ΔΔΣv=−ΔΔKi,j(u,v)·Iin(i−u,j−v), (2)
where Iin and Iout represent input and output image, respectively, i and j are the coordinates in an image, u and v are the coordinates in each Ki,j. Note that Δ=floor (k/2). Applying these dynamic kernels is equivalent to computing a weighted sum over nearby pixels to enhance the image quality; different kernels are applied to different grids of the image. In a default setting, there are H×W kernels and the corresponding weights are shared across channels. By introducing an additional dimension C with Equation (2), dynamic convolution can be extended for independent weights across channels.
In a dynamic convolution with upsampling, r×r convolutions are performed on the same corresponding patch to create r×r new pixels, where the patch is the area to which the dynamic kernel is applied. The mathematical form of such operation is defined as:
I
out(i×r+x,j×r+y)=Σx=0rΣy=0rΣu=−ΔΔΣv=−ΔΔKi,j,x,y(u,v)·Iin(i−u,j−v), (3)
where x and y are in the coordination of each r×r output block (0≤x; y≤r−1). Here, the resolution of Iout is r times the resolution of lin. A total of r2HW kernels are used to generate rH×rW pixels as Iout. When performing the dynamic convolution with upsampling, the weights may be shared across channels to avoid excessively high dimensionality.
Loss=Σm=1MF(Im,IHR), (4)
where M is the number of dynamic blocks 123 and F is loss function such as L2 loss or perceptual loss. To obtain a high-quality resultant image, the sum of losses from each dynamic block 123 is minimized. The sum of losses is used to update the convolution weights in each dynamic block 123.
The processing hardware 710 is coupled to a memory 720, which may include memory devices such as dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, and other non-transitory machine-readable storage media; e.g., volatile or non-volatile memory devices. To simplify the illustration, the memory 720 is represented as one block; however, it is understood that the memory 720 may represent a hierarchy of memory components such as cache memory, system memory, solid-state or magnetic storage devices, etc. The processing hardware 710 executes instructions stored in the memory 720 to perform operating system functionalities and run user applications. For example, the memory 720 may store framework parameters 725, which are the trained parameters of the framework 100 (
In some embodiments, the memory 720 may store instructions which, when executed by the processing hardware 710, cause the processing hardware 710 to perform image refinement operations according to the method 600 in
The operations of the flow diagram of
While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, and can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.