This invention relates to a method for performing single-image super-resolution, and an apparatus for performing single-image super-resolution.
First efforts in Super-Resolution (SR) focused on classical multi-image reconstruction-based techniques [1,2]. In this approach, different observations of the same scene captured with sub-pixel displacements are combined to generate a super-resolved image. This constrains its applicability to very simple types of motion between captured images, since registration needs to be done, and is typically unsuitable for up-scaling frames in most video sequences. It also degrades fast whenever the magnification factor is large [3,4] or the number of available images is insufficient.
The SR research community has overcome some of these limitations by exploring the so called Single-Image Super Resolution (SISR). This alternative provides many possible solutions to the ill-posed problem of estimating a high-resolution (HR) version of a single input low-resolution (LR) image by introducing different kinds of prior information.
One common approach in SISR is based on machine learning techniques, which aim to learn the relation between LR and HR images, usually at a patch level, using a training set of HR images from which the LR versions are computed [5,6, 7]. Thus, performance will be closely related to the content of the training information. To increase the generalization capability, the training set needs to be enlarged, resulting in a growing computational cost. When considering all possible image scenarios (ranging e.g. from animals to circuitry), finding a generalizable training set can then be unfeasible. Current research on sparse representation [8] tackles this problem by representing image patches as a sparse linear combination of base patches from an optimal over-complete dictionary. Even though with sparse representation the dictionary size is drastically reduced and so the querying times, the execution time of the whole method is still lengthy. In addition, the cost of finding the sparse representation is still conditioned by the size of the training dataset. Thus, there might still be generalization issues.
There also exist methods with internal learning (i.e. the patch correspondences/examples are obtained from the input image itself) which exploit the cross-scale self-similarity property [9,10].
The present invention follows this strategy, aiming at a better execution time vs. quality trade-off. In principle, when performing super-resolution of a single image, this comprises generating a high-resolution version of an observed image by exploiting cross-scale self-similarity. According to the invention, a low-frequency band of the super-resolved image is interpolated, and the missing high-frequency band is estimated by combining high-frequency examples extracted from the input image. Then it is added to the interpolated low-frequency band. Further according to the invention, adaptively selected up-scaling and analysis filters are used, e.g. for local error measurement. In particular, the up-scaling and analysis filters provide a range of parametric kernels with different levels of selectivity, among which the most suitable ones are adaptively selected. More selective filters provide a good texture reconstruction in the super-resolved image, whereas filters with small selectivity avoiding ringing, but tend to miss texture details.
In one embodiment, the invention uses internal learning, followed by adaptive filter selection, which leads to better generalization to the non-stationary statistics of real-world images.
Advantages of the invention are visible in view of quantitative results (PSNR, SSIM and execution time) as well as qualitative evidence that support the validity of the proposed approach in comparison to two well-known state-of-the-art SISR methods, obtained with different datasets. These results show that the proposed method is orders of magnitude faster than the known comparison SISR methods [8,11], while the visual quality of the super-resolved images is comparable to that of the internal learning SISR method [11] and slightly superior to that of the dictionary-based SISR method [8]. The latter is affected by the limited generalization capability problem.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in
The present invention relates to a new method for estimating a high-resolution version of an observed image by exploiting cross-scale self-similarity. The inventors extend prior work [14] on single-image super-resolution by introducing an adaptive selection of the best fitting up-scaling and analysis filters for example learning. This selection is based on local error measurements obtained by using each filter with every image patch, and contrasts with the common approach of a constant metric in both dictionary-based and internal learning super-resolution.
The invention is interesting for interactive applications, offering low computational load and parallelizable design that allows e.g. straight-forward GPU implementations. The invention can be applied for digital input data structures of various different dimensions (i.e. 1D, 2D or 3D), including digital 2D images. Experimental results show how the disclosed method and apparatus of the invention generalize better to different datasets than dictionary-based up-scaling, and comparably to internal learning with adaptive post-processing.
In principle, the method for generating a super-resolution version of a single low resolution digital input data structure S0 according to the present invention works as follows (cf.
When using interpolation-based up-scaling methods, the resulting HR image presents a frequency spectrum with shrunk support. Interpolation does not provide any mechanism to fill in the missing high-frequency band up to the wider Nyquist limit for the up-scaled image. In the method and apparatus according to the invention, the missing high frequency band is estimated by combining high-frequency examples extracted from the input image and added to the interpolated low-frequency band, based on a similar mechanism to the one introduced in [12]. As known from [9], most images present the cross-scale self-similarity property. This basically results in a high probability of finding very similar patches across different scales of the same image. Let xl=hs*(y↑s) be an up-scaled version of the input image y, with hs a linear interpolation kernel and s the up-scaling factor. The subscript l refers to the fact this up-scaled image only contains the low-frequency band of the spectrum (with normalized bandwidth 1/s). For now, it will just be assumed that hs has a low-pass filter behavior. More details about the filter will be given below.
The input image y can be analyzed in two separate bands by using the same interpolation kernel used for up-scaling. The low-frequency yl=hs*y and high-frequency yh=y−yl bands can be computed. By doing so, pairs of low-frequency references (in yl) and their corresponding high-frequency examples (in yh) are generated. yl has the same normalized bandwidth as xl and, most importantly, the cross-scale self-similarity property is also present between these two images.
Let xl,i be a patch with dimensions Np×Np pixels with the central pixel in a location λ(xl,i)=(ri, ci) within xl. We look for the best matching patch in the low-resolution low-frequency band yl,j=argminy
The local estimate of the high-frequency band corresponding to a patch is just xh,i=yh,j. However, in order to ensure continuity and also to reduce the contribution of inconsistent high-frequency examples, the patch selection is done with a sliding window, which means up to Np×Np high-frequency estimates are available for each pixel location λi. Let ei, be a vector with these n≦Np×Np high-frequency examples and 1 an all-ones vector. We can find the estimated high-frequency pixel as xi=argminx
Once the procedure above is applied for each pixel in the up-scaled image, the resulting high-frequency band xh might contain low-frequency spectral components, since (1) filters are not ideal and (2) the operations leading to xh are nonlinear. Thus, in order to improve the spectral compatibility between xl and xh, the low-frequency spectral component is subtracted from xh before adding it to the low-frequency band x:=xl+xh−hs*xh to generate the reconstructed image.
In one embodiment, a raised cosine filter [13] is chosen to provide a range of parametric kernels with different levels of selectivity. The analytic expression of a one-dimensional raised cosine filter is
where s is the up-scaling factor (the bandwidth of the filter is 1/s) and β is the roll-off factor (which measures the excess bandwidth of the filter). Since all the up-scaling and low-pass filtering operations are separable, this expression is applied for both vertical and horizontal axis consecutively. The value of β is enforced to lie in the range [0, s−1], so that the excess bandwidth never exceeds the Nyquist frequency. With β=0, the most selective filter (with a large amount of ringing) is obtained, and with β=s−1 the least selective one.
In order to adaptively select the most suitable filter from a bank of five filters with
we look for the one providing minimal matching cost for each overlapping patch, as introduced below.
to dark for β=s−1. That is, for pixels shown at the lightest grey level, a filter with a roll-off factor β=0 and high selectivity was adaptively selected. For pixels shown at the next darker grey level, a filter with a higher roll-off factor β=s−1/4 and lower selectivity was adaptively selected, etc.
The used nomenclature is: xβ,l,i, xβ,h,i, Yβ,l,i and yβ,h,i denote (in this order) a low-frequency patch, the corresponding reconstructed high-frequency patch, the best matching low-resolution reference patch and its corresponding high-frequency example patch, respectively, which have been obtained by using the interpolation kernel and analysis filter hs,β. Then, the local kernel cost is measured as
k
β,i
=α∥x
β,l,i
−y
β,l,j∥1+(1−α)∥xβ,h,i−yβ,h,i∥1 (2)
A parameter α is suitable for tuning the filter selection.
The proposed method has been implemented in MATLAB, with the costlier sections (example search, composition stages, filtering) implemented in OpenCL without special emphasis on optimization. The patch size is set to Np=3 and the search window size to Nw=15. The algorithm is applied iteratively with smaller up-scaling steps (s=s1 s2 . . . ), e.g. an up-scaling with s=2 is implemented as an initial up-scaling with s1=4/3 and a second one with s2=3/2.
Even though the proposed method can also compute the magnification with a single step, the wider bandwidth available for matching with smaller magnification factors results in better selection of high-frequency examples, at the cost of a somewhat increased computational cost. As a post-processing stage, we apply Iterative Back-Propagation (IBP) [1] to ensure the information of the input image is completely contained in the super-resolved one:
x
(n+1)
:=x
(n)
+h
u*((y−((x(n)*hd)↓s)↑s) (3)
The algorithm converges typically after 4 or 5 iterations. The up-scaling (hu) and down-scaling (hd) kernels are the ones used for bi-cubic resizing.
calculating in an adder/subtractor 180 a difference between the input data structure S0 and the low-frequency input data structure L0, whereby a high-frequency input data structure H0 is generated,
upscaling 120 the input data structure S0, and filtering 130 the upscaled input data structure by a second low-pass filter Fl,1, wherein a low-frequency upscaled data structure L1 is obtained,
determining in the low-frequency upscaled data structure L1 a first patch Pn,L1 at a first position,
searching 152,154 in the low-frequency input data structure L0 a first block Bn,L0 that matches the first patch Pn,L1 best, and determining the position of said first block Bn,L0 within the low-frequency input data structure L0,
selecting 155 a second block Bn,H0 in the high-frequency input data structure H0 at the determined position,
accumulating 157 pixel data of the selected second block Bn,H0 to a second patch Pn,H1, the second patch being a patch in a high-frequency upscaled data structure H1,acc at the first position,
repeating 150 the steps of determining a new patch Pn,L1 in the low-frequency upscaled data structure L1, searching 152,154 in the low-frequency input data structure L0 a block Bn,L0 that matches the selected patch Pn,L1 best, selecting 155 a corresponding block Bn,H0 in the high-frequency input data structure H0 and
accumulating 157 pixel data of the selected corresponding block Bn,H0 to a patch Pn,H1 in the high-frequency upscaled data structure H1,acc at the position of said new patch Pn,L1, and
normalizing 190 the accumulated pixel values in the high-frequency upscaled data structure H1,acc, whereby a normalized high-frequency upscaled data structure H1 is obtained. Finally, a super-resolved data structure S1 is obtained by adding the normalized high-frequency upscaled data structure H1 to the low-frequency upscaled data structure L1. The filters that are adaptively selected according to the present invention are the low-pass filters 130,170, i.e. the first low-pass filter Fl,0 and the second low-pass filter Fl,1. For these filters, one out of two or more raised cosine filters according to eq. (1) is selected in an adaptive selection step 135 (with the same parameter β for both filters), as controlled by a cost measuring step 145. The cost measuring step can be tuned by a parameter α, as described above. In implementations, different parameterized variants of these filters (with different β) can be available simultaneously, or as a single variable filter.
In some embodiments, the upscaled input data structure after filtering 130 by the second low-pass filter Fl,1 is downscaled 140 by a downscaling factor d, with n>d. Thus, a total non-integer upscaling factor n/d is obtained for the low-frequency upscaled data structure L1. The high-frequency upscaled data structure H1,init (or H1 respectively) has the same size as the low-frequency upscaled data structure L1. The size of H1 may be pre-defined, or derived from L1. H1 is initialized in an initialization step 160 to an empty data structure H1,init of this size.
The low-frequency band of the high-resolution image L1 is first divided into small patches Pn,L1 (e.g. 5×5 or 3×3 pixels) with a certain overlap. The choice of the amount of overlap trades-off robustness to high-frequency artifacts (in the case of more overlap) and computation speed (in the case of less overlap). In one embodiment, an overlap of 20-30% in a each direction is selected, i.e. for adjacent patches with e.g. 5 values, 2 values overlap, and for adjacent patches with 3 values, 1 or 2 values overlap. In other embodiments, the overlap is higher, e.g. 30-40%, 40-50% or around 50% (e.g. 45-55%). For an overlap below 20% of the patch size, the below-described effect of the invention is usually lower.
The final high-frequency band H1 is obtained after normalizing by the number of patches contributing to each pixel, thus resulting in an average value. It is clear that the larger the overlap between patches, the better the suppression of high-frequency artifacts resulting from the high-frequency extrapolation process.
Then, for each low-frequency high-resolution patch Pn,L1, a best match in terms of mean absolute difference (MAD) is obtained after an exhaustive search in a local search window (e.g. 11×11 pixels) over the low-frequency band L0 of the low-resolution image. The best match is a block Pn,L0 from the low-frequency high-resolution image L0 that has the same size as the low-frequency high-resolution patch Pn,L1 (e.g. 3×3 or 5×5 pixels). More details about the search window are described below with respect to
For understanding the next step, it is important to note that the low-resolution low-frequency data structure L0 has the same dimension as the low-resolution high-frequency data structure H0, and the high-resolution low-frequency data structure L1 has the same dimension as the high-resolution high-frequency data structure H1. as shown in
As a result, each value in the resulting (preliminary) high-frequency band of the high-resolution data structure H1 is a sum of values from a plurality of contributing patches. Due to the patch overlap in L1 (and consequently also in H1 since both have the same dimension), values from at least two patches contribute to many or all values in H1. Therefore, the resulting (preliminary) high-frequency band of the high-resolution data structure H1 is normalized 190. For this purpose, the number of contributing values from H0 for each value in the high-frequency high resolution data structure H1 is counted during the synthesis process, and each accumulated value in H1 is divided by the number of contributions.
In the example, a second patch P12,L1 is selected at a position that is shifted horizontally by a given patch advance. Patch advance is the difference between patch size and overlap. Patch advances in different dimensions (e.g. horizontal and vertical for 2D data structures) may differ, which may lead to different effects or qualities in the dimensions of the high-resolution output data structure, but they are usually equal. A new search window W12 is determined according to the new patch position. In principle, the search windows advance in the same direction as the patch, but slower. Thus, a current search window may be at the same position as a previous search window, as is the case here. However, since another patch P12,L1 is searched in the search window, the position of the best matching patch P12,L0 will usually be different. The best matching patch P12,L0 is then accumulated to the high-resolution high-frequency data structure H1 at the position of the low-frequency high-resolution patch P12,L1, as described above. Subsequent patches P13,L1, P14,L1 are determined and searched in the same way. As shown in
The above description is sufficient at least for 1-dimensional (1D) data structures. For 2D data structures, the position of a further subsequent patch is found by vertical patch advance (this may or may not be combined with a horizontal patch advance). Also vertical patch advance includes an overlap, as mentioned above and also shown in
The position of the search window is determined according to the position of the current patch. As shown in
In one embodiment (not shown in
In general, the larger the search window, the more likely it is to find a very similar patch. However, in practice little difference in accuracy is to be expected by largely increasing the search window, since the local patch structure is more likely to be found only in a very local region in general natural images. Moreover, a larger search window requires more processing during the search.
The second patch P2,L1 is selected according to the employed patch advance, as shown in
As mentioned above, the search window advances usually only after a plurality of patches have been processed. As shown exemplarily in
The method was tested using two different datasets. The first one, called “Kodak”, contains 24 images of 768×512 pixels and the second one, called “Berkeley”, contains 20 images of 481×321 pixels that are commonly found in SISR publications. The results were compared to a baseline method (bi-cubic resizing) and two state-of-the-art methods falling in the subcategories of dictionary-based ([8], referred to as “sparse”) and kernel ridge regression ([11], referred to as “ridge”) with a powerful post-processing stage based on the natural image prior. For “sparse”, a dictionary created offline with the default training dataset and parameters supplied by the authors was used. The comparison consists in taking each image from the two datasets, downscaling it by a factor of ½ and up-scaling it by a factor of s=2 with each method. The SSIM, Y-PSNR and execution time were measured. The detailed results are shown in
All SR methods perform better than the baseline bi-cubic interpolation, as expected, with “ridge” and the method of the present invention also surpassing the dictionary-based method. This reflects the fact that dictionary-based methods do not generalize well in comparison to internal learning. In terms of execution time, the method of the present invention is clearly faster than the other tested sophisticated SR methods, whereas the simple bi-cubic up-scaling algorithm takes a much shorter computing time.
The above-described single-image super-resolution method is suitable for interactive applications. An advantage is that the execution time is orders of magnitude smaller than that of the compared state-of-the-art methods, with similar Y-PSNR and SSIM measurements to those of the best performing one [11]. The method's execution time is stable with respect to the reconstruction accuracy, whereas [11]'s time increases for the more demanding images. Some key aspects of the proposed method are at least (1) an efficient cross-scale strategy for searching high-frequency examples based on local windows (internal learning) and (2) adaptively selecting the most suitable up-scaling and analysis filters based on matching scores.
In one embodiment, the invention relates to an apparatus for performing super-resolution of single image, wherein a high-resolution version of an observed image is generated by exploiting cross-scale self-similarity. The apparatus comprises at least up-scaling and analysis filters, and an adaptive selection unit for adaptively selecting the up-scaling and analysis filters.
In one embodiment, the adaptive selection unit is adapted for selecting among a plurality of filters with different levels of selectivity.
In one embodiment, the up-scaling and analysis filters are raised cosine filters.
In one embodiment, the up-scaling and analysis filters have parametric kernels, and said adaptive selection unit is adapted for selecting among a plurality of filters with different levels of selectivity.
In one embodiment, the apparatus further comprises a cost measuring unit for measuring a local kernel cost, wherein the adaptive selection unit is adapted for adaptively selecting a filter from among a plurality of filters with different roll-off factors, wherein the adaptively selected filter is the one that provides minimal matching cost for each overlapping patch.
The apparatus further comprises an adaptive selection unit 935 for selecting or adapting said adaptive upscaling and analysis filter, and a cost measuring unit 945 that, in one embodiment, operates according to eq. (2) and provides control input to the adaptive selection unit 935.
It will be understood that the present invention has been described purely by way of example, and modifications of detail can be made without departing from the scope of the invention.
Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features may, where appropriate be implemented in hardware, software, or a combination of the two. Connections may, where applicable, be implemented as wireless connections or wired, not necessarily direct or dedicated, connections. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
13305085.6 | Jan 2013 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/050617 | 1/14/2014 | WO | 00 |