The present invention relates to the field of image processing. More specifically, the present invention relates to local image similarity measurement.
Estimation of local image similarity is an important problem in image processing. Conceptually, image similarity can be categorized into 3 classes as described by Greg Shakhnarovich in “Learning Task-Specific Similarity, PhD Thesis,” Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2005, which is herein incorporated by reference, which include: 1) Low level similarity. Patches are considered to be similar if some distance measure (e.g. p-norm, EarthMovers, Mahalanobis) is within some threshold; 2) Mid-level similarity. Here patches share some simple semantic property; and 3) High-level similarity. In this case, similarity is primarily defined by semantics. Properties that make two patches similar are not visual but they can be inferred from visual information such as a gesture.
In most single sensor color imaging systems, only one color per pixel is measured. The remaining components have to be estimated to complete the color information at each location. This process is known as demosaicking. Several configurations of the color filter array (CFA) can be used. The most popular CFA is the Bayer pattern as described by B. E. Bayer in “Color Imaging Array”, U.S. Pat. No. 3,971,065, Jul. 20, 1976, which is herein incorporated by reference, winch consists of three colors (25% red, 50% green, and 25% blue pixels). Recently, to obtain better color accuracy and/or higher image fidelity other CFA's have been proposed. For instance, a four-color CFA improves color reproduction accuracy as described by T. Mizukura et al. in “Image pick-up device and image pick-up method adapted with image pick-up sensitivity”, U.S. Pat. No. 7,489,346, Feb. 10, 2009, which is herein incorporated by reference, arranging the Bayer colors in a zigzag arrangement instead of a rectangular array improves fill factor and pixel sensitivity as described in Yoshihara et al. in “A 1/1.8-inch 6.4 MPixel 60 frames/s CMOS Image Sensor With Seamless Mode Change”, IEEE J. Solid-State Circuits, Vol. 41, No. 12, December 2006, pp. 2998-3006, which is herein incorporated by reference, and a machine learning approach as described by F. Baqai in “Identifying optimal colors for calibration and color filter array design”, US Patent Application 20070230774, Oct. 4, 2007, which is herein incorporated by reference, estimates statistically optimal CFA colors. Demosaicking algorithms are predicated on the observation that the high-frequency information in the color channels is highly correlated. Since green pixels in the CFA are typically much more in number than other colors, demosaicking algorithms copy high frequency information from the green channel to other color channels that are unknown at a given pixel location. To do this effectively, demosaicking algorithms need to infer local image structure by identifying a set of pixels or regions that share similar local geometry.
Similar to demosaicking, denoising is also an estimation problem. The objective is to estimate a noise-free pixel value from degraded observations. To get a good estimate, a set of pixels that share similar local structure need to be found within the degraded image. The denoised value is typically a weighted average of the pixels in the similar pixel set. The weights are able to be determined in many ways such as proximity, similarity, noise level or a combination thereof. For example, see F. Baqai, “System and method for denoising using signal dependent adaptive weights”, U.S. patent application Ser. No. 12/284,055, filed on Sep. 18, 2008, which is incorporated herein by reference in its entirety.
It is interesting to note that for both demosaicking and denoising, the estimated pixel value is a weighted combination of the similar pixel set. The weights serve a different purpose. In denoising, weights are chosen to smooth out unwanted oscillations; while demosaicking the weights are chosen such that high-frequency information, in the unknown pixel estimate, is preserved. Some methods aim to do joint demosaicking and denoising by first estimating the basic structure and then iteratively fine tuning the result as described by A. Buades et al. in “Self-similarity driven color demosaicking”, IEEE TIP, Vol. 18, No. 6, June 2009, pp. 1192-1202 and K. Hirakawa and T. Parks in “Joint demosaicing and denoising”, IEEE TIP, Vol. 15, No. 8, Aug. 2006, pp. 2146-2157, both of which are incorporated by reference. For all these situations, a common problem is to find similar image structures in the presence of degradations such as blur, distortions, and noise.
In the literature, low-level image similarity has many manifestations. For instance, similarity based on Euclidean distance (L2 norm) between pixels is quite popular as described by C. Tomasi and R. Manduchi in “Bilateral Filtering for Gray and Color Images,” Proc. of IEEE International Conference on Computer Vision, pp. 841-846, 1998, which is herein incorporated by reference. This measure is very sensitive to lighting conditions and noise. It does not compare local image structure. To make this measure more robust and amenable to estimating local geometry, patch-based Euclidean distances have been proposed in “Self-similarity driven color demosaicking,” cited above.
A critical part of the similarity measure is the threshold at which a pixel or an image patch is considered to be similar. The threshold is application dependent. It needs to be adjusted based on an estimate of the degree of degradation in the image, similarity criterion, and distance measure (L1, L2, and others). If the threshold is incorrectly chosen, the similarity measure will either include pixels that are not similar or will not yield a statistically significant number of similar pixels. This poses several challenges. For instance if the estimate of the local geometry is incorrect, several artifacts such as zipper effect, blur, and false colors may appear in the demosaicked image. Similarly, denoising may not adequately remove noise (under smooth), or it may blur edges and texture (over smooth).
Another important point to note is that the computational complexity is directly proportional to the number of pixels in the patch. For instance, the computational overhead of computing similar pixels for a 3×3 patch is 9 times and for a 5×5 patch is 25 times the computational complexity of a 1×1 patch. Clearly, as patch size increases, the computational overhead rapidly goes up. So it is desirable to employ a smallest patch size that achieves the desired structural similarity.
A method of measuring low-level local image similarity using a relation between patch-based similarity measures of various patch sizes is described. The relation between similarity measures of various patch sizes is established using the probability distribution of L1 distances for arbitrary patch sizes. Patch size depends on application and/or image conditions such as lighting, illuminant, aperture, focus, exposure, and camera gain. For instance, if an image is highly degraded, bigger patch size may be needed to effectively measure local image similarity. In some situations where there is very little degradation, a patch size of 1×1 (just one pixel) may be sufficient. Similarly, for segmentation and object detection, a bigger patch may be warranted.
In one aspect, a method implemented on a device measuring local similarity in an image comprises obtaining imaging conditions, determining an appropriate patch size, choosing a threshold and measuring local image similarity. The imaging conditions are selected from the group consisting of lighting, illumination, exposure time, aperture, scene category and camera gain. The appropriate patch size is determined based on the imaging conditions. The threshold is from a set of thresholds stored in a lookup table. Choosing the threshold is based on a least one of desired similarity rate, imaging conditions, seamlessness of transition between patch size implementations. The method further comprises adaptively switching between patch sizes. The switching is automatic. The patch size is selected from the group consisting of a 1×1, 3×3, 5×5, 7×7, 9×9, 11×11, 13×13, 15×15 and 17×17 patch size. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, an iPod®, a video player, a DVD writer/player, a television and a home entertainment system.
In another aspect, a system implemented on a device for measuring local similarity in an image comprises a first module configured for utilizing a 1×1 patch size, a second module operatively coupled to this module configured for utilizing larger patch sizes and a switching module operatively coupled to the first module and the second module, the switching module configured for switching between the first module and the second module to measure local similarity of various patch sizes. The switching includes maintaining a same similarity rate irrespective of patch size. The switching is automatic. The larger patch sizes are selected from the group consisting of a 3×3, 5×5, 7×7, 9×9, 11×11, 13×13, 15×15 and 17×17 patch. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, an iPod®, a video player, a DVD writer/player, a television and a home entertainment system.
In another aspect, a device comprises a memory for storing an application, the application configured for determining an appropriate patch size for the application and/or imaging conditions, utilizing smaller patch sizes if image degradation is below a threshold and progressively increasing the patch size as degradation level increases and a processing component coupled to the memory, the processing component configured for processing the application. The device further comprises adaptively switching the patch size. Switching the patch size includes maintaining a same similarity rate irrespective of the patch size. The switching is automatic. The patch is selected from the group consisting of a 1×1, 3×3, 5×5, 7×7, 9×9, 11×11, 13×13, 15×15 and 17×17 patch. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, an iPod®, a video player, a DVD writer/player, a television and a home entertainment system.
The similarity measure used herein is based on the L1 distance as opposed to the popular L2 distance. There are several reasons for this choice. Natural images have heavy tailed distributions, and noise characteristics corrupting the image can be non-Gaussian. The L1 distance is more appropriate for such data since it is not as affected by outliers as L2 distance or other fractional distances as described by P. Howarth and S. Ruger in “Fractional distance measures for content-based image retrieval,” Lecture notes in computer science ISSN 0302-9743, Volume 3408, 2005, pp. 447-456, which is herein incorporated by reference. L1 distance gives all components the same weighting. Secondly, it is computationally much simpler to compute the absolute difference (L1 distance) as compared to the L2 distance (which even if the square root is discounted is still the sum of the squared difference).
In many image processing applications there is a need for measuring local image similarity. These applications include but are not limited to image restoration, classification, segmentation, and detection. Two restoration problems are addressed: demosaicking and denoising as a means of describing our invention; other applications are certainly possible. In demosaicking and denoising, similar pixels in the neighborhood of the pixel under consideration are used to estimate the missing or the degraded pixel value. The resulting image quality is a direct function of the degree of structural similarity of the pixels in the similar-pixel set to the pixel or image region under consideration. An appropriately chosen set of similar pixels results in an image that has significantly better appearance with little or no artifacts.
Depending on image conditions such as brightness, illuminant, aperture, focus, exposure, and camera gain, a different patch size may be necessary to measure similar local geometry. If the degradations are small, a smaller patch size may be used. However, if the image is highly degraded, as obtained in low light conditions via a consumer cell phone camera, a small patch size does not yield satisfactory results. For these situations, a bigger patch size may be required. The size of the patch (1×1, 3×3, 5×5, and others) depends on the degradation level, computational resources, and application. The challenge is to ensure seamless transition between various patch sizes while maintaining similar performance. In addition, the method should be fast and accurate.
The challenges are met by the method and system described herein for estimating local image similarity based on the L1 distance measure. An adaptive method that automatically estimates the threshold at any degradation level for similarity measures of arbitrary size based on L1 distances is presented. A smaller patch size when image degradations are small and progressively transition to bigger patch sizes as image degradations become larger are employed. This is done while maintaining similar performance by keeping a constant similarity rate while moving back and forth between patch sizes. To this end, a new relationship is derived between similarity measures of various patch sizes based on the L1 distance. For a patch size of 1×1 the L1 distance has a relatively unknown distribution referred to as the folded normal distribution (also known as a half-normal distribution), as described by Leone et al. in their article, “The folded-normal distribution”, Technometrics, 3(4), November 1961, pp. 543-550, incorporated herein in its entirety as a reference; while for bigger patch sizes (3×3, and up), the L1 distance has a normal distribution. Using the characteristics of these two distributions, a relationship between L1 similarity measures for arbitrary patch sizes is derived. Via this relationship, a seamless transition back and forth for various patch sizes is achieved while maintaining similar performance.
When trying to estimate or restore a degraded pixel in an image, a region around the pixel is utilized. Similar pixels in the region are used to determine an estimate for the missing or degraded pixel. Using pixels that are not similar would introduce unwanted artifacts such zipper effect, false colors and edges, and a smoothed appearance; which would degrade the image. Measures available for determining similar pixels (e.g. Euclidean (L2), Mahalanobis, fractional, and others) are computationally expensive. Second, there is no clear mechanism for automatically determining thresholds for various patch sizes. Sum of Absolute Differences (SAD), also known as the L1 distance, is used for determining distances for similar pixels or regions.
If image degradations are small a 1×1 patch size may work well. In this case, similarity is estimated by determining an absolute difference and then comparing the absolute difference with a threshold. If the absolute difference is below the threshold, then the pixels are similar, and if the absolute difference is equal to or above the threshold, the pixels are not similar. Then, the similar pixels are able to be used to find an estimate of the missing or degraded pixel.
If image imperfections are larger, pixels are degraded significantly including the neighboring pixels around the pixel under question. Here a 1×1 patch does not work very well. Instead a bigger patch size is needed to effectively compare local geometry. In this situation, instead of comparing individual pixels, a patch of pixels (e.g. 3×3 patch) is compared. To perform patch to patch comparisons, the SAD is used. This is more robust for comparing structural similarity in the presence of severe degradations in the image. After the SAD is obtained, it is compared with a threshold to determine if the patches are sufficiently similar. If the SAD is below the threshold, then the pixels are similar, and if the SAD is equal to or above the threshold, the pixels are not similar. Depending on the patch size, there are able to be different thresholds. Again, the similar pixels are able to be used to find an estimate of the missing or degraded pixel.
As described above, when image degradation is low, any patch size would work well. However, due to the higher complexity of bigger patches, a smaller patch size is favored. For higher degradations, bigger patch sizes provide a better comparison and thus better image quality. To ensure the image looks similar when switching between patch sizes, thresholds are set properly. Specifically, it is desired that the image appearance remain similar when adaptively switching between patch sizes. In some embodiments, that means, the number of similar pixels should be the same for arbitrary patch sizes. A way of ensuring that the number of pixels is the same is by determining a relationship between thresholds of various patch sizes.
Considering the signal model to be locally constant, pixels in the similar pixel set should have the same mean Υ but different noise levels derived from a probability distribution with standard deviation σ. The threshold value controls the degree of similarity and as pointed earlier depends on degradation level at the pixel in question and the similarity measure. The degradations can be blur introduced by camera optics, color cast due to illuminant, exposure compensation for high dynamic range images, noise from the signal and circuitry, gain applied to compensate for low light, and artifacts introduced in the camera pipeline, and other degradations. Generally, the noise is regarded to be normally distributed with a non-linear signal dependent variance computed via noise model. Noise variance is not constant for every pixel; it depends on the signal value, so every pixel is able to have a different noise level.
Assuming similar pixels X, Y to be random variables from a normal distribution with mean (Υ) and standard deviation ,: X, Y˜N (Υ, ,). The threshold for a 1×1 patch is based on random variable Z=|X−Y|. The threshold for bigger patches (3×3 and up) is based on random variable Q=mean (Zi); where i=1, . . . ω and ω is the number of pixels in the patch. To understand the relationship between threshold for 1×1 patch size and threshold for bigger patch sizes, the distributions of Z and Q are analyzed.
The difference X-Y has a normal distribution:
X−Y˜N(0,√{square root over (2)}σ)
Z=|X−Y| has a folded normal distribution Nf with mean:
Since E{z2}=2,2; ,z2=E{z2}−δz2=2,2(1−2/π), thus:
The random variable Q is able to be written as:
It is reasonable to assume Zi to be independent and identically distributed (Zi˜Nf(δz,,z)). The central limit theorem in statistics states that a sum of independent and identically distributed random variables (Z1, . . . , Zω) approaches a normal distribution:
N(ωμz,√{square root over (ω)}σz)
Therefore, Q is able to be considered normally distributed:
A patch size of 1×1 has a folded-normal (also referred to as half-normal) distribution which is not symmetric. For patch sizes >1×1 (3×3, and up), distances are obtained from a sum of several 1×1 distances, for example, with a 3×3 patch, there are nine (9) absolute differences summed up and with a 5×5 patch there are twenty five (25) absolute differences summed up. Based on the central limit theorem, if random variables have independent identical distributions, the distribution of the sum is Gaussian. Since distances for patch sizes greater than 1×1 involve summing random variables that have identical folded-normal distributions, its distribution is Gaussian. This is also able to be seen in Table 1 below. Distance for 1×1 patch size has a folded normal distribution which is unsymmetrical while distances for patch sizes 3×3 and bigger have a normal distribution which is symmetric.
indicates data missing or illegible when filed
Z has a folded-normal distribution with mean 1.1284, and standard deviation of 0.8525,. Q has a normal distribution with mean 1.1284, and standard deviation of 0.8525,/sqrt(ω), where co is the number of pixels in the patch (e.g. 9 for a 3×3 patch and 25 for a 5×5 patch).
Although different patch sizes have different distributions, their means are the same and their standard deviations interrelated. Distances that are normally distributed (patch sizes >1×1) have mean equal to the median. If the threshold is chosen to be at the mean (1.1284,), a pixel-similarity rate of 0.5 is yielded. However, L1 distance when patch size is 1×1 has a folded-normal distribution, which is (un-symmetric). Consequently, the median is not equal to the mean. To get a pixel-similarity rate of 0.5, the threshold should be at the median which is 0.9539,. Therefore, for patch size 1×1, a threshold of 0.9539, corresponds to a threshold of 1.1284, for a 3×3 patch. Both cases yield a similarity rate of 0.5. In other words, to obtain the same similarity rate, threshold should be chosen such that the lower tail probability of distance measures regardless of patch size is the same.
In the following a relationship between thresholds is derived for patch sizes >1×1. Since the distribution of similarity measure for patch sizes >1×1 (Q) is Gaussian (Q˜N(1.1284σ,0.8525σ/√{square root over (ω)})), the threshold is able to be written in terms of its mean Υq=1.1284, and the standard deviation σq=0.8525,/sqrt(ω):
Note that α=0 yields a pixel similarity of 0.5, α<0 makes similarity rate <0.5, and a σ>0 implies similarity rate >0.5. Without loss of generality, a relationship between thresholds for patch sizes >1×1 is derived, for a desired similarity rate ≧0.5. In a similar manner rates <0.5 are able to be handled. Therefore,
Rearranging terms α is able to be written as
As long as a remains the same, a constant similarity rate irrespective of patch size will be achieved.
Therefore, with the relationship described above, switching between patch sizes is able to be implemented. For a single image, aspects of the image that are more degraded than others are dealt with a bigger patch size and less degraded aspects are handled using smaller patch sizes. Similarly, different patch sizes can be used depending on region characteristics such as smoothness, texture, and structure.
95% of the area under a Gaussian distribution is within two standard deviations around the mean. This range is considered to determine the upper and lower threshold limits. The 95% region for folded-normal distribution yields a threshold range of 0.0089 to 2.772 for patch size 1×1. The threshold range for larger patch sizes (>1×1) is Υq±2,q. As patch size is increased, the number of pixels in the patch ω increases, hence the standard deviation decreases (recall σq=0.8525,/sqrt(ω)). Consequently the threshold range becomes narrower as is shown in Table 3.
Since complexity increases with patch-size, in some embodiments, it is preferred to use the smallest patch-size that achieves the desired quality.
At each pixel or image region it is important to find an estimate of degradations that degrade the image. These include blur introduced by camera optics, color cast due to illuminant, exposure compensation for high dynamic range images, noise from the signal and circuitry, gain applied to compensate for low light, and artifacts introduced in the camera pipeline by operations such as demosaicking. In some embodiments, these informations are stored in a lookup table.
In some embodiments, the local similarity estimation application(s) 530 include several applications and/or modules. In some embodiments, the local similarity estimation application(s) 530 include a module 532 configured for estimating similarity via a 1×1 patch, a module 534 configured for similarity measurement by using bigger patch sizes (>1×1) and a switching module 536 configured for switching between patch sizes.
Examples of suitable computing devices include a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, an iPod®, a video player, a DVD writer/player, a television, a home entertainment system or any other suitable computing device.
To utilize the method of and system for measuring local image similarity, an image is acquired. Depending on the broad application being performed on the image, similar regions or pixels are identified at the appropriate time according to that application's scheme. For example, in a restoration scheme, the local similarity measurement method identifies similar regions to remove degradations, thus improving the image quality. In some embodiments, the restoration occurs automatically on a system, and in some embodiments a user is able to initiate the restoration by selecting an input such as pushing a button, touching a screen or any other input mechanism.
In operation, the method of and system for estimating local image similarity based on the L1 distance measure determines the degree of degradations in the image. In some embodiments, the degradations are determined pixel by pixel and in other embodiments, larger portions of the image are used to determine degradations. The distance measure patch size depends on the application. In some embodiments, a constant similarity rate is maintained by appropriately choosing thresholds for different patch size implementations.
Although image processing has been the main focus of the description, the method and system described herein is able to be applied to other types of processing such as speech or video processing.
The method and system described herein is able to be applied to computer vision, machine learning, and image restoration applications such as super-resolution, in-painting, texture synthesis, segmentation, and object/scene/texture categorization, and other implementations.
Exemplary Implementations
The present invention has been described in terms of specific embodiments incorporating details to facilitate the understanding of principles of construction and operation of the invention. Such reference herein to specific embodiments and details thereof is not intended to limit the scope of the claims appended hereto. It will be readily apparent to one skilled in the art that other various modifications may be made in the embodiment chosen for illustration without departing from the spirit and scope of the invention as defined by the claims.