None.
The present invention relates generally to template matching for an image.
Referring to
One of the template matching techniques includes feature point based template matching which achieves good matching accuracy. Feature point based template matching extracts object discriminative interesting points and features from the model and the input images. Then those features are matched between the model image and the input image with K-nearest neighbor search or some feature point classification technique. Next a homography transformation is estimated from those matched feature points, which may further be refined.
Feature point based template matching works well when objects contain a sufficient number of interesting feature points. It typically fails to produce a valid homography when the target object in the input or model image contains few or no interesting points (e.g. corners), or the target object is very simple (e.g. target object consists of only edges, like paper clip) or symmetric, or the target object contains repetitive patterns (e.g. machine screw). In these situations, too many ambiguous matches prevents generating a valid homography. To reduce the likelihood of such failure, global information of the object such as edges, contours, or shape may be utilized instead of merely relying on local features.
Another category of template matching is to search the target object by sliding a window of the reference template in a pixel-by-pixel manner, and computing the degree of similarity between them, where the similarity metric is commonly given by correlation or normalized cross correlation. Pixel-by-pixel template matching is very time-consuming and computationally expensive. For an input image of size N×N and the model image of size W×W, the computational complexity is O(W2×N2), given that the object orientation in both the input and model image is coincident. When searching for an object with arbitrary orientation, one technique is to do template matching with the model image rotated in every possible orientation, which makes the matching scheme far more computationally expensive. To reduce the computation time, coarse-to-fine, multi-resolution template matching may be used.
What is desired therefore is a computationally efficient edge based matching technique.
The foregoing and other objectives, features, and advantages of the invention may be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.
Referring to
In the off-line model analysis, two principal types of information are gathered which is used for online template matching. The first type of information is for finding object position, which may include wavelet decomposition of the model image into multiple layers and a generally rotation-invariant feature of the target object in the model image, thus providing a model vector M. Any suitable technique may be used for decomposing the image into multiple layers, which facilitates more computational efficiency due to the reduced image size. The preferred technique for a generally rotation invariant feature is a ring projection transform, although any suitable technique may be used. The generally rotation invariant feature facilitates object identification without having to check a significant number of different angles (or no different angles) for the model images. The model vector M is thus determined for the model image based upon the ring projection transform. Other characteristics may likewise be determined that are characteristic of the object that include a generally rotation invariant feature based upon a lower resolution image.
The second type of information is for determining the model object orientation, which may include edge detection and orientation estimation thus providing model object orientation. The orientation estimation determines candidate positions for the object using the generally rotation invariant characteristic on the lower resolution image. In this manner, potential candidate locations can be determined in an efficient manner without having to check multiple angular rotations of the object.
In the online template matching, input image is first decomposed into lower resolution with a wavelet decomposition, the number of decomposition layers is determined by the offline model analysis based on the model image size. Other decomposition techniques may likewise be used, as desired. Then candidate positions and a span of the target object are identified by measuring the similarity between the model rotation invariant feature and a rotation invariant feature centered at each high energy pixel in the lowest resolution input wavelet composite subimage, and selecting the positions where the similarity is higher than a predefined threshold. Then the system can verify candidate positions by computing the correlation coefficients in the vicinity of their corresponding ones in the original highest resolution image. Candidates with highest correlation coefficients are kept as the final target object position. In this manner, the initial matching is done in a generally rotation invariant manner (or rotation invariant manner) for computational efficiency. After an object position is determined, the system may then estimate object orientation in input by detecting edges in the span of object and computing image moments using the edge map. Thus, after a likely determination is made of a candidate position, the system can then account for image rotation, which is a computationally efficient technique. Then any ambiguous orientation is resolved and the model image is aligned to the input by translating and rotating the model image to the estimated input position and orientation.
Referring to
Motivated by a desire for an efficient template matching scheme so that an arbitrarily orientated object can be detected in an input image, the system may use a multi-resolution template matching with the wavelet decomposition. The wavelet decomposition reduces an image to small subimages at multiple low-resolution levels 120. It also transforms the image into a representation where both spatial and frequency information is present. In addition, by using wavelet coefficients as features, it has the property that the matching is not very sensitive to the photometric changes (such as background and/or foreground intensity change, illumination change). By highlighting local feature points with high energy in the decomposed subimages, it results in significant computation saving in the matching process.
The wavelet transform of a 2D image f(x, y) may be defined as the correlation between the image and a family of wavelet functions {φs,t(x,y)}:
W
f(s,t;x,y)=f(x,y)*φs,t(x,y) (1)
The pyramid-structured wavelet decomposition operation produces four subimages fLL(x,y), fLH(x,y), fHL(x,y) and fHH(x,y) in one level of decomposition. fLL(x,y) is a smooth subimage, which represents the coarse approximation of the image. fLH(x,y), fHL(x,y) and fHH(x,y) are detailed subimages, which represent the horizontal, vertical and diagonal directions of the image, respectively. The 2D decomposition can iterate on the smooth subimage fLL(x,y) to obtain four coefficient matrices in the next decomposition level.
Various types of wavelet bases such as Haar and Daubechies may be used in the wavelet decomposition. Empirical results indicate that due to the boundary effect of limited model image size, a wavelet basis with shorter supports such as the Haar wavelet with a support of 2 or a 4-tap Daubechies wavelet are the preferred choice. The type of wavelet bases has limited effects on the matching results.
The matching process can be performed either on the decomposed smooth subimage or on the decomposed detail subimage at a lower multi-resolution level. Preferably the system uses the detailed subimage for the matching with normalized correlation so that only pixels with high-energy values in the detail subimage are used as the matching candidates. This alleviates pixel-by-pixel matching in the smooth subimage. Three detail subimages containing, separately, horizontal, vertical and diagonal edge information of object patterns are obtained in one resolution level. The system may combine these three detail subimages into a single composite detailed subimage that simultaneously display horizontal, vertical and diagonal edge information. The composite subimage may be given by,
f
c
(J)(x,y)=|fLH(J)(x,y)|+|fHL(J)(x,y)|+|fHH(J)(x,y) (2)
where fLH(J)(x,y), fHL(J)(x,y) and fHH(J)(x,y) are the horizontal, vertical and diagonal detail subimages at resolution level J, respectively. The system may use the L1 norm as the energy function for each pixel in the composite detail subimage fd(J)(x,y) for its computational simplicity.
The online template matching may be carried out on the composite detail subimage. Since the energy values of most pixels in the detailed subimage are approximate to zero, only the pixels with high energy values are considered for further matching. The threshold for selecting high energy-valued pixels can be manually predetermined, fixed, or adaptively determined.
In order to reduce the computational burden in the matching process and make the matching invariance to rotation, a ring-projection transformation may be used 130. Overall, any generally rotation invariant technique may be used to characterize the image. It transforms a 2D gray-level image into a rotation-invariant representation in the 1D ring-projection space. Let the pattern of interest be contained in a circular window of radius W. The radius chosen for the window depends on the size of the reference template. The ring-projection of the composite detail subimage fd(J)(x,y) is gives as follows. First, fd(J)(x,y) in the Cartesian coordinates is transformed into the polar coordinates:
x=r cos θ y=r sin θ (3)
The ring-projection of image fd(J)(x,y) at radius r, denoted by p(r), is defined as the mean value of fd(J)(r cos θ,r sin θ) at the specific radius r. That is,
where nr is the total number of pixels falling on the circle of radius r, r=0, 1, 2, . . . , W. Since the projection is constructed along circular rings of increasing radii, the derived 1D ring-projection pattern is invariant to rotation of its corresponding 2D image pattern. The pattern may be denoted at a model RPT vector, M 140. Other patterns or characterizations may likewise be used that are generally rotation invariant based upon a reduced resolution image.
It is noted that, in computing the RPT of an object, to avoid including other unwanted background pixels, the system preferably only adds together the wavelet coefficients of high energy pixels of equation 4. The maximum radius W is determined based on the model image size such that the rings cover the whole object. In addition, computing sin θ and cos θ at every pixel in the circle is time-consuming. To reduce time, the system may compute the distance transform of the center pixel of the ring, then all the pixels with radius r can be directly extracted with the distance map produced by distance transform.
The other part of the off-line model analysis relates to determining object orientation 150. The object orientation 150 may be computed as the principal axis of the object edge contour with moment analysis of the edge contour. This may include two steps: (1) extract object edges, and (2) compute moment of edges to obtain object orientation.
Referring also to
After object edges are determined, object orientation may be determined by moment analysis of edge map 190. The object orientation may be determined in using any technique, as desired. The central moment of a digital image f(x,y) is defined as,
Information about image orientation can be derived by first using the second order central moments to construct a covariance matrix,
The eigenvectors of this matrix correspond to the major and minor axes of the edge pixels, so the orientation can be extracted from the angle of the eigenvector associated with the largest eigenvalue. It can be shown that this angle is given by
The model object orientation 150 is thus determined by the off line model analysis. In this manner, the system can determine not only the angular rotation of the object, but also its orientation.
During online template matching, the system receives the input image 200 and other parameters determined by offline model image analysis including number of wavelet decomposition layers P 210, model image RPT vector 220, gradient threshold T 230, and model object orientation θM 240.
The online template matching should likewise decompose the image to the P-th layer using a wavelet transformation so that processing is carried out on the composite detail subimage 250. Since the energy values of most pixels in the detailed subimage are approximate to zero, only the pixels with sufficiently high energy values are preferably considered for further matching. The threshold for selecting high energy-valued pixels can be manually predetermined, fixed, or adaptively determined. Any suitable image decomposition technique may be used.
Also, referring to
In the matching process, the measure of similarity is given by the normalized correlation. Let
P
M
=[p(0), p(1), . . . , p(W)] (8)
and
P
I
=[{circumflex over (p)}(0), {circumflex over (p)}(1), . . . , {circumflex over (p)}(W)] (9)
represent the ring-projection vectors of the reference template and scene subimage, respectively. The normalized correlation between ring projection vectors PM and PI is defined as:
The correlation coefficient ρp is scaled in the range between −1 and +1. The computation of correlation coefficient is only carried out for those high energy-valued pixels in the composite detail subimage. Note that the dimensional length of the ring projection vector is W+1, where W is the radius of the circular window. This significantly reduces the computational complexity for the correlation coefficient ρp.
Once candidate positions of the target object are identified in the wavelet composite subimage at a lower resolution level, the system then verifies those candidates by computing the correlation coefficients in the vicinity of their corresponding ones in the original highest resolution image. Candidates with highest correlation coefficients are kept as the final target object position. Let (x*, y*) be the detected coordinates in the level J detail subimage. Then the corresponding coordinates of (x*, y*) in their level 0 image are given by (2Jx*, 2Jy*). If the localization error in one axis is Δt in the level J subimage, the search region in the original image should be (2Jx*±2JΔt)×(2Jy*±2JΔt) for fine tuning. Experiments show that the detected pixel with the largest correlation coefficient is typically within approximately 3-pixel distant from the true location for resolution level J<=2 and window radius W>=64.
The input object orientation, θI, 290 can be determined using the generally same technique as that off the off-line moment model analysis. Preferably the object area identified during the online template matching is downsampled 270, such as by a factor of 2, which is then used for edge detection and orientation estimation.
Since moment analysis of edge map only yields the angle of the principal axis of the object edges, so there is an ambiguity about the orientation: the orientation of angle θ can correspond to two different directions, these two directions are flipped with each other. This further means that the angle difference between model object and input object could be θ1=0, θI−θM or θ2=0, θI−θM+π, where θI denotes the input object orientation and θM denotes the model object orientation 280. To resolve this ambiguity when aligning model to input, the system may rotate 300 model image by θ1 and θ2 respectively, then compare the NCC matching score between input object region and rotated model image by θ1 with the NCC score between the input object region and rotated model image by θ2. This angle difference which yields highest NCC matching score 310 is selected 320.
To decrease the computational complexity in resolving orientation ambiguity, the NCC matching may be computed in the down-sampled image (original image is down-sampled by a factor of 2), which is provided by the off-line model analysis 330. To further reduce the computational complexity the NCC matching when resolving orientation ambiguity, the NCC matching is only performed with the edge pixels not the entire intensity image. The model may be rotated and translated to align with the input 340.
One type of template matching technique is to search the target object by sliding a window of the reference template in a pixel-by-pixel basis (or other basis), and computing the degree of similarity between them, where the similarity metric is commonly given by correlation. The preferred correlation is a Normalized Cross Correlation (NCC). The matching feature may be, for example, an edge map of the model (or input) image, or the full intensity image of the model (or input) image. If the matching feature is the edge map of the object, it is using global object shape information for matching, and hence provides advantages over feature point based matching.
For example, some advantages of NCC based template matching is (1) it is robust to object shapes, be it complex or simple shapes (2) it is robust to photometric changes (brightness/contrast changes, blur, noise, etc). However, one drawback of traditional NCC matching is that it is not robust to rotation, thus computation cost is high when the input object orientation is different from the model object orientation because many template images at different orientations are matched with the input image. Accordingly, a modified NCC matching technique should handle both rotation and translation of the image, preferably in a computationally efficient manner.
To enable NCC to match an object with different rotations, a set of rotated template images may be determined and then matched to the input image to find the optimal input orientation and position. To increase the matching efficiency, the preferred system may employ acceleration techniques such as a Fourier Transform, a coarse-to-fine multi-resolution search, and an integral image. In addition to using such acceleration techniques, if desired, the orientation search may be further modified for computational efficiency with coarse-to-fine hierarchical angle search. The coarse-to-fine angle search preferably occurs in the orientation domain (i.e. different angles). In addition, the technique may likewise be suitable for multi-object matching.
Referring to
By way of background for traditional NCC matching, when a (2h+1)×(2w+1) template y is matched with an input image x, template matching is performed by scanning the whole image and computing the similarity between the template and the local image patch at every input pixel. Various similarity metrics can be used such as normalized Euclidean distance, Summed Square Distance (SSD) or Normalized Cross Correlation (NCC). If NCC is used as the similarity measure in the template matching, then the system can use NCC-based template matching. A conventionally used NCC-based template matching takes the form
NCC(u, v) gives the matching NCC score at position (u, v). The higher the NCC score at (u, v), the more similar template pattern is to the input local pattern at the neighborhood of (u, v). The input position that yields the highest NCC score across the whole input image is selected as the final matched position, as shown by equation (10).
If there are multiple objects in the input and their orientation are all the same as the template object orientation, the system may keep the top K peaks in equation (10) where K corresponds to the number of objects in the input. While this technique is suitable to handle translation of an object, it is not suitable to handle rotation of an object.
Before entering to the matching stage, the system should first compute the feature used for matching. For NCC-based matching, one or two types of features are preferably employed. One feature for a matching feature is an object edge. Object edge encodes the global shape information about the object. It is more robust to object shape variations and it can potentially handle cases including simple-shape objects, symmetric objects, and objects with repetitive patterns. However, using the object edge often requires that object edge can be extracted from the input and the template image, which implies that it is not excessively robust to low contrast, noise, and blur in the input image because it is problematic to extract clean and clear object edges from such images.
Another feature for a matching feature is a gray-scale image, that is, using the raw gray-scale image for NCC matching. The use of such a gray-scale image is typically more robust to low contrast, nose, and blur. However, using the raw gray-scale image technique is not excessively robust to brightness/illumination changes in the input image. It also typically fails to obtain a valid matching result if the input intensity sufficiently deviates from the model intensity.
One automatic technique for determine a matching feature is to determine the input image blur and noise level based on frequency domain analysis. One particular transform that may be used in image capture is a discrete cosine transform (DCT). The DCT coefficients may form 64 (8×8) histograms, with one histogram for each of the DCT coefficients. To further reduce the data, these histograms for the 2-D 8×8 coefficients are mapped to 1-D histograms using a mapping function as illustrated in
With the 1D histogram, various statistics may be derived from these coefficient histograms, e.g. second order statistics variance, fourth order statistics kurtosis, maximum, and minimum, etc. Many of these can be used to predicted blur and noise. Preferably, the standard deviation (square-root of variance) of the DCT coefficients and absolute maximum DCT coefficients are used for blur detection, and the high frequency components are used to predict noise.
The matching feature extraction for template image matching illustrated in
Feature extraction for input image is shown in the right path of
Referring again to
The preferred coarse matching procedure pseudo code is illustrated in
The difference between single object and multi-object matching is that, for single object coarse angle search, the system only keeps one single peak position which yields the highest NCC score, then the top K [angle, position] triplets among all coarse angles are kept for future fine orientation search. Whereas, in the case of multi-object matching, since it is possible that there could be multiple object with the same orientation in an input image, the system may keep the top K positions for each coarse angle.
Referring again to
Referring again to
The terms and expressions which have been employed in the foregoing specification are used therein as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding equivalents of the features shown and described or portions thereof, it being recognized that the scope of the invention is defined and limited only by the claims which follow.