The invention relates to a method for generating improved image descriptors for media content and a system to perform efficient image querying to enable low complexity searches.
Computer supported image search in general, for example when trying to find all images of the same place in an image collection, is a request for large data bases. Well known systems for image search are e.g. Google's image-based image search and tineye.com. Some systems are based on a retrieval of meta data as a descriptive text information for a picture as e.g. a movie poster, cover, label of a wine or descriptions of works of art and monuments, however, sparse representations as so-called image descriptors are a more important tool for low complexity searches for an object and scene retrieval of all the occurrences of a user outlined object in a video with a computer by means of an inverted file index. An example is “Video Google: A text retrieval approche to object matching in videos” as disclosed by J. Sivic and A. Zisserman in Proceedings of the Ninth IEEE International Conference on Computer Vision ICCV 2003. A vector quantization of local descriptors is used to achieve sparsity while discarding the geometric information conveyed in position and scale of key-points. This quantization procedure, while enabling low-complexity matching, significantly degrades the descriptive power of the local descriptor vectors. That means that the reduced complexity obtained by means of very sparse vectors comes at the expense of degraded vector descriptiveness. One standard method used to correct the weakened descriptiveness originated by vector quantization consists of using geometric post-verification applied to a short-list of query responses. The method requires a high expenditure and is restrictive as it requires an estimation of the homography between each potential matching pair of images and further assumes that this homography is constant over a large portion of the scene. A weak form that does not require estimating the homography and incurs only marginal added complexity is also known, however, this approach is complementary to a full geometric post-verification process.
It is an aspect of the invention to provide an improved image description method that exploits both the photometrical information of key-points and the geometrical layout i.e. the relative position of the key-points and nevertheless performs efficient image querying for a precise and fast search.
Problems in view of said aspects are solved by features disclosed in independent claims. Dependent claims disclose preferred embodiments.
A method for generating image descriptors for media content of images represented by a set of key-points is recommended which determines for each key-point of the image, designated as a central key-point, a neighbourhood of other key-points whose features are expressed relative to those of the central key-point.
Each key-point is associated to a region centred on the key-point and to a descriptor describing pixels inside the region. A region detection system is applied to the image media content for generating key-points as the centre of a region with a predetermined geometry. That means that image descriptors are generated by generating key-point regions, generating descriptors for each key-point region, determining geometric neighbourhoods for each key-point region, a quantisation of the descriptors by using a first visual vocabulary, expressing a neighbour of a neighbourhood region relative to the key-point region and quantizing this relative region using a shape codebook and a quantization of descriptors of neighbours of the neighbourhood region by using a second visual vocabulary for generating a photo-geometric descriptor being a representation of the geometry and intensity content of a feature and its neighbourhood. The photo-geometric descriptor is a vector for each key-point defined in the quantized photo-geometric space. The inverted file index of the sparse photo-geometric descriptors is stored in a program storage device readable by machine to enable low complexity searches.
It is a further aspect of the invention to provide a system for providing descriptors for media content of images represented by a set of key-points, which comprises a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for generating descriptors for image media content. Said method comprises the steps of applying a key-point and region generation to the image media content to provide a number of key-points each with a vector specifying the geometry of the corresponding region,
generating a descriptor for the pixels inside the region,
a quantisation of the descriptors by using a first visual vocabulary,
determining, for each key-point neighboring key-points with similar regions,
normalisation and quantisation of the neighbouring regions relative to the region and a quantisation using a shape codebook and
a quantization of neighbourhood descriptors in each of the neighbourhood regions by using a second visual vocabulary for providing a sparse photo-geometric descriptor—abbreviated as SPGD—of each key-point in the image being a representation of the geometry and intensity content of a feature and its neighbourhood. The sparsity of the descriptor means that an inverted file index of the photo-geometric descriptor is stored in a program storage device readable by machine to enable fast and low complex searches.
The geometric neighbourhood of the geometric neighbourhood region to a region is determined by applying thresholds to vectors within a four-dimensional parallelogram centered at the position of the region.
The method is unlike known approaches which, for large scale search, first completely discard the geometrical information and subsequently take advantage of it in a costly short-list post-verification based on exhaustive point matching.
According to the invention, a local key-point descriptor is recommended that incorporates, for each key-point, both the geometry of surrounding key-points as well as it's photometric information by the local descriptor. That means that for each key-point, a neighbourhood of other key-points whose relative geometry and descriptors are encoded in a sparse vector using visual vocabularies and a geometrical codebook. The sparsity of the descriptor means that it can be stored in an inverted file structure to enable low complexity searches. The proposed descriptor, despite its sparsity, achieves performance comparable or better to that of a scale-invariant feature transform abbreviated as SIFT.
A local key-point descriptor that incorporates, for each key-point, both the geometry of the surrounding key-points as well as their photometric information through local descriptors is determined by a quantized photo-geometric subset as the Cartesian product of a first visual codebook for the central key-point descriptor, a geometrical codebook to quantize the relative positions of neighbors and a visual codebook for the descriptors of the neighbors.
That means that a Sparse Photo-Geometric Descriptor, in the following abbreviated SPGD, is provided that is a binary-valued sparse vector of a dimension equal to the cardinality of this subset and having non-zero values only at those positions corresponding to the geometric and photometric information of the neighboring key-points.
The proposed SPGD ensures that it is possible to obtain a sparse representation of local descriptors without sacrificing descriptive power. In fact, the proposed SPGD can outperform non-sparse SIFT descriptors built for several image pairs in an image registration application and geometrical constraints for image registration can be used to reduce the local descriptor search complexity. This is contrary to known approaches wherein geometrical constraints are applied as an unavoidable and high expenditure requiring short-list post-verification process.
The use of relative key-point geometry is similar to a known shape context description scheme as e.g. disclosed by Mori G., Belongie S., Malik J., “Efficient Shape Matching Using Shape Contexts”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1832-1837, November, 2005, with the difference that the SPGD recommended according to the present invention is based on key-points instead of contours and considers not only a pixel position but also key-point orientation and scale. Furthermore, SPGDs are very sparse vectors, which is not the case for shape context vectors. The quantized photo-geometric space relies e.g. on a product quantizer, wherein sub-quantizers are applied to the relative geometries and descriptors, rather than to sub-components of the descriptor vector. In this sense, the recommended SPGD is tailored specifically to image search based on local descriptors and although the SPGD exploits both the photometrical information of key-points as well as their geometrical layout, a performance is achieved comparable or even better to that of the SIFT descriptor.
A Sparse Photo-Geometric Descriptor—SPGD—is recommended that jointly represents the geometrical layout and photometric information through classic descriptors of a given local key-point and its neighboring key-points. The approach demonstrates that incorporating geometrical constraints in image registration applications does not need to be a computationally demanding operation carried out to refine a query response short-list as it is the case in existing approaches. Rather, geometrical layout information can itself be used to reduce the complexity of the key-point matching process. It is also established that the complexity reduction related to a sparse representation of local descriptors need not be enjoyed at the expense of performance. Instead it can even result in improved performance relative to non-sparse descriptor representations.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
f
n=[log2(σn),Δxn,Δyn,θn]T (1)
which according to said formula 1 are consisting, respectively, of a scale or size σ, a central position xn and yn as coordinates of an area Δ as well as orientation parameters as an angle of orientation θn. For convenience it shall be assumed that the scale parameter size σ is expressed in terms of its logarithm.
The descriptor vectors do are built using the known SIFT algorithm.
The SPGD representation of a local key-point includes in such way geometric information of all key-points in a geometric neighborhood. To define a geometric neighborhood, a neighbourhood region fm is used and geometry is expressed in terms of reference geometry:
The geometrical neighborhood of a region fn is defined as all those vectors and neighbourhood descriptors dm respectively of a neighbourhood region fm that are within a 4-dimensional parallelogram centered at the key-point of region fn and with sides of half-lengths log2 (Tσ), TΔ, TΔ and Tθ.
Letting T=diag(log2(Tσ), TΔ, TΔ, Tθ), the indices of those shapes in the neighborhood of region fn can be expressed as follows:
M
n
Δ={mεI:m≠n̂∀k,|(T−1(fm·fn))[k]|≦1} (3)
wherein v[k] denotes the k-th entry of a vector v and Mn represents a neighbourhood.
For convenience it is assumed that the entries of neighbourhood Mn are ordered as e.g. in increasing order, with the l-th entry denoted mln. When possible, we will drop the superscript n and simply use ml for notational convenience.
The Sparse Photo-Geometric Descriptor consists of representing each key-point (fn,dn) along with the features (fm
Q(v;C)=argminl|cl−v|. (4)
and Ln the number of neighbours of a key-point.
The SPGD construction process consists of three consecutive quantization steps. In the first step, the key-point descriptor do is quantized using a visual vocabulary v1 as it is done in a large number of approaches:
v
n
=Q(dn;V1). (5)
In the second step, vectors fm
s
l
n
=Q(fm
In the third step, the neighborhood descriptors dm are quantized using a visual vocabulary V2.
c
l
n
=Q(dm
The resulting SPGD (vn,{(sln,cln)}l=1L
That means as shown in
For comparing SPGDs:
The distance we propose to compare two SPGDs (Vn,{(sln,cln)}l=1L
Using the above matching function, the following similarity Φ measure between two SPGDs is recommended, where δkl is the so-called Kronecker delta function:
That means that SPGD descriptors are represented as sparse vectors defined in a high-dimensionality space. Accordingly, the distance of similarity Φ can be expressed as an inner product between these sparse vectors. To illustrate this, we define at first a sparse photo-geometric subset as being the Cartesian product A=V1×g×V2 of the three SPGD codebooks. We consider next the vector XnεR|A|, initialized to zero and having one entry per member triplet of A. An SPGD (vn,{(sln,cln)}l=1L
The similarity measure in equation (9) e.g. can be computed efficiently by storing the database SPGDs using a four-level nested list structure. An SPGD (vm,{(skm,ckm)}k=1L
Hence the similarity measure in equation (9) can be computed efficiently by aggregating over all those lists related to the neighborhood of the query SPGD:
That means that the query SPGD allows a low complex and efficient search.
A preliminary evaluation of the proposed SPGD descriptor is carried out by using image registration experiments disclosed by K. Mikolajczyk and C. Schmid. “A performance evaluation of local descriptors” IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615-1630, 2005.
Accordingly, Key-Points and their descriptors are first computed on a pair of images corresponding to different views of the same scene. Each key-point of the reference image is then matched to the key-point in the transformed image yielding the smallest descriptor distance or inverse similarity measure, and the match correctness is established using the nomography matrix for the image pair, allowing for a small registration error as e.g. 5 pixels. We then measure recall R and precision P, where
R=(# correct matches)/(# ground truth), (12)
P=(# correct matches)/(# total matches). (13)
The total correct and wrong number of matches considered can be pruned by applying a maximum threshold on the absolute descriptor distance of matches. A second pruning strategy instead applies a maximum threshold to the ratio of distances to first and second nearest neighbours. We vary the threshold used to draw R, 1−P curves, using the labels abs. and ratio as shown in
Note that the ratio-based pruning approach requires that the exact first and second Nearest Neighbors NN be found. In large scale applications where, as a result of the curse of dimensionality, approximate search methods are mandatory, this ratio-based match verification approach is not possible. Pruning based on the absolute distance order is more representative of approximate schemes where the exact first and second Nearest Neighbors NN is very likely to be found in the short-list returned by the system. Indeed for the proposed SPGD descriptor we only consider the exact first and second Nearest Neighbors NN matching, whereas for the reference SIFT descriptor we will consider both matching strategies, as using an absolute threshold greatly improves SIFT's R, 1−P curve. The image pairs used to measure recall R and precision P are those of the Leuven-INRIA dataset as disclosed by above mentioned K. Mikolajczyk and C. Schmid. “A performance evaluation of local descriptors”. The image pairs consist of eight scenes as boat, bark, trees, graf, bikes, leuven, ubc and wall with six images per scene labeled 1 to 6. Image 1 from each scene is taken as a reference image, and images 2 through 6 are transformed versions of increasing baseline. The transformation per scene is indicated in
The publicly available Flickr-60K visual vocabularies are used according to H. Jégou, M. Douze, and C. Schmid. “Hamming embedding and weak geometric consistency for large scale image search”, ECCV, volume I, pages 304-317, 2008. These visual vocabularies have sizes between 100 and 200000 codewords and are trained on SIFT descriptors extracted from 60000 images downloaded from the Flickr website. We also build smaller vocabularies of size 10 and 50 by applying a K-means on the size of 20,000 vocabulary. For consistency of presentation, we also consider a trivial, size 1 vocabulary as shown in
N1=1; N2=2,000; log2(Tσ)=1; log2(Rσ)=0.59; TΔ=30; RΔ=6; Rθ=0.79; Min. neighs.=1.
The following parameters need to be specified to define an SPGD description system:
While the values log2(Tσ) and TΔ determine SPGD invariance to image scale and cropping, Tθ only serves to control the effective size of the geometrical neighborhoods and hence the matching complexity.
In this embodiment it is assumed Tθ===π, meaning that relative angle is not used to constrain the geometrical neighborhoods and hence only 5 parameters are required to define the geometrical quantizer.
The sizes N1 and N2 of the visual codebooks V1 and V2 have to be determined. Furthermore, a minimum neighborhood size is selected, discarding local descriptors that have too few geometrical neighbors, resulting in a total of 8 parameters to be selected.
To select these 8 parameters we maximize the Area Under the R,1−P-Curve AUC for image 3 relative to image 1 of the leaven dataset.
We use an iterative, coordinate-wise, exhaustive search over a coordinate-dependent set of discrete values to find a local maximum of the AUC curve. The AUC values versus the discrete parameter sets for the last iteration of the maximization are displayed in
The parameter selection approach as described above maximizes only performance. A better approach would maximize performance subject to a constraint on query complexity as measured, for example, by the cumulative length of lists from the inverted file visited during the query process. This measure of complexity will decrease with increasing resolution of the various quantizers. One would expect low complexity and high performance to imply opposing parameter requirements. Yet in only one case for sizes N1 of the visual codebooks V1 does large quantizer bin size of highest complexity results in maximum performance. This suggests that other methods should be tried to use the information of the central key-point descriptor when designing an SPGD. A simple approach consists of a multiple assignment strategy where the query central descriptor is assigned to K>1 visual words from the first visual vocabulary v1
Another approach consists of discarding quantization over the first visual vocabulary v1 altogether and instead subtracting the central key-point descriptor from those of neighboring key-points, accordingly training the second visual vocabulary v2 on a set of such re-centered neighbouring key-point descriptors dm.
The performance of the SPGD is illustrated in
its scale is greater than σn/Tσ and smaller than σn*Tσ, its offset relative to key-point n is less than TΔ and its angle difference relative to key-point n is less than Tθ. Only key-points m1 and m2 satisfy these constraints.
| Filing Document | Filing Date | Country | Kind | 371c Date |
|---|---|---|---|---|
| PCT/EP2012/060779 | 6/7/2012 | WO | 00 | 12/5/2014 |