This invention relates generally to image processing, and more particularly to extracting descriptors from images and videos that can be used for quering, classification and object detection.
The advent of inexpensive cameras and inexpensive storage has made it practical to collect images and video for storage in very large databases. For example, it is estimated that one popular social media provider stores about 80 billion images, and processes 600,000 images per second.
The commercial viability of such databases depends in large part on the availability of search and retrieval application. Thus, a great effor has been devoted to search and retrieval mechanisms for images. In general, such mechanisms rely on identifying points of interest in an image, often referred to as keypoints, and then extracting features from these points that remain accurate when subject to variations in translation, rotation, scaling and illumination.
Examples of such features include scale-invariant feature transform (SIFT), speeded-up robust features (SURF), binary robust invariant scalable keypoints (BRISK), fast retina keypoint (FREAK), histogram of oriented gradients (HoG), circular Fourier-HOG (CHOG), others.
To reduce the bandwidth and complexity of such applications, while preserving matching accuracy and speed, the features are often aggregated and summarized to more compact descriptors. Approaches for compacting the feature spaces include principal component analysis (PCA), linear discriminant analysis (LDA), boosting, spectral hashing, and the popular Bag-of-Features approach. The latter converts features to compact descriptors codewords) using cluster centers produced by means clustering.
The compact descriptors extracted from a query image or video can be compared to descriptors extracted from images in the database to determine similar images. There has, however, been much less work in developing efficient feature matching mechanisms for video queries.
To extend conventional image descriptors to derive video descriptors is not straightforward. One naïve method extracts image descriptors from each image in the video sequence, treating each image separately. That method fails to exploit the fact that features extracted from successive video images tend to be very similar, and describe similar keypoints, resulting in a very redundant representation. Furthermore, that method does not remove features that are not persistent from image to image, and probably does not describe the video sequence very well. Thus, simply collecting individual image descriptors is bandwidth-inefficient and significantly increase matching complexity.
A more efficient approach is to compress the descriptors derived from each video image, exploiting the motion of those descriptors through the video sequence. Those methods exploit powerful paradigms from video compression, such as Motion compensated prediction and rate-distortion optimization, to reduce the bit-rate of the transmitted descriptors. However, those methods do not address the problem of discovering a small set of descriptors that can represent a visually salient object.
The embodiments of the invention prove a method for extracting low-rank descriptors ala video acquired of a scene, wherein the video includes a sequence of images.
Therefore, it is an object of this invention to generate a low-rank descriptor that reduces the amount of information that is required to store representative descriptors of a video scene, while maintaining the discriminability relative to descriptors generated from different video scenes. Another object of this invention is to utilize the low-rank descriptors for querying, and retrieval of videos from a large database, and object detection.
In one embodiment of this invention, the low-rank descriptors are generated by extracting visual descriptors from a group of pictures (GoP) in a video, determining a low-rank descriptor representation of the video scene descriptors, determining a selection matrix that associates every extracted descriptor to a corresponding column in a low-rank descriptor.
Another embodiment of the invention extracts a low-rank descriptor from a large collection of video descriptors using non-negative matrix factorization (NMF), comprising a sequence of steps where a low-rank factor is first determined by non-negative least squares minimization, next a selection factor is determined by minimizing a proximal point least squares problem, and then keeping a largest entry in every column of the selection matrix and setting all other entries to zero. The sequence of steps is repeated until the low-rank factor and the selection matrix do not change.
Another embodiment of the invention classifies video scenes by using the low-rank descriptors, comprising determining a low-rank descriptor of a query video, determining a low-rank descriptor of each of many videos available in a database, determining the correlation coefficient between the low-rank descriptor of the query video and the low-rank descriptor of each of the database videos, assigning the query video to the database video with a low-rank descriptor that has a largest correlation coefficient with the low-rank descriptor of the query video.
Another embodiment of the invention detects objects in a video, comprising acquiring, a video of an object, subtracting the background pixels from the video to keep only pixels representing the object, extracting visual descriptors from every image containing only the object, determining a low-rank descriptor from the visual descriptors extracted from the background subtracted video, determining the correlation coefficient between the low-rank descriptor and visual descriptors belonging to several videos available in a database, assigning the object to the video in the database that has a visual descriptor with a highest correlation coefficient relative to the low-rank descriptor of the query object.
The embodiments consider the problem of extracting descriptors that represent visually salient portions of a video sequence. Most state-of-the-art schemes generate video descriptors by extracting features, e.g., SIFT or SURF or other keypoint-based features, from individual video images. Those approaches are wasteful in scenarios that impose constraints on storage, communication overhead and on the allowable computational complexity for video querying. More important, the descriptors obtained by that approach generally do riot provide semantic clues about the video content.
Therefore, the embodiments provide novel feature-agnostic approaches for efficient retrieval of similar video content. The efficiency and accuracy of retrieval is evaluated relative to applying k-means clustering to image features extracted from video images. The embodiments also propose a novel approach in which the extraction of low-rank video descriptors is cast as a non-negative matrix factorization (NMF) problem.
The embodiments of our invention provide a method for extracting low-rank descriptors of a video acquired of a scene, wherein the video includes a sequence of images. The low-rank descriptors of visual scenes allow us to reduce the amount of metadata that is compressed and stored with the video bitstream, while maintaining a discriminative representation of the scene content. Our framework assumes that local scene descriptors, such as SIFT or HoG features, are extracted from every video image in a group of pictures (GoP). The descriptors are stacked to form a matrix X of size m×N where m is a length of the feature vector and N is a total number of descriptors extracted from the GoP. In many situations, the number of descriptors can reach several hundred features per image.
For the purpose of this description, the rank of an individual descriptor is 1. By aggregating descriptors into the matrix X, the rank is a minimum between 128, for SIFT, and a number of columns in the matrix X. Therefore, any compact descriptor with a rank less than 128 is considered to be low-rank.
As shown in
A set of descriptors 111 is extracted 110 for each image in the video. The sets of descriptors is aggregated 121 to form a descriptor matrix 121. A low-rank, descriptor matrix 131 representation of the scene is determine 130. Then,
a selection Matrix 141 that associates every extracted descriptor to a corresponding column in the low-rank descriptor is also determined. The steps 130 and 140 are iterated until convergence, when the low-rank descriptor marix is output. The steps of the method can be performed in a processor 100 connected to memory and input/ouput interfaces by busses as known in the art.
Matrix factorization is a technique used for determining low dimensional representations for high dimensional data. An m×N matrix X is factored into two components L and R such that their product closely approximates the original matrix
X≈LR. (1)
In the special case where the matrix and its factors have non-negative entries, the problem is known as non-negative matrix factorization (NMF). NMF has gained popularity in machine learning and data mining, for example searching videos stored in a vary large database.
Several NMF formulations exist, with variations on the approximation cost matrix, the structure imposed on the non-negative factors, applications, and the computational methods to achieve the factorization, among others.
Of interest to the invention are NMF formulations used for clustering. Specifically, we consider sparse NMF and orthogonal NMF formulations. The orthogonal NMF problem is defined as
where F is a vector transpose operator, and I is an, identity matrix. This formulation is equivalent to k-means clustering.
Alternatively, the sparse NMF problem relaxes the orthogonality constraint on R replacing R with an Ll
where α and β are problem specific regularization parameters.
Note that NMF problems are non-convex. Procedures that solve these problems generally do not have global optimality guarantees. Therefore, different procedures that solve the same problem can arrive at different solutions. In what follows, we develop a procedure procedure that addresses the orthogonal NMF problem, and demonstrate that the solutions produced by our procedure has better classification properties compared to k-means and sparse NMF.
Low-rank descriptors of visual scenes enable us to reduce the amount of metadata that is compressed and stored with a video bitstream, while maintaining a discriminative representation of the scene content. Our framework assumes that local scene descriptors, such as SIFT or HoG features are extracted from every video image in a group of pictures (GoP). The descriptors are stacked to form the descriptor matrix X 121 of size m×N, where in is a length of the feature vector and N is a total number of descriptors extracted from the GoP.
In many situations, the number of descriptors N can reach several hundred features per image. Therefore, it is imperative that these descriptors be encoded in a compact manner. In this section, we develop a framework for extracting a low-rank descriptor that represents the salient visual information in a video scene.
We observe that visually salient objects in a scene maintain a nearly stationary descriptor representation throughout the GoP. Therefore, we formulate the problem of determining a low-rank, descriptor of a video scene as that of determining a low dimensional representation of the matrix X. Ideally, the set of feature vectors that represent the salient objects in a GoP can be encoded using a matrix L∈Rm×r, where r<<N represents the number of descriptors that distinctly represent the salient object.
where Lj and Rj are the columns of the matrices L and R indexed by i and j, respectively, and is the positive orthant.
The NMF formulation in equation (4) functions similar to a k-means classifier and ensures that for a large enough r, the columns of {circumflex over (L)} contain the cluster centers of dominant features in the matrix X, while the selection matrix R selects the cluster centers in {circumflex over (L)} that best match the data.
As shown in
where ρ is a parameter that controls smoothness of the problem.
The columns of {circumflex over (L)} are then projected onto the non-negative Ll
As shown in
Suppose that the query video as well as the database videos are partitioned into GoPs of size n video images. Let ĹQ denote the GoP's low-rank query descriptor, and {circumflex over (L)}D(g) denote the low-rank class descriptors of GoPs in the database indexed by g. A database GoP indexed by ĝ matches the query GoP if it has a largest correlation coefficient relative to {circumflex over (L)}Q, i.e.,
where an infinity norm ∥.∥∞ is applied after vectorizing the matrix product {circumflex over (L)}TQ{circumflex over (L)}D(g). Consequently, the matching GoP in the database is the one whose low-rank descriptor correlates best with the query descriptor, and the class of the matching GoP can be assigned to the query GoP.
The classification method described above can also be used for video retrieval. In this case, the retrieval method obtains videos from the database with correlation coefficients larger than a predermined threshold.
We can also use the low-rank descriptor to detect an object in a video. This precess is similar to what is shown in
Thus, when the scene includes a specific object, background pixels are subtracted from each image in the video to obtain the foreground video. A low-rank object descriptor is determined of the foreground video. A low-rank object class descriptor of each video in a database is also determined, wherein each video in the database is associated with an object class. The object class of the video in the database with a largest correlation coefficient is assigned to the foreground video.
Our experimental data demonstrate that low dimensional clustering of visual features according to embodiments of the invention can significantly reduce the memory requirements for representing visually salient objects in a video scene.
A rank 30 descriptor achieves storage reductions that exceed 97% and average at 99%. Moreover, the low-rank descriptors maintain their discriminability with well over 90% matching accuracy despite the significant compression.
Procedurally, we demonstrate that our proposed orthogonal NMF (ONMF) method for determining low dimensional clusters is more discriminative than both k-means clustering and sparse NMF. Our approach is also more robust to variations in the number of clusters than k-means.
One striking observation is that while sparse NMF outperforms k-means for very low-rank representations, it quickly becomes unstable as the number of clusters, i.e., the rank of the factors, increases. We also note that: because all of the above mentioned clustering problems are non-convex, the solutions to these problems depend on the initialization.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.