1. Field of the Invention
The present application relates generally to digital video processing and more particularly to automated recognition and classification of image objects in digital video streams.
2. Description of the Background Art
Video has become ubiquitous on the Web. Millions of people watch video clips everyday. The content varies from short amateur video clips about 20 to 30 seconds in length to premium content that can be as long as several hours. With broadband infrastructure becoming well established, video viewing over the Internet will increase.
Video watching on the Internet is, today, a passive activity. Viewers typically watch video streams from beginning to end much like they do with television. In contrast, with static Web pages, users often search for text of interest to them and then go directly to that portion of the Web page.
Applicants believe that it would be highly desirable, given an image or a set of images of an object, for users to be able to search for the object, or type of object, in a single video stream or a collection of video streams. However, for such a capability to be reliably achieved, a robust technique for object recognition and classification is required.
A number of classifiers have now been developed that allow an object under examination to be compared with an object of interest or a class of interest. Some examples of classifier/matcher algorithms are Support Vector Machines (SVM), nearest-neighbor (NN), Bayesian networks, and neural networks. The classifier algorithms are applied to the subject image.
In previous techniques, the classifiers operate by comparing a set of properties extracted from the subject image with the set of properties similarly computed on the object(s) of interest that is (are) stored in a database. These properties are commonly referred to, as local feature descriptors. Some examples of local feature descriptors are scale invariant feature transforms (SIFT), gradient location and orientation histograms (GLOH) and shape contexts. A large number of local feature descriptors are available and known in the art.
The local feature descriptors may be computed on each object separately in the image under consideration. For example, SIFT local feature descriptors may be computed on the subject image and the object of interest. If the properties are close in some metric, then the classifier produces a match. To compute the similarity measure, the SVM matcher algorithm may be applied to the set of local descriptor feature vectors, for example.
The classifier is trained on a series of images containing the object of interest (the training set). For the most robust matching, the series contains the object viewed from many different viewing conditions such as viewing angle, ambient lighting, and different types of cameras.
However, even though multiple views and conditions are used in the training set, previous classifiers still often fail to produce a match. Failure to produce a match typically occurs when the object of interest in the subject frame does not appear in precisely or almost the same viewing conditions as in at least one of the images in the training set. If the properties extracted from the object of interest in the subject frame vary too much from the properties extracted from the object in the training set, then the classifier fails to produce a match.
The present application discloses a technique to more robustly perform object identification and/or classification. Improvement comes from the capability to go beyond applying the classifier to an object in a single subject frame. Instead, a capability is provided to apply the classifier to the object of interest moving through a sequence of frames and to statistically combine the results from the different frames in a useful manner.
Given that the object of interest is tracked through multiple frames, the object appears in multiple views, each one somewhat different from the others. Since the matching confidence level (similarity measure) obtained by the classifier depends heavily on the difference between the viewed image and the training set, having different views of the same object in different frames results in varying matching quality based on different features being available for a match. A statistical averaging of the matching results may therefore be produced by combining the results from the different subject frames. Advantageously, this significantly improves the chance of correct classification (or identification) by increasing the signal-to-noise ratio.
The object tracking module 122 identifies the pixels belonging to each object in each frame. An example video sequence is shown in
The object tracking module 122 may be configured to output an object pixel mask per object per frame 104. An object pixel mask identifies the pixels in a frame that belong to an object. The object pixel masks may be input into a local feature descriptor module 124.
The local feature descriptor module 124 may be configured to apply a local feature descriptor algorithm, for example, one of those mentioned above (i.e. scale invariant feature transforms (SIFT), gradient location and orientation histograms (GLOH) and shape contexts). For instance, a set of SIFT feature vectors may be computed from the pixels belonging to a given object. In general, a set of feature vectors will contain both local and global information about the object. In a preferred embodiment, features may be selected at random positions and size scales. For each point randomly selected, a local descriptor may be computed and stored as a feature vector. Such local descriptors are known in the art. The set of local descriptors calculated over the selected features in the object are used together for matching. An example extracted image with feature points is shown in
The set of local feature vectors for an object, obtained for each frame, may then be fed into a classifier module 126. The classifier module 126 may be configured to apply a classifier and/or matcher algorithm.
For example, the set of local feature vectors per object 106 from the local feature descriptor module 124 may be input by the classifier module 126 into a Support Vector Machine (SVM) engine or other matching engine. The engine may produce a score or value for matching with classes of interest in a classification database 127. The classification database 127 is previously trained with various object classes. The matching engine is used to match the set of feature vectors to the classification database 127. For example, in order to identify a “van” object, the matching engine may return a similarity measure xi for each candidate object i in an image (frame) relative to the “van” class. The similarity measure may be a value ranging from 0 to 1, with 0 being not at all similar, and 1 being an exact match. For each value of xi, there is a corresponding value of pi, which is the estimated probability that the given object i is a van.
As shown in
For example, given the example image frames in
In accordance with a first embodiment, a highest score achieved on any of the frames may be used. For the particular example given in Table 1, the score from frame 40 would be used. In that case, the probability of the given object being a van would be determined to be 73%. This determined probability may then be compared against a threshold probability. If the determined probability is above (or is equal to or above) the threshold probability, then the classification score aggregator 128 may identify or classify the given object as a van and that classification for the given object 110 may be output.
In accordance with a second embodiment, the average of scores from all the frames with the given object may be used. For the particular example given in Table 1, the average similarity score is 0.61, which corresponds to a probability of 64%. If this determined probability is above (or is equal to or above) the threshold probability, then the classification score aggregator 128 may identify or classify the given object as a van and that classification for the given object 110 may be output.
In accordance with a third embodiment, a median score of the scores from all the frames with the given object may be used. For the particular example given in Table 1, the median similarity score is 0.61, which corresponds to a probability of 64%. If this determined probability is above (or is equal to or above) the threshold probability, then the classification score aggregator 128 may identify or classify the given object as a van and that classification for the given object 110 may be output.
In accordance with a fourth and preferred embodiment, a Bayesian inference may be used to get a better estimate of the probability that the object is a member of the class of interest. The Bayesian inference is used to combine or fuse the data from the multiple frames, where the data from each frame is viewed as an independent measurement of the same property.
Using Bayesian statistics, if we have two measurements of a same property with probabilities p1 and p2, then the combined probability p12=p1p2/[p1p2+(1−p1)(1−p2)]. Similarly, if we have n measurements of a same property with probabilities p1, p2, p3, . . . , pn, then the combined probability p1n=p1p2p3 . . . pn/[p1p2p3 . . . pn+(1−p1)(1−p2)(1−p3) . . . (1−pn)]. If this combined probability is above (or is equal to or above) the threshold probability, then the classification score aggregator 128 may identify or classify the given object as a van and that classification for the given object 110 may be output.
For the particular example given in Table 1, the probability that the object under consideration is a van is determined, using Bayesian statistics, to be 96.1%. This probability is higher under Bayesian statistics because the information from multiple frames reinforces each other to give a very high confidence that the object is a van. Thus, if the threshold for recognition is, for example, 95%, which is not reached by analyzing the data in any individual frame, this threshold would still be passed in our example due to the higher confidence from the multiple frame analysis using Bayesian inference.
Advantageously, the capability to use multiple instances of a same object to statistically average out the noise may result in significantly improved performance for an image object classifier or identifier. The embodiments described above provide example techniques for combining the information from multiple frames. In the preferred embodiment, a substantial advantage is obtainable when the results from a classifier are combined from multiple frames.
In the example of
In a first phase, shown in block 602 of
Per block 704, given a segmentation of a static image, the motion vectors for each segment are computed. The motion vectors are computed with respect to displacement in a future frame/frames or past frame/frames. The displacement is computed by minimizing an error metric with respect to the displacement of the current frame segment onto the target frame. One example of an error metric is the sum of absolute differences. Thus, one example of computing a motion vector for a segment would be to minimize the sum of absolute difference of each pixel of the segment with respect to pixels of the target frame as a function of the segment displacement.
Per block 706, segment correspondence is performed. In other words, links between segments in two frames are created. For instance, a segment (A) in frame 1 is linked to a segment (B) in frame 2 if segment A, when motion compensated by its motion vector, overlaps with segment B. The strength of the link is preferably given by some combination of properties of Segment A and Segment B. For instance, the amount of overlap between motion-compensated Segment A and Segment B may be used to determine the strength of the link, where the motion-compensated Segment A refers to Segment A as translated by a motion vector to compensate for motion from frame 1 to frame 2. Alternatively, the overlap of the motion-compensated Segment B and Segment A may be used to determine the strength of the link, where the motion-compensated Segment B refers to Segment B as translated by a motion vector to compensate for motion from frame 2 to frame 1. Or a combination (for example, an average or other mathematical combination) of these two may be used to determine the strength of the link.
Finally, per block 708, a graph data structure is populated so as to construct a temporal graph for N frames. In the temporal graph, each segment forms a node in the temporal graph, and each link determined per block 706 forms a weighted edge between the corresponding nodes.
Once the temporal graph is constructed as discussed above, the graph may be partitioned as discussed below. The number of frames used to construct the temporal graph may vary from as few as two frames to hundreds of frames. The choice of the number of frames used preferably depends on the specific demands of the application.
In a preferred embodiment, the partitioning may use a procedure that minimizes a connectivity metric. A connectivity metric of a graph may be defined as the sum of all edges in a graph. A number of methods are available for minimizing a connectivity metric on a graph for partitioning, such as the “min cut” method.
After partitioning the original temporal graph, the partitioning may be applied to each sub-graph of the temporal graph. The process may be repeated until each sub-graph meets some predefined minimal connectivity criterion or satisfies some other statically-defined criterion. When the criterion (or criteria) is met, then the process stops.
In the illustrative procedure depicted in
Per block 806, a determination may be made as to whether any of the sub-partitions (sub-graphs) have multiple objects and so require further partitioning. In other words, a determination may be made as to whether the sub-partitions do not yet meet the statically-defined criterion. If further partitioning is required (statically-defined criterion not yet met), then each such sub-partition is designated as a partition per block 810, and the process loops back to block 804 so as to perform optimum cuts on these partitions. If further partitioning is not required (statically-defined criterion met), then a partition designated object has been created per block 808.
At the conclusion of this method, each sub-graph results in a collection of segments on each frame corresponding to a coherently moving object. Such a collection of segments, on each frame, form outlines of coherently moving objects that may be advantageously utilized to create hyperlinks, or to perform further operations with the defined objects, such as recognition and/or classification. Due to this novel technique, each object as defined will be well separated from the background and from other objects around it, even if they are highly overlapped and the scene contains many moving objects.
As shown in block 906, two candidate nodes may then be swapped. Thereafter, the energy is re-computed per block 908. Per block 910, a determination may then be made as to whether the energy increased (or decreased) as a result of the swap.
If the energy decreased as a result of the swap, then the swap did improve the partitioning, so the new sub-partitions are accepted per block 912. Thereafter, the method may loop back to step 904.
On the other hand, if the energy increased as a result of the swap, then the swap did not improve the partitioning, so the candidate nodes are swapped back (i.e. the swap is reversed) per block 914. Then, per block 916, a determination may be made as to whether there is another pair of candidate nodes. If there is another pair of candidate nodes, then the method may loop back to block 906 where these two nodes are swapped. If there is no other pair of candidate nodes, then this method may end with the optimum or near optimum cut having been determined.
In block 1002, selection is made of a partition designated as an object. Then, for each frame, segments associated with nodes of the partition are collected per block 1004. Per block 1006, pixels from all of the collected segments are then assigned to the object. Per block 1008, this is performed for each frame until there are no more frames.
The methods disclosed herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. In addition, the methods disclosed herein are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
The apparatus to perform the methods disclosed herein may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories, random access memories, EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus or other data communications system.
In the above description, numerous specific details are given to provide a thorough understanding of embodiments of the invention. However, the above description of illustrated embodiments of the invention is not intended to be exhaustive or to limit the invention to the precise forms disclosed. One skilled in the relevant art will recognize that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the invention. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims. Rather, the scope of the invention is to be determined by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
The present application claims the benefit of U.S. Provisional Patent Application No. 60/864,284, entitled “Apparatus and Method For Robust Object Recognition and Classification Using Multiple Temporal Views”, filed Nov. 3, 2006, by inventors Edward Ratner and Schuyler A. Cullen, the disclosure of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60864284 | Nov 2006 | US |