The present invention relates to image and video analysis, and in particular to determining the similarity between squences of images or video and for detecting periodic motion in sequences of images or video.
The present invention consists of a computational method for identifying similar digital image sequences such as those comprising all or part of a video. The current invention can be used, for instance, to identify repeating portions of an image sequence that shows a scene undergoing partial or full periodic motion. This includes automatically identifying the video frame at which a person or object makes one complete 360-degree revolution as they rotate in front of a camera at either a fixed or variable speed of rotation.
A number of prior methods attempt to detect cyclic motion in the case of a non-stationary (moving) observer. This relaxes the assumption that the repetitive motion produce a repeating sequence of images. This includes the method proposed by Allmen and Dyer, Cyclic Motion Detection Using Spatiotemporal Surfaces and Curves (International Conference on Pattern Recognition 1990) as well as the method of Seitz and Dyer, View-Invariant Analysis of Cyclic Motion (International Journal of Computer Vision 1997). Common to both of these methods is that they must track the 2D image locations of 3D features on the moving object. In contrast, our method assumes a stationary observer and thus can rely on the fact that the motion will produce a repeating sequence of images. This simplifying assumption avoids the difficult and error-prone step of isolating and tracking 3D features.
Xu and Aliaga, Efficient Multi-viewpoint Acquisition of 3D Objects Undergoing Repetitive Motions (ACM Symposium on Interactive 3D Graphics 2007) introduced a method for estimating the 3D surface geometry of an object from a pair of image sequences recorded while the scene undergoes “repetitive” motion (their definition of “repetitive” is included in the definition of “semi-periodic motion” used in this document). A cornerstone of their technique is locating loop points in the captured sequences; however, this process relies on compensating for motion of the camera with respect to the scene (i.e., tracking features like the methods described in the preceding paragraph) and it only considers single frame pairwise comparisons. The current invention is an improvement that compares a longer subsequence of frames and increases the reliability of determining the periodic motion in the input.
Schodl et. al. Video Textures (Proc. SIGGRAPH 2000), provide a way of extending a finite video of a repetitive motion (e.g., flickering flame, running water, etc.) to an infinite sequence by replaying the frames out of their original order. The basic idea is to identify pairs of frames that give the appearance of a smooth transition and choose these alternative paths according to some schedule of probabilities. Although these methods consider the pairwise distance between subsequences of video frames, they do not attempt to reduce the computational expense of this operation by focusing only on a subset of image pixels. The current invention is an improvement that improves efficiency and robustness by sub-sampling the original image sequence.
The present disclosure provides a novel framework for determining the similarity of two image sequences and the application of this framework to identifying the temporal location or locations of periodic motion in a longer image sequence or video.
A key component of the present invention is establishing a robust and discriminating distance function that assigns a value to dissimilar image sequences based on the likelihood that those two sequences show the same scene. The two input image sequences are assumed to be of the same length, alternatively the sequences can be scaled in time and re-sampled to ensure a 1-to-1 mapping between images in the two sequences.
In broad terms, a degree of similarity between two image sequences can be determined by computing a set of statistics for each image sequence (e.g., the mean pixel intensity in each frame), organizing these statistics into a list called a feature vector for each sequence using a consistent and predetermined process, and comparing the distances between these lists using a standard vector-valued distance function (e.g., Euclidean norm) to determine the measure of similarity.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
An illustrative embodiment of the disclosed invention shown in
The current invention includes methods that use any linear or non-linear combination of the pixel values in the frames composing each sequence to create the representative vectors [4] described above, but here we discuss a particular method for computing the feature vectors, favored for its efficiency and robustness.
Given two or more image sequences, the first step is to compute a representative vector from each sequence as depicted in
In the preferred embodiment, each image sequence is first denoised using a standard approach such as convolving the color channels with a small Gaussian kernel, and then the resulting pixels are serialized directly into a representative vector. We note that denoising significantly increases robustness by reducing the effect of camera noise and small transient image features irrelevant to the broader image sequence similarity. The distance between these resulting vectors is computed using the normalized cross correlation (NCC) function. In this case, a value close to one would indicate a high degree of positive correlation and one would conclude that the two sequences are similar. On the other hand, if the NCC is closer to zero or negative one, this would indicate that the two image sequences are dissimilar.
A typical 30 second 1,920×1,080 video contains over 1.8 billion individual pixels, and performing computations on this intractable workload directly would result in an inefficient technique. Instead, in the preferred embodiment we compute the representative vector based on only a subset of the pixels in the input image sequences. Selection of the pixel subset is another contribution of the present invention.
One approach is to use a fixed pattern of pixel locations as shown in
One use of the present invention also claimed in this application is to extend the prior invention described by U.S. Provisional Patent No. 61/609,313. This embodiment is illustrated in
The process involves the following steps:
Note that the period computed by the preceding method can be converted into seconds if the frame rate, measured in frames per second, of the video is known.
This application claims the benefit of U.S. Provisional Application Ser. No. 61/664,325 “Method for Computing the Similarity of Two Image Sequences” filed June, 2012.
This invention was made with government support under SBIR IIP-1142829 awarded by the National Science Foundation. The government has certain rights in the invention.