The present invention relates generally to vision systems, and more particularly to a method and system for matching a video clip to one of a plurality of stored videos using a “fingerprint” derived from the video clip.
There are many applications where it would be desirable to identify a full length video that is resident in a large database or distributed over a network such as the Internet using only a short video clip. One such application involves the identification and removal of innumerable illegal copies of copyrighted video content that reside in popular video-sharing websites and peer-to-peer (P2P) networks on the Internet. It would be desirable to have a robust content-identification system that detects and removes copyright infringing, perceptually identical video content from the databases of such websites and prevent any future uploads made by users of these web sites.
A computer vision technique that meets the goals of such an application is called video fingerprinting. Video fingerprinting offers a solution to query and identify short video segments from a large multimedia repository using a set of discriminative features. What is meant by the term fingerprint is a signature having the following properties:
Practical video fingerprinting techniques need to meet accuracy and speed requirements. With regard to accuracy, it is desirable for a querying video clip to be able to identify content in the presence of common distortions. Such distortions include blurring, resizing, changes in source frame rates and bit rates, changes in video formats, resolution, illumination settings, color schemes letterboxing, and frame cropping. With regard to speed, a video fingerprinting technique should determine a content-match with a small turn-around time, which is crucial for real-time applications. A common denominator of many fingerprinting techniques is their ability to capture and represent perceptually relevant multimedia content in the form of short robust hashes for fast retrieval.
In some existing content-based techniques known in the prior art, video signatures are computed employing features such as mean-luminance, centroid of gradient, rank-ordered image intensity distribution, and centroid of gradient orientations, over fixed-sized partitions of video frames. The limitation of employing such features is that they encode complete frame information and therefore fail to identify videos when presented with queries having partially cropped or scaled data. This motivates the use of a local fingerprinting approach.
In Sivic, J., and Zisserman, A., “Video google: A text retrieval approach to object matching in videos,” ICCV 2, 1-8 (2003) (hereinafter “Sivic and Zisserman”), a text-retrieval approach for object recognition is described using of two-dimensional maximally stable extremal regions (MSERs), first proposed in Matas, J., Chum, O., Martin, U., Pajdla, T., “Robust wide baseline stereo from maximally stable extremal regions,” BMVC 1, 384-393 (2002), as representations of each video frame. In summary, MSERs are image regions which are covariant to affine transformations of image intensities.
Since the method of Sivic and Zisserman clusters semantically similar content together in its visual vocabulary, it is expected to offer poor discrimination, or example, between different seasons of the same TV program having similar scene settings, camera capture positions and actors. A video fingerprinting system is expected to provide good discrimination between such videos.
Similar to Sivic and Zisserman, as described in Nister, D., and Stewenius, H., “Scalable recognition with a vocabulary tree,” CVPR 2, 2161-2168 (2006) (hereinafter “Nister and Stewenius”), Nister and Stewenius propose an object recognition algorithm that extracts and stores MSERs based on a group of images of an object, captured under different viewpoint, orientation, scale and lighting conditions. During retrieval, a database image is scored depending on the number of MSER correspondences it shares with the given query image. Only the top scoring hits are then scanned further. Hence, fewer MSER pairs decrease the possibility of a database hit to figure out within the top ranked images.
Since a fingerprinting system needs to identify videos even when queried with short distorted clips, both Sivic and Zisserman and Nister and Stewenius become unsuitable, since strong degradations such as, blurring, cropping, frame-letterboxing, result in a fewer suitable MSERs found in a distorted image as compared to its original. Such degradations have a direct impact on the algorithm's performance because of a change in the representation of a frame.
In Massoudi, A., Lefebvre, F., Demarty, C.-H., Oisel, L., and Chupeau, B., “A video fingerprint based on visual digest and local fingerprints,” ICIP, 2297-2300 (2006), (hereinafter “Massoudi et al.”), Massoudi et al. proposes an algorithm that first slices a query video in terms of shots, extracts key-frames and then performs local fingerprinting. A major drawback of this approach is that even the most common forms of video processing such as blurring and scaling, disturb the key-frame and introduce misalignment between the query and database frames.
Accordingly, what would be desirable, but has not yet been provided, is a method and system for effectively and automatically matching a video clip to one of a plurality of stored videos using a fingerprint technique derived from the video clip that is fast and immune to common distortions.
The above-described problems are addressed and a technical solution is achieved in the art by providing a computer implemented method for deriving a fingerprint from video data, comprising the steps of receiving a plurality of frames from the video data; selecting at least one key frame from the plurality of frames, the at least one key frame being selected from two consecutive frames of the plurality of frames that exhibiting a maximal cumulative difference in at least one spatial feature of the two consecutive frames; detecting at least one 3D spatio-temporal feature within the at least one key frame; and encoding a spatio-temporal fingerprint based on mean luminance of the at least one 3D spatio-temporal feature. The least one spatial feature can be intensity. The at least one 3D spatio-temporal feature can be at least one Maximally Stable Volume (MSV). The at least one MSV is based on two dimensions of length and width of the key frame and the third dimension is resolution or time. The MSV is a volume that exhibits about a zero change in intensity for an incremental change in volume. The encoding step further comprises projecting the at least one MSV onto a circle whose center is selected as a reference center of the key frame.
The method can further comprise the step of storing at least the spatio-temporal fingerprint in a lookup table (LUT). The LUT associates with the at least the spatio-temporal fingerprint at least one MSV represented as an ellipse to achieve an affine invariant representation.
Also disclosed is a method for matching video data to a database containing a plurality of video fingerprints of the type described above, comprising the steps of calculating at least one fingerprint representing at least one query frame from the video data; indexing into the database using the at least one calculated fingerprint to find a set of candidate fingerprints; applying a score to each of the candidate fingerprints; selecting a subset of candidate fingerprints as proposed frames by rank ordering the candidate fingerprints; and attempting to match at least one fingerprint of at least one proposed frame based on a comparison of gradient-based descriptors associated with the at least one query frame and the at least one proposed frame. The at least one fingerprint representing at least one query frame and the plurality of video fingerprints are based on at least one Maximally Stable Volume (MSV) determined from at least one of the at least one query frame and the proposed frames and the mean luminance of the at least one MSV. The score is inversely proportional to the number of frames in the database having a matching fingerprint and directly proportional to the area of a frame represented by the fingerprint.
The step of merging the candidate fingerprints into a plurality of bins further comprises the step of placing candidate fingerprints into divisions of volumes of a 3D space constructed from the length and width of an area covered by the proposed frames, the third dimension of the 3D space being the frame number in a sequence of the proposed frames.
The step of selecting a subset of candidate fingerprints further comprises the steps of inverse transforming the transformed three points to frame of reference of the proposed frame: for each of the matching candidate fingerprints; computing the average inverse transformation of the bins that have the highest N accumulated scores; and rotating and translating a predetermined number of query frames (siftnum) to produce a series of frames that are aligned to the top ranked proposed frames that polled to the bins that have the highest N accumulated scores. The step of attempting to match at least one fingerprint further comprises the steps of calculating the Bhattacharyya distance between gradient-based descriptors of the aligned query frames and the top ranked proposed frames; and declaring a match to a proposed frame p if the Bhattacharyya distance is less than an empirically chosen predetermined threshold T, otherwise, declaring that no match is found.
The video associated with a matched proposed frame from one of the database containing a plurality of video fingerprints and a remote database. The remote database can be distributed over the Internet.
The gradient-based descriptors is based on a scale invariant feature transformation (SIFT).
The present invention will be more readily understood from the detailed description of exemplary embodiments presented below considered in conjunction with the attached drawings, of which:
It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.
Referring now to
The system 10 of the present invention can combat video piracy by recognizing illegal copies of videos even when distorted in different ways, as enumerated above. Such video identification enables a user to recognize the presence of a pirated copy on the Internet 20 and consequently take steps to remove the illegal copy. In operation, the web-crawler 18 maintains a day-to-day list of videos published on the Internet 20. The query video sent by the video fingerprinting system 14 to the video fingerprinting database 16 is matched amongst the videos published on the Internet 20 for their possible candidature as illegal copies of the query. If a match is found, immediate steps to remove the clip from its location may be taken by the video fingerprinting system 14 as a consequence.
Referring now to
Referring now to
Referring now to
For preprocessing step 46, the transformations to the plurality of frames can include changing the source frame rate to a predefined resampling rate (e.g., 10 fps), followed by converting each video frame to grayscale and finally, resizing all frames to a fixed width (w) and height (h) (e.g., w=160 and h=120 pixels). The benefit of improvement in speed of the algorithm for retrieval of large-size videos is most sensitive to the appropriate selection of w and h.
Since videos contain a large amount of data, the present method minimizes storage requirements by selecting only a predetermined minimum number of key frames in step 48. Most frames contain redundant information, for few portions of each frame change in consecutive frames. Therefore, the selection of key frames from the plurality of frames can be based on detecting large changes in motion. A key frame can selected from two consecutive frames of a plurality of frames that exhibiting a maximal cumulative difference in at least one spatial feature of the two consecutive frames. The at least one spatial feature can be intensity. The frame sequence is examined to extract key-points that correspond to local peaks of maximum change in intensity. Since maximum change in intensity can reduce the stability of the regions detected in key-frames, a few neighboring frames on either side of the key-frames are also stored to maintain minimal database redundancy.
For further storage efficiency, instead of storing entire key frames, only small portions of the key frames are stored. The small are those that would be the most stable, i.e., those portions that would change the least when subjected to the aforementioned distortions. The selection of regions in key frames that are least sensitive to distortions is based on the concept of Maximally Stable Volumes (MSVs), proposed in Donser, M., Bischof, H., “3D segmentation by maximally stable volumes (MSVs),” ICPR, 63-66 (2006) for 3D segmentation. In any given frame, a region is represented in the two dimensions of length and width, which can be extended to a third dimension of resolution or time based on building Gaussian pyramids. Thus, each video frame is represented as a set of distinguished regions that are maximally stable to intensity perturbation over different scales.
The process of extracting MSVs is given formally as follows:
Image Sequence:
For a video frame F, determine a set of multi-resolution images F′1, F′2, . . . , F′i, . . . , F′s, where F′i is the video frame F downsampled 2i−1 and consequently upsampled to the same size as that of F.
Volume:
For all pixel intensity levels i, connected volumes Vji are defined as the jth volume such that all 3D points belonging to it have intensities less than (or greater than) i, ∀(x,y,z)εVji iff F′z(x,y)≦i (or Fzi(x,y)≧i. Thus, a 3D point (x,y,z) in this space corresponds to pixel (x,y) of frame F at resolution z or, equivalently Fz(x,y).
Connectivity:
Volume Vji is said to be contiguous, if for all points p, qεVji, there exists a sequence p, a1, a2, . . . , an, q and pAa1, a1Aa2, . . . , aiAai+1, . . . , anAq. Here A is an adjacency relation defined such that two pixels p, qεVji are adjacent (pAq) iff Σl3|pi−qi|≦1.
Partial Relationship:
Any two volumes Vki and Vlj are nested, i.e., Vki∝=Vlj if i≦j (or i≧j).
Maximally Stable Volumes:
Let V1, V2, . . . Vi−1, Vi, . . . be a sequence of a partially ordered set of volumes, such that Vi∝=Vi+1. Extremal Volume v(i) is said to be maximally stable (i.e., an MSV) iff v(i)=|Vi+Δ\Vi−Δ|/|Vi| has a local minimum at i′, i.e., for changes in intensity of magnitude less than Δ, the corresponding change in region volume is zero.
Thus, each video frame is represented as a set of distinguished regions that are maximally stable to intensity perturbation over different scales. The reason for stability of MSVs over MSERs in most cases of image degradations is that additional volumetric information enables selection of regions with near-identical characteristics across different image resolutions. The more volatile regions (the ones which split or merge) are eliminated from consideration. Thus, detecting MSVs implies a reduction in the number of measurement regions per video frame, an important aspect for both decreased database storage as well as lower query retrieval times.
Referring now to
For the purpose of unique video characterization, an appropriate “fingerprint” needs to capture or encode both the spatial properties of each frame as well as the amount of change in successive frames along the entire length of the video clip. There are two constraints that need to be kept in mind before choosing the appropriate features for the task of video identification:
The present invention fulfills these criteria by expressing each local measurement region of a frame associated with an MSV in the form of a spatio-temporal fingerprint. Referring to
Q
p
i(r,c)=(Lip+step(r,c+1)−Lip+step(r,c))−α(Lip(r,c+1)−Lip(r,c)) (1)
Encoding mean luminance makes the fingerprint of the present invention invariant to photometric distortions.
In a preferred embodiment, localized content of each video frame is stored inside a database look-up table (LUT) using preferably 32-bit fingerprint signatures, as computed in Equation 1. In our database implementation, the LUT consists of 232 entries of all possible binary fingerprints. Each such LUT entry in turn stores pointers to all video clips with regions having the same fingerprint value. In order to save an affine invariant representation of the video frame which is independent of different query distortions, the geometric and shape information of ellipse ei corresponding to region MSVi is also stored, along with the fingerprint inside the database. Each ellipse corresponding to a fingerprint undergoes three transformations as depicted in
1) inverse rotation by α;
2) translation to a frame center; and
3) warping/scaling down to a unit circle.
These steps are effected by transforming coordinates of the original frame center, denoted by (cx, cy), onto a new reference axis, denoted by ({circumflex over (X)},Ŷ). The new axis has the property of projecting ellipse eip onto a circle 67, with the ellipse center being the origin of the new reference axis and ellipse major and minor axes aligned with ({circumflex over (X)},Ŷ) respectively. The coordinates of the original frame center with respect to the new reference axis are denoted by (, ). Thus, during insertion, the coordinates of the image center (cx, cy) of the frame Fp are transformed into coordinates (, ) in the reference frame of each of the ellipses associated with the maximally stable volumes of the frame Fp. The transformation between (cx,cy) and (, ) is given by:
i
p=((cx−xip)cos(−αip)−(cy−yip)sin(−αip))/(lxip×sip) (2)
i
p=((cx−xip)sin(−αip)−(cy−yip)cos(−αip))/(lxip×sip) (3)
Referring again to
The representation of all the fingerprints together in the database is expressed as ∪p(∪i(Bip,ip,ip,eip), fp). Thus, each MSV entry inside the database includes fields for its binary fingerprint Bip, ellipse parameters eip, the coordinates of the frame center with respect to the reference axis ({circumflex over (X)},Ŷ) the coordinates of “transformed prefixed square corner points,” and gradient-based descriptor of SQ given by fp. In a database retrieval, for a query video frame Eq, the ellipses and fingerprints of the frame corresponding to the frame's MSVs are generated using Equation 1. Thus, the query frame can be expressed as ∪j{Bjq,ejq}. Each of the fingerprints of MSVs belonging to the query frame is used to probe the database for potential candidates. That is, the database is queried to get the candidate set given by ∪p(∪(Bip,ip,ip,eip), fp). Now there exists a possibility for every entry in the candidate set of being the expected correct database match during database retrieval. Hence, a hypothesis is proposed for the query frame Eq, such that the query frame Eq is the same as original frame Fp stored inside the database. This can happen when ellipses ejq and eip denote similar regions in their respective frames. For every candidate hit produced from the database, potential matching frames in the database are those whose transformed image centers ip,ip can be inverse transformed to coordinates which closely match the coordinate of the query frame's center. The inverse transformation from a transformed image center ip,ip to the query frame's center are computed by using:
=(ip×sjq×lxjp,q)cos(αjq)−(ip×sjq×lyjq)sin(αjq) (4)
i,j
p,q=(ip×sjq×lxjp,q)sin(αjq)−(ip×sjq×lyjq)cos(αjq) (5)
A score sci,j,p,q is associated between the MSV of each candidate database frame represented as an ellipse and the query frame defined as:
sc
i,j,p,q
=fac×(lxip×lyip×sip×sip÷(w×h))+(1−fac)×log(N÷Njq) (6)
where N is the total number of entries present in the database and Njq is the number of database hits generated for the query fingerprint Bjq. The first term of Equation 6, (lxip×lyip×sip×sip÷(w×h)), signifies that the greater the area represented by the fingerprint of the database image, the higher is the score. Thus, the scoring gives more weight to candidate MSVs of larger size, since these regions encode more information than smaller regions. The second term, log(N÷Njq), assigns higher scores to unique fingerprints Bjq that produce fewer database hits. Regions with fewer database hits are hence more discriminative. The factor facε[0,1] is used for assigning the appropriate weight to each of the two terms in Equation 6. In the preferred embodiment, fac=0.5.
An important requirement of a video fingerprinting system is speed. In cases where a large number of candidate clips are produced as hits from the database, performing an exhaustive check on each one cannot be performed in real time. To meet a real time goal, adopting a strategy to rank-order the local database results in terms of their potential of leading to a correct hit is desirable. For this purpose, an additional stage for scoring each database frame “hit” is employed, followed by a poll to collate all local information and arrive at a final decision.
In an ideal situation, all MSVs within the matching candidate frame will have transformed frame centers (ip,ip) and “transformed prefixed square corner points,” that map back to the same frame center and “prefix square corner points” which matches query frame center and “prefix square corner points”, respectively. In a more realistic scenario with frames subject to distortions, additional processing is necessary via binning Consider a video as a 3D space with its third dimension given by its frame number. This space is divided into bins (in one preferred embodiment, of size 5×5×10), where each bin is described by a three tuple b≡(b1,b2,b3). Thus, the frames and frame information of database hits are merged (1) which have their hypothetical image centers close to each other, and (2) which belong to neighboring frames of the same video considering that the movements of the region across them is appreciably small.
Referring to
Once polling and scoring have completed, from the top n candidates obtained in step 78, the correct database hit is found using a gradient-based descriptor, e.g., a 128 dimension scale invariant feature transformation (SIFT)-based descriptor, in the verification process. Let e siftbq be the SIFT descriptor of the square Ŝ{circumflex over (Q)} (e.g., of size 100 by 100 pixels) centered at (cx, cy) in the query frame Ebq with its sides aligned to the (X,Y) axis of frame Ebq. The verification process, as shown in
is the Bhattacharyya coefficient.
In the equation, p and q are the 128-dimension SIFT descriptors that each describe the region of the square in query and database frames. Substituting esiftbq and siftp for p and q in Equation (9), we have
At step 82, if the Bhattacharyya distance is less than an empirically chosen predetermined threshold T (e.g., 0.2) for the database frame p is declared at step 86, otherwise, at step 86, no match is declared to be found.
It is to be understood that the exemplary embodiments are merely illustrative of the invention and that many variations of the above-described embodiments may be devised by one skilled in the art without departing from the scope of the invention. It is therefore intended that all such variations be included within the scope of the following claims and their equivalents.
This is a divisional application which claims the benefit of pending U.S. non-provisional patent application Ser. No. 12/262,463 filed on Oct. 31, 2008 and provisional patent application No. 61/013,888 filed Aug. 20, 2008, the disclosures of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
61090251 | Aug 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12262463 | Oct 2008 | US |
Child | 13897032 | US |