This application claims foreign priority from UK Patent Application Serial No. 1200322.4, filed 10 Jan. 2012.
Devices with digital video recording capability are readily available and easily affordable. In fact, multimedia technologies have advanced to the point where video recording capabilities are commonly included as part of electronic devices such as digital cameras, cell phones and personal digital assistants for example. Alongside the popularity and ease of capturing video content, there are now many options for sharing and consuming captured content. For example, it is straightforward to distribute video content from a computer onto social media websites for example, and more and more people are uploading and sharing video content in this and similar ways.
This situation creates issues relating to data management, such as database optimization for example. It is inefficient to store multiple copies of the same video in a database as it creates needless infrastructure expenses and complicates search and retrieval algorithms. Another issue relates to copyright infringement. There are ways to copy commercial content and redistribute it over the Internet for example. This can result in loss of revenue for a business, and it is typically not feasible to manually sift through hours of videos to determine if an illegal copy has been made or distributed.
Detecting whether a video already exists in a database can allow more effective use of storage. In addition, automated video copy detection techniques can be used to detect copyright violations as well as to monitor usage. According to an example, there is provided a content-based video copy detection method, which generates signatures that capture spatial and temporal features of videos. Typically, a computed signature is compact, and can be created from individual video frames. For example, each video frame can be divided into multiple regions and discriminating or salient visual features for each region can be determined. In an example, a count of the visual features in each region can be used as a spatial signature. To determine a spatial signature, the number of counts in each region can be sorted along a time line and an ordinal value assigned based on its temporal rank for example. Temporal and spatial signatures can be combined, and the resultant signature can be compared against signatures of different videos.
According to an example, there is provided a computer-implemented method for detecting a copy of a reference video, comprising segmenting respective ones of multiple frames of the reference video into multiple regions, determining sets of image features appearing in respective ones of the regions, determining a measure for the relative number of image features for a given region across the multiple frames, generating a spatio-temporal signature for the reference video using the determined measures, and comparing the signature for the reference video against a spatio-temporal signature of a query video to determine a likelihood of a match. A set of static pixels for the multiple frames with an intensity variance below a predetermined threshold intensity value is determined and disregard or otherwise excluded or ignored from further processing. Such static pixels can relate to objects in a video which have been inserted such as text and borders and the like. In an example, segmenting includes specifying a number of horizontal and vertical partitions to define a number of regions for the multiple frames. A set of static pixels for the multiple frames can be determined with a colour variance below a predetermined threshold value, and disregarded as above.
In an example, determining sets of image features can include using a scale- and rotation-invariant interest point detection method to determine an interest point in a region of a frame having a plurality of pixels, the interest point having a location in the region and an orientation. A spatial element of the spatio-temporal signature can represent the number of interest points for regions of the multiple frames, and a temporal element can represent the number of interest points for regions of the multiple frames after the number is sorted in each region along a time line of the multiple frames. The spatio-temporal signature can include spatial information from segmenting the frames into regions and temporal information from ranking each region along a time line for the multiple frames. In an example, comparing the signature includes generating a spatio-temporal signature for the query video, and computing the distance between the spatio-temporal signature for the reference video and the spatio-temporal signature for the query video.
According to an example, there is provided apparatus for extracting a spatio-temporal signature from video data to detect a copy, comprising a segmentation module to segment respective ones of multiple frames of a reference video into multiple regions, a feature extraction engine to generate feature data representing sets of image features appearing in respective ones of the regions, a signature generation module to determine a measure for the relative number of image features for a given region across the multiple frames and to generate a spatio-temporal signature for the reference video using the determined measures, and a comparison engine to compare the signature for the reference video against a spatio-temporal signature of a query video to determine a likelihood of a match.
In an example, the comparison engine can generate a measure representing the similarity of the query video to the reference video. A reference video signature database to store the signature for the reference video can be provided. A static pixel identification engine to determine a set of static pixels for the multiple frames with an intensity or colour variance below a predetermined threshold value can be provided. In an example, the segmentation module can receive partition data representing a number of horizontal and vertical partitions to define a number of regions for the multiple frames. The feature extraction engine can determine sets of image features using a scale- and rotation-invariant interest point detection system configured to determine an interest point in a region of a frame having a plurality of pixels, the interest point having a location in the region and an orientation.
According to an example, there is provided a computer program embedded on a non-transitory tangible computer readable storage medium, the computer program including machine readable instructions that, when executed by a processor, implement a method for detecting a copy of a reference video, comprising segmenting respective ones of multiple frames of the reference video into multiple regions, determining sets of image features appearing in respective ones of the regions, determining a measure for the relative number of image features for a given region across the multiple frames, generating a spatio-temporal signature for the reference video using the determined measures, and comparing the signature for the reference video against a spatio-temporal signature of a query video to determine a likelihood of a match.
An embodiment of the invention will now be described, by way of example only, and with reference to the accompanying drawings, in which:
In an example, the spatial part of a signature for a video is generated by dividing multiple video frames into regions. Local features are detected and counted in each region. These feature counts in the regions of a frame represent the spatial part of the signature for that frame.
With reference to
The matrix S shows the spatial part of the signature. That is, feature counts from regions of the multiple frames are transferred into a matrix S. Rows represent the feature counts for a region across frames. Accordingly, a column represents feature counts for a frame. In order to add temporal information to the signature, the number of counts in each region is sorted along the time line of the frames, and assigned an ordinal value based on its temporal rank. In the example of
Therefore, according to an example the final combined signature of a video clip (matrix λ) includes two parts: spatial information generated from partitioning frames into regions, and temporal information generated by ranking each region along the time-line of the frames. In order to detect copies signatures can be compared using any similarity metric. In an example, an L1 (1-norm) distance is used in order to determine the similarity between videos.
In block 307 the percentage of static pixels in each region is calculated. The presence of static pixels within the cropped image can indicate the presence of an image, some text, a logo, or a background pattern superimposed onto the video. If a significant number of pixels within a region are masked then too much of the area may be occluded to obtain useful information. If this is the case, the region in question can be turned off or disregarded, and the process can proceed based on the remaining regions.
That is, according to an example, objects such as static borders, letter-box and pillar-box effects for example can be removed or otherwise disregarded. This can form a pre-processing step which is performed by determining how pixels change throughout the video clip under consideration. In an example, pixels with a variance 308 below a threshold value are likely to be edit effects which have been added to the video. These effects can include borders, logos and pattern insertions as well as letter-box and pillar-box effects from resizing. The variance can be calculated on each pixel as follows: a gray-scale value of pixel x in frame i is denoted xi. Two quantities, Mk and Qk are defined as follows:
For the nth frame, and once Qn is calculated the variance is
If all pixels in a row (or column) on the outside border of a frame have a variance below the predetermined threshold value, they are removed or otherwise disregarded from the image. The process is repeated until a row (or column) is encountered where at least one pixel shows variance above the threshold. The result is an image which is cropped of borders in which the pixels do not vary. The sub-image corresponding to the size of the cropped mask can be used for further processing. This will remove any pillar-box effects, letter-box effects, and borders from cropping or shifting. In an example, the step of determining static pixels can be performed before or after a frame is segmented.
In block 309 the number of interest points in each region are determined. That is, the features for the frame are extracted, and a count for a region is incremented if an interest point resides in the region. As described above with reference to
More formally, for a video consisting of M frames and L regions, each region ti would result in an M-dimension vector, si=(fi,1, fi,2, . . . , fi,M), where fi,k is the number of features counted in region of frame k. The matrix Si=(S1, S2, . . . , SL) is used to produce the ranking matrix, λ=(λ1, λ2, . . . , λL). Each λi=(r2i, r2i, . . . , rLi) where rki is the rank of the ith region of frame k.
For a video with M frames and L regions, the signature for the video consists of an L×M matrix. In order to calculate the distance between two signatures for a reference video and a query video, the L1 distance between them is used in an example. If the number of frames in the reference video is N and the number of frames in the query video is M, where N≧M, each video is divided into L regions. A sliding window approach can then used where the distance between the query video and the first M frames of the reference video is calculated. The window of M frames is then slid over one frame and the distance between the query video and M frames in the reference video starting at the second frame is determined. The minimum distance and the frame offset, p, for which this occurred is recorded as sliding proceeds. At the end of the reference video, the best match occurs at the minimum distance.
If λi is the ranking vector of the ith region, the distance between a query video Vq and a reference video Vr is calculated in an example as:
D(Vq,Vr)=p argmin(D(Vq,Vrp)) i.
where p is the frame offset in the reference video which achieved this minimum and represents the location of the best match between the query video and the reference video, and D(Vq, Vrp) is given by:
Here, C(M) is a normalizing factor which is a function of the size of the query. It represents the maximum possible distance between the reference video and the query video. This maximum distance occurs when the ranking of the reference video is exactly opposite to that of the query. There are two cases based on whether M is even or odd. In the case when M is even, it is twice the sum of the first
odd integers. Similarly, when M is odd, C(M) is twice the sum of the first
even integers. Each of these sequences can be computed directly as follows:
In block 315 a query video 314 whose signature 316 has been determined as described above can be processed using comparison engine 111 in order to determine if it is a copy of the reference video. If the minimum distance between the signature 316 of the query video 314 and that of the reference video at offset p is below a threshold, then it is likely that the query video 314 is a copy of the reference video. In this case, an output such as similarity measure 113 can be provided indicating that a copy has been located starting in frame p of the reference video.
Accordingly, a method of an example proceeds by determining the distance between the set of frames from a query video and the first M frames in a reference video. It then shifts the comparison window and finds the distance between the query set and the reference set starting at the second frame of the reference set. This continues for a total of N−M+1 calculations. For every calculation, the M frames in the query are compared with M frames of the reference video for a total of N−M+1)M comparisons.
A user can interface with the system 400 with one or more input devices 411, such as a keyboard, a mouse, a stylus, and the like in order to provide user input data. The display adaptor 415 interfaces with the communication bus 399 and the display 417 and receives display data from the processor 401 and converts the display data into display commands for the display 417. A network interface 419 is provided for communicating with other systems and devices via a network (not shown). The system can include a wireless interface 421 for communicating with wireless devices in the wireless community.
It will be apparent to one of ordinary skill in the art that one or more of the components of the system 400 may not be included and/or other components may be added as is known in the art. The apparatus 400 shown in
According to an example, a feature extraction engine 105, signature generation module 106 and comparison engine 111 can reside in memory 402 and operate on data representing a reference video 101 or query video 103 to provide signature data 107 for comparison and for storage in database 109 for example. A database 109 can be provided on a HDD such as 405, or can be provided as a removable storage unit 409 for example. The database 109 can be remote from the apparatus 400 and can be connected thereto via the network interface for example.
Number | Date | Country | Kind |
---|---|---|---|
1200322.4 | Jan 2012 | GB | national |