In various applications it may be desirable to find similarities between a new or sample media file and one or more other media files. For example, such a comparison may be useful in identifying copyrighted audio or video files, such as in the context of a website that allows users to upload such media files, to identify potential infringement. To do so, some techniques use hashes that characterize the media files or portions of the media files, for example at or near a given time point within the media file. Such techniques may be limited as the number of known media files grows, or as media files become larger. For example, as the number of media files grows, a high number of matching hashes may be retrieved for a given portion of a media file. However, many of the retrieved hashes may not represent true matches to the given media file segment. It may be difficult to identify desirable hashes to use for the comparison. For example, some media files may include segments that result in hashes that are often matched to new media files, but that do not necessarily indicate a good match between the content in those files.
Embodiments of the disclosed subject matter include techniques, systems, and computer-readable media configured to obtain a plurality of media file feature definitions, each definition associated with a hash of media file content. A plurality of linear combinations of the plurality of media file feature definitions may be generated, where each combination is associated with a plurality of feature coefficients. Each of the linear combination may be applied with respect to a sample media file and a reference media file to generate a correlation coefficient for the linear combination, [We use the correlation coefficient to measure the “benefit” of a linear combination. But what we are interested in are the linear combinations.] Preferred linear combinations may be determined based upon the correlation coefficients. Media file features corresponding to the preferred linear combinations may be identified (i.e. hash functions are based on these new features, which are linear combinations of the old, raw, features), and used to determine whether the sample media file contains media substantially equivalent to media contained in the first reference media file. The linear combinations may exclude known false positive matches between the media file features apparent in the sample media file and in the reference media file. Embodiments also may exclude linear combinations which have a variance below a defined threshold. The sample media file may be, for example, a video file provided by an end user. The first reference media file may be, for example, a copyrighted media file, and the sample media file may be a media file to be tested for the presence of content that is substantially similar to content in the first reference media file.
Additional features, advantages, and embodiments of the disclosed subject matter may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary and the following detailed description are exemplary and are intended to provide further explanation without limiting the scope of the claims.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate embodiments of the disclosed subject matter and together with the detailed description serve to explain the principles of embodiments of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and various ways in which it may be practiced.
As a collection of media files grows, it may become more difficult to effectively find candidate features within the media files to compare to new media files, and thereby identify similar or matching media files. Even if a particular set of “features” are given, it may not be apparent how the features used in a hash function may be selected or combined to obtain a useful or best performance in matching media files. A technique to do so may use labeled data, such as pairs of media files that human observers have identified as “matching” or “non-matching”. According to embodiments of the disclosed subject matter, such labeled data or other feature sets may be used efficiently to generate effective feature combinations.
“Features” within a media file may define one or more hashes that characterize a portion of the media file at or around a given time point within the media file. In an embodiment, a “feature” may be designated or otherwise associated with a real number, whereas a “hash” may be designated by or otherwise associated with an integer. Thus, it may be possible for two features to be similar (approximately the same), whereas two hashes may be either exactly the same or not, i.e., hashes may not be referred to as having various degrees of similarity as may be appropriate to features. Several such hash functions may be combined to define a larger hash. For example, a set of eight single-hex-digit hashes may be combined to form a four-byte hash function. Other types and sizes of hashes may be used. Generally, the quality or usefulness of the resulting hashes may depend on how well the features are correlated for “matching” pairs of features between a sample (new) media file and one or more reference (known) media files, compared to a similar correlation in random pairs.
An embodiment of the disclosed subject matter constructs linear combinations of a set of defined features for each pair of sample and known media files. The linear combination of those features that give the optimal correlation coefficients (i.e., most-correlated) is then found. The highest correlation coefficients then indicate the “best” or preferred features to use in detecting a match of the sample file. More generally, a higher correlation coefficient indicates a feature that is more likely to reliably indicate whether a portion of a sample media file matches a portion of a known media file. Techniques as disclosed herein may be viewed as contrasting the probability that a set of hashes for a sample media file matches a known media file, with the probability that it matches random values. Embodiments of the disclosed subject matter may scale linearly with respect to the number of media files and time point pairs, and thus may scale to relatively very large sets of data. Embodiments of the disclosed subject matter also may use regularization and/or “must not match pairs” that evaluate a first version of the hashes to improve the reliability of the technique based on observed errors.
At 330, one or more preferred correlation coefficients may be determined based upon the statistics generated at step 320. A “preferred” coefficient may be one that is closest to 1 out of the potential coefficients for the linear combinations. More generally, a “preferred” coefficient may be a coefficient that is closer to 1 than at least one other coefficient for the linear combinations. More generally, an embodiment may consider vector spaces of linear combinations and find an optimal or preferred d-dimensional vector space. That is, for each dimension d a d-dimensional vector space of linear combinations, such that the minimum over it's the correlation coefficients of linear combinations is maximal. Based upon the optimal correlation coefficient identified at 330, preferred linear combinations may be identified among those generated at step 320. The preferred linear combinations are those that result in the optimal correlation coefficient. These combinations then indicate, at 340, the appropriate set of features to be used in determining whether the sample media file contains content that is substantially similar to content contained in the reference media file. Once the features have been identified, at 350 they may be used to determine whether the sample media file contains media substantially equivalent to media contained in the reference media file. For example, if the same features appear in the sample and reference media files, it may be determined that a portion of the sample media file is substantially similar to a portion of the reference media file. Thus, the linear combinations of features generated at 320 may be used to efficiently select a set of features to use in determining whether a sample media file, or a portion of the sample media file, is substantially similar to a reference media file or a portion of the reference media file.
As a specific, non-limiting example, an initial set of features may include any number of features, such as 1000 features. A set of preferred linear combinations of those features may be limited to the highest-rated, most preferred, or otherwise “best” combinations. As disclosed herein, in an example configuration 64 linear combinations may be used. Each of the 64 linear combinations may be specified by 1000 coefficients, one coefficient for each feature, allowing for a set of 64,000 values that can be used in conjunction with feature values to consider similarities between media files.
Embodiments of the disclosed subject matter may provide additional benefit because the linear combinations of features may be static or invariant with respect to the specific sample and reference media files being used. Thus, the only variation may be in the values associated with the features for each pair of media files. In the specific example given above, the 64,000 coefficients (64 linear combinations, each with 1000 coefficients) may be constant regardless of the specific media files being considered. For each time point in each media file, the features may take different values and, therefore, the evaluated linear combinations may ultimately have different values despite the invariance of the coefficients.
To aid in understanding the operation of techniques disclosed herein, a mathematical derivation of the linear combination and correlation coefficient operations will be described. Other techniques and operations may be used without departing from the scope of the disclosed subject matter.
Initially, it may be presumed that there will be a relatively high correlation between a sample media file and a reference media file for which there should be a match, i.e., which contain substantially or exactly similar content, when compared to a correlation between the sample media file and arbitrary data. Thus, the technique generates linear combinations X and Y for the sample and reference media files, respectively, for which the correlation coefficient r should be as close to 1 as possible:
Where the covariance and variance have the conventional definitions of:
The sum is taken over a set of values X(n), Y(n), with n being an index of media file offset pairs that should match as described above. The variances and covariances are independent of adding a constant to X or Y. Because different features will be compared, they may be adjusted to have a mean of 0:
By assuming that the samples and references have approximately the same distribution, r then becomes
Where CoVar(xi,xi) is the same as Var(xi). This can be viewed as a calculation to determine the best ai given the measured CoVar(xi,xj) and CoVar(xi,yi). In an illustrative example, the former may contain about 1000 values, whereas the latter may include 2 million values, independent of the number of matching media file segments (e.g., video frames) being considered.
Let {right arrow over (a)} be the vector of the unknown coefficients ai, and C the matrix of the numbers CoVar(xi,yi), and V the matrix of CoVar(xi,xj). The desired value is then the value of as that maximizes r for known matrices C and V:
By construction, V is a positive definite symmetric matrix. C also may be expected to be symmetric, by assuming that “x matches y” is equivalent to “y matches x”. This constraint may be enforced by using each pair both as (x, y) and (y, x) or, equivalently, by using each pair only once but replacing C by the average of C and CT. It also may be assumed that this is positive definite. However, this may not be guaranteed, and numerically there may be a few eigenvalues which are negative and relatively very small (e.g., with an absolute value less than 10−16). This also may be caused by rounding errors in the computation. To ensure that V is also positive definite, a very small constant amount (e.g., 10−16) may be added in the diagonal.
For positive definite symmetric matrices V,C there are spectral decompositions
V=Q1·T1·Q1−1,C=Q2·T2·Q2−1
With Qi orthogonal matrices (so Qi−1=Q1T) and Ti a diagonal matrix with positive entries. The diagonal matrix Di may be set such that Di2=Ti, and the following may be set:
Bi:=Di·Qi−1=Di·QiT
such that B1T·B1=V and B2T·B2=C. Then for
{right arrow over (b)}:=B1{right arrow over (a)},
Since {right arrow over (a)}≠0 was selected arbitrarily, {right arrow over (b)} is also arbitrary, and the maximal r is given by the largest eigenvalue of MTM, where M=B2B1γ−1).
The eigenvectors {right arrow over (v)}, {right arrow over (v)}′ for different eigenvalues are orthogonal. For {right arrow over (a)}=B1−1{right arrow over (v)}{right arrow over (a)}′=B1−1{right arrow over (v)}′:
{right arrow over (a)}TV{right arrow over (a)}′={right arrow over (a)}TB1TB1{right arrow over (a)}′=<{right arrow over (v)},{right arrow over (v)}′>=0
This indicates that the corresponding combination features
are uncorrelated:
Thus, a reasonable entropy may be expected for embodiments of the disclosed subject matter, with normalized combination features from the eigenvalues described above. This may be expected because the low entropy results from a correlation between features.
A subset of all possible eigenvalue solutions may be used to identify preferred features to use when determining whether media files match reference files. For example, in an embodiment the top 64 eigenvalues may be used. It has been found that in some circumstances, using eigenvalues below the top 64 provides little or no additional benefit. It will be understood that the 64 value limit is provided only as an illustrative example, and in general any number of values may be used.
In an embodiment, “must-not-match” feature pairs may be used to further refine the selection of preferred features to use when determining whether a sample media file matches a reference media file. These features may be, for example, features that have been determined to be likely to lead to false-positive matches.
If X is used to denote a combined feature and (X, Y) a pair of values obtained for matching media files, the prior derivation maximizes the average of (X−Ŷ)·(Y−Ŷ) compared to the average of (X−{circumflex over (X)})2 (which is equivalent to the average of (Y−Ŷ)2). In the best case, Y is almost the same as X, so the averages are relatively close together. However, if X and Y are allowed to run independently over randomly sampled media files or media file segments, the average of (X−
The previously-described techniques may be used to find matches, which then may be further evaluated such as by an automated process or by human inspection. False positives, i.e., media file pairs that are identified as matching but do not in fact match, may be tracked or assembled t provide another useful data set. The covariance matrix Cfp of the false positive set may be used to modify the technique. Specifically, If (X, Y) run over pairs in the false positives list, there may be some combinations that have a positive correlation (because the pairs were found based on matching combinations), although the pairs did not really match. The denominator of Equation 1 may then be set to {right arrow over (a)}TCfp{right arrow over (a)}, which then gives the ratio of the correlation coefficients on “true” positives and “false” positives instead of r. This may then be optimized that the combinations “are more similar for matching pairs than for non-matching pairs”. However, since the false positives are a relatively small and biased sample of all media files that do not match, Cfp should be used as a “correction” to V, i.e., the following should be optimized for some constant η>0 instead of Equation 1:
As previously described, it may be presumed that “if media files a and b do not match, then b and a also don't match”, and Cfp may be enforced to be symmetric by replacing it with the average of Cfp and its transpose.
In an embodiment, a correction may be made to avoid false indications that may arise, for example, in combinations that happen to have a very low variance. The correction may remove unreliable features from those used to compare a sample media file to a reference media file or provide other analytical benefits. For example, if the measured variance of a linear combination is several orders of magnitude below the variance of one term in this linear combination, a slight measuring error in this term will give relatively very large error in the linear combination when the features are normalized to have a variance of 1. In some cases, these features may not be reliable, so it may be useful to exclude features with extremely low variance. An example technique for doing so may be to add a small constant to the diagonal elements of the denominator, i.e., take as a matrix in the denominator
V+η·Cfp+μ·Iη>0,μ>0
where I is the identity matrix. The value μ may be chosen such that the resulting eigenvalues are relatively significantly positive; this also enforces the constraint that the matrix is positive definite. This results in a new positive definite symmetric matrix, so that now
{tilde over (B)}1T{tilde over (B)}1=V+η·Cfp+μ·I
In an embodiment, η should be, for example, relatively small, but large enough to remove at least some of the false positives. As another example, η may be chosen such that the highest absolute value of an eigenvalue of ηCfp, is about 1/10 of the highest eigenvalue of V.
This may be used to get the orthogonal eigenvectors as previously described. The eigenvectors {right arrow over (υ)}1, . . . , {right arrow over (υ)}n to the n largest eigenvalues
λ1≧ . . . ≧λn>0
span a vector space Vn of dimension n in which
∥
This has the consequence that for all {right arrow over (a)}ε{tilde over (B)}1−1Vn, for {right arrow over (v)}:={tilde over (B)}1{right arrow over (a)}εVn,
The fact that different eigenvectors {right arrow over (v)} of {tilde over (M)}T{tilde over (M)} are orthogonal no longer means that the combinations corresponding to {right arrow over (a)}={tilde over (B)}1−1{right arrow over (v)} are uncorrelated, because {tilde over (B)}1T{tilde over (B)}1 is no longer equivalent to V. Because the vector space {tilde over (B)}1−1Vn contains “good” combinations an orthonormal basis may be computed with respect to the quadratic form given by the variance V=B1TB1:
For a basis {right arrow over (a)}i:={tilde over (B)}1−{right arrow over (v)}i, an orthonormal basis {right arrow over (b)}1, . . . , bn′ of B1B1Vn may be found. Then
are a basis of {tilde over (B)}1−1Vn such that for corresponding feature combinations, there exists the covariance
{right arrow over (a)}i′V{right arrow over (a)}i′={right arrow over (a)}i′=<B1{right arrow over (a)}i′,B1{right arrow over (a)}j′>=<{right arrow over (b)}i′,{right arrow over (b)}j′>=δij
Again, the corresponding feature combinations are uncorrelated, and have a variance of 1.
To test the techniques disclosed herein, a set of coefficients were trained on 20% of a set of must-match pairs and 40% of must-not-match pairs, and tested on the other 40% of each, for a test set of video media files. In the sample data set, this included 141 false positive video files and 407 correct matches. The results were compared against randomly-chosen features. Because the randomly-chosen features have lower entropy than the orthogonal new feature combinations, six groups of new features were compared against eight groups of old features.
In each group of features, the index of the largest feature was computed. It was then determined whether in all groups the index was the same for corresponding video frames in a pair. The comparison considered three criteria: the number of video segments with zero matches; the number of segments with more, the same, or fewer matches; and the number of “interesting” clips with “significantly more or fewer” matches. For purposes of the test, “interesting” clips were considered those where at least one version had less than 10% matches, and “significant” to indicate a difference of at least 2, and more than 5%.
The results of the test were as follows:
This test was performed comparing raw features, which were selected to have a certain distance to avoid low entropy resulting from the correlation between nearby features, against new combination features. This test may be considered relatively simple, because normally feature wavelet coefficients are used instead of the raw features. The difference is relative small because, for example, for the false positives there are already 0 matches in all but 12 of the 141 video clips.
Another measurement was performed using fewer groups to have a larger number of interesting cases, and using Haar wavelet coefficients, normalized by dividing by the square root of the number of used coefficients. In a comparison of the “old” (Haar square root), with 4 groups of 15 features, to “new” (“optimal”) combination of features, with 3 groups of 12 features, the following results were obtained:
Similar tests which replaced “False Positives” with “random pairs” were performed, and provided similar results.
Embodiments of the presently disclosed subject matter may be implemented in and used with a variety of component and network architectures.
The bus 21 allows data communication between the central processor 24 and the memory 27, which may include read-only memory (ROM) or flash memory (neither shown), and random access memory (RAM) (not shown), as previously noted. The RAM is generally the main memory into which the operating system and application programs are loaded. The ROM or flash memory can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with the computer 20 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed storage 23), an optical drive, floppy disk, or other storage medium 25.
The fixed storage 23 may be integral with the computer 20 or may be separate and accessed through other interfaces. A network interface 29 may provide a direct connection to a remote server via a telephone link, to the Internet via an internet service provider (ISP), or a direct connection to a remote server via a direct network link to the Internet via a POP (point of presence) or other technique. The network interface 29 may provide such connection using wireless techniques, including digital cellular telephone connection, Cellular Digital Packet Data (CDPD) connection, digital satellite data connection or the like. For example, the network interface 29 may allow the computer to communicate with other computers via one or more local, wide-area, or other networks, as shown in
Many other devices or components (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the components shown in
More generally, various embodiments of the presently disclosed subject matter may include or be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments also may be embodied in the form of a computer program product having computer program code containing instructions embodied in non-transitory and/or tangible media, such as floppy diskettes, CD-ROMs, hard drives, USB (universal serial bus) drives, or any other machine readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. Embodiments also may be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits. In some configurations, a set of computer-readable instructions stored on a computer-readable storage medium may be implemented by a general-purpose processor, which may transform the general-purpose processor or a device containing the general-purpose processor into a special-purpose device configured to implement or carry out the instructions. Embodiments may be implemented using hardware that may include a processor, such as a general purpose microprocessor and/or an Application Specific Integrated Circuit (ASIC) that embodies all or part of the techniques according to embodiments of the disclosed subject matter in hardware and/or firmware. The processor may be coupled to memory, such as RAM, ROM, flash memory, a hard disk or any other device capable of storing electronic information. The memory may store instructions adapted to be executed by the processor to perform the techniques according to embodiments of the disclosed subject matter.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit embodiments of the disclosed subject matter to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of embodiments of the disclosed subject matter and their practical applications, to thereby enable others skilled in the art to utilize those embodiments as well as various embodiments with various modifications as may be suited to the particular use contemplated.
Number | Name | Date | Kind |
---|---|---|---|
6072903 | Maki et al. | Jun 2000 | A |
6178272 | Segman | Jan 2001 | B1 |
6675174 | Bolle et al. | Jan 2004 | B1 |
6990453 | Wang et al. | Jan 2006 | B2 |
20020076310 | Knapik et al. | Jun 2002 | A1 |
20030055516 | Gang et al. | Mar 2003 | A1 |
20060088096 | Han et al. | Apr 2006 | A1 |
20060277047 | DeBusk et al. | Dec 2006 | A1 |
20090013405 | Schipka | Jan 2009 | A1 |
20100076911 | Xu et al. | Mar 2010 | A1 |
20100092091 | Kanda | Apr 2010 | A1 |
20100220906 | Abramoff et al. | Sep 2010 | A1 |
20110276157 | Wang et al. | Nov 2011 | A1 |