The present invention relates to techniques for determining if audio data that may be broadcast, transmitted in a communication channel or played is a copy of an audio piece within a repository. Those techniques can be used to perform copy detection for copyright infringement purposes or for advertisement monitoring purposes.
There are many applications of content-based audio copy detection. It can be used to monitor peer-to-peer copying of music or any copyrighted audio over the internet. Global digital music trade revenues exceeded $3.7 billion in 2008 (“IFPI digital music report 2009” www.ifpi.org/cpmtemt/library/dmr2009.pdf) and grew rapidly in 2009 to reach an estimated $4.2 billion (“IFPI digital music report 2010” www.ifpi.org/cpmtemt/library/dmr2010.pdf). The digital track sales in the US increased from $0.844 billion in 2007 to $1.07 billion in 2008, and to $1.16 billion in 2009. These figures do not include peer-to-peer download of music and songs that may or may not be legal. The music industry believes that peer-to-peer file sharing has led to billions in lost sales. Fast and effective copy detection will allow ISPs to monitor such activity at a reasonable cost.
Content-based audio copy detection can also be used to monitor advertisement campaigns over TV and radio broadcasts. Many companies that advertise not only monitor their advertisements, but follow the ad campaigns of their competitors for business intelligence purposes. Worldwide, the TV and radio advertising market amounted to over $214 billion dollars in 2008. In the US alone, TV and radio advertisements amounted to over $82 billion dollars in 2008.
Currently, monitoring of ad campaigns is being offered as a service by many companies worldwide. Some companies offer watermarking for automated monitoring of ads. In watermarking audio, a unique code is embedded in the audio before it is broadcast. This code can then be retrieved by watermark monitoring equipment. However, watermarking every commercial and then monitoring by specialized equipment is expensive. Furthermore, watermarking only allows companies to monitor their own ads that have been watermarked. They cannot follow the campaigns of their competitors for business intelligence. Content-based audio copy detection would alleviate many such constraints imposed by watermarking.
Published papers from the audio copy detection and advertisement detection fields show that the two fields have evolved differently. In audio copy detection (J. Haitsma, T. Kalker, “A highly robust audio fingerprinting system”, [online] ismir2002.ismir.net/proeeedings/02-FP04-2.pdf and Y. Ke, D. Hoiem, and R. Sukthankar, “Computer vision for music identification”, Proc. Compo Vision Pattern Recog., 2005), the emphasis is on speed, since the alleged copy is compared with a large repository of copyrighted audio pieces. A small percentage of misses will not make a big difference so long as most of the copies are captured. The system has to be robust under various coding schemes and distortions that audio may go through over the Internet. Fast audio copy detection uses audio fingerprints. The audio fingerprints proposed by Haitsma and Kalker (J. Haitsma, T. Kalker, “A highly robust audio fingerprinting system”, [online] ismir2002.ismir.net/proeeedings/02-FP04-2.pdf) have been found to be quite robust to various distortions of the audio signals. These fingerprints have been used for music search (N. Hurley, F. Balado, E. McCarthy, G. Silvestre, “Performance of Phillips Audio Fingerprinting under Desynchronisation,”, [online] ismir2007.ismir.net/proceedings/ISMIR2007_p133_hurley.Pdf). These fingerprints have also been proposed for controlling peer-to-peer music sharing over the Internet (P. Shrestha, T. Kalker, “Audio Fingerprinting In Peer-to-peer Networks,”, [online] ismir2004.ismir.net/proceedings/p062-page-341-paper91.pdf), and for measuring sound quality (P. Docts, R. Lagendijk, “Extracting Quality Parameters for Compressed Audio from Fingerprints,”, [online] ismir2005.ismir.net/proceedings/1063.pdf). These audio fingerprints use energy differences in consecutive bands to generate a feature expressed in 32 bits. The audio search using these fingerprints is speeded up by looking for exact match of these 32 bits in the stored repository. A more complete search is only performed around the frames corresponding to these matching fingerprints. This complete search involves computing bit matches and a threshold in order to find matching segments. This search is expensive because of the computing involved in the bit matching.
In contrast, within the advertisement detection field, the emphasis is focused more on finding all the ads broadcast in the campaign (M. Covell, S. Baluja, and M. Fink, “Advertisement Detection and Replacement using Acoustic and Visual Repetition”, IEEE Workshop multimedia sig. proc., October 2006, pp. 461-466) (P. Duygulu, M. Chen, and A. Hauptmann, “Comparison and combination of two novel commercial detection methods”, Proc. ICME, 2004, pp. 1267-1270) (V. Gupta, G. Boulianne, P. Kenny, and P. Dumouchel, “Advertisement Detection in French Broadcast News using Acoustic repetition and Gaussian Mixture Models”, Proc. InterSpeeeh 2008, Brisbane, Australia). This type of search is generally exhaustive. The process is speeded up by first using a fast search strategy that overgenerates the possible advertisement matches. These matches are then compared using a detailed match. In many instances, the detailed match includes comparing video features, although in some instances, the same audio may be played even though the video frames may be different.
Accordingly, there exists in the industry a need to provide improved solutions for content-based copy detection.
As embodied and broadly described herein the invention provides a method for performing audio copy detection, comprising, providing a query audio data, the query audio data having a succession of frames and also providing a plurality of test audio data units, each test audio data unit including a succession of frames. For each test audio data unit the method generates a test fingerprint set. The generation of the test fingerprint set including computing similarity measurements between at least one frame of the test audio data and a plurality of frames of the query audio data. A test audio data unit is then selected as a match for the query audio data at least in part on the basis of the fingerprint sets.
As embodied and broadly described herein, the invention also provides a method for performing audio copy detection, comprising providing a query audio data, the query audio data having a succession of frames and also providing a plurality of test audio data units, each test audio data unit including a succession of frames. The query audio data and each test audio data unit are processed with software to derive a plurality of fingerprint sets, each fingerprint set being associated with the query audio data and a respective test audio data unit combination. The plurality of fingerprint sets and the query audio data are further processed to identify a fingerprint set that best matches the query audio.
As embodied and broadly described herein, the invention further provides a method for generating a group of fingerprint sets for performing audio copy detection. The method includes providing query audio data having a succession of frames and also providing a plurality of test audio data units, each test audio data units having a succession of frames. A group of fingerprint sets is computed, for each fingerprint set the computing including mapping frames of a test audio data unit to corresponding frames of the query audio data on the basis of similarity measurement between the frames.
As embodied and broadly described herein, the invention also provides a method for performing audio copy detection, comprising providing a query audio data having a succession of frames and deriving a set of query audio fingerprints from audio information conveyed by the query audio data, frames in the succession being associated with respective fingerprints in the set. The method further provides a group of test audio fingerprint sets, each fingerprint set uniquely representing a test audio data unit and also providing a map linking fingerprints in the query audio fingerprint set to frame positions in the succession, wherein the map establishes a relationship between each fingerprint in the query audio fingerprint set and the positions of one or more frames in the succession associated with the fingerprint. For each test audio fingerprint set the method identifies via the map the fingerprints in the test audio fingerprint set matching the fingerprints in the query audio set and selecting on the basis of the identifying the test audio data unit that corresponds to the query audio data.
As embodied and broadly described herein, the invention provides an apparatus for performing audio copy detection, comprising an input for receiving query audio data, the query audio data having a succession of frames and a machine readable storage holding a plurality of test audio data units, each test audio data unit including a succession of frames. The machine readable storage being encoded with software for execution by a CPU for computing similarity measurements between a frame of every test audio data unit and a plurality of frames of the query audio data, to generate a test fingerprint set for each test audio data unit, software selecting at least in part on the basis of the fingerprints sets a test audio data unit as a match for the query audio data. The apparatus also comprising an output for releasing information conveying the selected test audio data unit.
As embodied and broadly described herein, the invention provides an apparatus for performing audio copy detection, comprising an input for receiving query audio data, the query audio data having a succession of frames, and a machine readable storage for holding a plurality of test audio data units, each test audio data unit including a succession of frames. The machine readable storage being encoded with software for execution by a CPU, the software processing the query audio data and each test audio data unit to derive a plurality of fingerprint sets, each fingerprint set being associated with the query audio data and a respective test audio data unit combination. The software processing the plurality of fingerprint sets and the query audio data to identify a fingerprint set and a corresponding test audio data unit that matches the query audio.
As embodied and broadly described herein, the invention also provides an apparatus for generating a group of fingerprint sets for performing audio copy detection, the apparatus comprising an input for receiving query audio data having a succession of frames and a machine readable storage holding a plurality of test audio data units, each test audio data units having a succession of frames. The machine readable storage is encoded with software for execution by a CPU for computing the group of fingerprint sets, for each fingerprint set the software mapping frames of a test audio data unit to corresponding frames of the query audio data on the basis of similarity measurement between frames.
As embodied and broadly described herein, the invention also includes an apparatus for performing audio copy detection, comprising an input for receiving query audio data having a succession of frames and a machine readable storage encoded with software for execution by a CPU for deriving a set of query audio fingerprints from audio information conveyed by the query audio data, frames in the succession being associated with respective fingerprints in the set. The machine readable storage holding a group of test audio fingerprint sets, each fingerprint set uniquely representing a test audio data unit. The machine readable storage also holding a map linking fingerprints in the query audio fingerprint set to frame positions in the succession, wherein the map establishes a relationship between each fingerprint in the query audio fingerprint set and the positions of one or more frames in the succession associated with the fingerprint. For each test audio fingerprint set the software identifying via the map the fingerprints in the test audio fingerprint set matching the fingerprints in the query audio set and selecting on the basis of the identifying the test audio data unit that corresponds to the query audio data.
A detailed description of examples of implementation of the present invention is provided hereinbelow with reference to the following drawings, in which:
In the drawings, embodiments of the invention are illustrated by way of example. It is to be expressly understood that the description and drawings are only for purposes of illustration and as an aid to understanding, and are not intended to be a definition of the limits of the invention.
The overall system 10 shown in
The fingerprints that are computed at step 14 are processed at step 16 that tries to determine if the fingerprints match a set of fingerprints in a repository 18. The repository 18, which can be implemented as a component or segment of the machine readable storage of the computer, contains a multitude of fingerprint sets associated with respective audio pieces, where each audio piece can be a song, the soundtrack of a movie or an audio book, among others. In this specification, the fingerprints in the repository 18 are referred to as “test fingerprints”. In the context of copy detection, the repository 18 holds fingerprints of copyrighted audio material that is to be detected in a stream of query audio. In a different application, when monitoring advertisements, the repository 18 holds fingerprints of ads that are being monitored. The query audio in this case corresponds to the broadcast program segment being searched for ads.
Audio fingerprints allow for quickly matching segments of the query audio by counting the fingerprints that match exactly in the corresponding test segments. Two different audio fingerprints can be considered. One fingerprinting method is based on energy differences in consecutive sub-bands and results in a very fast search. The other fingerprints are based on classification of each frame of the test to the nearest frame (nearest neighbor) of the query. These fingerprints provide even better performance. While the nearest neighbor fingerprints are slower to compute, the computation can be speeded up by parallel processing on a Graphical Processing Unit (GPU).
The fingerprints are used to find test segments that may be copies of the queries. Fingerprint matching is done by moving the query over the test and counting the total fingerprint matches for each alignment of the query with the test. In other words, the search is done by moving the query audio (n frames) over the test (m frames) and counting the number of fingerprint matches for each possible alignment of the query audio and the test.
An example of one such alignment is shown in
Both the counts and counts/sec values can used to determine if a match exists. Since the same query is matched against all the test segments in the repository 18, the total count can be a better measure of match between the query and the test segment in certain cases. The reason for this is that counts/sec can vary even though the count (or the number of frame matches) is the same. Therefore, counts can be used when the system is searching for the best matching test segment for a given query. However, when comparing matches for different queries, counts/sec is a more consistent measure, since the queries can vary in duration. For example, queries having respective lengths of 3 seconds and 3 minutes will have very different counts, but similar counts/sec. This is the case when the scores across queries are compared to reject query matches that may be false alarms.
During the search, segments that match the query can overlap with one another. In this case, the overlaps that are found to be synchronized are combined and overlaps with low counts are removed. Overlaps are synchronized if the start of the query (when the query is overlaid on test) differs by less than 2 frames. In such a case the two counts are added and only the segment with the higher count is retained. In all other cases of overlap, the overlap is removed with the lower count. This is an optional enhancement and it only has a small influence on copy detection accuracy. The algorithms work as follows:
Extension
Consider two alignments a1 and a2. Both alignments are synchronized if
|(pStart[a1]−pStart[a2])−(aStart[a1]−aStart[a2])|≦2
where pStar[a] and aStart[a] are respectively the first matching frame in the audio segment and the first matching frame in the advertisement for the alignment a.
If two alignments are synchronized, the one with the lower count is eliminated and its count is added to the remaining one.
Overlap
Two alignments a1 and a2 overlap if the following conditions are met:
pStart[a2]<pEnd[a1] and pEnd[a2]>pStart[a1]
where pStar[a] and pEnd[a] are respectively the first and last matching frame in the audio segment for the alignment a. When two alignments overlap, the one with lower count is eliminated.
Two different audio fingerprints can be used. The first fingerprint is based on the energy difference in consecutive sub-bands of the audio signal (energy-difference fingerprint) and it is best suited for music search and other copy detection tasks. This energy-difference fingerprint has 15 bits/frame and is extracted by using the process illustrated in
F(n,m)=1, if EB(n,m)−EB(n,m+1)>0,
Otherwise, F(n,m)=0.
While it is known to generate audio fingerprints based on energy differences that are expressed as 32 bits values, those fingerprints are less than optimal in the context of a fast search. The problem with 32 bits is that the likelihood of all the bits matching is low. As a result, fingerprints in very few frames match, even in matching segments. In order to get a good measure of match between the two segments, the total number of matching bits needs to be counted. This is likely to be computationally expensive and will cause the search to slow down. Using less than 32 bits leads to frequent matches of the fingerprints, and then just the counts of matching fingerprints can be used as a measure of closeness between two segments. This count goes down with the severity of the query transformations. However, the count remains high enough that it can be relied upon as a measure of match.
In a specific example of implementation, energy-difference fingerprints of 15 bits have been found satisfactory, however this should not be considered as limiting. Applications are possible where energy-difference fingerprints of more or less than 15 bits can be used.
To search for a test segment that matches a query a map is provided in the machine readable storage linking fingerprints in the query audio fingerprint set to frame positions. A map can be implemented by a hash function. For example, if the fingerprint for frame k of the query is ƒp, then hash(ƒp)=k. In other words, for every fingerprint value (fingerprint ƒp can have 215 different values according to the above example), the hash function will return the frame number of the query with that hash value. If there is no query frame with that hash value, then the hash function will return a value of −1.
This hash function is beneficial to performing a fast search of best test segment that matches the query. For each frame j of the test, a count c(j) of total query frame matches is kept, when the first frame of the query starts at frame j of the test. If the test frame t has a fingerprint ƒp1, then the count c(t−hash(ƒp1)) is incremented when hash(ƒp1) is not −1. At the same time, the first and the last matching test frames of the query are updated, when the query starts at test frame
t−hash(ƒp1). Since more than one query frame can have the fingerprint ƒp1, hash(ƒp1) can have multiple values, and therefore all the counts c(t−hash(ƒp1)) are updated. The maximum count c(t1) for some test frame t1 and the corresponding start and end test frames provides the best matching test segment. Accordingly, there are only three operations involved per test frame.
A specific search example is illustrated at
Note that the process searches for a segment in the test that matches the query. Since the query is fixed, the count of the number of fingerprint matches in a segment is a good measure of the match. However, when a threshold is applied across many queries, then a better measure is the count/sec. The reason for this is simple, as query duration may vary from 3 sec to several minutes. Therefore, the distribution of matching fingerprint counts for test segments will be very different when the query lengths differ. Using counts/sec across queries helps to normalize the counts and leads to fewer false alarms and higher recall rate. The threshold for rejection/acceptance is based on counts/sec. For example, for the TRECVID 2008/2009 audio copy detection evaluation, this threshold was set at 1.88 counts/sec to avoid any false alarms. This threshold will vary depending on the search requirements.
The second audio fingerprint that can be used maps each frame of the audio segment to the closest frame of the query audio. This approach is more accurate than the energy-difference fingerprints, but is more computationally expensive. For computing this measure of closeness, 12 cepstral coefficients and normalized energy and its delta coefficients are computed. The distance between the query audio frame and the test audio frame is defined as the sum of the absolute difference between the corresponding cepstral parameters. If a1 . . . an are the cepstral parameters for a query audio frame and p1 . . . pn are the cepstral parameters for an audio test frame, then this distance is computed as
To each test audio segment frame is associated the closest query audio frame. This process is depicted by Algorithm 1, below in which “result” refers to the closest test audio frame and “n” is the nth cepstral coefficient:
Computing the closest test audio frame for each query audio frame is computationally intensive. However, one may note that the search for the nearest test audio frame for each query audio frame can be done independently. Consequently, an alternate processor that is specialized in parallel computations may be used to outperform the speed offered by a modern CPU.
Modern graphic cards incorporate a specialized processor called Graphics Processing Unit (GPU). A GPU is mainly a Single Instruction, Multiple Data (SIMD) parallel processor that is computationally powerful, while being quite affordable.
One possible approach to implement the nearest neighbor computation is to use CUDA, a development framework for NVidia graphic cards (CUDA, “[online] http://www.nvidia.com/object/cuda_home.html.”). The CUDA framework models the graphic card as a parallel coprocessor for the CPU. The development language is C with some extensions.
A program in the GPU is called a kernel and several programs can be concurrently launched. A kernel is made up of configurable amounts of blocks, each of which has a configurable amount of threads.
At execution time, each block is assigned to a multiprocessor. More than one block can be assigned to a given multiprocessor. Blocks are divided in groups of 32 threads called warps. In a given multiprocessor, 16 threads (half-warp) are executed at the same time. A time slicing-based scheduler switches between warps to maximize the use of available resources.
There are two kinds of memory. The first is the global memory which is accessible by all multiprocessors. Since this memory is not cached, it is beneficial to ensure that the read/write memory accesses by a half-warp are coalesced in order to improve the performance. The texture memory is a component of the global memory which is cached. The texture memory can be efficient when there is locality in data.
The second kind of memory is the shared memory which is internal to multiprocessors and is shared within a block. This memory, which is considerably faster than the global memory, can be seen as user-managed cache. This memory is divided into banks in such a way that successive 32-bit words are in successive banks. To be efficient, it is important to avoid conflicting accesses between threads. Conflicts are resolved by serializing accesses; this incurs a performance drop proportional to the number of serialized accesses.
As a first step, the audio segment frames are divided into sets of 128 frames. Each set is associated with a multiprocessor running 128 threads. Thus, each thread computes the closest query frame for its associated test frame.
Each thread in the multiprocessor downloads one test audio frame from global memory. At this time, each thread can compute the distance between its audio segment frame and all of the 128 advertisement frames now in shared memory. This operation corresponds to lines 4 to 11 of Algorithm 1. Once all threads are finished, the next 128 advertisement frames are downloaded and the process is repeated.
To increase performance, it is possible to concurrently process several test audio segments and/or queries. A search algorithm that can be used is described in detail in (M. Héritier, V. Gupta, L. Gagnon, G. Boulianne, S. Foucher, P. Cardinal, “CRIM's content-based copy detection system for TRECVID”, Proc. TRECVID-2009, Gaithersburg, Md., USA.) and in V. Gupta, G. Boulianne, and P. Cardinal, “Content-based audio copy detection using nearest-neighbor mapping,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010.
The search using the nearest-neighbor fingerprints is explained below. However, even with a GPU, the processing time is too long when a large set of data is considered. Another approach is to combine both fingerprints.
An example of a search for the test segment that matches the query is illustrated in
In this example, the frames of the query are naturally labeled sequentially. Each frame of the test is labeled as the frame of query that best matches this frame. In the example, test frame zero matches frame four of the query. Once this labeling is complete, appropriate counts are incremented to find the frame with the highest count. In the given example, frame 3 of the test has the highest matching count.
The nearest-neighbor fingerprints are more accurate than the energy-difference fingerprints. However, even with a GPU, the processing time is too long when a large set of data is processed. In order to reduce this time, a two phase search is used. In the first phase, the search uses the energy-difference fingerprints, and then the second phase of the search rescores the matches found using the nearest-neighbor fingerprints. This reduces the computation time significantly while maintaining the search accuracy of nearest-neighbor fingerprints.
The process for performing this search is illustrated at
Tests have been performed with copy detection systems according to the invention in order to assess their performance. The test data used for the performance assessment for copy detection comes from NIST-sponsored TRECVID 2008 and 2009 evaluations (“Final CBCD Evaluation Plan TRECVID 2008”, Jun. 3, 2008, [online] www.nlpir.nist.gov/projects/tv2008/Evaluation-cbcd_vl.3.htm) (W. Kraaij, G. Awad, and P. Over, “TRECVID-2008 Content-based Copy Detection”, [online]. www-nlpir.nist.gov/projects/tvpubs/tv8.slides/CBCD.slides.pdf) (A. Smeaton, P. Over, and W. Kraaij, “Evaluation campaigns and TRECVid”. In Proc. 8th ACM International Workshop Multimedia Information Retrieval (Santa Barbara, Calif.). MIR '06. ACM Press, New York. (http://doi.acm.org/10.1145/1178677.1178722). Most of this data was provided by the Netherlands Institute for Sound and Vision and contains news magazine, science news, news reports, documentaries, educational programming, and archival video encoded in MPEG-1. Other data comes from BBC archives containing five dramatic series. All together, there are 385 hours of video and audio. Both the 2008 and 2009 audio queries contain 201 original queries. The queries for the 2009 submission are different from the 2008 queries. Each audio query goes through seven different transformations for a total of 1,407 audio-only queries. The seven audio transformations for 2008 and 2009 are shown in table 1 below.
In the specific context of detection of copyrighted material (such as songs or movies), the system was developed using audio queries from TRECVID 2008. These are 1,407 queries (201 queries*7 transforms). Since query 166 occurred twice in the test, it was removed from the development set. The duration statistics for the 2008 and 2009 queries are shown in table 2 below.
Evaluation Criteria
The submissions were evaluated using the minimal normalized detection cost rate (NDCR) computed as:
NDCR=PMiss+β·RFA
The PMiss is the probability of a miss error, and RFA is the false alarm rate. β is a constant depending on the test conditions. For example, for no false alarm (no FA) case, β was set to 2000. In this case, even at a low false alarm rate, the value of NDCR will go up dramatically. So in the no FA case, the optimal threshold always corresponded to a threshold where there were no false alarms. In other words, in the no FA case, the minimal NDCR value corresponded to PMiss at the minimal threshold where there were no false alarms. This optimal threshold is computed separately for each transform, so the optimal threshold could be different depending on the transform. There were two different evaluations: optimal and actual. In the actual case, an a priori threshold was provided based on 2008 queries. In the actual case where a priori threshold was provided, this threshold was used for all the transforms to compute the NDCR. If there are any false alarms at that threshold, then the NDCR will be very high, leading to poor results.
For the balanced case, β was set to 2. In computing results, it was found that even for the balanced case, the optimal result turned out to be at the threshold where there were no false alarms. In other words, optimal no FA and balanced results in the case were the same. For detailed evaluation criteria, please see [17].
All the results for the 2008 queries were computed using the software provided by NIST. This software computes the optimal minimal NDCR value for each transform and outputs the results. For the 2009 queries, all the results were computed by NIST.
Results—Energy Difference Fingerprint
The query audio detection using energy difference fingerprints was run on 1,400 queries from 2008 and 385 hours of test audio from TRECVID. The results were compiled for the no FA case. The no FA results were established separately for each transform. Results are also provided when one threshold is used for all the transforms. This corresponds to the real life situation where the transformation that the query has gone through is not known.
For no FA case, results for each transform are given in table 3 below, where the decision threshold for each transform is computed separately.
The first four transforms do not have any extraneous speech added, while the last three add extraneous speech to the query. For the first two transforms, the number of missed test segments is less than 1%. Even for transforms with extraneous speech added, the worst result is 6% missed segments. In no FA case, the minimal normalized detection cost rate (NDCR) corresponds to a threshold with no false alarms: all the errors are due to missed test segments corresponding to the queries. The table below shows minimal NDCR when there is one threshold for all the transforms. In this case the minimal NDCR value more than doubles for the last three transforms.
In order to explain this increase in min NDCR, it is worth considering the distribution of counts for the matching test segments. The table below shows the total number of test segments that match the queries with a given count.
Over 350,000 test segments have a matching count of 35. The counts for matching segments vary between 32 and 2,300. It is worth noting that the counts are consistent: the correct segment has a higher count than the incorrect segments. However, one-third of the queries have no matching segment in the test. This implies that some of these queries could have high counts/sec. that could be higher than other queries with correct matching segments in the test. It so happens that counts/sec. for the first four transforms is higher because they do not have any added speech. Queries that correspond to the first four transforms that have no matching test segments could lead to high rejection threshold that affects the performance of queries that have undergone one of the last three transforms, which is actually the case. The highest count/sec. for a query that is a false alarm is 1.88/sec. for a query with transform 4. Many correct segments for the last three transforms have counts/sec. that are less than 1.88. The number of missed queries with counts/sec. below 1.88 can be calculated by dividing the min NDCR in Table 4 by 0.007.
The average query processing time for the energy difference fingerprints is 15 seconds on an Intel Core 2 quad 2.66 GHz processor (a single processor). For searching through 385 hours of audio, this search speed is very fast.
Results—Nearest Neighbor (NN) Fingerprint
The copy detection using NN-based fingerprints was run on the same 2008 queries and 385 hours of test data. The results in Table 6 for one optimized threshold per transform are better than those in Table 3 for the energy difference fingerprints.
Results for one threshold across all transforms are shown in the first row of Table 7.
These results are nearly the same as those for one threshold per transform, except for a small increase in the minimal NDCR value for transforms 3 and 4. One surprising result is that no segments for transform 6 are missed even though extraneous speech has been added to the queries with this transformation.
The computing expense required for finding the query frame closest to the test frame is significantly higher than that for the energy difference fingerprint. To reduce this expense, the process was implemented on a GPU with 240 processors and 1 Gbyte of memory as discussed earlier. The nearest neighbor computation lends itself easily to parallelization. The resulting average compute time per query is 360 seconds when the fingerprint uses 22 features (12 cepstral features+normalized energy+9 delta cepstra). Even though these parameters are very accurate, they are slower to compute than the energy difference parameters. As we reduce the number of features used to compute the nearest query frame, the results get worse. Table 8 gives the minimal NDCR value for 13 features (12 cepstral features+normalized energy).
The computing time can be reduced by rescoring the results from energy difference parameters with the NN-based features. Rescoring lowers average compute time/query to 20 sec. (15 sec. on CPU+5 sec. on GPU). Even for rescoring using NN-based features, the NN features are computed using a GPU. Minimal NDCR is shown in the second row of Table 7. Compared to energy difference feature (see Table 4), minimal NDCR has dropped significantly.
Table 9 illustrates why NN-fingerprints give such good results. This table shows the total number of test segments that match one of the 2008 audio queries and have a given count.
It should be noted that the number of test segment matches with a given count drops dramatically with increasing counts. The count threshold for no false alarms (no FA) is 23. This implies that none of the queries that are imposters have a matching segment with a count higher than 23. For 2009 queries also, this highest count for false alarms turned out to be 23. When the energy-difference parameter is rescored with NN fingerprints, this highest imposter segment count goes down to 14 (i.e., some of the high scoring imposter queries are filtered out by energy difference parameter). For 2009 queries, it turns out that this highest count was 11, showing the robustness of this feature. Using counts/sec instead of counts increased the minimal NDCR. Counts itself is a good measure of copy detection for nearest-neighbor fingerprint, even across queries of different lengths. Therefore, counts have been used as a confidence measure for the nearest-neighbor fingerprints. (Note all the previous results with the NN-fingerprints use counts). The total number of missed queries with counts below 23 for each transform can be computed by dividing the minimal NDCR in table 7 by 0.007. So the NN-based fingerprints generate false alarms with low counts, and the boundary between false alarms and correct detection is well marked.
Since rescoring energy-difference fingerprints with NN-based fingerprints results in very fast compute times (20 sec./query) and low NDCR, one run was submitted for no FA and one for the balanced case using this rescoring for TRECVID 2009 copy detection evaluation. The only difference between the two submissions was the threshold: for no FA, the threshold corresponds to the count for correct detection just above the highest count for any false alarm (for 2008 queries). For a balanced case, the threshold corresponds to the highest count for any false alarm (for 2008 queries). Table 10 shows the results for 2009 queries.
The results show optimal NDCR and actual NDCR using the thresholds from 2008 queries. The threshold set for computing the actual NDCR is a count of 17 as shown in the last row. First, the optimal results for no FA and for balanced cases are exactly the same. Second, the optimal and actual min NDCR are the same, except for a small difference for transforms three and six. This means that the count of 17 is very close to the optimal threshold for all the transforms. Also, the mean processing time is 20.5 sec. (15.5 sec. on CPU and 5 sec. on GPU). It turns out that these results are the best results for both computing speed and for minimal NDCR. For 2009 queries, the highest score for false alarms turns out to be 11, which is even lower than the score of 14 for the 2008 queries.
Since the results for NN-based feature search are the best and most reliable, one no FA submission was submitted using NN-based features computed using 22 cepstral features. Table 11 shows results for this case.
Compared to the submission that rescores using NN-based features, these results are slightly better for many transforms. However, the overall computing expense has risen from 20.5 sec./query to 376 sec./query. The last row shows the count of 25 that was set as a threshold to use for the actual case. Here also, the actual and optimal min NDCR values are very close, showing that the count of 25 is very close to the optimal threshold for each transform.
Fusion of Energy Difference and NN-Based Fingerprints Results
The two results were fused by combining the counts/sec. from energy-difference fingerprints with counts from NN-based fingerprints. The counts/sec. are multiplied by 15 to achieve a proper balance. Each fingerprint generates 0 or 1 matching segments per query. For segments common in the two fingerprints (same query, overlapping test segment), the weighted scores is added and then the segment corresponding to the NN-based fingerprints is output. For segments not in common, the segment with the best weighted score is output. The results for no FA case for 2008 queries are shown in Table 12.
The results for no FA with just one threshold across all transformations is shown in Table 13.
When Tables 7 and 13 are compared, one can appreciate the significant reduction in minimal NDCR due to fusion. If one averages across all transformations, the minimal NDCR value decreases from 0.016 to O.OOS. Table 14 compares this averaged minimal NDCR for energy difference fingerprints versus NN-based fingerprints versus the fused results for 2008 queries.
Note that rescoring results from energy-difference features with NN-based features results in only a small increase in computing while reducing minimal NDCR from 0.077 to 0.017.
In addition, a further submission was made using this fusion for the balanced case for 2009 queries. The results are shown in Table 15.
The results are good except for the actual minimal NDCR results for transform seven. The threshold given for the actual case was 28.6 as shown in the table and the compute time per query is 390 sec.
Table 16 summarizes the results for the four submissions for 2009 audio queries.
For optimal minimal and actual minimal NDCR, average the NDCR is averaged across all transformations in order to see the relative advantage of each algorithm. The optimal minimal NDCR value keeps decreasing with the improved algorithms. However, the actual minimal NDCR value goes up for the fused results due to transform 7. This was due to false alarms that were above the given threshold. This was brought about by the energy-difference parameter. The variability of imposter counts for energy difference fingerprints was the primary reason for not submitting any runs with energy-difference parameter alone, even though they are the fastest to compute. Note also that the average processing time per query is 20.5 sec. (row 1), while the average query duration is 81.4 sec. So the copy detection algorithms are four times faster than real-time. In other words, one processor can process four queries simultaneously.
Although various embodiments have been illustrated, this was for the purpose of describing, but not limiting, the invention. Various modifications will become apparent to those skilled in the art and are within the scope of this invention, which is defined more particularly by the attached claims.
This application claims the benefit of and priority to U.S. provisional application No. 61/247,728 filed Oct. 1, 2009, the context of which is herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5606655 | Arman et al. | Feb 1997 | A |
7359889 | Wang et al. | Apr 2008 | B2 |
20040258397 | Kim | Dec 2004 | A1 |
20050091275 | Burges et al. | Apr 2005 | A1 |
20050243103 | Rudolph | Nov 2005 | A1 |
20060182368 | Kim | Aug 2006 | A1 |
20080317278 | Lefebvre et al. | Dec 2008 | A1 |
20090154806 | Chang et al. | Jun 2009 | A1 |
20100085481 | Winter et al. | Apr 2010 | A1 |
20100247073 | Nam et al. | Sep 2010 | A1 |
20100329547 | Cavet | Dec 2010 | A1 |
Entry |
---|
Cardinal et al., “Content-Based Advertisement Detection”, Interspeech 2010, Makuhari, Chiba, Japan, Sep. 26-30, 2010, pp. 2214-2217. |
Gupta et al., “CRIM's Content-based Audio Copy Detection System for TRECVID 2009”, Multimedia Tools and Applications, 2010, Springer Netherlands, DOI: 10.1007/s11042-010-0608-x, Sep. 30, 2010, 16 pages. |
Kraaij et al., “TRECVID 2009 Content-based Copy Detection task Overview”, 2009, www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2009, Dec. 7, 2009, 76 pages. |
Kraaij et al., “TRECVID 2010 Content based Copy Detection task overview”, 2010, www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html, Mar. 1, 2011, 26 pages. |
Kraaij et al., “TRECVID-2008 Content-based Copy Detection task Overview”, www-nlpir.nist.gov/projects/tvpubs/tv8.slides/CBCD.slides.pdf, Dec. 17, 2008, 33 pages. |
Li et al., “PKU—IDM @ TRECVid 2010: Copy Detection with Visual-Audio Feature Fusion and Sequential Pyramid Matching”, www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html, Mar. 1, 2011, 6 pages. |
Liu et al., “AT&T Research at TRECVID 2009 Content-based Copy Detection”, 2009, www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2009, Mar. 1, 2010, 8 pages. |
Over et al., “Guidelines for the TRECVID 2009 Evaluation”, www-nlpir.nist.gov/projects/tv2009/, NIST—National Institute of Standards and Technology, Jan. 26, 2010, 15 pages. |
Mukai et al., “NTT Communications Science Laboratories at TRECVID 2010 Content-Based Copy Detection”, Proc. TRECVID 2010, Gaitersburg, MD, USA, Mar. 1, 2011, 10 pages. |
Office Action for U.S. Appl. No. 13/310,092 mailed Mar. 14, 2013. 24 pages. |
Number | Date | Country | |
---|---|---|---|
20110082877 A1 | Apr 2011 | US |
Number | Date | Country | |
---|---|---|---|
61247728 | Oct 2009 | US |