The invention relates to a method and arrangement for storing fingerprints identifying audio-visual media signals in a database. The invention also relates to a method and arrangement for identifying an unknown audio-visual media signal.
A fingerprint (in literature also referred to as signature or hash) is a digital summary of an information signal. In cryptography, hashes have been used for a long time to verify correct reception of large files. Recently, the concept of hashing has been introduced to identify multi-media content. Unknown content such as an audio or video clip is recognized by comparing a fingerprint extracted from said clip with a collection of fingerprints stored in a database. In contrast with a cryptographic hash, which is extremely fragile (flipping a single bit in the large file will result in a completely different hash), a fingerprint extracted from audio-visual content must be robust. To a large extent, it must be invariant to processing such as compression or decompression, A/D or D/A conversion.
A prior-art fingerprinting system is disclosed in Haitsma et al.: Robust Hashing for Content Identification, published at the Content-Based Multimedia Indexing (CBMI) conference in Brescia (Italy), 2001. As described in this article, the fingerprint is derived from a perceptually essential property of the content, viz. the distribution of energy in bands of the audio frequency spectrum. For video signals, the distribution of luminance levels in video images has been proposed to constitute the basis for a robust fingerprint.
A fingerprint is created by dividing the signal into a series of (possibly overlapping) frames, and extracting a hash word representing the perceptually essential property of the signal within each frame to obtain a respective series of hash words. In order to identify an unknown clip, the database receives the series of hash words concerned, and searches the most similar stored series of hash words. Similarity is measured by determining how many bits of the series match a series of hash words in the database. If the BER (Bit Error Rate, the percentage of the non-matching bits) is below a certain threshold, the clip is identified as the song or movie from which the most similar series of hash words in the database originates.
A problem of the prior-art fingerprinting method is the size of the database. In the Haitsma et al. article, the audio signal is divided into frames of 0.4 seconds with an overlap of 31/32. This results in a new frame every 11.6 ms (=0.4/32). For every frame, a 32-bit hash word is extracted. Accordingly, a 5-minute song needs approximately 100 kbytes, viz. 5 (minutes)×60 (seconds)×4 (bytes per hash word)/0.0116 (seconds per hash word). Needless to say that the database must have a huge capacity to allow recognition of a large repertoire of songs. Similar considerations apply to video fingerprinting systems.
It is an object of the invention to provide a method and system for storing fingerprints in a database, which alleviates the above-mentioned problem. It is also an object of the invention to provide a method and system for identifying an unknown audio-visual signal in such a database.
To this end, the invention provides a method for storing fingerprints in a database as defined in independent claim 1. The method differs from the prior art in that only a sub-sampled sequence of hash words (i.e. one out of every M hash word) is stored in the database. The word “sequence” is used in this claim to refer to a full-length signal (song or movie). A storage reduction by a factor M is achieved.
A method of identifying an unknown audio-visual signal in such a database is defined in independent claim 4. As there is uncertainty as to which of M possible sub-sampled sequences of hash words is stored in the database, a full (i.e. not sub-sampled) series of hash words is extracted from the unknown clip in accordance with this method. The word “series” is used here to refer to a possibly short segment or clip of the unknown signal. Interleaved sub-series of hash words are now successively applied to the database for matching with the sub-sampled sequences stored therein. If at least one of the applied sub-series has a BER below a certain threshold, the signal is identified.
It is achieved with the invention that the storage requirements are reduced (by a factor M), while the robustness and the reliability of the prior-art identification method are maintained.
Further advantageous embodiments of the methods are defined in the dependent claims.
The invention will be described for audio signals.
The first operational mode (storage) of the arrangement will be described first. In this mode, the arrangement receives a fall-length music song K(t). The signal is divided, in a framing circuit 11, into time intervals or frames F(n) having a length of approximately 0.4 seconds and weighted by a Hanning window with an overlap of 31/32. The overlap is used to introduce a large correlation between subsequent frames. For audio signals, this is a prerequisite because the framing applied to unknown signals to be recognized may be different.
The framing circuit 11 generates a new frame every 11.6 ms (=0.4/32). A hash extracting circuit 12 generates a 32-bit hash word H(n) for every frame. A practical embodiment of such a hash extracting circuit is described in the Haitsma et al. article referred to in the chapter Background of the Invention. Briefly summarized, the circuit divides the frequency spectrum of each audio signal frame into frequency bands and produces for each band a hash bit indicating whether the energy in said band is above or below a given threshold.
In accordance with the invention, the sequence of hash words is sub-sampled by a factor M by a sub-sampler 13, which produces a sub-sequence H′(n). The sub-sequence of hash words, along with identification data such as title of the song, name of the artist, etc., constitutes a fingerprint of the known music song. Such a fingerprint is shown in
The second operational mode (identification) of the arrangement will now be described. In this mode, the arrangement receives a part (say, 3 seconds) of an unknown song. i.e. an audio clip U(t). The clip is processed by a similar (or the same) framing circuit 11 and hash extracting circuit 12 as described above. The hash extraction circuit 13 extracts a full hash block (no sub-sampling) of the clip. For a 3-second clip, this operation yields a series of approximately 256 hash words H(n). Such a series of hash words representing the unknown audio clip is also referred to as hash block. In an alternative embodiment, the hash block has been extracted by a remote station and is merely received by the arrangement.
The hash block is applied to an interleaving circuit 16, which divides it into M interleaved sub-series or sub-blocks H0(n), H1(n), . . . HM−1(n), where M is the same integer as used in the sub-sampler 13 described above.
The sub-blocks are applied to respective inputs of a selection circuit 17. Under the control of the computer 15, the sub-blocks H0(n), H1(n), . . . HM−1(n) are successively applied to the database 14 for identification. If a series of hash words is found in the database, for which the bit error rate BER (i.e. the percentage of non-matching bits between said series and the applied sub-block) is below a certain threshold, the fingerprint comprising said series of hash words identifies the unknown audio clip.
If the BER is below the threshold, the audio clip has been identified. The title and performer of the song as stored in the database (23 in
It is achieved with the invention that the database capacity is reduced by a factor M. It should be noted that the same reduction can effectively be achieved by choosing a different frame overlap, viz. ⅞ in the present example. This is true as far as the first operational mode (storage) is concerned. However, if the same overlap of ⅞ without interleaving was chosen in the identification process, the robustness and reliability of the identification would be seriously affected. The invention resides in the concept of interleaving in the second operational mode (identification). It is achieved thereby that at least one of the interleaved sub-blocks is derived from a series of frames that substantially matches (in time) the series of frames from which the stored hash words have been derived. The identification process in accordance with the invention yields substantially the same robustness and reliability as the prior-art (non-interleaving) method with an overlap of 31/32. A mathematical background thereof will now be given.
When a sub-sampling with a factor M is applied and if the bits in a hash block are random i.i.d. (independent and identically distributed), the standard deviation of the BER increases by a factor √M. This implies that either the robustness and/or the reliability is/are affected. If the threshold on the BER is kept the same, then the robustness is unaffected but the reliability decreases. If on the other side the threshold is decreased by an appropriate amount, then the reliability stays the same but the robustness decreases.
However, the bits in a hash block of an audio-visual media signal have a large correlation along the time axis, which is introduced by the large overlap of the framing and inherent correlation in music. Therefore, the standard deviation s does not increase by the factor √M when sub-sampling with the factor M is applied. Experiments have shown that, for small values of M, the standard deviation does not even increase significantly at all. In a practical system without sub-sampling, the threshold on BER is set to 0.35. If sub-sampling by a factor M=4 is applied, then the threshold has only to be lowered to 0.342. Therefore, the decrease of robustness is insignificant, whilst the needed storage in the database has been decreased by a factor of 4. Furthermore, the time needed to search a hash database will decrease simply because there are 4 times fewer hash values in the database.
The search speed can even be further increased by refraining from applying a further sub-block to the database if one of the sub-blocks (generally the first) appears to have a BER which is larger than a further threshold (which is substantially larger than the threshold T). Because of the large correlation between sub-blocks (due to the frame overlap and inherent correlation in music), it is unlikely that another sub-block will have a significantly lower BER.
A robust fingerprinting system is disclosed. Such a system can recognize unknown multimedia content (U(t)) by extracting a fingerprint (a series of hash words) from said content, and searching a resembling-fingerprint in a database in-which fingerprints of a plurality of known contents (K(t)) are stored. In order to more efficiently store the fingerprints in the database and to speed up the search, the hash words (H(n)) of known signals (K(t)) are sub-sampled (13) by a factor M prior to storage in the database (14). The hash words (H(n)) of unknown signals (U(t)) are divided (16) into M interleaved sub-series (H0(n) . . . HM−1(n)). The interleaved sub-series are selectively (17) applied to the database (14) under the control of a computer (15). If only one of the sub-series sufficiently matches a stored fingerprint, the signal is identified.
Number | Date | Country | Kind |
---|---|---|---|
02075498 | Feb 2002 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB03/00217 | 1/21/2003 | WO | 00 | 8/2/2004 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO03/067466 | 8/14/2003 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4677466 | Lert, Jr. et al. | Jun 1987 | A |
5019899 | Boles et al. | May 1991 | A |
5113383 | Amemiya et al. | May 1992 | A |
5276629 | Reynolds | Jan 1994 | A |
5400261 | Reynolds | Mar 1995 | A |
5436653 | Ellis et al. | Jul 1995 | A |
5499294 | Friedman | Mar 1996 | A |
5612729 | Ellis et al. | Mar 1997 | A |
5616876 | Cluts | Apr 1997 | A |
5621454 | Ellis et al. | Apr 1997 | A |
5703795 | Mankovitz | Dec 1997 | A |
5767893 | Chen et al. | Jun 1998 | A |
5822436 | Rhoads | Oct 1998 | A |
5893910 | Martineau et al. | Apr 1999 | A |
5925843 | Miller et al. | Jul 1999 | A |
5960081 | Vynne et al. | Sep 1999 | A |
5987525 | Roberts et al. | Nov 1999 | A |
6034925 | Wehmeyer | Mar 2000 | A |
6061680 | Scherf et al. | May 2000 | A |
6076104 | McCue | Jun 2000 | A |
6076111 | Chiu et al. | Jun 2000 | A |
6195693 | Berry et al. | Feb 2001 | B1 |
6201176 | Yourlo | Mar 2001 | B1 |
6240459 | Roberts et al. | May 2001 | B1 |
6247022 | Yankowski | Jun 2001 | B1 |
6266429 | Lord et al. | Jul 2001 | B1 |
6272078 | Yankowski | Aug 2001 | B2 |
6345256 | Milsted et al. | Feb 2002 | B1 |
6388957 | Yankowski | May 2002 | B2 |
6388958 | Yankowski | May 2002 | B1 |
6408082 | Rhoads et al. | Jun 2002 | B1 |
6505160 | Levy et al. | Jan 2003 | B1 |
6633653 | Hobson et al. | Oct 2003 | B1 |
6647128 | Rhoads | Nov 2003 | B1 |
6665417 | Yoshiura et al. | Dec 2003 | B1 |
6674876 | Hannigan et al. | Jan 2004 | B1 |
6700990 | Rhoads | Mar 2004 | B1 |
6737957 | Petrovic et al. | May 2004 | B1 |
6748533 | Wu et al. | Jun 2004 | B1 |
6782116 | Zhao et al. | Aug 2004 | B1 |
6829368 | Meyer et al. | Dec 2004 | B2 |
6941003 | Ziesig | Sep 2005 | B2 |
6941275 | Swierczek | Sep 2005 | B1 |
6952774 | Kirovski et al. | Oct 2005 | B1 |
6963975 | Weare | Nov 2005 | B1 |
6983289 | Commons et al. | Jan 2006 | B2 |
6990453 | Wang et al. | Jan 2006 | B2 |
7024018 | Petrovic | Apr 2006 | B2 |
7080253 | Weare | Jul 2006 | B2 |
7082394 | Burges et al. | Jul 2006 | B2 |
7159117 | Tanaka | Jan 2007 | B2 |
7349555 | Rhoads | Mar 2008 | B2 |
20010004338 | Yankowski | Jun 2001 | A1 |
20020023020 | Kenyon et al. | Feb 2002 | A1 |
20020033844 | Levy et al. | Mar 2002 | A1 |
20020059208 | Abe et al. | May 2002 | A1 |
20020078359 | Seok et al. | Jun 2002 | A1 |
20020116195 | Pitman et al. | Aug 2002 | A1 |
20020178410 | Haitsma et al. | Nov 2002 | A1 |
20030023852 | Wold | Jan 2003 | A1 |
20030028796 | Roberts et al. | Feb 2003 | A1 |
20030033321 | Schrempp et al. | Feb 2003 | A1 |
20030086341 | Wells et al. | May 2003 | A1 |
20040028281 | Cheng et al. | Feb 2004 | A1 |
20040172411 | Herre et al. | Sep 2004 | A1 |
20050004941 | Kalker et al. | Jan 2005 | A1 |
20060041753 | Haitsma | Feb 2006 | A1 |
20060075237 | Seo et al. | Apr 2006 | A1 |
20060143190 | Haitsma et al. | Jun 2006 | A1 |
20060206563 | Van De Sluis | Sep 2006 | A1 |
20060218126 | De Rujter et al. | Sep 2006 | A1 |
20070071330 | Oostveen et al. | Mar 2007 | A1 |
20070106405 | Cook et al. | May 2007 | A1 |
Number | Date | Country |
---|---|---|
4309957 | Jul 1994 | DE |
0283570 | Sep 1988 | EP |
0367585 | May 1990 | EP |
0319567 | Feb 1993 | EP |
0936531 | Aug 1999 | EP |
2338869 | Dec 1999 | GB |
63-104099 | May 1988 | JP |
40299399 | Oct 1992 | JP |
06-225799 | Aug 1994 | JP |
06315298 | Nov 1994 | JP |
11-261961 | Sep 1999 | JP |
2000-305578 | Nov 2000 | JP |
2001283568 | Oct 2001 | JP |
WO-O9825269 | Jun 1998 | WO |
WO-9935771 | Jul 1999 | WO |
WO-0128222 | Apr 2001 | WO |
WO-0211123 | Feb 2002 | WO |
WO-02065782 | Aug 2002 | WO |
WO-03012695 | Feb 2003 | WO |
WO-2004077430 | Feb 2004 | WO |
WO-2006044622 | Apr 2006 | WO |
Number | Date | Country | |
---|---|---|---|
20050141707 A1 | Jun 2005 | US |