1. Field of the Invention
The present invention relates to signal analysis and particularly to signal analysis for the purpose of identification of signal content.
2. Description of the Related Art
In order to archive the ever increasing stock of audio and video material, establish databases that are easy to search or distribute them via various ways of distribution, automatic information recognition systems are necessary that assist to identify audio and video material or, more generally, information material unambiguously based on the contents.
One application for this is the so-called “broadcast monitoring”. With the help of such an audio-video monitoring system, it is for example intended to ensure that only legal contents are distributed or that the respective royalties for the right holders of the audio and video material are paid correctly.
A further application is, for example, the recognition of audio material that is to be exchanged between partners via peer-to-peer networks.
A further application is the monitoring possibility for the advertising industry to monitor a television or radio station as to whether the booked advertising times have really been broadcast, or whether only parts of the booked advertising share have been broadcast, or whether parts of the commercials have been disturbed during transmission, which may, for example, be the responsibility of the television or radio station. At this point, it is to be noted that particularly the costs for television commercials in popular programs at good broadcasting times are so high that the advertising industry, particularly in view of these high costs, has a vital interest in a monitoring possibility, so that they do not merely have to trust the word of the broadcasting stations. Currently, the monitoring possibility is based on paid “test hearers” or “test viewers”, who continuously watch a certain television program and record, for example, the exact times at which a commercial is transmitted, and who further monitor whether, during the transmission, there has been no disturbance, or whether the whole commercial has been transmitted correctly, i.e. whether there has been no picture distortion, etc.
The disadvantages of this concept are evident. On the one hand, the costs are significant and, on the other hand, the reliability or strength of evidence of statements of test hearers and/or test viewers is problematic, particularly if considerable repayment demands are made that solely depend on test watchers with regard to their provability.
Various known systems may be used for automated broadcast monitoring. For example, WO 02/11123 A2 or the specialist publication: “Invited Talk: An Industrial-Strength Audio Search Algorithm”, Avery Wang, ISMIR 2003, Baltimore, October 2003, disclose systems and methods for recognizing audio and music signals in an environment of strong noise and high distortions. A first step is an examination whether there is a match between hash values of a reference audio object and the currently determined hash value of the audio object still unidentified. If this is the case, the associated time offset, i.e. the relative distance from the beginning of the audio object, of the hash value in the still unidentified audio object and the time offset of the hash value in the reference audio object is stored under the respective identification of the reference audio object. When all input hash values have been processed, a so-called scanning phase starts. During this phase, there is an examination of how many time offset pairs per reference audio object time match continuously. If a certain number is detected, an identification of the corresponding reference audio object is assumed. The time offset pairs are considered to be continuous in time, i.e. temporally associated with each other, when they form a straight line in a two-dimensional scatter plot with one time offset as the x-coordinate and the other one as the y-coordinate.
In the specialist publication “Robust Audio Hashing for Content Identification” by J. Haitsma, T. Kalker, J. Oostveen, in Proceedings of the Content-Based Multimedia Indexing, 2001, url:citeseer.ist.psu.edu/haitsma01robust. html, a system for robust audio hashing for content identification is presented. For content-based music recognition, a hash function is used that associates a bit sequence with a portion from an audio signal, namely such that audio signals acoustically similar for the human sound perception also generate a similar bit sequence. For the calculation of a hash value, the audio signal is first windowed and subjected to a transform to finally perform a division of the transform result into frequency bands with logarithmic bandwidth. For these frequency bands, the signs of the differences in the time and frequency directions are determined. The bit sequence resulting from the signs constitutes the hash value. One hash value is always calculated for an audio signal length of 3 seconds. If the Hamming distance between a reference hash value and a test hash value to be examined for such a portion is below a threshold s, a match is assumed and the test portion is associated with the reference element.
In order to perform a recognition of audio material, the audio signal is typically split into small units of length Δt. These individual units are each analyzed individually to have at least a certain time resolution.
This causes several problems.
The recognition results of the small analyzed time periods of the audio signal have to be put together so that an unambiguous correct statement on the recognized audio signal can be made for a longer time period.
For the analysis of a continuous audio data stream, transitions from one audio element to another, i.e. a transition from a piece of music A to a piece of music B, should be detected correctly.
There is further the situation in which there are several versions of a piece of music, which, for example, have the same beginning and only start to differ after a certain time. Just think of, for example, short versions or maxi versions of a song. Alternatively, there are also situations in which pieces of music that are based on the same song differ, for example, at the beginning, have an identical middle part and again differ from each other towards the end of at least one of the two pieces of music. For the payment of royalties to copyright holders, it may be important, whether, for example, the maxi version of a song may be played for a higher charge, whether only a normal version may be played for a medium charge, or whether, for a low charge, there may already be played the short version of a song. In this case, it should be possible to reliably distinguish several versions of a song.
The above prior art is unsatisfactory in that it results in detection errors when the results of the individual recognitions are simply put together. In particular, no information is given as to whether and how a continuous audio data stream from several different audio objects may be analyzed, and how corresponding transitions between various audio objects may be detected. In addition, although particularly in the latter prior art the ambiguity of reference hash values is mentioned, no explicit solution for the problem of the determination of an unambiguous candidate is given. If an audio object is considered to be identified for a hash value, for the directly subsequent hash value there is only an examination whether it fits the identified audio object. If this is not the case, there is a new search including all reference audio objects.
Particularly for distinguishing different versions of one and the same song, no solution is known in prior art.
It is the object of the present invention to provide a reliable concept for analyzing an information signal.
In accordance with a first aspect, the present invention provides a device for analyzing an information signal having a sequence of blocks of information units, wherein a plurality of consecutive blocks of the sequence of blocks represents an information entity, using a sequence of fingerprints for the sequence of blocks so that the sequence of blocks is represented by the sequence of fingerprints, having a unit for providing identification results for consecutive fingerprints, wherein an identification result represents an association of a block of information units with a predetermined information entity, and wherein there is a reliability measure for each identification result, wherein the unit for providing is designed to generate a first identification result for a first fingerprint, and to generate a second identification result differing from the first identification result for a following block; a unit for forming at least two hypotheses from the identification results for the consecutive fingerprints, wherein a first hypothesis is an assumption for the association of the sequence of blocks with a first information entity, and wherein a second hypothesis is an assumption for the association of the sequence of blocks with a second information entity, wherein the unit for forming is designed to start the first hypothesis or continue the already existing first hypothesis in response to the first identification result and to start the second hypothesis or to continue the already existing second hypothesis in response to the second identification result; a unit for examining the at least two hypotheses by combining the reliability measures of the hypotheses to obtain an examination result; and a unit for making a statement on the information signal based on the examination result.
In accordance with a second aspect, the present invention provides a method for analyzing an information signal having a sequence of blocks of information units, wherein a plurality of consecutive blocks of the sequence of blocks represents an information entity, using a sequence of fingerprints for the sequence of blocks so that the sequence of blocks is represented by the sequence of fingerprints, having the steps of providing identification results for consecutive fingerprints, wherein an identification result represents an association of a block of information units with a predetermined information entity, and wherein there is a reliability measure for each identification result, wherein, in the step of providing, a first identification result is generated for a first fingerprint and a second identification result differing from the first identification result is generated for a following block; forming at least two hypotheses from the identification results for the consecutive fingerprints, wherein a first hypothesis is an assumption for the association of the sequence of blocks with a first information entity, and wherein the second hypothesis is an assumption for an association of the sequence of blocks with a second information entity, wherein the step of forming includes starting the first hypothesis or continuing the already existing first hypothesis in response to the first identification result, and starting the second hypothesis or continuing the already existing second hypothesis in response to the second identification result; examining the at least two hypotheses by combining the reliability measures of the hypotheses to obtain an examination result; and making a statement on the information signal based on the examination result.
In accordance with a third aspect, the present invention provides a computer program having a program code for performing the above-mentioned method, when the program runs on a computer.
The present invention is based on the finding that a reliable content identification is achieved by not only considering individual recognition results by themselves, but over a certain period of time. For example, there is considerable information usable for recognition in the sequence of individual recognition results for a sequence of fingerprints. According to the invention, a formation of at least two different hypotheses is performed based on a sequence of fingerprints representing a sequence of blocks of an information signal, wherein a first hypothesis is an assumption for the association of the sequence of blocks with a first information entity, and wherein the second hypothesis is an assumption for the association of the sequence of blocks with the second information entity. The at least two hypotheses are now examined and subjected to an evaluation so that a statement on the information signal is made based on an examination result. The statement could, for example, consist in determining that the sequence of blocks represents an information entity having a hypothesis that is most likely. The statement could alternatively or additionally be that an information unit ends with the fingerprint that contributes to the most likely hypothesis as temporally last fingerprint of the sequence of fingerprints.
Preferably, the hypotheses are examined so that there are at least two different identification results for fingerprints, and that there is a reliability measure for each of the two different identification results, wherein this reliability measure may consist in a concrete number. This reliability measure, however, may also be given implicitly so that only by the fact that, for example, two identification results are provided, a reliability of, for example, ½ is signaled, and that this number is not given explicitly.
For the assessment whether a hypothesis is more likely than the other hypothesis, reliability measures of the individual recognitions for the respective number of blocks consecutive in time are advantageously combined, wherein this combination preferably consists in an addition. Then the hypothesis providing the highest combined reliability measure is evaluated to be the most likely hypothesis.
In a preferred embodiment of the present invention, a fingerprint database in which a number of reference fingerprints is respectively filed in association with an identification result is used as means for providing consecutive identification results. Then a database search is made with the fingerprint generated from a block of the information signal to be analyzed to look for a reference fingerprint providing a match with the test fingerprint within the database. Depending on the design of the database, only the best hit, i.e. the hit with a minimum distance measure, is output as search result by the database as identification result. Also, databases are preferred that provide a hit result not only qualitatively, but also provide a quantitative hit result, so that a number of possible hits with an associated reliability measure is output, so that, for example, all hits with a reliability measure larger than or equal to a certain threshold, such as 20%, are output by the database.
In the preferred embodiment of the present invention, a new hypothesis is started when a new identification result appears for which there is no hypothesis yet. This procedure is performed for a certain number of blocks to then examine directed into the past whether a certain hypothesis that has been found reliable has already ended, to then identify this hypothesis as the most likely hypothesis.
An advantage of the present invention is that the concept works reliably and is nevertheless error-tolerant particularly regarding transmission errors. For example, no attempt is made to make a decision based on a single block, but a sequence of consecutive blocks is, as it were, considered and evaluated together by hypothesis formation, so that short-term transmission disturbances and/or generally occurring noise do not make the whole recognition process useless.
In addition, the inventive concept automatically provides recording of the transmission quality from the beginning to the end, for example of a commercial. Even if a hypothesis has been identified as the most likely hypothesis, i.e. if a certain commercial is determined to have been there, quality variations within the commercial are still traceable based on the reliability measures. Furthermore, in that way particularly the complete time continuity of a commercial as an example of an information entity is traceable and recordable, particularly with respect to the aspect that they did not continuously repeat a part of the commercial, but that the whole commercial was transmitted from the beginning of the commercial to the end of the commercial in a continuous way.
The present invention is further advantageous in that, by hypothesis formation, the end of an information entity and the beginning of an information entity are automatically detected. This is due to the fact that an association with an information entity will generally be unambiguous. This means that it is not possible to replay several information entities together over a certain point in time, but that, at least for the excessive number of program contents, only one information entity is contained in the information signal at one point in time. The hypothesis examination and the evaluation of the hypotheses based on the hypothesis examination automatically provides a point in time at which a previous information entity ends and at which a new information entity starts. This is due to the block association maintained in the hypotheses. Thus a sequence of fingerprints still corresponds to a sequence of blocks and, in turn, a sequence of identification results corresponds to a sequence of fingerprints, so that a hypothesis is unambiguously associated with the original information signal with respect to time.
The inventive concept is further advantageous in that there are no “draw” situations between two hypotheses, even if information entities partially have identical audio material, such as short versions or long versions of one and the same song.
Preferred embodiments of the present invention will be explained in detail below with respect to the accompanying drawings, in which:
a-4c show an exemplary scenario for subsequent examples of application;
a-5d show a schematic representation of various wrong evaluations;
a-7c show a representation of the functionality of the inventive concept for the output scenario illustrated in
The device shown in
In any case, the device for analyzing the information signal operates using a sequence of fingerprints for the sequence of blocks, so that the sequence of blocks 802 is represented by the sequence of fingerprints FA1, FA2, FA3, FA4, . . . , FAi. The sequence of fingerprints is fed into a fingerprint input in means 12 for providing identification results for consecutive fingerprints. The means 12 for providing consecutive identification results is operative to provide consecutive identification results for the consecutive fingerprints, wherein an identification result represents an association of a block of information units with a predetermined information entity. Assuming, for example, that a song has a time length corresponding to about six blocks, the six blocks provide different fingerprints, but in the means 12 for providing all these six blocks are signaled to be part of the predetermined information entity, i.e. the mentioned song.
Depending on the implementation, the means 12 for providing will provide one or more identification results for a fingerprint. The one or more identification results are supplied to means 14 for forming at least two hypotheses from the identification results for the consecutive fingerprints. Specifically, a first hypothesis represents an assumption for the association of the sequence of blocks with a first information entity, and the second hypothesis is an assumption for the association of the sequence of blocks with the second information entity. The various hypotheses H1, H2, . . . are supplied to means 16 for examining the hypotheses, wherein the means 16 is designed to operate according to an adjustable examination algorithm to finally provide an examination result at an examination result output 18.
This examination result on line 18 is then provided to means 20 for making a statement on the information signal. The means 20 for making a statement on the information signal is designed to output information on the information signal based on the examination result, and may have various settings.
All settings have in common that the statement on the information signal is made on the basis of the examination result 18. Examples of various statements on the information signal consist in determining that the sequence of blocks represents an information entity having a hypothesis that is most likely. Alternative statements are that an information entity ends with the fingerprint that contributes to the most likely hypothesis as the timewise last fingerprint. An alternative statement that may be made by the means 20 consists in determining that an information entity per se is present in the information signal or not.
The inventive post-processing particularly provided by the means 14, 16 and 20, i.e. forming at least two hypotheses, examining the hypotheses and making a statement on the basis of an examination result, thus not only allows the identification of a piece in an information signal that is unknown, i.e. to be analyzed, but—apart from the identification of a piece itself—also allows the detection of the end of a first piece, i.e. a first information entity, and the detection of the beginning of a second information entity following the first information entity.
Regarding commercial monitoring, the inventive post-processing concept, however, also provides the possibility to detect whether a certain piece was present in the information signal or not. The fingerprints acquired from the information signal would here only be compared to one set of fingerprints, namely the set of fingerprints representing the predetermined information entity, i.e. a certain commercial. This statement is thus not primarily to be considered in the context of identifying an information entity or detecting the end of an information entity and the beginning of a following information entity, but consists in detecting whether a certain information entity is present in an unknown information signal to be analyzed or not.
Depending on the implementation, the whole result table 28 may be supplied to the means 14 for forming at least two hypotheses of
It can be seen from the database 22 in
Furthermore, reference is already made to the last two rows, based on
As already discussed, the database 22, i.e. this implementation of the means 12 for providing identification results for consecutive fingerprints, may be designed such that it always supplies only the most likely identification result. Alternatively, however, the database 22 could also be defined to always supply, for example, only the identification results whose probability is higher than a minimum threshold, such as a threshold of 5%. This would have the result that the number of rows of the table varies from fingerprint to fingerprint. Again alternatively, the database 22 could, however, also be implemented to supply, for each input fingerprint FAi, a certain number of most likely candidates, such as the “top ten”, i.e. the ten most likely candidates, to the means 14 for forming at least two hypotheses.
Subsequently, an implementation of the database 22 will be illustrated based on
The means 14 for forming at least two hypotheses is thus operative to see for each new fingerprint whether there will be a new identification result, to start a new hypothesis, and to continue a hypothesis already started earlier when, for a time period Δti, an element is included in the “top three” or “top x” for the hypothesis already started earlier that, although with less probability, provides an identification result for a hypothesis just started. This procedure is continued for a certain time. Then, for example at predetermined times or triggered by a user, etc., the means 16 for examining the hypotheses will examine the hypotheses formed for the past and, for the case shown in
In the case shown in
The above scenarios thus show that the inventive concept, which works with hypotheses on the basis of post-processing and, on the one hand, considers the sequence and, on the other hand, the reliability measures of the individual fingerprint identification processes, is extraordinarily robust with respect to transmission errors and also with respect to problematic functionalities in the database or also with respect to fingerprints that may not differ as much as would be desirable for some information entities, such as pieces of music, video images, texts, etc.
In a preferred embodiment, a hypothesis is a stored protocol (
At the end of
Next, there is first a more general discussion of database systems based on
In order to identify a piece of music—or also any other audio signal—, a compact and unique data set is extracted therefrom, also referred to as fingerprint or signature. This extraction is done in a block feature extraction 900. In the training or learning phase, such fingerprints are generated from a set of known audio objects and stored in a fingerprint database 902. Preferably, the feature extraction means 900 is designed to use the SFM feature as feature, wherein SFM means “spectral flatness measure”. Of course, other fingerprint generation systems and/or feature extraction results may also be used. However, it has been found that tonality-related features and particularly the SFM feature have a particularly good distinctiveness on the one hand and a particularly good compactness on the other hand. For this purpose, each block is first subjected to a time/frequency conversion, to then calculate an SFM for a block with the values generated from the time/frequency conversion according to the following equation.
In this equation, X(n) represents the square of an absolute value of a spectral component with the index n, wherein N is the total number of spectral coefficients of a spectrum. It may be seen from the equation that the SFM measure is equal to the quotient of the geometric mean of the spectral components and the arithmetic mean of the spectral components. It is known that the geometric mean is always less than or maximally equal to the arithmetic mean, so that the SFM has a value range between 0 and 1. In this context, a value close to 0 indicates a tonal signal, and a value close to 1 indicates a rather noise-like signal with a flat spectral curve. It is to be noted that the arithmetic mean and the geometric mean are only equal if all X(n) are identical, which corresponds to a completely atonal, i.e. noise-like or pulse-like signal. However, if in an extreme case only one spectral component has a very high value, while other spectral components X(n) have very small values, the SFM measure will have a value close to 0, indicating a very tonal signal.
The SFM concept as well as other feature extraction concepts to generate fingerprints are, for example, discussed in Wo 03/007185.
In the identification phase, illustrated in
According to the invention, now an unknown audio object at the input is not only associated with exactly one reference audio object in the reference database, namely only for a time Δt, but there is a continuous operation without interruption of the data stream at the input. According to the invention, an association of various portions from audio objects with the correct audio objects from the reference database is performed. Thus an unbroken sequence, i.e. a protocol, of the identified audio objects at the input is obtained.
Next, a particular difficulty of the continuous analysis of a continuous audio data stream is represented based on
a represents a long version of a piece of music XY, which is also represented by a long fingerprint illustrated in
Subsequently, there will be an illustration based on
Subsequently,
In addition, further wrong recognition protocols are conceivable, which are generated by the ambiguity of the individual recognitions for a portion of the audio data stream in the time period Δtx.
According to the invention, the general concept illustrated in
In the post-processing stage, the probability for the transition from an identified reference audio object for the time period Δtx to any other reference audio objects for the time period Δtx+1 is assumed to be equal. From this assumption, various hypotheses, which are first considered in parallel, are formed for contiguous audio portions from the individual recognitions. It is to be noted that individual recognitions are combined to form a hypothesis when they are related to one and the same reference audio signal and are time-continuously connected. The recognition protocol results from a combination of the respective most likely hypotheses considering the progress in time. Subsequently, a preferred algorithm is illustrated in detail.
At first, various hypotheses for contiguous audio portions are formed from the individual recognitions for the time periods Δtx (wherein x=N, N+1, N+2, . . . ; wherein tN is the starting time for the respective hypothesis) for each recognized reference audio object.
Individual recognitions are combined to form a hypothesis, if the individual recognitions are consecutive in time in a continuous way.
The time continuity is a further element that serves to determine whether an already existing hypothesis is continued or whether a new hypothesis is started. Consider, for example, the scenario in which a certain guitar solo, for example, in a piece is situated rather at the beginning of the piece in the short version of the piece and is situated rather in the middle of the piece in a long version of the piece.
In a preferred embodiment, the database, i.e. the means for providing identification results, not only outputs a fingerprint identification, but also a time value which results from the identification fingerprint in the database having a length and the input (short) fingerprint only matching part of the (long) fingerprint in the database.
In the scenario described above, the database would perhaps provide two ID results for the guitar solo (short version and long version), but with two different time indices. The time index for the ID result for the short version is smaller than the time index for the long version. On the basis of the time index, the means for forming the hypotheses is now capable of continuing hypotheses (if there is time continuity between the time index and the last time index in the hypothesis) or starting new hypotheses, if there is no continuity in the currently obtained time index and a last time index of a hypothesis.
Each time discontinuity with respect to a reference audio object generates a new hypothesis, if the following element has a larger distance in time than a time distance Ta to be set, or if the following element is temporally before the previous one.
For the hypothesis examination, an addition of the confidence measures, i.e. the reliability values and/or the measures for the plausibility, of the individual recognitions is made for each hypothesis.
Starting with the time period Δt0, the hypothesis with the highest confidence measure is then evaluated to be true and adopted into the recognition protocol. For the next time period following the first hypothesis, the hypothesis with the highest confidence measure is again evaluated to be true and adopted into the recognition protocol, etc.
For the above example, the result is thus a process illustrated based on
The means 14 (
Some time after time t7, the hypothesis situation shown in
Assuming that, between t1, and t5, the identification results ID108 and ID109 occur with the same probability, only the first hypothesis H1 will win in the embodiment shown in
Starting at t0, the hypothesis H1 is thus chosen, because until t7 there is no hypothesis with a higher confidence measure. The hypothesis H2 is discarded, wherein, in principle, all hypotheses can be discarded that exist in parallel to another hypothesis that has been chosen as the most likely one.
According to the invention, there is thus recorded exactly the sequence, in this example an element, namely ID108, that was really played at the audio input.
It is to be noted that there are various possibilities for the determination of the end of a hypothesis. For example—independent of the hypothesis situation—an information entity end may be determined, for example, from the audio signal itself, for example if there is a pause with a certain minimum length. Since, however, this criterion does not work if there is fading between two information entities or if two pieces follow each other so quickly that no noticeable pause can be found, it is preferred to determine an information entity end based on the hypotheses considered in the past. This may be done, for example, such that a hypothesis is considered to have ended when, for example, two or more blocks that have no longer any identification result with a reliability value above a certain minimum threshold are provided to the means 14 for forming hypotheses. Alternatively, for example for the case shown in
The above discussion shows that the end of a hypothesis does not necessarily have to be determined actively, but that this end may automatically result from the analysis of the past, i.e. the started hypotheses. Preferably, a new hypothesis is started whenever a new identification result with a reliability measure above a significance threshold appears, wherein then the past is examined at some time to see which hypothesis survives for a certain time period, wherein it is not necessary to explicitly determine an end of a hypothesis for this purpose, because it is an automatic result.
Depending on the circumstances, the inventive method may be implemented in hardware or in software. The implementation may be done on a digital storage medium, particularly a floppy disk or CD with control signals that may be read out electronically, which may cooperate with a programmable computer system so that the method is performed. In general, the invention thus also consists in a computer program product with a program code stored on a machine-readable carrier for performing the inventive method when the computer program product runs on a computer. In other words, the invention may thus be realized as a computer program with a program code for performing the method when the computer program runs on a computer.
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10 2004 023 436 | May 2004 | DE | national |
This application is a continuation of copending International Application No. PCT/EP2005/005004, filed on May 9, 2005, which designated the United States and was not published in English.
Number | Name | Date | Kind |
---|---|---|---|
6597802 | Bolle et al. | Jul 2003 | B1 |
7460994 | Herre et al. | Dec 2008 | B2 |
7574313 | Disch et al. | Aug 2009 | B2 |
7580832 | Allamanche et al. | Aug 2009 | B2 |
7676336 | Herre et al. | Mar 2010 | B2 |
Number | Date | Country |
---|---|---|
101 29 635 | Jun 2001 | DE |
WO 0104870 | Jan 2001 | WO |
WO 0211123 | Feb 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20070127717 A1 | Jun 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2005/005004 | May 2005 | US |
Child | 11557023 | US |