The invention relates to a method for characterizing, according to specific parameters, a sound signal developing over time in different frequency bands.
The field of the invention is that of sound signal recognition applied in particular to the identification of musical works used without authorization.
In fact, the development of methods of digitizing and multimedia have caused a considerable increase in such fraudulent uses. The result is a new problem for those agencies charged with collecting royalties, since there must be some way to identify these uses, especially on the interactive digital networks such as the Internet, in order to satisfactorily assess and to distribute the compensation due to the authors of these musical works.
Consequently, in order not to be limited to musical works, a sound signal is more generally considered.
The object of the present invention is then to create a database of sound signals, each sound signal being characterized by one fingerprint such that being given a unknown sound signal that is characterized in this same fashion, a search can be executed and a rapid comparison of the fingerprint of said unknown signal made with the universe of fingerprints in the database.
The fingerprint is constituted of specific parameters determined in the following fashion.
In a first step, the sound signal is broken down in that its amplitude x(t) varies with time t, according to different frequency bands k: x(k, t) is the amplitude of the sound signal filtered into the frequency band k and represented in
As represented in
These values E(k, t) constitute the specific parameters of an extract of 2N seconds of the sound signal x(k, t) in the frequency band k.
Other parameters can be obtained by calculating the energy of E(k, t) for the different frequency bands j by using a window h′(t) represented in
Thus standardized, these values constitute specific parameters of an extract of 2N′ seconds of the sound signal x(k, t) in the k band of frequencies.
One can also calculate the phase of E(k, t) for different bands of frequencies j: one obtains P(j, k, t). The P(j, k, t) values are standardized with respect to a reference value P(1, j, t) and one then obtains other specific parameters of an extract of 2N′ seconds of sound signal.
Other parameters can be added such as the mean value of the E(k, t) energy.
The object of the invention is a method for characterizing in accordance with specific parameters a sound signal x(t) evolving according to the time t over a duration D in different bands of frequencies k and then written x(k, t), principally characterized in that it consists of storing the signal x(t), calculating the energy E(k, t) of said signal x(k, t) for each of said bands of frequencies k, k varying from 1 to K and according to a temporal window h(t) of a duration of 2N, storing the values of the energy E(k, t) obtained, these values constituting the specific parameters of an extract of a duration of 2N of the sound signal x(t) and reiterating this calculation at regular intervals, in order to obtain the universe of specific parameters for the duration D of the sound signal x(t).
In addition, it consists of calculating and storing the energy F(k, j, t) of E(k, t) for the bands of frequencies j, j varying from 1 to J, according to a temporal window h′ (t) of a duration of 2N′, the J×K values of the energy F(j, k, t) obtained constitute the specific parameters of an extract of a duration of 2N′ of the sound signal x(t) and reiterating this calculation at regular intervals, in order to obtain the universe of specific parameters for the duration D of the sound signal x(t).
It may consist of calculating the phase P(j, k, t) of the energy E(k, t) for the bands of frequencies j, j varying from 1 to J with j being different from k, and including the values of the phase P(j, k, t) obtained among the specific parameters of the sound signal x(t).
It can also consist of calculating the mean value of the energy E(k, t) over 2N′ seconds for each frequency band j, in reiterating this calculation at regular intervals, in order to obtain the universe of specific parameters for the duration D of the sound signal x(t) and including the mean values so obtained among the specific parameters of the sound signal x(t).
According to one feature, it consists of taking into account the specific parameters of a sound signal x(t) as the components of a vector representing x(t), of positioning the vectors in a space of as many dimensions as there are parameters, of defining classes including the most proximate vectors and of recording said classes.
The classes having inter-class distances and intra-class distances, the method consists advantageously of selecting from among the specific parameters those parameters making it possible to obtain the relatively large inter-class distances with respect to of the intra class distances and of recording the selected parameters.
The invention relates also to a device for identifying a sound signal, characterized in that it comprises a database service comprising means for implementing the method for characterizing a sound signal according to specific parameters as described hereinbefore and the means for executing a search for said signal in the database.
Preferably, the search means comprise means for directly recognizing the class to which said sound signal belongs and means for executing a search for the class by comparison of the specific parameters of the unknown sound signal with those of the database, the class being chosen, for example, using the method of the nearest neighbor algorithm.
Other characteristics and advantages of the invention will become more apparent when reading the description provided by way of example and non-limitingly and with reference to the appended drawings, wherein:
a, 1b and 1c represent, respectively, the diagrammatic plottings of the variation of a sound signal x(ki, t) filtered into a band of frequencies ki, a Hamming window h(t) and the short-term energy E(ki, t) of the signal x(ki, t);
a, 2b and 2c represent, respectively the diagrammatic plottings of the variation of energy E(ki, t) for the frequency band ki, a Hamming windos h′(t) and the energy F(jm, ki, t) of E(ki, t) for the band of frequencies jm.
The sound signals that are processed according to this method of characterization are recorded sound signals, particularly on compact disks.
In the following, it will be considered that the sound signal x(t) is a digital signal sampled at a sampling frequency of fe, for example 11,025 Hz corresponding to one quarter of the current sampling frequency for compact disks, which is 44,100 Hz.
Therefore, an analog sound signal can be characterized: it must first be converted into a digital signal by means of an analog—digital converter.
The sound signal x(k, t) represented in
The short-term energy E(k, t) represented in
E(k, t) is the square of the module of a transformation of the sound signal sampled x(t) in the time—frequency plan or in the time—scale plan. Among the transformations that can be utilized are the Fourier transformation, the cosine transformation, the Hartley transformation and the wavelet transformation. A bank of band-pass filters also does this type of transformation. The short-term Fourier transformation makes possible a time—frequency representation adapted to the musical signal analysis. Accordingly, the energy E(k, t) is written:
One slides the window over the sound signal every S seconds; for example, every 10 ms. E(k, t) will thus be sampled every 10 ms: E(k, t0), E(k, t1) with t1=to +10 ms, etc. will be obtained.
Thus, all of the S seconds of the sound signal x(t) will be coded by a vector having K components E(k, t), each of these components coding for the energy of 23 ms or the sound signal x(t) in K bands of frequencies.
Other parameters are obtained by reproducing in any fashion the aforementioned calculations and applying them each time to E(k, t) as represented n
The energy E(k, t) is filtered into J different bands of frequencies: E(j, k, t) is the energy E(k, t) filtered into the band of frequencies j, j varying from 1 to J with, for example, J=51.
Then F(j, k, t) is calculated, represented in
In our example, every seconds (S′=1), the sound signal x(t) is coded by 127×51 parameters F(j, k, t), each real F(j, k, t) representing the energy of ten seconds (2N′=10) of the energy signal E(k, t) in the frequency band j.
In order to make F(j, k, t) independent of the amplitude of the signal that can be more or less strong, these values are put in relation to a reference value; in the present case, the maximum value of FM(j, k, t) for all of the k and j taken into account. Thus K×J parameters are obtained F(j, k, t)/FM(j, k, t).
In addition, the phase of the energy E(k, t) in each of the bands of frequency j is calculated every 2N′ seconds: P(j, k, t).
To do this, the argument of the Fourier transformation of E(k, t) in each of the frequency bands j is calculated:
As above, these values are put in relation to a reference value; in the present case, the value of P(j, k, t) for the second band of frequencies (j=1) considered, because the temporal reference of the sample is unknown: the origin of the time is unknown.
To do this, the phases yielded φ(j, k, t) are calculated using the following formulae
Thus, K×J parameters corresponding to the values of the yielded phase φ(j, k, t) are obtained.
Other parameters can also be taken into account; in particular, the mean values of the energy E(k, t) over 2N′ seconds and this for each band of frequencies j: E(j, k, t).
The universe of these standardized parameters define at regular intervals a fingerprint that can be considered as a vector V(x(t)). The universe of the standardized parameters, for example, F(j, k, t)/FM and P(j, k, t)−P(j, 1, t) define every S′ seconds a fingerprint that can be considered as a vector V(x(t)) having 2×K×J dimensions (2×127×51) or about 13,000 in our example), one dimension per parameter, each vector characterizing an extract of 2N′ seconds of the sound signal x(t), 10 seconds in our example.
This characterization is reiterated every S′ seconds, every second for example (S′=1).
As represented in
For a sound signal lasting 10 nm or 600 s, 600 vectors are obtained; that is, 600×2×J×K parameters.
These vectors are stored in the storage zone 10 of a database housed on a server or on a compact disk.
It is desirable to reduce the number of components of these vectors; in other words, the number of parameters in order to obtain a vector or a fingerprint of a more reduced size in view of its storage in the database. Furthermore, when it is a question of comparing the fingerprint of an unknown sound signal to those in the database, it is desirable that the number of parameters to be compared be reduced in order that said search can be executed quickly.
Now, these parameters do not all contain the same quantity of information; certain ones can be redundant or useless. That is why one selects the most meaningful parameters from among all parameters, using a mutual information calculation presented in the publication PROC. ICASSP '99, Phoenix, Ariz., USA, March 1999H. YANG, S. VAN VUUREN, H. HERMANSKY, “Relevancy of Time—Frequency Features for Phonetic Classification Measured by Mutual Information”. Thus, K to K1 and J to J1 are limited.
A method for selecting these parameters will now be presented.
Each of the fingerprints of these sound signals; that is, each of these vectors is classified into a space R to N dimensions, N being the number of components of the vectors. For the sake of simplicity, an example of classification for vectors having 2 dimensions P1 and P2 is represented in
The classes C(m) are defined by grouping the vectors by proximity, m varying from 1 to M. For example, one can decide that one class corresponds to one musical work: in this case M is the number of musical works stored in the database.
The result of the mutual information calculation between these classes C(m) and the parameters is that the relevance of the parameters is linked to the inter and intra class distances: relevant parameters assuring relatively large inter-class distances d compared to the intra-class distances D.
By keeping only the relevant parameters, K1 and J1 are thus defined.
For example, one can consider five (K1=5) bands of frequencies centered on 344 Hz, 430 Hz, 516 Hz, 608 Hz and 689 Hz, respectively.
Tests have been done by taking J1=3.
The classes C(m) are thus constituted using the vectors Vq(x) not comprising more than 2×K1×J1 components.
An example will be given for K1=5 and J1=3, of the size of the memory of a database containing 1,000 hours of music and by taking into account as parameters E(k, t) and F(j, k, t), each of these parameters being coded using 4 bytes.
The E(k, t) parameters calculated every 10 ms occupy 1,000×3,600×100×4 bytes or apprximately 7 gigabytes.
The parameters F(j, k, t) calculated every second occupy 1,000×3,600×3×5×4 bytes or approximately 200 megabytes.
These parameters are associated with sound signal references: if one considers that the references contain 100 characters each coded on one byte, these references occupy 1,000×10×100 bytes or approximately 1 megabyte.
Such a database would ultimately occupy approximately 7 gigabytes.
When one wishes to identify an unknown sound signal, one first of all establishes the fingerprint, referenced V(xinc) in
The search for the class of this fingerprint in the database thus consists, according to a classical method illustrated in
A database server 1 is diagrammatically represented in
When entering new sound signals into the database 1, the interface 13 receives the signal x(t) accompanied by its references; if it is only an unknown signal to be identified, the interface 12 receives only the unknown signal x(t).
Upon output, the interface 13 provides a response to the search for an unknown signal. This response is negative if the unknown signal does not exist in the storage zone 10; if the signal has been identified, the response includes the references of the identified signal.
Number | Date | Country | Kind |
---|---|---|---|
0116949 | Dec 2001 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR02/04549 | 12/24/2002 | WO |