The present disclosure relates to the digital audio technology, and in particular to a method and an apparatus for generating a fingerprint of an audio signal.
This section is intended to provide a background to the various embodiments of the technology described in this disclosure. The description in this section may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and/or claims of this disclosure and is not admitted to be prior art by the mere inclusion in this section.
Audio fingerprinting technique can match distorted unlabeled audio snippets to corresponding labeled data. It has wide range of applications in digital audio technologies, such as audio classification, audio retrieval and content synchronization. As an example, a reference written by A. Wang, “An industrial-strength audio search algorithm”, Proc. ISMIR 2003 (hereinafter referred to as reference 1) discusses an audio retrieval system, by which a person who is listening to a music (live, or on radio, . . . ) and wants to know more about the singer, name of song, album of the music can simply record a short audio signal and uses it as a query to retrieve metadata information. Another example for the content synchronization is described in a reference written by N. Q. K Duong, C Howson, and Y Legallais, “Fast second screen TV synchronization combining audio fingerprint technique and generalized cross correlation,” IEEE International Conference in Consumer Electronics-Berlin (ICCE-Berlin), 2012 (hereinafter referred to as reference 2), where an audio fingerprint can assure fast and accurate synchronization of media components streamed over different networks to different rendering devices for the implementation of emerging second screen TV applications.
There are some known solutions for generating fingerprint in the art. In a reference written by Pedro Cano et al, “a review of audio fingerprinting”, Journal of VLSI Signal Processing 41, 271-284, 2005 (hereinafter referred to as reference 3), several fingerprinting technologies were introduced. According to the reference 3, basically an audio signal will be subject to a preprocessing, a framing & overlap, a transform, a feature extract and a post-processing by a front end block and then the output is subject to a fingerprint modeling block to generate a fingerprint of the audio signal.
The above mentioned reference 1 also discusses the generation of an audio fingerprint. In the approach of the reference 1, locations of pairs of energy peaks in the audio spectrogram (i.e. the time-frequency representation of an audio signal) are encoded as fingerprint. In a reference written by J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, Proc. ISMIR 2002 (hereinafter referred to as reference 4), energy differences between neighboring time-frequency point in the spectrogram are bit-quantized to generate signature.
Some known fingerprint approaches considered spectrogram as an image and applied computer vision techniques to this spectral image for designing fingerprint. For examples, a reference written by S. Baluja and M. Covell, “Waveprint: Efficient wavelet-based audio fingerprinting,” Patent recognition, 2008 (hereinafter referred to as reference 5) proposes to apply wavelet transform to the spectral images and designed Min-Hash signature based on sign of the top wavelet coefficients. In the algorithm provided by a reference written by K. Behun, “Image features in music style recognition”, Proc. CESCG 2013 (hereinafter referred to as reference 6), the image based feature SIFT is computed and the histogram of SIFT (a.k.a. the bag-of-word (BoW) feature) is taken as signature. A reference written by M. Riley et al., “A text retrieval approach to content-based audio”, Proc. ISMIR 2008 (hereinafter referred to as reference 7) provides an algorithm to use Bag-of-Audio-Word (BoA) for content-based audio retrieval. A reference written by S. Pancoast and M. Akbacak, “ Bag-of-Audio-Words Approach for Multimedia Event Classification,” Proc. Interspeech 2012 (hereinafter referred to as reference 8) proposes to use BoA for audio event classification.
However, most of the above known fingerprint solutions are not able to deal with the big time stretching (which for example happens in the process of changing speed or duration of an audio signal to fit the time in a TV or radio program) and the pitch variation (which happens for example in live concert, covered song) although they are robust against noise and distortions (such as A/D conversion, compression). Thus, the known solutions are not robust to some more challenging applications, such as in recognizing songs in live concert, where the recorded audio query is not exactly a distorted version of the original signal in the database (too much variation either time or frequency scale).
Therefore, there is a need for a method and an apparatus for generating a fingerprint of an audio signal, which is robust to time stretching and pitch variation in audio applications.
The present invention disclosure is provided to solve at least one problem of the prior art. The present disclosure will be described in detail with reference to exemplary embodiments. However, the present disclosure is not limited to the exemplary embodiments.
According to a first aspect of the present invention disclosure, there is provided a method for generating a fingerprint of an audio signal. The method comprises detecting peaks in a representation of a temporal spectrum of frequencies of the audio signal, a peak being defined as a point in the representation which has a higher energy than its neighboring points; and generating the fingerprint of the audio signal as a function of a distribution of positions of the detected peaks along a frequency axis and a distribution of positions of the detected peaks along a time axis.
In an embodiment, the obtaining of the time-frequency representation of the audio signal comprises segmenting the audio signal into overlapping time frames; and transforming the segmented audio signal from a time domain to a time-frequency domain to generate a spectrogram of the audio signal comprising linearly-spaced frequencies.
In an embodiment, it further comprises mapping the linearly-spaced frequencies of the spectrogram into P bands of an auditory-motivated frequency scale.
In an embodiment, the distribution of positions of the detected peaks along the frequency axis is represented by a vector of integer numbers Vf=[Vf1, . . . , VfF]T as a function of the number of peaks appearing at each frequency bin, wherein a parameter F is the number of frequency bins and T denote vector transpose; and the distribution of positions of the detected peaks along the time axis is represented by a vector of integer numbers Vt=[Vt1, . . . , VtN]F as a function of the number of peaks appearing at each time frame bin, where a parameter N is the number of time frame bins.
In an embodiment, the function is a concatenation of the vector Vf=[Vf1, . . . , VfF]T and the vector Vt=[Vt1, . . . , VtN]T according to the equation below:
V=[a*V
f
;b*V
t],
wherein a and b are constants.
In an embodiment, it further comprises adapting the parameters F and N according to a requirement on compactness and robustness of the fingerprint.
In an embodiment, it further comprises adapting the constants a and b according to a requirement on robustness to either frequency shifting or time scale shifting of the fingerprint.
In an embodiment, the segmented audio signal is transformed by a Fourier transform.
According to a second aspect of the present invention disclosure, there is provided an apparatus for generating a fingerprint of an audio signal. The apparatus comprises a time-frequency representing unit for obtaining a representation of the temporal spectrum of frequencies in the audio signal; a peak detecting unit for detecting peaks in the representation of the audio signal, a peak being defined as a point in the representation which has a higher energy than its neighboring points; a first calculating unit for obtaining a distribution of the positions of the detected peaks along a frequency axis; a second calculating unit for obtaining a distribution of positions of the detected peaks along a time axis; and a combining unit for combining the distribution of positions from the first calculating unit and the second calculating unit to generate the fingerprint of the audio signal.
In an embodiment, the time-frequency representing unit is adapted to segment the audio signal into overlapping time frames; and transform the segmented audio signal from time domain to time-frequency domain to generate a spectrogram of the audio signal comprising linearly-spaced frequencies.
In an embodiment, the time-frequency representing unit is further adapted to map the linearly-spaced frequencies of the spectrogram into P bands of an auditory-motivated frequency scale.
In an embodiment, the first calculating unit generates a vector of integer numbers Vf=[Vf1, . . . , VfF]T representing the distribution of positions of the detected peaks along the frequency axis as a function of the number of peaks appearing at each frequency bin, wherein a parameter F is the number of frequency bins and T denote vector transpose; and the second calculating unit generates a vector of integer numbers Vt=[Vt1, . . . , VtN]T to represent the distribution of positions of the detected peaks along the time axis as a function of the number of peaks appearing at each time frame bin, where a parameter N is the number of time frame bins.
In an embodiment, the combining unit combines the distribution of positions by a concatenation of the vector Vf=[Vf1, . . . , VfF]T and the vector Vt=[Vt1, . . . , VtN]T according to the equation below:
V=[a*V
f
;b*V
t],
wherein a and b are constants.
According to a third aspect of the present disclosure, there is provided a computer program product downloadable from a communication network and/or recorded on a medium readable by computer and/or executable by a processor, comprising program code instructions for implementing the steps of a method according to the first aspect of the disclosure.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable medium comprising a computer program product recorded thereon and capable of being run by a processor, including program code instructions for implementing the steps of a method according to the first aspect of the disclosure.
The above and other objects, features, and advantages of the present disclosure will become apparent from the following descriptions on embodiments of the present disclosure with reference to the drawings, in which:
Hereinafter, the present disclosure is described with reference to embodiments shown in the attached drawings. However, it is to be understood that those descriptions are just provided for illustrative purpose, rather than limiting the present disclosure. Further, in the following, descriptions of known structures and techniques are omitted so as not to unnecessarily obscure the concept of the present disclosure.
At step S101, it obtains a representation of a temporal spectrum of frequencies in the audio signal.
It can be appreciated that the representation can be called the spectrogram of the audio signal, which is a visual representation of the spectrum of frequencies in the audio signal varying with time. The spectrogram is actually the time-frequency representation of the audio signal which is normally viewed as a 2D image. In this case, normally the horizontal axis of the spectrogram represents time, and the vertical axis is frequency. There are known ways in the art to obtain the spectrogram of the audio signal, which can be used in the step S101. Hereinafter, a process for obtaining a spectrogram of the audio signal will be described with reference to
As shown in
At step S202, it transforms the segmented audio signal from frequency domain to time-frequency domain to obtain a spectrogram of the audio signal.
The above steps S201 and S202 are for transforming the time domain audio signal into time-frequency domain representation known as spectrogram. In the step S202, a Fourier transform can be used for the transform. In this case, the steps S201 and S202 can be called a short time Fourier transform (STFT). The spectrogram obtained by the STFT comprises linearly-spaced frequencies varying with time. That is, the horizontal axis of the spectrogram is time, and the vertical axis represents linearly-spaced frequencies of the audio signal. The STFT is well-known in the art. No further details will be given in this respect.
As shown in
The auditory-motivated frequency scales mentioned in the step S203 are well-known in the art. No further details will be given in this respect.
Back to
As an example, it can detect peaks in the spectrogram, which are points having higher energy than its neighboring points. Please note that the detection of peaks in a spectrogram of an audio signal is known in the art. For example, the reference 1 describes a detection method, which can be used for the step S102. No further details will be given in this respect.
At step S103, it generates a fingerprint of the audio signal as a function of the distribution of positions of the detected peaks along the frequency axis and those along the time axis.
In an example, the above-mentioned distribution can be represented by a histogram which is a graphical representation of the distribution of the peaks along two axes, each axis being divided into bins.
A detailed description of the generation of a histogram will be provided below.
Histogram of the positions of the detected peaks along the frequency axis can be obtained by counting the number of peaks appearing at each frequency bin f (denoted by Vf). This histogram feature can be denoted by a F-dimensional vector of integer numbers Vf=[Vf1, . . . , VfF]T, where F is the number of frequency bins and T denote vector transpose. It provides the robustness to time scale modification because intrinsically when time is stretched, the number of peak in each frequency bin is not changed.
Histogram of the positions of the detected peaks along the time axis can be obtained by counting the number of peaks appearing at each time frame bin. This feature can be denoted by a N-dimensional vector of integer numbers Vt=[Vt1, . . . , VtN]T, where N is the number of time frame bins. It provides the robustness to frequency shifting effect because intrinsically when pitch is shifted, the number of peak in each time frame bin is not changed.
Note that, the number N depends on both the signal length, and the number of frequency bins F. Given the fixed signal length, N will be higher if F is smaller and vice versa. Thus in a variant dealing mostly with frequency shifting, Vt is advantageously used as a robust fingerprint instead of Vf, and the smaller value of N, the more compact the fingerprint is. In another variant dealing mostly with time-scale distortion, Vf is advantageously used instead of Vt, and the smaller value of F, the more compact fingerprint is. Thus the fingerprint of the audio signal can be generated by a fu of the histogram along frequency axis and that along time axis of positions of the detected peaks. For example, the combination of both histograms can be built as below
V=[a*V
f
;b*V
t] (1)
In this example, the generated fingerprint is the concatenation of Vf and Vt, which resulting in (F+N)-dimensional vector of integers. Note that the constant a and b in the equation (1) allow tuning the contribution (weight) between the two histogram in the final fingerprint signature. In applications where there is no scale shifting or the scale shifting is very small, it can set a=0 so as to reduce the fingerprint size, make the signature very robust to pitch variation, and fasten the matching process. Similarly, in applications where the frequency shifting is not concerned, it can set b=0 so that the signature is very robust to time stretch.
In an embodiment of the disclosure, a weighting scheme can be built for different peak locations, for example, based on prior knowledge about the important regions. In general case, one can set a=b=1, the number of frequency bins can be in the order of 128 (auditory-motivated scale), and the number of time frames N can be in the order of 100. Another way to balance the contribution of Vf and Vt is to set a=N/(F+N) and b=F/(N+F). For example, in case the query length is 4 seconds, the frameshift in the short time Fourier transform (STFT) is 20 ms.
As shown in
The apparatus 400 comprises a time-frequency representing unit 401 for obtaining a representation of the spectrum of frequencies in the audio signal varying with time. A spectrogram of the audio signal can be obtained according to the process described above.
The apparatus 400 further comprises a peak detecting unit 402 for detecting peaks in the representation of the audio signal.
The apparatus 400 further comprises a first calculating unit 403 for obtaining the distribution of the positions of the detected peaks along the frequency axis. As described above, the distribution can be represented by a histogram, which can be obtained by counting the number of peaks appearing at each frequency bin.
The apparatus 400 further comprises a second calculating unit 404 for obtaining the distribution of positions of the detected peaks along the time axis. As described above, the distribution can be represented by a histogram, which can be obtained by counting the number of peaks appearing at each time frame bin.
The apparatus 400 further comprises a combining unit 405 for combining the histograms from the first calculating unit 403 and the second calculating unit 404 to generate the fingerprint of the audio signal. The combination can be the concatenation of both histograms, which resulting in a vector of integers as the fingerprint of the audio signal.
The output of the apparatus 400 is a fingerprint of the audio signal. As described above, in an embodiment, it is a vector of integers.
According to the embodiments of the present disclosure, the peak locations, which are coordinates of peaks in time and frequency axes of the spectral image representation, are very robust to background noise due to the fact that background noise can only change the energy level in most cases, instead of the position of the local maximum energy point.
The fingerprint generated according to the embodiments of the disclosure is a vector of integer number. It can be applied to the application of similarity search, exhaustive search or Approximate Nearest Neighbor (ANN) search such as LSH, Hamming embedding, product quantization (PQ) code.
The fingerprint according to the embodiments of the disclosure is not only robust to many types of noise, but also robust against time scale modification and frequency shifting. The fingerprint is compact and therefore applicative for large-scale search. It therefore can bring wide range of applications in both audio retrieval and content synchronization.
As shown in
As described above, the fingerprint generated according to the embodiment of the disclosure is robust to time stretching and pitch variation in audio applications. In the known arts introduced in the background part, features used in the fingerprint of the reference 1 is robust to the background noise while the resulted fingerprint is not able to deal with the big time stretching and the pitch variation. The bag-of-word (BoW) feature used in the fingerprint of the references 6 and 7 can bring some benefits to those major distortions such as either time scale modification and or pitch shifting. The audio fingerprint according to the embodiment of the disclosure is proposed considering both features discussed in the references 1 and 6, 7. Therefore, the proposed fingerprint can be used in more challenging applications such as in recognizing songs in live concert where the recorded audio query is not exactly a distorted version of the original signal in the database (too much variation either time or frequency scale). In addition, since the fingerprint is a vector of integer numbers, it is very easily integrated to any search well-established engine.
It is to be understood that the present disclosure may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
The present disclosure is described above with reference to the embodiments thereof. However, those embodiments are provided just for illustrative purpose, rather than limiting the present disclosure. The scope of the disclosure is defined by the attached claims as well as equivalents thereof. Those skilled in the art can make various alternations and modifications without departing from the scope of the disclosure, which all fall into the scope of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
14306854.2 | Nov 2014 | EP | regional |