The present technology relates to a tonal component detection method, a tonal component detection apparatus, and a program.
Components constituting a one-dimensional time signal such as voice or music are broadly classified into three types of representations: (1) a tonal component, (2) a stationary noise component, and (3) a transient noise component. The tonal component corresponds to a component caused by the stationary and periodic vibration of a sound source. The stationary noise component corresponds to a component caused by a stationary but non-periodic phenomenon such as friction or turbulence. The transient noise component corresponds to a component caused by a non-stationary phenomenon such as a blow or a sudden change in a sound condition. Among them, the tonal component is a component that faithfully represents the intrinsic properties of a sound source itself, and thus it is particularly important when analyzing the sound.
The tonal component obtainable from an actual sound may often be a plurality of sinusoidal components which are gradually changed over time. The tonal component may be represented, for example, as a horizontal stripe-shaped pattern on a spectrogram representing amplitudes of the short-time Fourier transform with a time series, as shown in
The detection of tonal components has been made from the past. A typical technique of detecting tonal components includes a method of obtaining an amplitude spectrum at each of the short time frames, detecting local peaks of the amplitude spectrum, and regarding all of the detected peaks as tonal components. One disadvantage of this method is that a large number of erroneous detections are made, because none of the local peaks becomes necessarily tonal components.
Incidentally, local peaks occurred in the amplitude spectrum includes (1) a peak due to the tonal component, (2) a side lobe peak, (3) a noise peak, and (4) an interference peak.
For the method described above, an approach for improving the detection accuracy may include, for example, (A) method of setting a threshold for the height of each local peak and then not detecting local peaks having a smaller value than the threshold, and (B) method of connecting local peaks across multiple frames in a time direction according to the local neighbor rule and then excluding components which are not connected more than a certain number of times.
The method of (A) is assumed that the magnitude of tonal components is greater than that of noise components at all times. However, this assumption is unreasonable and is not true in many cases, thus its performance improvement will be limited. Actually, the magnitude of the peak erroneously detected in the vicinity of 2 kHz on the frequency axis of
The method of (B) is disclosed in, for example, R. J. McAulay and T. F. Quatieri: “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 34, No. 4, 744/754 (August 1986), and J. O. Smith III and X. Serra, “PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation”, Proceedings of the International Computer Music Conference (1987). This method employs a property that tonal components have temporal continuity (e.g., in case of music, a tonal component is often continued for a period of time more than 100 ms). However, because peaks in any other components than the tonal components may be continued and a shortly segmented tonal component is not detected, it is not necessarily mean that sufficient accuracy can be achieved in many applications.
According to an embodiment of the present technology, it is possible to accurately detect a tonal component from time signals such as voice or music.
According to an embodiment of the present technology, there is provided a tonal component detection method including performing a time-frequency transformation on an input time signal to obtain a time-frequency distribution, detecting a peak in a frequency direction at a time frame of the time-frequency distribution, fitting a tone model in a neighboring region of the detected peak, and obtaining a score indicating tonal component likeness of the detected peak based on a result obtained by the fitting.
According to the embodiments of the present technology described above, in the step of performing the time-frequency transformation, the time-frequency distribution (spectrogram) can be obtained by performing the time-frequency transformation on the input time signal. In this case, for example, the time-frequency transformation of the input time signal may be performed using a short-time Fourier transform. In addition, the time-frequency transformation of the input time signal may be performed using other transformation techniques such as a wavelet transform.
In the step of detecting the peak, the peak of the frequency direction is detected at each of the time frames in the time-frequency distribution. In the step of fitting, the tone model is fitted in a neighboring region of each of the detected peaks. In this case, for example, a quadratic polynomial function in which a time and a frequency are set to variables may be used as the tone model. In addition, a cubic or higher-order polynomial function may be used. Further, in this case, the fitting may be performed, for example, based on a least square error criterion of the tone model and a time-frequency distribution in the vicinity of each of the detected peaks. In addition, the fitting may be performed based on a minimum fourth-power error criterion, a minimum entropy criterion, and so on.
A score indicating tonal component likeness of the detected peak may be obtained based on a result obtained by the fitting. In this case, in the step of obtaining the score, for example, the score indicating the tonal component likeness of the detected peak may be obtained using at least a fitting error extracted based on the result obtained by the fitting. Further, in this case, in the step of obtaining the score, for example, the score indicating the tonal component likeness of the detected peak may be obtained using at least a peak curvature in a frequency direction extracted based on the result obtained by the fitting.
Further, in this case, in the step of obtaining the score, for example, the score indicating the tonal component likeness of the detected peak may be obtained by extracting a predetermined number of features and by combining the predetermined number of extracted features, based on the result obtained by the fitting. In this case, in the step of obtaining the score, when the predetermined number of extracted features are combined, a non-linear function may be applied to the predetermined number of extracted features to obtain a weighted sum. The predetermined number of features may be at least one of a fitting error, a peak curvature in a frequency direction, a frequency of a peak, an amplitude value in a peak position, a rate of a change in a frequency, or a rate of a change in amplitude that are obtained by the tone model on which the fitting is performed.
According to the embodiments of the present technology as described above, the tone model can be fitted in a neighboring region of each peak in the frequency direction detected from the time-frequency distribution (spectrogram), and the score indicating the tonal component likeness of each of the detected peaks can be obtained based on results obtained by the fitting. Therefore, it is possible to accurately detect tonal components.
According to embodiments of the present technology, it is possible to accurately detect a tonal component from time signals such as voice or music.
Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the appended drawings. Note that, in this specification and the appended drawings, structural elements that have substantially the same function and structure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
The description will be made in the following order.
1. Embodiment
2. Modification
[Tonal Component Detection Apparatus]
The time-frequency transformation unit 101 transforms an input time signal f(t) such as voice or music into a time-frequency representation to obtain a time-frequency signal F(n,k). In this example, t is the discrete time, n is the time frame number, and k is the discrete frequency. The time-frequency conversion unit 101 obtains the time-frequency signal F(n,k) by transforming the input time signal f(t) into a time-frequency representation, for example, using a short-time Fourier transform, as given in the following Equation (1).
In the above Equation (1), W(t) is the window function, M is the size of the window function, and R is the frame time interval (hop size). The time-frequency signal F(n,k) indicates a logarithmic amplitude value of the frequency component in the time frame n and frequency k, i.e., it is a spectrogram (time-frequency distribution).
The peak detection unit 102 detects peaks in the frequency direction at each time frame of the spectrogram obtained by the time-frequency transformation unit 101. Specifically, the peak detection unit 102 detects whether peaks (maximum values) are found in the frequency direction for all of the frames and all of the frequencies on the spectrogram.
The detection of whether F(n,k) is a peak or not is performed by checking whether the following Equation (2) is satisfied. In addition, as the method of detecting peaks, a method of using three points is illustrated, but a method of using five points may be used.
F(n,k−1)<F(n,k) and F(n,k)>F(n,k+1) (2)
The fitting unit 103 fits a tone model in a neighboring region of each of the peaks detected by the peak detection unit 102, as described below. The fitting unit 103 initially performs a coordinate transformation into a coordinate with a target peak as the origin and then sets up a neighboring time-frequency region as given in the following Equation (3). In Equation (3), ΔN is the neighboring region in the time direction (e.g., three points), and Δk is the neighboring region in the frequency direction (e.g., two points).
Γ=[−ΔN≦n≦ΔN]×[−ΔK≦k≦ΔK] (3)
Subsequently, the fitting unit 103 fits, for example, a tone model of the quadratic polynomial function as given in the following Equation (4), with respect to the time-frequency signal within the neighboring region. In this case, the fitting unit 103 performs the fitting, for example, based on the least square error criterion of the tone model and the time-frequency distribution in the vicinity of the peak.
Y(k,n)=ak2+bk+ckn+dn2+en+g (4)
In other words, the fitting unit 103 performs the fitting by obtaining coefficients that minimize the square error as given in the following Equation (5), in the neighboring region of the time-frequency signal and the polynomial function. The coefficients are determined as given in the following Equation (6).
This quadratic polynomial function has the property that it is well fitted in the vicinity of the tonal spectral peak (smaller margin of error) but it is not well fitted in the vicinity of the noise spectral peak (larger margin of error). This property of the function is schematically shown in
b) shows how the quadratic function f0(k) which is given in the following Equation (7) is fitted to the spectrum shown in
ƒ0(k)=a(k−k0)2+g0 (7)
a) schematically shows the change of the tonal peaks in the time direction. The tonal peak has the amplitude and frequency that are being changed while maintaining its overall shape in the previous and subsequent time frames. In addition, the obtained spectrum is actually formed by discrete points, but the spectrum is drawn with a curved line in the figure for descriptive purposes. Specifically, the dashed line represents the previous frames, the solid line represents the current frames, and the dotted line represents the subsequent frames.
In many cases, the tonal components have a certain extent of time continuity and involve some changes in frequency and time, but the tonal components can be represented by the shift of substantially the same form of quadratic function. This change Y(k,n) is given by the following Equation (8). The spectrum is represented by logarithmic amplitudes, and thus the amplitudes are changed between the top and bottom of the spectrum. This is the reason why the addition of the term f1(n) indicating the change in amplitude is necessary. In the following Equation (8), β is the rate of change in amplitude, and f1(n) is the time function indicating the change in amplitude at the peak position.
Y(k,n)=ƒ0(k−βn)+ƒ1(n) (8)
If f1(n) is approximated by the quadratic function in the time direction, the change Y(k,n) is given by the following Equation (9). In Equation (9), a, k0, β, d1, e1, and g0 are constants, and thus Equation (9) will be equivalent to Equation (8) by converting them to appropriate variables.
b) schematically shows a fitting performed in the small region Γ on the spectrogram. Equation (4) tends to be well fitted for the tonal component, because the tonal peaks with a similar shape are gradually changed over time. However, the shape or frequency of the peaks are varied in the vicinity of the noise peaks, and then Equation (4) is not well fitted. In other words, even when the fitting is performed optimally, the error becomes large.
Furthermore, Equation (6) shows the calculation in which the fitting is performed for all of the coefficients a, b, c, d, e, and g. However, some of the coefficients may be previously fixed to the constant values and the fitting may be performed on them. In addition, the fitting may be performed using the quadratic and higher order polynomial function.
Referring back to
The scoring unit 105 obtains scores indicating the tonal component likeness of each peak using the features extracted by the feature extraction unit 104 for each peak in order to quantify the tonal component likeness of each peak. The scoring unit 105 obtains the score S(n,k) as given in the following Equation (11) using one or a plurality of features (x0, x1, x2, x3, x4, x5). In this case, at least the fitting normalization error x5 or the peak curvature x0 in the frequency direction is used.
In Equation (11), Sigm(x) is the sigmoid function, wi is the predetermined weighting factor, Hi(xi) is the predetermined non-linear function performed for the i-th feature xi. For example, the function as given in the following Equation (12) can be used as the non-linear function Hi(xi). In Equation (12), ui and vi are the predetermined weighting factors. In addition, wi, ui, and vi may be previously set to any suitable constants, or alternatively, they may be automatically determined by performing the steepest decent learning procedure or the like using a large amount of data.
Hi(xi)=Sigm(uixi+vi) (12)
As described above, the scoring unit 105 finds the S(n,k) which indicates the tonal component likeness of each peak by using Equation (11). In addition, the scoring unit 105 sets the score S(n,k) in the position (n,k) having no peak to zero. The scoring unit 105 obtains the score S(n,k) indicating the tonal component likeness at each of the times and frequencies of the time-frequency signal f(n,k). The score S(n,k) takes a value between 0 and 1. Then, the scoring unit 105 outputs the obtained score S(n,k) as tonal component detection results.
Moreover, in the case where it is necessary to make a binary determination as to whether it is a tonal component or not, the determination can be made using an appropriate threshold SThsd as given in the following Equation (13).
The operation of the tonal component detection apparatus 100 shown in
The peak detection unit 102 detects whether peaks are found in the frequency direction at all of the frames and all of the frequencies on the spectrogram. The peak detection results are supplied to the fitting unit 103. The fitting unit 103 fits a tone model in a neighboring region of the peak for each of the peaks. This fitting allows the coefficients of the quadratic polynomial function constituting the tone model (see Equation (4)) to be obtained so that the square error may be minimized. The results obtained by the fitting are supplied to the feature extraction unit 104.
The feature extraction unit 104 extracts a various types of features based on the results (see Equation (6)) obtained by fitting each of the peaks in the fitting unit 103 (see Equation (10)). For example, features such as the curvature of peak, the frequency of peak, the logarithmic amplitude value of peak, the rate of change in amplitude, and the fitting normalization error are extracted. The extracted features are supplied to the scoring unit 105.
The scoring unit 105 obtains the score S(n,k) indicating the tonal component likeness of each of the peaks using the features (see Equation (11)). The score S(n,k) takes a value between 0 and 1. Then, the scoring unit 105 outputs the obtained score S(n,k) as tonal component detection results. In addition, the scoring unit 105 sets the score S(n,k) in the position (n,k) at which there is no peak to zero.
Furthermore, the tonal component detection apparatus 100 shown in
The computer equipment 200 includes a CPU (Central Processing Unit) 181, a ROM (Read Only Memory) 182, a RAM (random Access Memory) 183, a data input/output unit (data I/O) 184, and a mHDD (Hard Disk Drive) 185. The ROM 182 stores the processing programs to be performed by the CPU 181. The RAM 181 serves as a work area for the CPU 181. The CPU 181 reads out the processing programs stored in the ROM 182 as necessary, and sends the readout processing programs to the RAM 183, so that the processing program is loaded in the RAM 183. Thereafter, the CPU 181 reads out the loaded programs to execute the tonal component detection process.
The computer equipment 200 receives an input time signal f(t) through the data I/O 184 and accumulates it to the HDD 185. The CPU 181 performs the tonal component detection process on the input time signal f(t) accumulated in the HDD 185. The tonal component detection result S(n,k) is outputted to outside through the data I/O 184.
The flowchart of
Subsequently, in step ST3, the CPU 181 sets the number n of the frame (time frame) to zero. Then, in step ST4, the CPU 181 determines whether n<N. In addition, frames in the spectrogram (time-frequency distribution) are assumed to be between 0 and N−1. If it is determined that n is greater than or equal to N (n≧N), then the CPU 181 determines that processes for all of the frames are completed, and terminates the process at step ST5.
If it is determined that n is less than N (n<N), then the CPU 181, in step ST6, sets the discrete frequency k to zero. In step ST7, the CPU 181 determines whether k<K. In addition, the discrete frequency k of the spectrogram (time-frequency distribution) is assumed to be between 0 and K−1. If it is determined that k is greater than or equal to K (k≧K), then the CPU 181 determines that processes for all of the discrete frequencies are completed, and, in step ST8, increments the n by 1. Subsequently, the flow returns to step ST4, and then a process for the next frame is performed.
If it is determined that k is less than K (k<K), then the CPU 181, in step ST9, determines whether the F(n,k) is a peak. If the F(n,k) is not a peak, then the CPU 181, in step ST10, sets the score S(n,k) to zero, and then, in step ST11, increments the k by 1. Subsequently, the flow returns to step ST7, and then a process for the next discrete frequency is performed.
In step ST9, if it is determined that the F(n,k) is a peak, then the CPU 181 performs a process of step ST12. In step ST12, the CPU 181 performs a fitting on a tone model in a neighboring region of the peak. The CPU 181, in step ST13, extracts a various types of features (x0, x1, x2, x3, x4, x5) based on the results obtained by the fitting.
Subsequently, the CPU 181, in step ST14, obtains the score S(n,k) indicating the tonal component likeness of each of the peaks using the features extracted in step ST13. The score S(n,k) takes a value between 0 and 1. After step ST14 is completed, the CPU 181 increments the k by 1 at step ST11. Then, the flow returns to step ST7, and then a process for the next discrete frequency is performed.
As described above, the tonal component detection apparatus 100 shown in
Moreover, the tonal component detection apparatus 100 shown in
Although the time-frequency transformation performed using the short-time Fourier transform has been described in the above embodiments, it can be considered that the input time signal is transformed into a time-frequency representation using other transformation techniques such as the wavelet transform. In addition, although the fitting performed using the least square error criterion of the tone model and the time-frequency distribution in the vicinity of each of the detected peaks has been described in the above embodiments, it can be considered that the fitting can be performed using the minimum fourth-power error criterion, the minimum entropy criterion, and so on.
It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and alterations may occur depending on design requirements and other factors insofar as they are within the scope of the appended claims or the equivalents thereof.
Additionally, the present technology may also be configured as below.
performing a time-frequency transformation on an input time signal to obtain a time-frequency distribution;
detecting a peak in a frequency direction at a time frame of the time-frequency distribution;
fitting a tone model in a neighboring region of the detected peak; and
obtaining a score indicating tonal component likeness of the detected peak based on a result obtained by the fitting.
a time-frequency transformation unit configured to perform a time-frequency transformation on an input time signal to obtain a time-frequency distribution;
a peak detection unit configured to detect a peak in a frequency direction at a time frame of the time-frequency distribution;
a fitting unit configured to perform fitting on a tone model in a neighboring region of the detected peak; and
a scoring unit configured to obtain a score indicating tonal component likeness of the detected peak based on a result obtained by the fitting.
means for performing a time-frequency transformation on an input time signal to obtain a time-frequency distribution;
means for detecting a peak in a frequency direction at a time frame of the time-frequency distribution;
means for fitting a tone model in a neighboring region of the detected peak; and
means for obtaining a score indicating tonal component likeness of the detected peak based on a result obtained by the fitting.
The present disclosure contains subject matter related to that disclosed in Japanese Priority Patent Application JP 2012-078320 filed in the Japan Patent Office on Mar. 29, 2012, the entire content of which is hereby incorporated by reference.
Number | Date | Country | Kind |
---|---|---|---|
2012-078320 | Mar 2012 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5229716 | Demoment et al. | Jul 1993 | A |
6542869 | Foote | Apr 2003 | B1 |
6604072 | Pitman et al. | Aug 2003 | B2 |
7276656 | Wang | Oct 2007 | B2 |
7598447 | Walker et al. | Oct 2009 | B2 |
7627477 | Wang et al. | Dec 2009 | B2 |
7978862 | Betts | Jul 2011 | B2 |
8116463 | Wang | Feb 2012 | B2 |
8255214 | Abe et al. | Aug 2012 | B2 |
8315857 | Klein et al. | Nov 2012 | B2 |
8588427 | Uhle et al. | Nov 2013 | B2 |
20020138795 | Wang | Sep 2002 | A1 |
20020143530 | Pitman et al. | Oct 2002 | A1 |
20020181711 | Logan et al. | Dec 2002 | A1 |
20040165736 | Hetherington et al. | Aug 2004 | A1 |
20040211260 | Girmonsky et al. | Oct 2004 | A1 |
20040260540 | Zhang | Dec 2004 | A1 |
20050177372 | Wang et al. | Aug 2005 | A1 |
20060095254 | Walker et al. | May 2006 | A1 |
20060229878 | Scheirer | Oct 2006 | A1 |
20070010999 | Klein et al. | Jan 2007 | A1 |
20080133223 | Son et al. | Jun 2008 | A1 |
20080148924 | Tsui et al. | Jun 2008 | A1 |
20090125298 | Master et al. | May 2009 | A1 |
20090265174 | Wang et al. | Oct 2009 | A9 |
20090282966 | Walker et al. | Nov 2009 | A1 |
20100000395 | Walker et al. | Jan 2010 | A1 |
20110015931 | Kawahara et al. | Jan 2011 | A1 |
20110071824 | Espy-Wilson et al. | Mar 2011 | A1 |
20110123044 | Hetherington et al. | May 2011 | A1 |
20110194702 | Wang | Aug 2011 | A1 |
20110235823 | Betts | Sep 2011 | A1 |
20110243349 | Zavarehei | Oct 2011 | A1 |
20120046771 | Abe et al. | Feb 2012 | A1 |
20120067196 | Rao et al. | Mar 2012 | A1 |
20120103166 | Shibuya et al. | May 2012 | A1 |
20120157857 | Abe et al. | Jun 2012 | A1 |
20120197420 | Kumakura et al. | Aug 2012 | A1 |
20120243705 | Bradley et al. | Sep 2012 | A1 |
20120266742 | Touyama et al. | Oct 2012 | A1 |
20120266743 | Shibuya et al. | Oct 2012 | A1 |
20130255473 | Abe et al. | Oct 2013 | A1 |
20130282373 | Visser et al. | Oct 2013 | A1 |
Entry |
---|
McAulay, R.J., et al. “Speech Analysis/Synthesis Based on Sinusoidal Representation” IEEE Transactions on Acoustics, Speech and Signal Processing, Aug. 4, 1986, pp. 744-754. |
Smith, Julius O., et al. “Parshl: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on Sinusoidal Representation” Center for Computer Research in Music and Acoustics, Department of Music, Stanford University, Stanford, CA, 1987, pp. 1-23. |
Number | Date | Country | |
---|---|---|---|
20130255473 A1 | Oct 2013 | US |