1. Technical Field of the Invention
The present invention relates to a technology for estimating a pitch (fundamental frequency) of music sounds.
2. Description of the Related Art
A technology for estimating the fundamental frequency of a desired sound (tone) included in music sounds (which will be referred to as a target sound) is described in Japanese Patent Registration No. 3413634. In this technology, an amplitude spectrum or power spectrum of a target sound is modeled as a mixed distribution of a plurality of tone models, each of which is a probability density function modeling a harmonic structure, and a distribution of respective weights of the plurality of tone models is interpreted as a fundamental frequency probability density function, and a salient peak prominent in the probability density function is estimated as the pitch of the target sound.
However, a number of peaks appear in the fundamental frequency probability density function at fundamental frequencies other than the fundamental frequency of the desired sound. For example, peaks in an amplitude spectrum of a sound whose fundamental frequency is 100 Hz overlap at the harmonic frequencies (200 Hz, 400 Hz, 600 Hz, 800 Hz, . . . ) with peaks of another amplitude spectrum of another sound whose fundamental frequency is 200 Hz. Thus, when a sound whose fundamental frequency is 200 Hz is included in a target sound, a salient peak appears not only at 200 Hz but also at 100 Hz in its fundamental frequency probability density function even though no sound of a fundamental frequency of 100 Hz is actually included in the target sound. In addition, when the target sound is a mixture of a number of sounds, prominent peaks corresponding to fundamental frequency and harmonic components of the sounds appear in the fundamental frequency probability density function. It is difficult to accurately extract only the fundamental frequency of a desired sound from such a probability density function which includes a number of salient peaks.
The present invention has been made in view of the above circumstances and it is an object of the present invention to accurately estimate the fundamental frequency of an audio signal, particularly containing a mixture of a plurality of sounds).
In order to achieve the object, the present invention provides a pitch estimation apparatus for estimating a fundamental frequency of an audio signal from a fundamental frequency probability density function by modeling the audio signal as a weighted mixture of a plurality of tone models corresponding respectively to harmonic structures of individual fundamental frequencies, so that the fundamental frequency probability density function of the audio signal is given as a distribution of respective weights of the plurality of the tone models. The pitch estimation apparatus comprises: a function estimation part that estimates the fundamental frequency probability density function by repeating a weight calculation process and an estimated shape specification process, wherein the weight calculation process calculates a weight of each tone model of each fundamental frequency based on an estimated shape of each tone model of each fundamental frequency, the estimated shape indicating a degree of dominancy of a corresponding tone model in a total harmonic structure of the audio signal, and the estimated shape specification process specifies each estimated shape of each tone model of each fundamental frequency based on an amplitude spectrum of the audio signal, the harmonic structure of each tone model of each fundamental frequency, and the weight of each tone model of each fundamental frequency; a similarity analysis part that calculates a similarity index value indicating a degree of similarity between each tone model of each fundamental frequency and each estimated shape specified from the corresponding tone model in the estimated shape specification process; and a weight correction part that reduces a weight of at least one tone model of a certain fundamental frequency having the similarity index value indicating that the one tone model and the corresponding estimated shape are not similar to each other, among the weights of the plurality of the tone models calculated in the weight calculation process.
This configuration suppresses a weight of a fundamental frequency, whose tone model and corresponding estimated shape are not similar, among the plurality of weights calculated in the weight calculation process, thereby reducing the possibility that a ghost peak will occur in the fundamental frequency probability density function due to a tone model that deviates from the total harmonic structure of the audio signal. This makes it possible to accurately extract fundamental frequencies of an audio signal (i.e., pitches of target sounds).
In a preferred embodiment of the present invention, the weight correction part changes the weight of the one tone model of the certain fundamental frequency to zero, the one tone model of the certain fundamental frequency having the similarity index value indicating that the one tone model and the corresponding estimated shape are not similar to each other. This embodiment changes, to zero, a weight of a fundamental frequency, whose tone model and corresponding estimated shape are not similar, thereby absolutely suppressing a peak in the fundamental frequency probability density function caused by a tone model that deviates from the total harmonic structure of the target sound. This makes it possible to more accurately extract fundamental frequencies of the audio signal.
In the configuration illustrated above, the weight correction part reduces a weight of a fundamental frequency, whose similarity index value indicates that a tone model and an estimated shape corresponding to the fundamental frequency are not similar. However, the present invention may also provide a configuration in which the weight correction part increases a weight of a fundamental frequency, whose similarity index value calculated by the similarity analysis part indicates that a tone model and an estimated shape corresponding to the fundamental frequency are similar, among a plurality of weights calculated in the weight calculation process.
In a preferred embodiment of the present invention, the function estimation part executes the estimated shape specification process to generate the estimated shape of the corresponding tone model of the respective fundamental frequency based on a product of the amplitude spectrum of the audio signal, the harmonic structure of the corresponding tone model, and the weight calculated for the corresponding tone model of the respective fundamental frequency. This embodiment has advantages in that the estimated shape is generated through a simple calculation, and the similarity between the total harmonic structure of the audio signal and the harmonic structure of the tone model is remarkably reflected in the estimated shape.
When an audio signal including a plurality of sounds is processed, a fundamental frequency of a desired sound could be estimated, for example by searching for a salient peak with the highest weight in the fundamental frequency probability density function, even if two or more peaks are present in the probability density function at ghost fundamental frequencies that are not actually included in the audio signal. However, in the case where fundamental frequencies of a plurality of sounds are estimated from an audio signal, such a highest weight search method could not be used so that it is difficult to accurately determine whether or not peaks in the fundamental frequency probability density function correspond to fundamental frequencies that are actually included in the audio signal. According to the present invention, peaks at fundamental frequencies, which are not actually included in the audio signal, are suppressed in the fundamental frequency probability density function so that it is possible to accurately estimate fundamental frequencies of a plurality of sounds from the fundamental frequency probability density function. That is, the present invention is desirably applied to a pitch estimation apparatus that includes a pitch specifying part for specifying, as pitches, a plurality of fundamental frequencies corresponding to peaks in the fundamental frequency probability density function estimated by the function estimation part.
The present invention is also specified as a method for estimating a fundamental frequency of an audio signal. Thus, the present invention provides a pitch estimation method of estimating a fundamental frequency of an audio signal from a fundamental frequency probability density function by modeling the audio signal as a weighted mixture of a plurality of tone models corresponding respectively to harmonic structures of individual fundamental frequencies, so that the fundamental frequency probability density function of the audio signal is given as a distribution of respective weights of the plurality of the tone models. The pitch estimation method comprises: estimating the fundamental frequency probability density function by repeating a weight calculation process (for example, a process of a weight calculator 23 in
The pitch estimation apparatus according to the present invention is implemented by hardware (electronic circuitry) such as a Digital Signal Processor (DSP) dedicated to each process and is also implemented through cooperation between a program and a general-purpose processing unit such as a Central Processing Unit (CPU). In order to estimate a fundamental frequency of an audio signal from a fundamental frequency probability density function that is a distribution of respective weights of a plurality of tone models corresponding respectively to harmonic structures of individual fundamental frequencies when the audio signal is modeled as a mixed distribution of the plurality of tone models, a program according to the present invention causes a computer to perform a function estimation process that estimates the fundamental frequency probability density function by repeating a weight calculation process and an estimated shape specification process, wherein the weight calculation process calculates a weight of each fundamental frequency based on an estimated shape of a tone model of the fundamental frequency, the estimated shape representing an extent to which the tone model of the individual fundamental frequency supports or contributes a total harmonic structure of the audio signal, and the estimated shape specification process specifies an estimated shape of each fundamental frequency based on an amplitude spectrum of the audio signal, a tone model of the fundamental frequency, and a weight of the fundamental frequency; a similarity analysis process that calculates a similarity index value of each fundamental frequency indicating whether or not a tone model of the fundamental frequency and an estimated shape specified from the tone model in the estimated shape specification process are similar; and a weight correction process that reduces a weight of a fundamental frequency, whose similarity index value calculated in the similarity analysis process indicates that a tone model and an estimated shape corresponding to the fundamental frequency are not similar, among a plurality of weights calculated in the weight calculation process. The program of the present invention has the same operations and advantages as those of the pitch estimation apparatus according to the present invention. The program of the present invention is provided to a user in a form stored in a machine readable medium or portable recording medium such as a CD-ROM and then installed on the computer and is also provided from a server apparatus in a distributed manner over a network and then installed on the computer.
An audio signal V representing a time waveform of the target sound is input to the frequency analyzer 12. The target sound representing the audio signal V of this embodiment is a mixture of a plurality of sounds of different pitches or sound sources. The frequency analyzer 12 specifies an amplitude spectrum of the target sound by dividing the audio signal V into a number of frames using a specific window function and then performing frequency analysis including a Fast Fourier Transform (FFT) process on each frame of the audio signal V. The frames are set so as to overlap each other on the time axis.
The BPF 14 selectively passes components included in a specific frequency band in the amplitude spectrum specified by the frequency analyzer 12. The passband of the BPF 14 is previously selected statistically or empirically such that the BPF passes most of the fundamental frequency and harmonic components of sounds, whose pitches are to be estimated, among the plurality of sounds included in the target sound and blocks frequency bands in which fundamental frequency and harmonic components of other sounds are predominant over those of the desired sounds. An amplitude spectrum S that has passed through the BPF 14 is output to the function estimator 20.
The function estimator 20 of
The storage 30 is means for storing, as templates, the plurality of tone models M[F] used in the function estimator 20, examples of which include a magnetic storage device and a semiconductor storage device. As shown in
As shown in
First, a peak appears in the estimated shape C at each frequency at which a peak appears in both the tone model M[F] and the amplitude spectrum S. For example, peaks appear in both the amplitude spectrum S of
On the other hand, no peak appears in the estimated shape C[F] at a frequency x corresponding to a peak in the tone model M[F] if the amplitude spectrum S has no peak at the frequency x. For example, while peaks appear in the tone model M[100] of
The weight calculator 23 is means for calculating a weight ω[F] of each fundamental frequency F from each estimated shape C[F] calculated by the estimated shape specifier 21. As shown in
The process selector 25 of
As shown in
A unit process including the process for specifying the estimated shape C[F] at the estimated shape specifier 21 (hereinafter referred to as an “estimated shape specification process”) and the process for specifying the weight ω[F] at the weight calculator 23 (hereinafter referred to as a “weight calculation process”) is repeated a plurality of times (EM algorithm). Each unit process makes the weights ω[F] closer to respective weights of a plurality of tone models M[F] when the amplitude spectrum S is modeled as a mixed distribution of the plurality of tone models M[F].
At a stage immediately after one frame of the audio signal V is started to be processed, the weight calculator 23 has not yet calculated the weight ω[F] and thus the estimated shape specifier 21 calculates an estimated shape C[F] by multiplying the amplitude spectrum S by the tone model M[F] (i.e., by the spectral distribution ratio Q[F]). The process selector 25 outputs the weight ω[F] initially calculated for one frame to the ghost suppressor 27 while outputting subsequently calculated weights ω[F] to the estimated shape specifier 21. Accordingly, in the first estimated shape specification process after one frame of the audio signal V is started to be processed, the estimated shape C[F] is calculated by multiplying the amplitude spectrum S by the tone model M[F] and, in the second estimated shape specification process, the estimated shape C[F] is calculated by multiplying the amplitude spectrum S by the spectral distribution ratio Q[F] generated from both the tone model M[F] and a weight ω[F] that has been processed by the ghost suppressor 27. In the third and subsequent estimated shape specification processes, the estimated shape C[F] is calculated by multiplying the amplitude spectrum S by the spectral distribution ratio Q[F] generated from both the tone model M[F] and a weight ω[F] calculated by the weight calculator 23 (i.e., a weight ω[F] that has not been processed by the ghost suppressor 27). The weight calculator 23 outputs a distribution of weights ω[F] calculated when the number of repetitions of the unit process has reached a predetermined number, as a fundamental frequency probability density function P, to the pitch specifier 40.
However, when the fundamental frequency F of the amplitude spectrum S is 200 Hz as shown in
It is difficult to accurately remove only the ghost from a plurality of peaks in the fundamental frequency probability density function P. Another problem is that a weight ω[F] of the fundamental frequency F of a sound that is actually included in the target sound is limited (i.e., an increase in the weight ω[F] is restricted) by as much as the amplitude of the ghost since the weight ω[F] is determined such that the integral of the weight ω[F] over all fundamental frequencies F is 1. The ghost causes a reduction in the accuracy of pitch specification as described above. Thus, in this embodiment, the ghost suppressor 27 suppresses the ghost by correcting the weight ω[F] calculated by the weight calculator 23.
An estimated shape C[F] specified based on the product of the amplitude spectrum S and a spectral distribution ratio Q[F] generated from a tone model M[F], which dominantly supports (or contributes to) the harmonic structure of the amplitude spectrum S, includes peaks at the same frequencies x as those of the tone model M[F] since the tone model M[F] includes peaks at the same frequencies x as those of the amplitude spectrum S. Accordingly, aspects (such as frequencies or amplitudes of peaks) of the tone model M[F] are similar to those of the estimated shape C[F], as can be seen from the tone model M[200] of
As shown in
The weight corrector 273 forcibly changes a weight ω[F] of a fundamental frequency F, whose tone model M[F] and estimated shape C[F] are not similar (i.e., have low similarity), to zero regardless of its value calculated by the weight calculator 23. More specifically, the weight corrector 273 of this embodiment maintains the weight ω[F] calculated by the weight calculator 23 when the similarity index value R[F] is less than a threshold TH and changes, to zero, the weight ω[F] when the similarity index value R[F] is greater than the threshold TH.
If the weights ω[F] are corrected as described above, the sum of the weights ω[F] of all fundamental frequencies F may not be 1. Thus, the normalizer 275 of
The pitch specifier 40 of
As described above, in this embodiment, an estimated shape C[F] corresponding to a fundamental frequency F of a sound, which is not actually included in the target sound, and a weight ω[F] and a value k[F] generated based on the estimated shape C[F] are effectively reduced, compared to a configuration without the ghost suppressor 27 (which will be referred to as a “comparison example”), since the weight ω[F] corrected by the ghost suppressor 27 is used to specify the estimated shape C[F].
When only one fundamental frequency F0 is estimated from a fundamental frequency probability density function P as described in Japanese Patent Registration No. 3413634, it is likely that the fundamental frequency F0 of the desired sound can be estimated by searching for the most prominent peak in the probability density function P even in the case of the comparison example where ghosts are present in the weight ω[F]. However, using the most prominent peak search method, it is difficult to accurately extract the fundamental frequencies F0 of a plurality of sounds from a probability density function P having ghosts G and peaks corresponding to the desired fundamental frequencies F0. This embodiment suppresses weights ω[F] corresponding to ghosts G to selectively make only the peaks of sounds, which are actually included in the target sound, apparent in the probability density function P. Thus, it is possible to accurately and easily specify the fundamental frequencies F0 of a plurality of sounds by selecting a predetermined number of peaks (agents), for example in order of decreasing weight ω[F].
The above embodiments may be modified in various ways. The following illustrates specific modified embodiments. Appropriate combinations of the following embodiments are also possible.
Although the weight ω[F] initially calculated for one frame is corrected at the weight corrector 273 in the configurations illustrated in the above embodiments, the timing when the weight ω[F] is corrected is optional. For example, it is also possible to provide configurations in which the weight ω[F] is corrected after a unit process is performed a predetermined number of times (one or more times). However, the configurations, in which the weight ω[F] is corrected at an initial stage as in the above embodiments, have an advantage of reducing the time (or the number of repetitions of the unit process) required to optimize the weight ω[F]. The number of times the correction of the weight ω[F] is performed on one frame is also optional. For example, configurations, in which the weight ω[F] is corrected each time the unit process is performed a predetermined number of times (one or more times), are also employed.
Although the similarity index value R[F] is compared with the threshold TH in the configurations illustrated in the above embodiments, the method of determining whether or not to correct the weight ω[F] is changed appropriately. For example, the weights ω[F] of a predetermined number of fundamental frequencies F selected in order of increasing similarity between the tone model M[F] and the estimated shape C[F] (in order of decreasing similarity index value R[F]) may be corrected to zero.
In addition, although weights ω[F] corresponding to ghosts are changed to zero in the configurations illustrated in the above embodiments, the method of correcting the weights ω[F] is not limited to it. That is, weights corresponding to ghosts, among weights ω[F] output from the ghost suppressor 27 to the estimated shape specifier 21, only needs to be reduced to values less than the weights ω[F] calculated by the weight calculator 23. Accordingly, in addition to the means for replacing weights ω[F] corresponding to ghosts with zero, means for multiplying weights ω[F] corresponding to ghosts by a value less than 1 or means for subtracting a predetermined value from the weights ω[F] may also be employed as the weight corrector 273.
Further, although weights ω[F] corresponding to ghosts are suppressed in the configurations illustrated in the above embodiments, a configuration, in which weights ω[F] of fundamental frequencies F at which no ghost occurs are increased to values greater than the weights ω[F] calculated by the weight calculator 23, is also employed. For example, the weight corrector 273 maintains weights ω[F] of fundamental frequencies F, whose similarity index value R[F] is greater than the threshold TH, at the weights ω[F] calculated by the weight calculator 23 and corrects weights ω[F] of fundamental frequencies F, whose similarity index value R[F] is less than the threshold TH (i.e., whose tone model M[F] and estimated shape C[F] are similar), to values greater than the weights ω[F] calculated by the weight calculator 23 and outputs the values as the corrected weights ω[F] of the fundamental frequencies F. Means for multiplying weights ω[F] corresponding to ghosts by a predetermined value greater than 1 or means for adding a predetermined value to the weights ω[F] is also employed as the weight corrector 273 in this configuration.
The KL information quantity is just an example of the similarity index value R[F]. For example, a Root Means Square (RMS) error between the tone model M[F] and the estimated shape C[F] may also be calculated as the similarity index value R[F]. In addition, although the similarity index value R[F] approaches zero as the similarity between the tone model M[F] and the estimated shape C[F] increases in the cases illustrated above, the similarity index value R[F] may be calculated such that the similarity index value R[F] approaches zero as the similarity between the tone model M[F] and the estimated shape C[F] decreases. That is, in the present invention, the method of calculating the similarity index value R[F] is optional and any configuration suffices if it reduces weights ω[F] of fundamental frequencies F whose tone model M[F] and estimated shape C[F] have low similarity.
Although a predetermined number of peaks selected in order of decreasing weight ω[F] in the fundamental frequency probability density function P are extracted as fundamental frequencies F0 in the configurations illustrated in the above embodiments, configurations, in which peaks higher than a predetermined threshold among a plurality of peaks of the probability density function P are extracted as fundamental frequencies F0, may also be employed. In addition, although a plurality of fundamental frequencies F0 are estimated in the configurations illustrated in the above embodiments, the above embodiments may of course be applied when one fundamental frequency F0 is estimated.
Although a set of tone models M[F] is used in the configurations illustrated in the above embodiments, a plurality of sets of tone models M[F] may also be used as shown in
An amplitude spectrum S output from a BPF 14 is divided into n sets, which are then provided respectively to the function estimators 20. Each function estimator 20 performs, in parallel with each other, the same unit process (including an estimated shape specification process and a weight calculation process) as that of the above embodiment based on the amplitude spectrum S and a tone model Mi[F], corresponding to the function estimator 20, stored in the storage 30. As shown in
In configurations in which a weight ω[F] is separately calculated for each frame of an audio signal V as in the above embodiments, an estimated shape C[F] is calculated, for example by multiplying the amplitude spectrum S by the tone model M[F] (or the spectral distribution ratio Q[F]), when the first estimated shape specification process is performed on one frame. However, a weight ω[F] of each frame may also be calculated using, as an initial value, a weight ω[F] finally determined for an immediately previous frame (i.e., a function value of a probability density function P estimated for the immediately previous frame). For example, when the first estimated shape specification process is performed on one frame, an estimated shape C[F] may also be calculated by multiplying the amplitude spectrum S by a spectral distribution ratio Q[F] generated from both a tone model M[F] and a weight ω[F] finally calculated for an immediately previous frame.
A pitch estimation program is installed and executed on the personal computer that has audio signal acquisition functions such as a communication function to acquire musical audio signals from a network through COM I/O. Otherwise, the personal computer may be equipped with a sound collection function to obtain input audio signals from nature, or a player function to reproduce musical audio signals from a recording medium such as HDD or CD. The computer, which executes the pitch estimation program according to this embodiment, functions as a pitch estimation apparatus according to the invention.
A machine readable medium such as HDD or ROM is provided for use in a computer for estimating a fundamental frequency of an audio signal from a fundamental frequency probability density function by modeling the audio signal as a weighted mixture of a plurality of tone models corresponding respectively to harmonic structures of individual fundamental frequencies, so that the fundamental frequency probability density function of the audio signal is given as a distribution of respective weights of the plurality of the tone models. The machine readable medium contains program instructions executable by the computer for performing: a function estimation process of estimating the fundamental frequency probability density function by repeating a weight calculation process and an estimated shape specification process, wherein the weight calculation process calculates a weight of each tone model of each fundamental frequency based on an estimated shape of each tone model of each fundamental frequency, the estimated shape indicating a degree of dominancy of a corresponding tone model in a total harmonic structure of the audio signal, and the estimated shape specification process specifies each estimated shape of each tone model of each fundamental frequency based on an amplitude spectrum of the audio signal, the harmonic structure of each tone model of each fundamental frequency, and the weight of each tone model of each fundamental frequency; a similarity analysis process of calculating a similarity index value indicating a degree of similarity between each tone model of each fundamental frequency and each estimated shape specified from the corresponding tone model in the estimated shape specification process; and a weight correction process of reducing a weight of at least one tone model of a certain fundamental frequency having the similarity index value indicating that the one tone model and the corresponding estimated shape are not similar to each other, among the weights of the plurality of the tone models calculated in the weight calculation process.
Number | Date | Country | Kind |
---|---|---|---|
2006-238778 | Sep 2006 | JP | national |