The following description will explain the present invention in detail, based on the drawings illustrating some embodiments thereof.
A computer program 11a of the present invention is recorded in the recording means 11, and a computer operates as the sound signal processing apparatus 1 of the present invention by storing various kinds of processing steps contained in the recorded computer program 11a into the storing means 12 and executing them under the control of the control means 10.
A part of the recording area of the recording means 11 is used as various kinds of databases, such as an acoustic model database (acoustic model DB) 11b recording acoustic models for voice recognition, and a language dictionary 11c recording recognizable vocabulary described by phonemic or syllabic definitions corresponding to the acoustic models, and grammar.
A part of the storing means 12 is used as a sound data buffer 12a for storing digitized sound data obtained by sampling sound that is an analog signal acquired by the sound acquiring means 13 at a predetermined period, and a frame buffer 12b for storing frames obtained by dividing the sound data into a predetermined time length.
The navigation means 16 includes a position detecting mechanism such as a GPS (Global Positioning System), and a recording medium such as a DVD and a hard disk recording map information. The navigation means 16 executes navigation processing such as searching for a route from the current location to a destination and indicating the route, displays a map and the route on the display means 15, and outputs a voice guide from the sound output means 14.
The structural example shown in
The following description will explain the processing performed by the sound processing apparatus 1 according to Embodiment 1 of the present invention.
The sound signal processing apparatus 1 generates frames of a predetermined length from the sound data stored in the sound data buffer 12a, under the control of the control means 10 (step S3). In step S3, the sound data is divided into frames by a predetermined length of 20 ms to 30 ms, for example. The respective frames overlap each other by 10 ms to 15 ms. For each of the frames, frame processing general to the field of voice recognition, including window functions such as a Hamming window and a Hanning window, and filtering with a high pass filter, is performed. The following processing is performed on each of the frames thus created.
Under the control of the control means 10, the sound signal processing apparatus 1 converts a sound signal based on the sound data of each frame into a spectrum by performing FFT processing (step S4). In step S4, the sound signal processing apparatus 1 finds a power spectrum by squaring an amplitude spectrum X(ω) obtained by performing the FFT processing on the sound signal, and calculates a logarithmic power spectrum 20 log10|X(ω)| as the logarithm of the found power spectrum. In this manner, the sound signal is converted into a logarithmic power spectrum. Note that, in step S4, it may be possible to calculate a logarithmic amplitude spectrum 10 log10|X(ω)| as the logarithm of the amplitude spectrum X(ω) obtained by performing FFT processing on a sound signal, and use the calculated logarithmic amplitude spectrum as a spectrum after conversion.
Under the control of the control means 10, the sound signal processing apparatus 1 converts the spectrum based on the Fourier transform of the sound signal into a cepstrum, and calculates a spectral envelope by performing inverse FFT processing on a lower-order component than a predetermined order of the converted cepstrum (step S5).
The processing in step S5 will be explained. The amplitude spectrum |X(ω)| obtained by performing FFT processing on the sound signal is expressed by Equation 1 below, using G(ω) and H(ω) representing the FFTs of higher-order component and lower-order component, respectively.
X(ω)=G(ω)H(ω) Equation 1
The logarithm of Equation 1 can be expressed by Equation 2 below.
log10|X(ω)|=log10|G(ω)|+log10|H(ω)| Equation 2
A cepstrum c (τ) is obtained by the inverse FFT of Equation 2 by using the frequency co as a variable. The first term of the right side of Equation 2 shows a fine structure that is a higher-order component of the spectrum, and the second term of the right side shows a spectral envelope that is a lower-order component of the spectrum. In other words, in step S5, a spectral envelope is calculated by performing the inverse FFT of a lower-order component than a predetermined order, such as a component lower than the 10th order or 20th order of the FFT cepstrum calculated from the FFT spectrum. Note that although there is a method using a spectral envelope using an LPC (Linear Predictive Coding) cepstrum, this method gives an envelope with enhanced peaks, and therefore the FFT cepstrum is preferable.
The sound signal processing apparatus 1 removes the spectral envelope calculated in step S5 from the spectrum found in step S4 under the control of the control means 10 (step S6). The removal operation in step S6 is carried out by subtracting the values of the respective frequencies in the spectral envelope from the values of the respective frequencies in the spectrum found in step S4. By removing the spectral envelope from the spectrum in step S6, the tilt of the spectrum is removed and the spectrum becomes flat, and thus the fine structure of the spectrum is found as a result of processing. Note that it may be possible to calculate the spectral fine structure by performing the inverse FFT on a higher-order component such as a component of not lower than the 11th order or 21st order of the FFT cepstrum, which was not used in calculating the spectral envelope, instead of removing the spectral envelope from the spectrum.
Under the control of the control means 10, the sound signal processing apparatus 1 detects a spectral peak in the spectrum obtained by the removal of the spectral envelope (step S7), and suppresses the detected spectral peak (step S8).
In step S7, when detecting a spectral peak, a band including a spectral peak showing a greater value than a predetermined threshold value recorded in the recording means 11 is detected as a band including a spectral peak to be suppressed. Alternatively, a band including n (n is a natural number) peaks from the largest peak as the spectral peak to be suppressed may be detected. Further, it may be possible to detect a band including a maximum of n peaks from the largest value of spectral peaks among spectral peaks showing greater values than the predetermined threshold value as the spectral peaks to be suppressed. Note that the value of n is appropriately around 2 to 4.
As the method of suppressing the spectral peak in step S8, some methods are listed below as examples. The first suppression method is a method in which the values of power equal to or higher than the threshold value in a band including the detected spectral peak are converted into the threshold value, that is, power corresponding to the threshold value and greater values is subtracted from the spectrum. It is not necessarily to convert the values equal to or higher than the threshold value into the threshold value, and it may be possible to convert the values into a value based on the threshold value, for example, a value greater than the threshold value by a predetermined value.
The second suppression method is a method in which a power value equal to or higher than the spectral envelope in a peripheral band including the detected spectral peak, for example, a band with a width of several 100 Hz around the spectral peak, is converted into a corresponding spectral envelope value.
The third suppression method is a method in which the values in a band between points at which the detected spectral peak crosses the spectral envelope, that is, a band in which the value of power forming the spectral peak exceeds the spectral envelope and then becomes lower than the spectral envelope, are converted into a value of the corresponding spectral envelope.
The fourth suppression method is a method of suppressing a spectral peak by converting the value of power in a band including the detected spectral peak with the total value or, for example, the average value of the values in a band wider than the band including the detected spectral peak, for example, a band with a width of several 100 Hz around the spectral peak.
Under the control of the control means 10, the sound signal processing apparatus 1 extracts feature components such as power obtained by integrating a power spectrum with the suppressed spectral peak in the frequency axis direction, pitch, and cepstrum (step S9), and determines a voice interval based on the extracted spectral power and pitch (step S10). Regarding the determination of a voice interval in step S10, the spectral power calculated in step S9 is compared with a threshold value for voice detection recorded in the recording means 11, and, if spectral power equal to or greater than the threshold value exists and pitch exists, the interval is determined to be a voice interval.
Then, under the control of the control means 10, the sound signal processing apparatus 1 refers to the acoustic models recorded in the acoustic model database 11b and the recognizable vocabulary and grammar recorded in the language dictionary 11c, based on a feature vector that is a feature component extracted from the spectrum obtained by suppressing the spectral peak, and executes voice recognition processing on a frame determined to be a voice interval (step S11). The voice recognition processing in step S11 is executed by calculating the similarity with respect to the acoustic models and referring to language information about the recognizable vocabulary.
Thus, in Embodiment 1 of the present invention, it is possible to detect peaks caused by non-stationary noise having a sharp peaks, such as electronic sound and the siren sound, by removing stationary noise even under a stationary noise environment having moderate peaks such as the engine sound and the sound of air conditioners, and it is possible to suppress the detected peaks. It is therefore possible to prevent non-stationary noise from being misrecognized as voice. Although the spectrum of voice (a vowel) has a plurality of peaks, they are removed as a spectral envelope because the peaks are not sharp compared with electronic sound, and thus the peaks of the vowel will never be mistakenly suppressed.
Embodiment 2 is an embodiment configured by modifying the spectral peak detection method of Embodiment 1. Since the structural example of a sound signal processing apparatus of Embodiment 2 is the same as in Embodiment 1, the explanation thereof is omitted by referring to Embodiment 1. In the following explanation, the structure of the sound signal processing apparatus is illustrated by adding the same codes as in Embodiment 1. Moreover, since the processing performed by the sound signal processing apparatus 1 of Embodiment 2 is the same as that in Embodiment 1, the explanation thereof is omitted by referring to Embodiment 1. In the following explanation, the respective processes to be performed by the sound signal processing apparatus 1 are explained by adding the same step numbers as in Embodiment 1.
As the process in step S7 of detecting a spectral peak from the spectrum obtained by removing the spectral envelope, the sound signal processing apparatus 1 of Embodiment 2 detects, as a band including a spectral peak, a band in which the ratio between a total value of the values in a band of a predetermined width and a total value of the values in all bands except for the predetermined width shows a value greater than a predetermined threshold value. More specifically, a frequency at which the power of the spectrum has a maximum value is detected, and the total value or, for example, the average value of power in a band of a predetermined width such as 100 Hz around the detected frequency is calculated. In
Embodiment 3 is an embodiment configured by modifying the spectral peak detection method of Embodiment 1. Since the structural example of a sound signal processing apparatus of Embodiment 3 is the same as in Embodiment 1, the explanation thereof is omitted by referring to Embodiment 1. In the following explanation, the structure of the sound signal processing apparatus 1 is illustrated by adding the same codes as in Embodiment 1. Moreover, since the processing performed by the sound signal processing apparatus 1 of Embodiment 3 is the same as that in Embodiment 1, the explanation thereof is omitted by referring to Embodiment 1. In the following explanation, the respective processes to be performed by the sound signal processing apparatus 1 are explained by adding the same step numbers as in Embodiment 1.
As the process in step S7 of detecting a spectral peak from the spectrum obtained by removing the spectral envelope, the sound signal processing apparatus 1 of Embodiment 3 detects, as a band including a spectral peak, a first band in which the ratio between a total value of the values in the first band of a first predetermined width and a total value of the values in a second band of a second predetermined width near the first band shows a value greater than a predetermined threshold value. More specifically, a frequency at which the power of the spectrum has a maximum value is detected, and the total value or, for example, the average value of power in a band with a predetermined width, such as 100 Hz around the detected frequency, is calculated. In
In Embodiments 1 through 3 described above, embodiments in which voice recognition is performed after removing non-stationary noise are illustrated as the invention related to voice recognition, but the present invention is not limited to these embodiments and may be expanded in various fields related to voice processing. For example, when the present invention is applied to telecommunication to transmit a sound signal based on sound acquired by a receiver device to a person you are calling, it may be possible to transmit the sound signal to the person after removing non-stationary noise from the sound signal by the processing of the present invention.
As this invention may be embodied in several forms without departing from the spirit of essential characteristics thereof, the present embodiments are therefore illustrative and not restrictive, since the scope of the invention is defined by the appended claims rather than by the description preceding them, and all changes that fall within metes and bounds of the claims, or equivalence of such metes and bounds thereof are therefore intended to be embraced by the claims.
Number | Date | Country | Kind |
---|---|---|---|
2006-254931 | Sep 2006 | JP | national |