Exemplary embodiments according to the present invention will be explained in detail below with reference to the accompanying drawings.
A feature-vector compensating apparatus according to a first embodiment of the present invention designs in advance compensation vectors for a plurality of noise environments, and stores the compensation vector into a storing unit, calculates a degree of similarity of an input speech with respect to each of the noise environments at a time of a speech recognition, obtains a compensation vector by weighting and summing the compensation vectors of the noise environments based on the calculated degree of similarity, and compensates a feature vector based on the obtained compensation vector.
The noise-environment storing unit 120 stores therein a Gaussian mixture model (GMM) parameter at a time of modeling a plurality of noise environments by the GMM, and compensation vectors calculated in advance as compensation vectors for a feature vector corresponding to each of the noise environments.
According to the first embodiment, it is assumed that parameters of three noise environments including a parameter 121 of a noise environment 1, a parameter 122 of a noise environment 2, and a parameter 123 of a noise environment 3 are calculated in advance, and stored in the noise-environment storing unit 120. The number of noise environments is not limited to three, in other words, any desired number of noise environments can be taken as reference data.
The noise-environment storing unit 120 can be configured with any recording medium that is generally available, such as a hard disk drive (HDD), an optical disk, a memory card, and a random access memory (RAM).
The input receiving unit 101 converts a speech input from an input unit (not shown), such as a microphone, into an electrical signal (speech data), performs an analog-to-digital (A/D) conversion on the speech data to convert analog data into digital data based on, for example, a pulse code modulation (PCM), and outputs digital speech data. The processes performed by the input receiving unit 101 can be implemented by using the same method as a digital processing of the speech signal according to a conventional technology.
The feature extracting unit 102 divides the speech data received from the input receiving unit 101 into a plurality of frames with predetermined lengths, and extracts a feature vector of the speech. The frame length can be 10 to 20 milliseconds. According to the first embodiment, the feature extracting unit 102 extracts the feature vector of the speech which includes static, A, and AA parameters of a Mel frequency cepstrum coefficient (MFCC).
In other words, the feature extracting unit 102 calculates a total of 39-dimensional feature vector including a 13-dimensional MFCC, and A and AA of the MFCC as the feature vector for each of divided frames by using a method of discrete-cosine converting a power of an output of a Mel-scaled filter bank analysis.
The feature vector is not limited to the above one. In other words, any parameter can be used as a feature vector as long as it represents a feature of the input speech.
The similarity calculating unit 103 calculates a degree of similarity for each of the above three noise environments determined in advance, which indicates a certainty that an input speech is generated under each of the noise environments, based on the feature vector extracted by the feature extracting unit 102.
The compensation-vector calculating unit 104 acquires a compensation vector of each noise environment from the noise-environment storing unit 120, and calculates a compensation vector for the feature vector of the input speech by weighting and summing the acquired compensation vectors with the degree of similarity calculated by the similarity calculating unit 103 as weights.
The feature-vector compensating unit 105 compensates the feature vector of the input speech by using the compensation vector calculated by the compensation-vector calculating unit 104. The feature-vector compensating unit 105 compensates the feature vector by adding the compensation vector to the feature vector.
First of all, the input receiving unit 101 receives an input of a speech uttered by a user (step S201). The input speech is converted into a digital speech signal by the input receiving unit 101.
The feature extracting unit 102 divides the speech signal into frames of 10 milliseconds, and extracts the feature vector of each of the frames (step S202). The feature extracting unit 102 extracts the feature vector by calculating the feature vector yt of the MFCC, as described above.
The similarity calculating unit 103 calculates a degree of similarity of a speech of the frame for each of the noise environments determined in advance, based on the feature vector yt extracted by the feature extracting unit 102 (step S203). When a model of a noise environment is e, the degree of similarity is calculated as a posterior probability p(e|yt) of the noise environment e given the feature vector yt at time t as in Equation (1):
where p(yt|e) is a probability that the feature vector yt appears in the noise environment e, and p(e) and p(yt) are a prior probability of the noise environment e and a probability of the feature vector yt, respectively.
When it is assumed that p(yt) is independent of the noise environment, and the prior probability of each of the noise environments is the same, the posterior probability p(e|yt) can be calculated using Equation (2):
p(e|yt)=αp(yt|e) (2)
where p(yt|e) and α are calculated using Equations (3) and (4), respectively:
where N is Gaussian distribution, p(s) is a prior probability of each component of the GMM, and the feature vector yt is modeled by the GMM. The parameters of the GMM, the mean vector μ and the covariance matrix Σ, can be calculated by using the expectation maximization (EM) algorithm.
The parameters of the GMM can be obtained using a Hidden Markov Model Toolkit (HTK) for a large number of feature vectors prepared in a noise environment as training data. HTK is widely used in speech recognition to train HMMs.
The compensation-vector calculating unit 104 calculates the compensation vector rt for the feature vector of the input speech by weighting and summing of the compensation vector rse pre-calculated for each noise environment, using the degree of similarity calculated by the similarity calculating unit 103 as weights (step S204). The compensation vector rt is calculated using Equation (5):
where rte is calculated using
Namely, the compensation vector rte of each noise environment e is calculated by weighting and summing of the pre-calculated compensation vector rse based on the same method as a conventional SPLICE method (Equation (6)). Then, the compensation vector rt for the feature vector of the input speech is calculated by weighting and summing the compensation vector rte of each noise environment e using the degree of similarity as weights (Equation (5)).
The compensation vector rse can be calculated by the same method as a conventional SPLICE method. For given numerous sets (xn, yn), where n is a positive integer, xn is a feature vector of clean speech data, and yn is a feature vector of noisy speech data in each of the noise environments; the compensation vector rse can be calculated using Equation (7), where the superscript “e” representing the noise environment is omitted, as follows:
where p(s|yn) is calculated using Equation (8):
The GMM parameters and the compensation vectors calculated in the above manner are stored in the noise-environment storing unit 120 in advance. Therefore, at step S204, the compensation vector rt is calculated by using the compensation vector rse of each noise environment stored in the noise-environment storing unit 120.
Finally, the feature-vector compensating unit 105 performs a compensation of the feature vector yt by adding the compensation vector rt calculated by the compensation-vector calculating unit 104 to the feature vector yt calculated at step S202 (step S205).
The feature vector compensated in the above manner is output to a speech recognizing apparatus. The speech processing using the feature vector is not limited to the speech recognition processing. The method according to the present embodiment can be applied to any kind of processing such like speaker recognition.
In this manner, in the feature-vector compensating apparatus 100, an unseen noise environment is approximated with a linear combination of a plurality of noise environments; and therefore, the feature vector can be compensated with an even higher precision, which makes it possible to calculate a feature vector with a high precision even when the noise environment at a time of performing the speech recognition does not match the noise environment at a time of making a design. For this reason, it is possible to achieve a high speech-recognition performance using the feature vector.
In a feature-vector compensating according to the conventional method, in which only one noise environment is selected for each frame of an input speech signal, the performance of a speech recognition becomes greatly degraded when there is an error in selecting the noise environment. On the contrary, the feature-vector compensating method according to the present embodiment linearly combines a plurality of noise environments based on the degree of similarity, instead of selecting only one noise environment; and therefore, even if there is an error in a calculation of the degree of similarity for some reason, an influence on a calculation of the compensation vector is small enough, and as a result, the performance becomes less degraded.
According to the first embodiment, a degree of similarity of a noise environment at each time t is obtained from a feature vector yt at the time t alone; however, a feature-vector compensating apparatus according to a second embodiment of the present invention calculates the degree of similarity by using a plurality of feature vectors at times before and after the time t together.
According to the second embodiment, the function of the similarity calculating unit 303 is different from that of the similarity calculating unit 103 according to the first embodiment. Other units and functions are the same as those of the feature-vector compensating apparatus 100 according to the first embodiment shown in
The similarity calculating unit 303 calculates the degree of similarity by using feature vectors in a time window of plural frames.
The processes from step S401 to step S402 are performed in the same way as the processes from step S201 to S202 performed by the feature-vector compensating apparatus 100, so that a detailed explanation will be omitted.
After extracting the feature vector at step S402, the similarity calculating unit 303 calculates a probability of an event in which the extracted feature vectors appear in each noise environment (appearance probability).
Subsequently, the similarity calculating unit 303 calculates a degree of attribution of a frame at the time t by using a value obtained by performing a weighting multiplication of the appearance probability calculated at a frame at each time (step S404). In other words, the similarity calculating unit 303 calculates the degree of similarity p(e|yt−a:t+b) by using Equation (9), where a and b are positive integers, and yt−a:t+b is a feature-vector series from a time t−a to a time t+b.
p(e|yt−a:t+b)=αp(yt−a:t+b|e) (9)
where p(yt−a:t+b|e) and α in Equation (9) are calculated by Equations (10) and (11), respectively,
where w(τ) is a weight for each time t+τ. A value of w(τ) can be set as, for example, w(τ)=1 for all values of τ, or can be set to be decreased with an increase of an absolute value of τ. Then, the compensation vector rt can be obtained, in the same way as Equation (5), using the degree of similarity p(e|yt−a:t+b) calculated in the above manner.
Namely, the compensation-vector calculating unit 104 calculates the compensation vector rt, in the same way as step S204 of the first embodiment, using the degree of similarity calculated at step S404 (step S405).
The feature-vector compensating unit 105 compensates the feature vector yt by using the compensation vector rt, in the same way as step S205 of the first embodiment (step S406), and the process of compensating the feature vector is completed.
In this manner, in the feature-vector compensating apparatus according to the second embodiment, the degree of similarity can be calculated by using a plurality of feature vectors; and therefore, it is possible to suppress an abrupt change of a compensation vector, and to calculate a feature vector with a high precision. For this reason, it is possible to achieve a high speech-recognition performance using the feature vector.
The feature-vector compensating apparatus includes a control device such as a central processing unit (CPU) 51, a storage device such as a read only memory (ROM) 52 and a random access memory (RAM) 53, a communication interface (I/F) 54 for performing a communication via a network, and a bus 61 that connects the above components.
A computer program (hereafter, “feature-vector compensating program”) executed in the feature-vector compensating apparatus is provided by a storage device such as the ROM 52 pre-installed therein.
On the contrary, the feature-vector compensating program can be provided by storing it as a file of an installable format or an executable format in a computer-readable recording medium, such as a compact disk-read only memory (CD-ROM), a flexible disk (FD), a compact disk-recordable (CD-R), and a digital versatile disk (DVD).
As another alternative, the feature-vector compensating program can be stored in a computer that is connected to a network such as the Internet, so that the program can be downloaded through the network. As still another alternative, the feature-vector compensating program can be provided or distributed through the network such as the Internet.
The feature-vector compensating program is configured as a module structure including the above function units (the input receiving unit, the feature extracting unit, the similarity calculating unit, the compensation-vector calculating unit, and the feature-vector compensating unit). Therefore, as an actual hardware, the CPU 51 reads out the feature-vector compensating program from the ROM 52 to execute the program, so that the above function units are loaded on a main memory of a computer, and created on the main memory.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2006-105091 | Apr 2006 | JP | national |