The present invention claims priority of Korean Patent Application No. 10-2008-0124372, filed on Dec. 09, 2008, which is incorporated herein by reference.
The present invention relates to a speech recognition system based on a microphone array and, more particularly, to an apparatus and method for high-performance speech recognition based on sound source separation and sound source identification, wherein source signals are separated from mixed sound signals using independent component analysis (hereinafter, referred to as “ICA”).
Speech recognition enables extraction of linguistic information from a user's speech signal, and conversion of the extracted linguistic information into character strings. The recognition rate becomes high in a relatively quiet environment. However, speech recognition systems are mounted in a computer, robot and mobile terminal, and may be used in various environments such as a living room, exhibit hall, laboratory, public place and the like. In these environments, various types of noises are present. Noise is one of major factors that lower the performance of the speech recognition system, and many noise handing technique have been developed to suppress noise.
Recently, technique to handle noises input from two or more microphones have been introduced. As these techniques, a beamforming technique, which strengthens a user's speech signal coming from a given direction while attenuating noise signals coming from other directions, and an independent component analysis (ICA) method, which separates original sounds from mixed sound signals by a statistical learning algorithm, are well known in the art.
In an apparatus receiving speech such as a speech recognizer or wired/wireless phone, ICA can be applied to effectively remove or suppress noises and interfering signals generated from noise sources such as neighborhood speakers, televisions, audio units and the like, but the noises to be removed or suppressed may be limited to point noise sources other than diffuse noise sources. Mixed sound signals formed of plural sound sources are reasonably well separated into the original sound signals by the ICA, however, the separated original sound sources are difficult to be indentified.
In other words, the conventional speech recognition techniques applied with ICA can separate source signals from the mixed sound signals, but cannot identify each of the separated sound signals, through the use of a speech recognizer. That is, it is necessary to accurately identify a sound signal of a particular user among the separated sound signals, but conventional techniques do not provide a solution in this respect.
Therefore, the high-performance present invention provides an apparatus and method for speech recognition based on sound signal separation and sound signal identification wherein sources are separated by using ICA.
The present invention further provides an apparatus and method for speech recognition based on sound signal separation and sound signal identification wherein sound signals input to microphones are separated by using the ICA, which is capable of automatically identifying the user's speech intended to be recognized from the separated sound signals.
In accordance with an aspect of the present invention, there is provided an apparatus for a speech recognition based on source separation and identification including: a sound source separator for separating mixed signals, which are input to two or more microphones, into sound source signals by using independent component analysis (ICA), and estimating direction information of the separated sound source signals; a speech recognizer for calculating log likelihood probabilities of the separated sound source signals by normalizing the separated source signals; and a speech signal identifier identifying a sound source corresponding to a user's speech signal by using the estimated direction information and reliability to identification of the separated sound source signals based on the normalized log likelihood probabilities.
In accordance with another aspect of the present invention, there is provided a method for speech recognition based on source separation and source identification, including: separating mixed signals, which are input to two or more microphones, into source signals by using independent component analysis (ICA), and estimating direction information (direction of arrival, DOA) of the separated sound source signals; calculating normalized log likelihood probabilities of the separated sound source signals; and identifying a sound source corresponding to a user's speech signal using the estimated direction information and the reliability scores based on the normalized log likelihood probabilities.
In accordance with the present invention, a speech recognizer can be used without significant performance degradation even in an environment such as a living room or exhibit hall where multiple point noise sources are present, enabling development of diverse application systems based on speech recognition.
In addition, the user can give a talk at any location without a restriction such as speaking in the front of or in a given direction from the speech recognizer in virtue of the source identification functionality of the present invention, significantly enhancing user convenience.
The objects and features of the present invention will become apparent from the following description of embodiments given in conjunction with the accompanying drawings, in which:
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
As shown in
M microphones are arranged at regular intervals in the speech recognition apparatus, and M mixed sound signals input through the microphones are indicated by x1(t), . . . , xM(t) 102. If the impulse response on the acoustic propagation path from a sound source n to a microphone m is denoted by hmn(l), Equation 1 below holds:
The ICA-DOA estimator 104 serves as a sound separator, which separates source signals from signals input to the microphones xm(t) to obtain separated signals yn(t) using Equation 2 below. In this regard, ICA is a representative approach to obtain wnm(l) corresponding to the inverse of hmn(l).
Equations 1 and 2 can be respectively converted into frequency domain representations through Fast Fourier Transform (FFT) as shown in Equation 3.
X
m(f,t)=Hmn(f)Sn(f,t),
Y
n(f,t)=Wnm(f)Xm(f,t) [Equation 3]
That is, in a frequency domain ICA, the microphone input signals xm(t) in time domain is converted into frequency domain, and the unmixing matrix wnm(f) is obtained by repeatedly executing a learning rule given by Equation 4 by using an initial value.
Wnm(f)←Wnm(f)+ΔWnm(f),
ΔWnm(f)=μ·(I−E[Φ(Yn)YnH])·Wnm(f) [Equation 4]
After calculating Yn(f,t) (separated signals in the frequency domain) by using the Equation 3 on the basis of the learned unmixing matrix Wnm(f), separated sound signals yn(t) in the time domain are finally obtained through Inverse Fourier Transform.
Although separated sound signals y1(t), . . . , yn(t) can be obtained with the ICA, their corresponding original sound sources are not known. Hence, it is necessary for the speech recognition apparatus to automatically identify the speech signal of the user among these separated sound signals yn(t).
To calculate directions of arrival (DOA) of the sound sources, the frequency response matrix (or mixing matrix) Hmn(f) is obtained first from the learned unmixing matrix Wnm(f) by Hmn(f)=Wnm−1(f). Herein, the separated sound signals may be exchanged in order (permutation problem) and scaled in amplitude (scaling problem) due the characturistics of ICA, and the response matrix Hmn(f) can be represented as Hmn(f)=Amn·exp(jφn)exp(j2πfc−1dmcosθn,f). Here, Amn and exp(jφn) respectively denote amplitude attenuation and phase modulation from the original sound signals.
The ratio between two frequency response matrices Hmn(f) and Hm′n(f) can be calculated by Equation 5 below.
H
mn(f)/Hm′n(f)=Amn/Am′nexp(j2πfc−1(dm−dm′)cosθn,f) [Equation 5]
As Equation 5 indicates a frequency response ratio with respect to an identical sound source n, Amn/Am′n≈1. Therefore, DOA θn,f of the separated signal yn(t) at a frequency f can be calculated by using Equation 6 below.
In Equation 6, constant c denotes the speed of sound (340 m/s).
That is, for two sound sources, values of DOA(1) θ1,f and DOA(2) θ2,f at different frequencies are plotted as a circle 200 or a cross 202. θ1,f and θ2,f can have slightly different values at different frequencies, and tend to have less accurate values at low frequencies or high frequencies. Thus, it is preferable to calculate the direction DOA(n) of the separated signal yn(t) by averaging values of θn,f over the entire frequencies or an interval [f1, f2] having a highly reliable value over the entire frequencies, as shown in Equation 7.
As described above, the directions DOA(n) of the separated sound signals y1(t), . . . , yn(t) can be obtained through the ICA-DOA estimator 104. Thereafter, in the speech recognizer 108, in order to calculate a speech recognition reliability of the separated sound signals y1(t), . . . , yn(t), a k-dimensional feature vector with respect to each of the separated sound signals y1(t), . . . , yn(t) is calculated during a preset interval (e.g., an interval of 20 ms for each 10 ms. When N feature vectors extracted respectively from the separated sound signals y1(t), . . . , yn(t) are denoted by Z1, . . . , Zn, and a search network formed with a set of hidden Markov models (HMM) as a probabilistic model for speech recognition is denoted by λ, the normalized log likelihood probability ln of the separated sound signal yn(t) can be calculated by following Equation 8.
l
n=max log(Pr(Zn|λ))/T [Equation 8]
As the log likelihood probability accumulates with increasing length of a speech, it is divided by the number of frames T in the entire signal interval for normalization. If one of the separated sound signals y1(t), . . . , yn(t) corresponds to the user's speech, it is highly likely that the corresponding separated sound signal has the highest probability through the operation of the HMM search network. Thus, if lk is the maximum among the calculated normalized log likelihood probabilities l1, . . . , ln, the kth separated signal yk(t) is considered as the user's speech.
However, in reality, signals separated by ICA do not include the original sound source only, and may still include other sound source or an interfering neighborhood speakers' speech signal. Therefore, the kth separated signal yk(t) having the maximum log likelihood probability lk can be a sound signal other than the user speech signal.
Therefore, the present embodiment additionally utilizes reliability information regarding a separated signal yk(t) being presumed as the user speech signal with the maximum log likelihood probability lk. The reliability is defined by the difference between the highest value lk and the second highest value lsecond of the obtained log likelihood probabilities 11, . . . , 1N. That is, the reliability is defined as c(k)=|lk−lsecond|. In this case, when the difference between lk and lsecond is higher than a specific threshold value, yk(t) is considered as the user speech signal.
In
As described above, separated signals yl(t), . . . , yn(t) and their direction information DOA(1), . . . , DOA(n) (106) are derived from input signals xl(t), . . . , xM(t) 102 by the ICA-DOA estimator 104, normalized log likelihood probabilities l1, . . . , lN are calculated by the speech recognizer 108, and the reliability c(k) is calculated by the speech signal identifier 112, the reliability c(k) being a reliability of the maximum normalized log likelihood probability lk (lk=max{l1, . . . , ln}).
In addition, in accordance with the embodiment of the present invention, the position of noise sources other than the user's speech is considered to be unmovable. Therefore, the present invention further enhances the performance of user speech identification.
Referring to
At step 406, DOA(i) of each of the N-1 noise sources excluding the sound source k is compared with reference DOA values stored in the reference DOA storage 408, and one reference DOA value closest to DOA(i) is found and updated. If the reference DOA of the jth noise source is ref_DOA(r), ref_DOA(r) can be updated by ref_DOA(r)←(1−ρ)·ref_DOA(r)+ρ·DOA(j) (0≦ρ≦1). The reason for updating the reference DOAs is that even though the position of noise sources are assumed to be fixed, the estimated DOA values are slightly different each time they are calculated. Also, the initial value of ref_DOA(r) are obtained with setting the ρ value as ρ=1, and thereafter the update continues with other predetermined ρ value as above.
Meanwhile, at step 402, if the reliability c(k) is less than the threshold value θ, sound source identification is performed by using DOA(k) for the sound source k with the highest output probability and DOA(s) for the sound source S with the second highest output probability at step 410. That is, one of the reference DOA values, for the N-1 noise sources stored in the reference DOA storage 408, closest to DOA(k) is found, and the difference DOA_diff(k) between DOA(k) and the found reference DOA value is calculated. Similarly, the difference DOA_diff(s) is calculated. Then, between DOA_diff(k) and DOA_diff(s), higher one is determined as a the user's speech and the other is determined as a noise source. Finally, at step 412, according to the result of source identification, one or more words Ws from the source s are recognized as the user speech if the source k is determined as a noise source, and one or more words Wk are recognized as the user speech if the source s is determined as a noise source.
As described above, the present invention performs separation of sound source signals by using the ICA for high-performance speech recognition. Herein, input signals to microphones are separated using the ICA, the user's speech signal is automatically identified from the separated source signals.
As described above, speech recognition based on source separation and source identification is a speech recognition technique resistant to noise. Audio source separation can be successfully performed in a noisy environment on the basis of two or more microphones and the ICA, and thus may be applied to diverse fields related to wireless headsets, hearing aids, mobile phones, speech recognizers, and medical image analysis.
While the invention has been shown and described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2008-0124371 | Sep 2008 | KR | national |