Priority is claimed on Japanese Patent Application No. 2013-200391, filed on Sep. 26, 2013, the content of which is incorporated herein by reference.
1. Field of the Invention
The present invention relates to a speech processing apparatus, a speech processing method, and a speech processing program.
2. Description of Related Art
A sound emitted in a room is repeatedly reflected by walls or installed objects to generate reverberations. When reverberations are added, the frequency characteristics are changed from an original speech and thus the speech recognition rate in a speech recognition apparatus performing speech recognition may be lowered. In the speech recognition apparatus, since a previously-uttered speech is added to a currently-uttered speech and thus an articulation rate may decrease. Therefore, reverberation reducing techniques of reducing reverberation components from a speech recorded under reverberant environments have been developed.
For example, Japanese Patent No. 4396449 (Patent Document 1) describes a reverberation removing method of acquiring a transfer function of a reverberation space using an impulse response of a feedback path, which is adaptively identified by an inverse filter processing unit, and reconstructing a sound source signal by dividing a reverberation speech signal by the magnitude of the transfer function. In the reverberation removing method described in Patent Document 1, the impulse response indicating the reverberation characteristic is estimated. Here, since the reverberation time ranges from 0.2 seconds to 2.0 seconds which is relatively long, the computational load excessively increases and a processing delay becomes marked. Accordingly, application thereof to speech recognition has not been widely spread.
H-G. Hirsch, Harald Finster, A New Approach for the Adaptation of HMMs to Reverberation and Background Noise, Speech Communication, Elsevier, 2008, 244-263 (Non-patent Document 5) describes a method of preparing a plurality of acoustic models obtained under reverberation environments having different reverberation times in advance and searching for an acoustic model having the highest likelihood in an environment in which a speech is recorded. The reverberation time is a time until reverberation intensity relative to a maximum value is attenuated to a predetermined intensity. In the method described in Non-patent Document 5, speech recognition is performed using the searched acoustic model.
However, in the technique described in Non-patent Document 5, a case is not considered in which the direction of a speaker is changed with respect to the speed recognition apparatus. Accordingly, when the direction of a speaker is changed, there is a problem in that the reverberation reduction performance decreases and thus the speech recognition accuracy decreases.
The invention is made in consideration of the above-mentioned circumstances and an object thereof is to provide a speech processing apparatus, a speech processing method, and a speech processing program which can realize reverberation reduction for improving speech recognition accuracy even when the direction of a sound source is changed.
(1) In order to achieve the above-mentioned object, according to an aspect of the present invention, there is provided a speech processing apparatus including: a sound collecting unit configured to collect sound signals; a sound source direction estimating unit configured to estimate a direction of a sound source of each sound signal collected by the sound collecting unit; a reverberation reducing filter calculating unit configured to calculate a reverberation reducing filter to be applied to the sound signals collected by the sound collecting unit; and a reduction processing unit configured to apply the reverberation reducing filter calculated by the reverberation reducing filter calculating unit to the sound signals, wherein the reverberation reducing filter calculating unit calculates the reverberation reducing filter to be applied based on the directions of the sound sources estimated by the sound source direction estimating unit.
(2) As another aspect of the invention, in the speech processing apparatus according to (1), the reverberation reducing filter calculating unit may calculate the reverberation reducing filter using an extension filter which is generated using a late reflection component of the sound signal and a response of the late reflection component of the direction of each of the sound sources.
(3) As another aspect of the invention, in the speech processing apparatus according to (1) or (2), the sound source direction estimating unit may estimate the direction of the sound source using a feature vector of the single sound signal collected by the sound collecting unit and a probability model of the direction of each of the sound sources.
(4) As another aspect of the invention, the speech processing apparatus according to any one of (1) to (3) may further include a sound source separating unit configured to separate a full reverberant signal and a late reflection component from the sound signals collected by the sound collecting unit, and the reverberation reducing filter calculating unit may calculate the reverberation reducing filter using an extension filter which is generated using the late reflection component separated by the sound source separating unit and a response of the late reflection component of the direction of each of the sound sources.
(5) As another aspect of the invention, in the speech processing apparatus according to (4), the reduction processing unit may reduce the late reflection component from the full reverberant signal separated by the sound source separating unit by applying the reverberation reducing filter calculated by the reverberation reducing filter calculating unit to the full reverberant signal.
(6) As another aspect of the invention, the speech processing apparatus according to (4) or (5) may further include: a first sound signal processing unit configured to calculate a first feature vector of the sound signals collected by the sound collecting unit based on a first room transfer function; and a second sound signal processing unit configured to calculate a second feature vector of the sound signals collected by the sound collecting unit based on a second room transfer function, the sound source separating unit may include a first sound source separating unit configured to separate the full reverberant signal based on the first feature vector calculated by the first sound signal processing unit and a second sound source separating unit configured to separate the late reflection component based on the second feature vector calculated by the second sound signal processing unit, and the reduction processing unit may reduce the late reflection component separated by the second sound source separating unit from the full reverberant signal separated by the first sound source separating unit by applying the reverberation reducing filter calculated by the reverberation reducing filter calculating unit to the full reverberant signal.
(7) As another aspect of the invention, in the speech processing apparatus according to any one of (1) to (6), the sound source direction estimating unit may estimate the directions of the sound sources based on at least one of an image captured by an imaging unit and detection results of azimuth sensors attached to the vicinities of the sound sources.
(8) According to still another aspect of the invention, there is provided a speech processing method including: a sound collecting step of collecting sound signals; a sound source direction estimating step of estimating a direction of a sound source of each sound signal collected in the sound collecting step; a reverberation reducing filter calculating step of calculating a reverberation reducing filter to be applied to the sound signals collected in the sound collecting step based on the directions of the sound sources estimated in the sound source direction estimating step; and a reduction step of applying the reverberation reducing filter calculated in the reverberation reducing filter calculating step to the sound signals.
(9) According to still another aspect of the invention, there is provided a non-transitory computer-readable recording medium having recorded thereon a speech processing program causing a computer of a speech processing apparatus to perform: a sound collecting procedure of collecting sound signals; a sound source direction estimating procedure of estimating a direction of a sound source of each sound signal collected in the sound collecting procedure; a reverberation reducing filter calculating procedure of calculating a reverberation reducing filter to be applied to the sound signals collected in the sound collecting procedure based on the directions of the sound sources estimated in the sound source direction estimating procedure; and a reduction procedure of applying the reverberation reducing filter calculated in the reverberation reducing filter calculating procedure to the sound signals.
According to the configurations of (1), (8), or (9), it is possible to reduce reverberations by applying the reverberation reducing filter calculated depending on the directions of the sound sources emitting the sound signals to the sound signals. Accordingly, it is possible to achieve the reduction of reverberation to improve speech recognition accuracy even when the direction of a sound source is changed.
According to the configuration of (2), since the reverberation reducing filter is calculated using the extension filter, it is possible to perform the reverberation reduction with a small computational load.
According to the configuration of (3), since the directions of the sound sources can be estimated using a single sound signal collected by the sound collecting unit, it is possible to estimate the directions of the sound sources with a small computational load.
According to the configuration of (4), since the directions of the sound sources can be estimated using a plurality of sound signals collected by the sound collecting unit and the reverberation reduction can be performed by applying the reverberation reducing filter calculated depending on the estimated directions of the sound sources to the sound signals, it is possible to achieve the reverberation reduction to improve speech recognition accuracy.
According to the configuration of (5), since the late reflection component can be reduced using the reverberation reducing filter, it is possible to perform the reverberation reduction with a small computational load.
According to the configuration of (6), since the late reflection component separated by the second sound source separating unit can be reduced from the full reverberant signal separated by the first sound source separating unit, it is possible to perform the reverberation reduction with a small computational load.
According to the configuration of (7), since the directions of the sound sources can be estimated depending on the captured image or the detection results of the azimuth sensors, it is possible to estimate the direction of a sound source with a small computational load.
First, the invention will be described in brief.
A speech processing apparatus according to the invention separates a collected sound signal into a full reverberant signal and a late reflection signal. Then, the speech processing apparatus according to the invention estimates the direction of a speaker (sound source) with respect to the apparatus based on a late reflection signal and calculates a reverberation reducing filter to be applied to the sound signal based on the estimated direction of a sound source. Then, the speech processing apparatus according to the invention corrects the separated late reflection signal using the reverberation reducing filter. Then, the speech processing apparatus according to the invention performs a reduction process on the full reverberant signal based on the corrected late reflection signal. As a result, the speech processing apparatus according to the invention can achieve a reverberation reduction to improve speech recognition accuracy even when the direction of a sound source is changed.
The sound source may be a speaker having directivity or the like.
Hereinafter, an embodiment of the invention will be described with reference to the accompanying drawings.
In this arrangement example, a speaker Sp is located at a position separated by a distance d from the center of the sound collecting unit 12 in a room Rm as a reverberation environment. The direction (azimuth) of the speaker Sp (sound source) with respect to the sound collecting unit 12 is defined, for example, θ1, . . . , θg, . . . , ΘG in a counterclockwise direction. The room Rm has an inner wall that reflects arriving sound waves. The sound collecting unit 12 collects a speech l(ω) directly arriving from the speaker Sp as a sound source and a speech e(ω) reflected by the inner wall. Here, ω represents a frequency.
The direction of the speaker Sp (sound source) is not limited to an azimuth on the horizontal plane but includes an azimuth in the vertical direction. The azimuth in the vertical direction includes, for example, the ceiling side (upper side), the bottom side (lower side), and the like of the room Rm.
The speech directly arriving from the sound source and the reflected speech are referred to as a direct sound and a reflection, respectively. A section in which the elapsed time after the direct sound is uttered is shorter than a predetermined time (for example, equal to or less than about 30 ms), the number of reflection times is relatively small, and reflection patterns are distinguished from each other in the reflection is referred to as an early reflection. A section in which the elapsed time is longer than that of the early reflection, the number of reflection times is relatively larger, and reflection patterns are not distinguished from each other in the reflection is referred to as a late reflection, a late reverberation, or simply a reverberation. In general, the time used to distinguish the early reflection and the late reflection varies depending on the size of the room Rm, but for example, a frame length as a process unit in speech recognition corresponds to the time. This is because the direct sound processed in a previous frame and the late reflection subsequent to the early reflection have an influence on the processing of a current frame.
In
When a reverberation is added, the frequency characteristic is changed from the original speech. Accordingly, in a speech recognition apparatus that recognizes a speech, the speech recognition rate may decrease. In the speech recognition apparatus, since a previously-uttered speech overlaps with a currently-uttered speech, articulation rate may decrease. Accordingly, in this embodiment, it is possible to improve the speech recognition rate by reducing the late reflection signal.
In general, the closer a sound source becomes to the sound collecting unit 12 (the smaller the distance d becomes), the more the direct sound from the sound source becomes and the smaller the ratio of reverberations becomes. In the below description, a speech not including any reverberation component or including a reverberation component small enough to ignore out of speeches collected by the sound collecting unit 12 is referred to as a clean speech.
The sound collecting unit 12 collects sound signals of one or multiple (N, where N is an integer greater than 0) channels and transmits the collected sound signals of N channels to the speech processing apparatus 11. N microphones are arranged at different positions in the sound collecting unit 12. The sound collecting unit 12 includes, for example, microphones that receive sound waves of a specific frequency band (for example, 200 Hz to 4 kHz). The sound collecting unit 12 may transmit the collected sound signals of N channels in a wireless manner or a wired manner. When N is greater than 1, the sound signals only have to be synchronized with each other between the channels at the time of transmission. The sound collecting unit 12 may be fixed or may be installed in a moving object such as a vehicle, an aircraft, or a robot so as to be movable.
The speech processing apparatus 11 stores room transfer functions (RTF) A(ω) depending on the direction of the speaker Sp. The speech processing apparatus 11 separates the collected speeches into a full reverberant signal and a late reflection signal based on the stored room transfer functions. The speech processing apparatus 11 estimates the direction of the speaker Sp based on the separated late reflection signal. The speech processing apparatus 11 calculates characteristics of a noise reducing filter based on the estimated direction of the speaker Sp and the separated late reflection signal. The speech processing apparatus 11 performs a reverberation reducing process of reducing the reverberation of the separated full reverberant signal based on the calculated characteristics of the noise reducing filter. The speech processing apparatus 11 performs a speech recognition process on the speech signals subjected to the reverberation reducing process.
The configuration of the speech processing apparatus 11 according to this embodiment will be described below.
The storage unit 104 stores a room transfer function (first room transfer function) A(ω) and a room transfer function (second room transfer function) AL(ω). Here, the superscript L denotes a signal or information on late reflection.
The sound source separating unit 101 acquires the sound signals of N channels transmitted from the sound collecting unit 12 and separates the acquired sound signals of N channels into a full reverberant signal s(ω) and a late reflection signal (late reflection component) sL(ω) based on the room transfer function A(ω) stored in the storage unit 104. The sound source separating unit 101 outputs the separated full reverberant signal s(ω) and late reflection signal sL(ω) to the reduction unit 102. The configuration of the sound source separating unit 101 will be described later.
The reduction unit 102 estimates the direction of the speaker Sp based on the late reflection signal sL(ω) input from the sound source separating unit 101. The reduction unit 102 calculates characteristics of the noise reducing filter based on the estimated direction of the speaker Sp and the input late reflection signal sL(ω). The reduction unit 102 performs a reverberation reducing process of reducing the reverberation of the input full reverberant signal s(ω) based on the calculated characteristics of the noise reducing filter. The reduction unit 102 outputs an estimated value (hereinafter, referred to as a sound signal subjected to reverberation reduction) e0̂(ω) of the sound signal subjected to the reverberation reducing process to a speech recognizing unit 103. Here, θ̂ represents the angle of the estimated direction of the speaker Sp.
The speech recognizing unit 103 recognizes speech details (for example, a text indicating a word or a sentence) by performing a speech recognizing process on the reverberation-reduced sound signal e0̂(ω) input from the reduction unit 102, and outputs the recognition data indicating the recognized speech details to the outside. The speech recognizing unit 103 includes, for example, a hidden Markov model (HMM) which is an acoustic model and a word dictionary.
Here, the speech recognizing unit 103 calculates a sound feature quantity of the reverberation-reduced sound signal for every predetermined time interval (for example, 10 ms). The sound feature quantity is, for example, a feature vector which is a set of 34-dimensional Mel-frequency cepstrum coefficients (MFCC), a static Mel-scale log spectrum (static MSLS), a delta MSLS, and single delta power, a set of a static Mel-scale log spectrum, a delta MSLS, and single delta power, or the like. The speech recognizing unit 103 determines phonemes from the calculated sound feature quantity using an acoustic model, and recognizes a word from a phoneme sequence including the determined phonemes using a word dictionary.
The sound source separating unit 101 and the reduction unit 102 will be described below with reference to
First, the sound source separating unit 101 will be described. As shown in
The sound signals u(ω) collected by a plurality of microphones of the sound collecting unit 12 are input to the sound signal processing unit 1011. The sound signals u(ω) are a vector [u1(ω), . . . , uM(ω)]T when there are K sound sources. A vector x(ω) including the signals observed by M microphones is expressed by Expression (1).
x(ω)=[x1(ω), . . . , xM(ω)]T (1)
When the room transfer function A(ω) stored in the storage unit 104 is K×M-dimensional set CM×K, the sound signal processing unit 1011 computes the vector x(ω) using Expression (2) based on Expression (1). The set C represents a set as a combination of M microphones and K sound sources. The sound signal processing unit 1011 outputs the calculated vector x(ω) to the sound source separation processing unit 1013. A(ω) is a room transfer function of early reflection, for example, acquired in advance by measurement or experiment. A(ω) may be measured every time.
x(ω)=A(ω)u(ω) (2)
Similarly, the sound signal processing unit 1012 computes the vector xL(ω) using Expression (3) using the room transfer function AL(ω) stored in the storage unit 104. The sound signal processing unit 1012 outputs the calculated vector xL(ω) to the sound source separation processing unit 1014. AL(ω) is a room transfer function of late reflection, for example, acquired in advance by measurement or experiment. AL(ω) may be measured every time.
x
L(ω)=AL(ω)u(ω) (3)
The sound source separation processing unit 1013 separates the vector x(ω) into sound signals of one or more sound sources by performing a sound source separating process on the vector x(ω) input from the sound signal processing unit 1011. The sound source separation processing unit 1013 outputs the separated full reverberant signal s(ω) to the reduction unit 102. The full reverberant signal s(ω) is almost equal to a reverberation signal r(ω). The reverberation signal r(ω) is expressed by Expression (4) based on the early reflection signal e(ω) and the late reflection signal l(ω).
r(ω)=e(ω)+l(ω) (4)
The sound source separation processing unit 1013 calculates the full reverberant signal s(ω) using Expression (5) based on, for example, a geometric-constrained high order decorrelation-based source separation (GHDSS) method as the sound source separating process.
s(ω)=GHDSS[x(ω)] (5)
The sound source separation processing unit 1014 separates the vector xL(ω) into sound signals of one or more sound sources by performing a sound source separating process on the vector xL(ω) input from the sound signal processing unit 1012. The sound source separation processing unit 1014 outputs the separated late reflection signal sL(ω) to the reduction unit 102. The sound source separation processing unit 1014 calculates the late reflection signal sL(ω) using Expression (6) and, for example, using the GHDSS method as the sound source separating process.
s
L(ω)=GHDSS[xL(ω)] (6)
The sound source separation processing unit 1013 and the sound source separation processing unit 1014 may use, for example, an adaptive beam forming method of estimating a sound source direction and controlling directivity so as to have the highest sensitivity in a designated sound source direction instead of the GHDSS method. At the time of estimating the sound source direction, the sound source separation processing unit 1013 and the sound source separation processing unit 1014 may use a multiple signal classification (MUSIC) method.
The GHDSS method will be described below.
The GHDSS method is a method of separating collected sound signals of multiple channels into sound signals by sound sources. In this method, a separation matrix [V(ω)] (a full reverberant signal s(ω) or a late reflection signal sL(ω)) is sequentially calculated and the input speech vector [x(ω)] is multiplied by the separation matrix [V(ω)] to estimate a sound source vector [u(ω)]. The separation matrix [V(ω)] is a pseudo-inverse matrix of a transfer function matrix [H(ω)] having transfer functions from respective sound sources to the microphones of the sound collecting unit 12 as elements. The input speech vector [x(ω)] is a vector having frequency-domain coefficients of the sound signals of channels as elements. The sound source vector [u(ω)] is a vector having frequency-domain coefficients of the sound signals emitted from the respective sound sources as elements.
At the time of calculating the separation matrix [V(ω)], the sound source separation processing unit 1013 and the sound source separation processing unit 1014 calculate the sound source vector [u(ω)] so as to minimize two cost functions such as separation sharpness JSS and geometric constraint JGC.
The separation sharpness JSS is an index value indicating a degree to which one sound source is erroneously separated as a different sound source and is expressed, for example, by Expression (7).
J
SS
=∥[u(ω)Iu(ω)]*−diag([u(ω)Iu(ω)]*)∥2 (7)
In Expression (7), ∥ . . . ∥2 represents a Frobenius norm of . . . , and * represents the conjugate transpose of a vector or a matrix. diag( . . . ) represents a diagonal matrix having diagonal elements of . . . .
The geometric constraint JGC(ω) is an index value indicating a degree of error of the sound source vector [u(ω)] and is expressed, for example, by Expression (8).
J
GC=∥diag([V(ω)IA(ω)]−[I])∥2 (8)
In Expression (8), [I] represents a unit matrix.
The reduction unit 102 will be described below. As shown in
The late reflection signal sL(ω) input from the sound source separation processing unit 1014 includes redundant information in the time domain. Accordingly, the vector parameter estimating unit 1021 estimates the feature vector fL of the late reflection signal sL(ω) using Expression (9) and outputs the estimated feature vector fL to the direction estimating unit 1022.
f
L
=F[s
L(ω)] (9)
In Expression (9), F represents a feature extraction order for acquiring the feature vector fL. The feature vector is, for example, 12-dimensional mel-frequency cepstrum coefficients (MFCC), or 12-dimensional delta MFCC, or one-dimensional delta energy.
The direction estimating unit 1022 estimates the estimated value θ̂ of the direction θ of the speaker Sp by evaluating the feature vector fL input from the vector parameter estimating unit 1021 based on the likelihood of Expression (10).
In Expression (10), arg max p( . . . ) is a function of giving p for maximizing . . . μθg is a probability model of a set of directions {θ1, . . . , θg, . . . , θG}. The direction estimating unit 1022 uses θg, of which the calculated value is a maximum, to select an extension filter Hθ̂ of an appropriate equalizer in the reverberation reducing filter calculating unit 1023.
In Expression (10), the probability model μθg is, for example, learned in advance. In learning the probability model μθg, the late reflection signal sL(ω) is expressed by Expression (11) instead of Expressions (3) and (6).
s
L(ω)=AL(ω)u(ω) (11)
The feature vector fLθ in the direction θ is expressed by Expression (12) using the extension filter Hθ of the equalizer which is made into a parameter.
f
θ
L
=F[s
L(ω)Hθ] (12)
The plurality of extension filters Hθ for each direction θ of the speaker Sp (sound source) are stored in the direction estimating unit 1022, for example, in advance by experiment or measurement.
The direction estimating unit 1022 selects the extension filter Hθ in Expression (12) from the stored extension filters Hθ and outputs the selected extension filter Hθ as the estimated value Hθ̂to the reverberation reducing filter calculating unit 1023.
The probability model μθg in Expression (10) is learned using Expression (13) based on the set of directions {θ1, . . . , θg, . . . , θG}. This process is performed off-line.
In Expression (13), μ is an unknown model parameter and fθi is a training vector of i-th late reflection. The training vector is equalized by the extension filter Hθ.
The reverberation reducing filter calculating unit 1023 corrects the late reflection signal sL(ω) input from the sound source separation processing unit 1014 based on the equalizer characteristic corresponding to the estimated value Hθ̂ of the extension filter input from the direction estimating unit 1022. The reverberation reducing filter calculating unit 1023 outputs the corrected late reflection signal sLθ to the reverberation reducing unit 1024.
Theoretically, the room transfer function A(ω) is necessary for each direction θ of the speaker Sp.
This is because the reverberation characteristic of the room Rm varies whenever the direction θ of the speaker Sp varies. Particularly, it has been confirmed by experiment that the late reflection signal sL(ω) varies with the variation in the direction θ of the speaker Sp. However, it is difficult to measure the room transfer characteristic for each direction θ of the speaker Sp in M microphones. Accordingly, in this embodiment, computation is equivalently carried out by using the late reflection signal sL(ω) separated from multiple channels by the sound source separating unit 101. Accordingly, in this embodiment, it is possible to simplify the influence of the direction θ of the speaker Sp in the room transfer functions of multiple channels in a filtered sound signal of one channel.
That is, the reverberation reducing filter calculating unit 1023 calculates the equalized late reflection signal sLθ(ω) using Expression (14).
s
θ
L(ω)=sL(ω)Hθ (14)
In Expression (14), the late reflection signal sL(ω) is the separated late reflection using a general room transfer function while it is equalized using the extension filter Hθ.
The extension filter Hθ is, for example, a filter characteristic acquired by measuring the late reflection signal sL(ω) depending on the actual direction θ of the speaker Sp.
Here, sLAθ(ω) is a substantial late reflection signal based on the room transfer functions Aθ(ω) of multiple channels. The design of this filter is carried out, for example, using a pole positioning method on a frequency grid of a logarithmic function based on Non-Patent Documents 1 and 2.
Non-Patent Document 1: “Body Modeling”, In Proceedings of the International Computer Music Conference, 2007.
Non-Patent Document 2: J. Laroche and J-L. Meillier, “Multichannel Excitation/Filter Modeling of Percussive Sounds with Application to the Piano”, In Proceedings IEEE Transactions Speech and Audio Processing, 1994.
The reverberation reducing filter calculating unit 1023 first sets a target response to the late reflection signal sL(ω). That is, the late reflection signal sL(ω) input to the reverberation reducing filter calculating unit 1023 is set as the target response.
Then, the reverberation reducing filter calculating unit 1023 calculates the extension filter H0 for {θ1, . . . , θg, . . . , θG} by appropriately setting the poles of the room transfer functions so as to achieve the target response sL(ω). The reverberation reducing filter calculating unit 1023 may perform an averaging pre-process so as to prevent a reverse phenomenon of the target response sL(ω). The reverberation reducing filter calculating unit 1023 stores, for example, a direction model correlated with each direction θ of the speaker Sp. The direction model is, for example, a Gaussian mixture model (GMM). The GMM is a kind of acoustic model in which the output probabilities for input sound feature quantities are weighted and added with a plurality of (for example, 256) normal distributions as a basis. Accordingly, the direction model is defined by statistics such as mixture weighting coefficients, mean values, and a covariance matrix. At the time of learning the GMM for each direction θ, the statistics may be determined in advance so as to maximize the likelihood using learning speech signals to which the reverberation characteristic is added for each direction θ. An HMM may be used as the direction model or a general discriminator such as a support vector machine (SVM) may be used.
After the extension filter Hθ̂is estimated by the direction estimating unit 1022, the reverberation reducing filter calculating unit 1023 corrects the separated late reflection signal sL(ω) using Expression (14) without using a correlated room transfer function A(ω).
The full reverberant signal s(ω) from the sound source separation processing unit 1013 and the corrected late reflection signal sLθ(ω) from the reverberation reducing filter calculating unit 1023 are input to the reverberation reducing unit 1024.
The reverberation reducing unit 1024 employs a reverberation model of one channel described with reference to
In Expression (15), | . . . | is the absolute value of . . . .
In Expression (15), |s(ω, t)|2 is power of the separated reflection signal (where |s(ω, t)|2 is almost equal to |r(ω, t)|2) and |sL(ω, t)|2 is the power of the late reflection signal sL(ω). The reverberation reducing unit 1024 generates a reverberation-reduced sound signal eθ̂(ω) obtained by converting the calculated frequency-domain coefficient e(ω, t) of the early reflection signal into a time-domain and outputs the generated reverberation-reduced sound signal eθ̂(ω) to the speech recognizing unit 103.
As described above, in this embodiment, it is possible to calculate the frequency-domain coefficient e(ω, t) of the early reflection signal as expressed by Expression (15) through the equalizing process in the reverberation reducing filter calculating unit 1023 and the exclusion of the weighting coefficient δp in the reverberation reducing unit 1024.
The speech processing in this embodiment will be described below.
(Step S101) The sound signal processing unit 1011 calculates the vector x(ω) using Expression (2) based on the room transfer function A(ω) stored in the storage unit 104 for the sound signals of N channels input from the sound collecting unit 12. Then, the sound signal processing unit 1011 outputs the calculated vector x(ω) to the sound source separation processing unit 1013. The sound signal processing unit 1011 performs the process of step S102 after step S101 ends.
(Step S102) The sound signal processing unit 1012 calculates the vector xL(ω) using Expression (3) based on the room transfer function AL(ω) stored in the storage unit 104 for the sound signals of N channels input from the sound collecting unit 12. Then, the sound signal processing unit 1012 outputs the calculated vector xL(ω) to the sound source separation processing unit 1014. The sound signal processing unit 1012 performs the process of step S103 after step S102 ends. Steps S101 and S102 may be reversed in order or may be performed at the same time.
(Step S103) The sound source separation processing unit 1013 performs the sound source separating process on the vector x(ω) input from the sound signal processing unit 1011, for example, using the GHDSS method to separate the vector into sound signals of one or more sound sources and outputs the separated full reverberant signal s(ω) to the reduction unit 102. The sound source separation processing unit 1013 performs the process of step S104 after step S103 ends.
(Step S104) The sound source separation processing unit 1014 performs the sound source separating process on the vector xL(ω) input from the sound signal processing unit 1012, for example, using the GHDSS method to separate the vector into sound signals of one or more sound sources and outputs the separated late reflection signal sL(ω) to the reduction unit 102. The sound source separation processing unit 1014 performs the process of step S105 after step S104 ends. Steps S103 and S104 may be reversed in order or may be performed at the same time.
(Step S105) The vector parameter estimating unit 1021 estimates the feature vector fL of the late reflection signal sL(ω) input from the sound source separation processing unit 1014 using Expression (12) and outputs the estimated feature vector fL to the direction estimating unit 1022. The vector parameter estimating unit 1021 performs the process of step S106 after step S105 ends.
(Step S106) The direction estimating unit 1022 estimates the direction of the speaker Sp based on the feature vector fL input from the vector parameter estimating unit 1021 and the likelihood of Expression (10). Then, the direction estimating unit 1022 estimates the equalizer characteristic Hθ using Expression (12) and outputs the estimated extension filter Hθ to the reverberation reducing filter calculating unit 1023. The direction estimating unit 1022 performs the process of step S107 after step S106 ends.
(Step S107) The reverberation reducing filter calculating unit 1023 corrects the late reflection signal sL(ω) input from the sound source separation processing unit 1014 based on the equalizer characteristic corresponding to the estimated value Hθ̂of the extension filter input from the direction estimating unit 1022. The reverberation reducing filter calculating unit 1023 outputs the corrected late reflection signal sLθ to the reverberation reducing unit 1024. The reverberation reducing filter calculating unit 1023 performs the process of step S108 after step S107 ends.
(Step S108) The reverberation reducing unit 1024 estimates the reverberation-reduced sound signal eθ̂(ω) based on the full reverberant signal s(ω) input from the sound source separation processing unit 1013 and the corrected late reflection signal sLθ(ω) input from the reverberation reducing filter calculating unit 1023. The reverberation reducing unit 1024 outputs the reverberation-reduced sound signal eθ̂(ω) to the speech recognizing unit 103. The reverberation reducing unit 1024 performs the process of step S109 after step S108 ends.
(Step S109) The speech recognizing unit 103 recognizes speech details (for example, a text indicating a word or a sentence) by performing a speech recognizing process on the reverberation-reduced sound signal e0̂(ω) input from the reduction unit 102, and outputs recognition data indicating the recognized speech details to the outside.
In this way, the speech processing ends.
A test result in which the speech recognition accuracy was verified using the speech processing apparatus 11 according to this embodiment will be described below. The test was carried out in the environment shown in
First, the speech recognizing unit 103 was learned by causing 24 speakers to utter speeches 200 times using a Japanese newspaper article sentence (JNAS) corpus. A phonetically tied mixture (PTM) HMM including total 8256 normal distributions, which is a kind of continuous HMM, was used as acoustic models.
The test was carried out at distances of 0.5 m, 1.0 m, 1.5 m, 2.0 m, and 2.5 m which are distances between the sound collecting unit 12 and the speaker Sp and at directions of the speaker Sp of θ1=30°, θ2=15°, θ3=0°, θ4=−15°, and θ5=−30° for each distance. Here, θ3=0° indicates the direction of the speaker Sp perpendicular to the sound collecting unit 12. The test was carried out 200 times at each position. At each position, the test was carried out for the five angles. In the same test room, the room transfer functions for the positions and the directions were measured and stored in the storage unit 104.
First to fourth setting rooms will be described below with reference to
The room states of the first to fourth setting rooms shown in
As shown in
The effectiveness of estimation of the direction of a speaker Sp will be described below with reference to
Appropriate selection of the direction θ̂ of a speaker Sp is effective for selecting the estimated value Hθ̂ of the extension filter which is the optimal equalization parameter. First, in the first to fourth setting rooms, three different random positions (first to third positions) are selected. The values of columns shown in
The results when the speech recognizing process is carried out using the speech processing apparatus 11 according to this embodiment in test room A and test room B will be described.
In
In
Non-Patent Document 3: S. Griebel and M. Brandstein, “Wavelet Transform Extrema Clustering for Multi-channel Speech Dereverberation”
Non-Patent Document 4: B. Yegnanarayana and P. Satyaranyarana, “Enhancement of Reverberant Speech Using LP Residual Signals”, In Proceedings of IEEE Trans. on Audio, Speech and Lang. Proc., 2000.
As shown in
For example, when the distance to the speaker Sp was 0.5 m, the word recognition rate was about 68% in method A, about 70% in method B, about 72% in method C, and about 72.5% in method D. The word recognition rate was about 74% in method E, about 77.5% in method F, and about 78% in method G.
For example, when the distance to the speaker Sp was 2.5 m, the word recognition rate was about 15% in method A, about 25% in method B, about 27% in method C, and about 28% in method D. The word recognition rate was about 30% in method E, about 46% in method F, and about 47% in method G.
As shown in
For example, when the distance to the speaker Sp was 1.0 m, the word recognition rate was about 11% in method A, about 20% in method B, about 22% in method C, and about 24% in method D. The word recognition rate was about 26% in method E, about 39% in method F, and about 40% in method G.
For example, when the distance to the speaker Sp was 2.0 m, the word recognition rate was about −14% in method A, about 7% in method B, about 10% in method C, and about 12% in method D. The word recognition rate was about 14% in method E, about 26% in method F, and about 27% in method G.
An example of the word recognition rate depending on the direction θ of a speaker Sp will be described below with reference to
First, the test result in test room A with a reverberation time of 240 ms will be described.
As shown in
The test result of test room B with a reverberation time of 640 ms will be described below.
As shown in
As described above, the speech processing apparatus 11 according to this embodiment includes the sound collecting unit 12 configured to collect sound signals, the sound source direction estimating unit (the direction estimating unit 1022) configured to estimate the direction of a sound source of each sound signal collected by the sound collecting unit 12, the reverberation reducing filter calculating unit 1023 configured to calculate a reverberation reducing filter to be applied to the sound signals collected by the sound collecting unit 12, and the reduction processing unit (the reverberation reducing unit 1024) configured to apply the reverberation reducing filter calculated by the reverberation reducing filter calculating unit 1023 to the sound signals, the reverberation reducing filter calculating unit 1023 calculates the reverberation reducing filter to be applied based on the directions of the sound sources estimated by the sound source direction estimating unit (the direction estimating unit 1022).
According to this configuration, the speech processing apparatus 11 according to this embodiment can reduce reverberations by applying the reverberation reducing filter calculated depending on the directions of the sound sources emitting the sound signals to the sound signals. Accordingly, it is possible to achieve the reverberation reduction to improve speech recognition accuracy even when the direction of a sound source is changed.
For example, in apparatuses according to the related art, in order to secure the robustness of a system to a variation in the direction of a sound source, it is necessary to collect and calculate the room transfer functions corresponding to all directions of the sound source using microphones. On the other hand, in the speech processing apparatus 11 according to this embodiment, it is possible to secure the robustness of a system to the variation in the direction of a sound source using a simple equalizer process without processing the sound signals of multiple channels. In the speech processing apparatus 11 according to this embodiment, it is not necessary to process the sound signals of multiple channels, unlike the related art, and it is thus possible to reduce the computational load.
The first embodiment has described an example where the reduction unit 102 performs estimation of the direction of a speaker Sp and reduction of reverberations using the full reverberant signal s(ω) and the late reflection signal (late reflection component) sL(ω) into which the collected sound signals of N channels are separated by the sound source separating unit 101.
The estimation of the direction of a speaker Sp or the reduction of reverberations may be performed by only the reduction unit 102.
For example, a full reverberant signal s(ω) and a late reflection signal sL(ω) collected in advance may be directly input to the reduction unit 102A.
Alternatively, a full reverberant signal s(ω) and a late reflection signal (late reflection component) sL(ω) into which a sound signal collected by one microphone of the microphones of the sound collecting unit 12 is separated by the sound source separating unit 101 may be input to the reduction unit 102A.
The acquisition unit 1025 of the reduction unit 102A may acquire an image captured by an imaging device and may output the acquired image to the direction estimating unit 1022. The direction estimating unit 1022 may estimate the direction of a speaker Sp (sound source) based on the captured image.
The acquisition unit 1025 may acquire a detected value output from an azimuth sensor or the like mounted on the head of a speaker Sp and may output the acquired detected value to the direction estimating unit 1022. The direction estimating unit 1022 may estimate the direction of the speaker Sp (sound source) based on the acquired detected value.
Alternatively, the reduction unit 102A may be connected to the respective microphones of the sound collecting unit 12.
This embodiment has described an example where a word uttered by a speaker Sp is recognized, but the invention is not limited to this example. The sound signals collected by the sound collecting unit 12 are not limited to speeches but may be music.
In this case, the speech processing apparatus 11 may estimate, for example, a tempo of a piece of music by performing a beat tracking process (not shown) and estimating a direction of a sound source.
Examples of equipment into which the speech processing apparatus 11 is assembled include a robot, a vehicle, and a mobile terminal. In this case, the robot, the vehicle, or the mobile terminal may include the sound collecting unit 12.
The sound source direction may be estimated by recording a program for performing the functions of the sound processing apparatus 11 according to the invention on a computer-readable recording medium and reading and executing the program recorded on the recording medium into a computer system. Here, the “computer system” may include an OS or hardware such as peripherals. The “computer system” may include a WWW system including a homepage providing environment (or display environment). Examples of the “computer-readable recording medium” include portable mediums such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM and a storage device such as a hard disk built in a computer system. The “computer-readable recording medium” may include a medium that temporarily holds a program for a predetermined time, like a volatile memory (RAM) in a computer system serving as a server or a client in a case where the program is transmitted via a network such as the Internet or a communication circuit such as a telephone circuit.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by transmission waves in the transmission medium. Here, the “transmission medium” via which the program is transmitted means a medium having a function of transmitting information such as a network (communication network) such as the Internet or a communication circuit (communication line) such as a telephone line. The program may be configured to realize a part of the above-mentioned functions or may be configured to realize the above-mentioned functions by combination with a program recorded in advance in a computer system, like a so-called differential file (differential program).
While preferred embodiments of the invention have been described and shown above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description, and is only limited by the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2013-200391 | Sep 2013 | JP | national |