1. Field of the Invention
The present invention relates to a speech recognition system and a speech recognizing method.
2. Background Art
When a robot functions while communicating with persons, for example, it has to perform speech recognition of speeches of the persons while executing motions. When the robot executes motions, so called ego noise (ego-motion noise) caused by robot motors or the like are generated. Accordingly, the robot has to perform speech recognition in the environment with ego noise being generated.
Several methods in which templates stored in advance are subtracted from spectra of obtained sounds have been proposed to reduce ego noise (S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No. 2, 1979, and A. Ito, T. Kanayama, M. Suzuki, S. Makino, “Internal Noise Suppression for Speech Recognition by Small Robots”, Interspeech 2005, pp. 2685-2688, 2005.). These methods are single-channel based noise reduction methods. Single-channel based noise reduction methods generally degrade the intelligibility and quality of the audio signal, for example, through the distorting effects of musical noise, a phenomenon that occurs when noise estimation fails (I. Cohen, “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement”, IEEE Signal Processing Letters, vol. 9, No. 1, 2002).
On the other hand, linear sound source separation (SSS) techniques are also very popular in the field of robot audition, where noise suppression is mostly carried out using SSS techniques with microphone arrays (K. Nakadai, H. Nakajima, Y. Hasegawa and H. Tsujino, “Sound source separation of moving speakers for robot audition”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3685-3688, 2009, and S. Yamamoto, J. M. Valin, K Nakadai, J. Rouat, F. Michaud, T. Ogata, and H. G. Okuno, “Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory”, IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2005). However, a directional noise model such as assumed in case of interfering speakers (S. Yamamoto, K Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K. Komatani, T. Ogata, and H. G. Okuno, “Real-time robot audition system that recognizes simultaneous speech in the real world”, Proc. of the IEEE/RSJ International Conference on Robots and Intelligent Systems (IROS), 2006.) or a diffuse background noise model (J. M. Valin, J. Rouat and F. Michaud, “Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter”, Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2123-2128, 2004.) does not hold entirely for the ego-motion noise. Especially because the motors are located in the near field of the microphones, they produce sounds that have both diffuse and directional characteristics.
Thus, conventionally a speech recognition system and a speech recognizing method for high-accuracy speech recognition in the environment under ego noise have not been developed.
Accordingly, there is a need for a speech recognition system and a speech recognizing method for high-accuracy speech recognition in the environment under ego noise.
A speech recognition system according to a first aspect of the present invention includes a sound source separating and speech enhancing section; an ego noise predicting section; and a missing feature mask generating section for generating missing feature masks using outputs of the sound source separating and speech enhancing section and the ego noise predicting section; an acoustic feature extracting section for extracting an acoustic feature of each sound source using an output for said each sound source of the sound source separating and speech enhancing section; and a speech recognizing section for performing speech recognition using outputs of the acoustic feature extracting section and the missing feature masks.
In the speech recognition system according to the present aspect, the missing feature mask generating section generates missing feature masks using the outputs of the sound source separating and speech enhancing section and the ego noise predicting section. Accordingly, input data of the speech recognizing section can be adjusted based on the results of sound separation and the predicted ego noise to improve speech recognition accuracy.
A speech recognition system according to a second aspect of the present invention includes a sound source separating and speech enhancing section; an ego noise predicting section; a speaker missing feature mask generating section for generating speaker missing feature masks for each sound source using an output for said each sound source of the sound source separating and speech enhancing section; and an ego noise missing feature mask generating section for generating ego noise missing feature masks for each sound source using an output for said each sound source of the sound source separating and speech enhancing section and an output of the ego noise predicting section. The speech recognition system according to the present aspect further includes a missing feature mask integrating section for integrating speaker missing feature masks and ego noise missing feature masks to generate total missing feature masks; an acoustic feature extracting section for extracting an acoustic feature of each sound source using an output for said each sound source of the sound source separating and speech enhancing section; and a speech recognizing section for performing speech recognition using outputs of the acoustic feature extracting section and the total missing feature masks.
The speech recognition system according to the present aspect is provided with the missing feature mask integrating section for integrating speaker missing feature masks and ego noise missing feature masks to generate total missing feature masks. Accordingly, appropriate total missing feature masks can be generated for each individual environment, using the outputs of the sound source separating and speech enhancing section and the output of the ego noise predicting section to improve speech recognition accuracy.
In a speech recognition system according to a first embodiment of the second aspect of the present invention, the ego noise missing feature mask generating section generates the ego noise missing feature masks for each sound source using a ratio between a value obtained by dividing the output of the ego noise predicting section by the number of the sound sources and an output for said each sound source of the sound source separating and speech enhancing section.
In the speech recognition system according to the present embodiment, reliability for ego noise of an output for each sound source of the sound source separating and speech enhancing section is determined using a ratio between a value obtained by dividing energy of the ego noise by the number of the sound sources and energy of sound for each sound source. Accordingly, a portion of the output which is contaminated by the ego noise can be effectively removed to improve speech recognition accuracy.
In a speech recognition system according to a second embodiment of the second aspect of the present invention, the missing feature mask integrating section adopts a speaker missing feature mask as a total missing feature mask for each sound source when an output for said each sound source of the sound source separating and speech enhancing section is equal to or greater than a value obtained by dividing the output of the ego noise predicting section by the number of the sound sources and adopts an ego noise missing feature mask as the total missing feature mask for said each sound source when the output for said each sound source of the sound source separating and speech enhancing section is smaller than the value obtained by dividing the output of the ego noise predicting section by the number of the sound sources.
In the speech recognition system according to the present embodiment, an appropriate total missing feature mask can be generated depending on energy of sounds from the sound sources and energy of the ego noise and speech recognition accuracy can be improved by using the total missing feature mask.
A speech recognizing method according to a third aspect of the present invention includes the steps of separating sound sources by a sound source separating and speech enhancing section; predicting ego noise by an ego noise predicting section; generating missing feature masks using outputs of the sound source separating and speech enhancing section and an output of the ego noise predicting section, by a missing feature mask generating section; extracting an acoustic feature of each sound source using an output for said each sound source of the sound source separating and speech enhancing section, by an acoustic feature extracting section; and performing speech recognition using outputs of the acoustic feature extracting section and the missing feature masks, by a speech recognizing section.
In the speech recognizing method according to the present aspect, the missing feature mask generating section generates missing feature masks using the outputs of the sound source separating and speech enhancing section and the output of the ego noise predicting section. Accordingly, input data of the speech recognizing section can be adjusted based on the results of sound separation and the predicted ego noise to improve speech recognition accuracy.
A speech recognizing method according to a fourth aspect of the present invention includes the steps of separating sound sources by a sound source separating and speech enhancing section; predicting ego noise by an ego noise predicting section; generating speaker missing feature masks for each sound source using an output for said each sound source of the sound source separating and speech enhancing section, by a speaker missing feature mask generating section; and generating ego noise missing feature masks for each sound source using an output for said each sound source of the sound source separating and speech enhancing section and an output of the ego noise predicting section, by an ego noise missing feature mask generating section. The speech recognizing method according to the present aspect further includes the steps of integrating speaker missing feature masks and ego noise missing feature masks to generate total missing feature masks, by a missing feature mask integrating section; extracting an acoustic feature of each sound source using an output for said each sound source of the sound source separating and speech enhancing section, by an acoustic feature extracting section; and performing speech recognition using outputs of the acoustic feature extracting section and the total missing feature masks, by a speech recognizing section.
In the speech recognizing method according to the present aspect, appropriate total missing feature masks can be generated by the missing feature mask integrating section for each individual environment, using the outputs of the sound source separating and speech enhancing section and the output of the ego noise predicting section to improve speech recognition accuracy.
In a speech recognizing method according to a first embodiment of the fourth aspect of the present invention, in the step of generating ego noise missing feature masks, the ego noise missing feature masks for each sound source are generated using a ratio between a value obtained by dividing the output of the ego noise predicting section by the number of the sound sources and an output for said each sound source of the sound source separating and speech enhancing section.
In the speech recognizing method according to the present embodiment, reliability for ego noise of an output for each sound source of the sound source separating and speech enhancing section is determined a ratio between a value obtained by dividing energy of the ego noise by the number of the sound sources and energy of sound for each sound source. Accordingly, a portion of the output which is contaminated by the ego noise can be effectively removed to improve speech recognition accuracy.
In a speech recognizing method according to a second embodiment of the fourth aspect of the present invention, in the step of integrating speaker missing feature masks and ego noise missing feature masks to generate total missing feature masks, a speaker missing feature mask is adopted as a total missing feature mask for each sound source when an output for said each sound source of the sound source separating and speech enhancing section is equal to or greater than a value obtained by dividing the output of the ego noise predicting section by the number of the sound sources and an ego noise missing feature mask is adopted as the total missing feature mask for said each sound source when the output for said each sound source of the sound source separating and speech enhancing section is smaller than the value obtained by dividing the output of the ego noise predicting section by the number of the sound sources.
In the speech recognizing method according to the present embodiment, an appropriate total missing feature mask can be generated depending on energy of sounds from the sound sources and energy of the ego noise and speech recognition accuracy can be improved by using the total missing feature mask.
The sound source separating and speech enhancing section 100 includes a sound source localizing section 101, a sound source separating section 103 and a speech enhancing section 105. The sound source localizing section 101 localizes sound sources using acoustic data obtained from a plurality of microphones set on a robot. The sound source separating section 103 separates the sound sources using localized positions of the sound sources. The sound source separating section 103 uses a linear separating algorithm called Geometric Source Separation (GSS) (S. Yamamoto, K Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K Komatani, T. Ogata, and H. G. Okuno, “Real-time robot audition system that recognizes simultaneous speech in the real world”, Proc. of the IEEE/RSJ International Conference on Robots and Intelligent Systems (IROS), 2006.). As shown in
The ego noise predicting section 200 detects operational states of motors used in the robot and estimates ego noise based on the operational states. The configuration and function of the ego noise predicting section 200 will be described in detail later.
The missing feature mask generating section 300 generates missing feature masks which are appropriate for each individual speaker (sound source) in the environment, based on outputs of the sound source separating and speech enhancing section 100 and the ego noise predicting section 200. The configuration and function of the missing feature mask generating section 300 will be described in detail later.
The acoustic feature extracting section 401 extracts an acoustic feature for each individual speaker (sound source) from those obtained by the sound source separating and speech enhancing section 100.
The speech recognizing section 501 performs speech recognition using the acoustic feature for each individual speaker (sound source) obtained by the acoustic feature extracting section 401 and missing feature masks for each individual speaker (sound source) obtained by the missing feature mask generating section 300.
The ego noise predicting section 200 will be described below. The ego noise predicting section 200 includes an operational state detecting section 201 which detects operational states of motors used in the robot, a template database 205 which stores noise templates corresponding to respective operational states and a noise template selecting section 203 which selects the noise template corresponding to the operational state which is the closest to the current operational state detected by the operational state detecting section 201. The noise template selected by the noise template selecting section 203 corresponds to an estimated noise.
In generating the template database 205, while the robot performs a sequence of motions and pauses of less than one second which are set between consecutive motions, the operational state detecting section 201 detects operational states, and acoustic data are obtained.
In step S1010 of
An operational state of the robot is represented by an angle θ, an angular velocity {dot over (θ)}, and an angular acceleration {umlaut over (θ)} of each joint motor of the robot. Assuming that the number of the joints of the robot is J, a feature vector representing an operational state is as below.
[θ1(k),{dot over (θ)}1(k),{umlaut over (θ)}1(k), . . . ,θj(k),{dot over (θ)}j(k),{umlaut over (θ)}j(k)]
Parameter k represents time. Values of an angle θ, an angular velocity {dot over (θ)}, and an angular acceleration {umlaut over (θ)} are obtained at the predetermined time and normalized to the range of [−1.1].
In step S1020 of
[D(1,k),D(2,k), . . . ,D(F,k)]
Parameter k represents time while parameter F represents a frequency range. The frequency ranges are obtained by dividing a range from 0 kHz to 8 kHz into 256 segments. The acoustic data are obtained at the predetermined time.
In step S1030 of
[θ1(k),{dot over (θ)}1(k),{umlaut over (θ)}1(k), . . . ,θj(k),{dot over (θ)}j(k),{umlaut over (θ)}j(k)]
and a frequency spectrum corresponding to the operational state
[D(1,k),D(2,k), . . . ,D(F,k)]
are stored in the template database 205.
In step S1040 of
Feature vectors representing operational states and frequency spectra of acoustic data are time-tagged. Accordingly, templates can be generated by combining a feature vector and a frequency spectrum whose time tags agree with each other. The template database shown in
In step S2010 of
In step S2020 of
Assuming that the number of the joints of the robot is J, feature vectors of operational states correspond to points in a 3J-dimensional space. A feature vector of an operational state represented by an arbitrary template in the template database 205 is represented by
{right arrow over (s)}=(s1,s2, . . . ,s3J)
while a feature vector of the obtained operational state is represented by
{right arrow over (q)}=(q1,q2, . . . ,q3N).
Then, selecting the template corresponding to the operational state which is the closest to the obtained operational state corresponds to obtaining the template having the feature vector
{right arrow over (s)}=(s1,s2, . . . ,s3J)
which minimizes a distance
in the 3J-dimensional Euclidian space.
The missing feature mask generating section 300 will be described below. A missing feature mask will be called a MFM hereinafter. The MFM generating section 300 includes a speaker MFM generating section 301, an ego noise MFM generating section 303 and a MFM integrating section 305 which integrates the both MFMs to generate a single MEM.
A missing feature theory automatic speech recognition (MFT-ASR) is a very promising Hidden Markov Model based speech recognition technique that basically applies a mask to decrease the contribution of unreliable parts of distorted speech (B. Raj and R. M. Stern, “Missing-feature approaches in speech recognition”, IEEE Signal Processing Magazine, vol. 22, pp. 101-116, 2005.). By keeping the reliable parameters that are essential for speech recognition, a substantial increase in recognition accuracy is achieved.
The speaker MFM generating section 301 obtains a reliability against speaker separation artifacts (a separation reliability) and generates a speaker MFM based on the reliability. A separation reliability for a speaker is represented by the following expression, for example (S. Yamamoto, J. M. Valin, K. Nakadai, J. Rouat, F. Michaud, T. Ogata, and H. G. Okuno, “Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory”, IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2005).
Ŝin and Ŝout are respectively the post-filter input and output energy estimates of the speech enhancing section 105 for time-series frame k and Mel-frequency band f. {circumflex over (B)}(f, k) denotes the background noise estimate and mm(f, k) gives a measure for the reliability. The input energy estimate Ŝin of the speech enhancing section 105 is a sum of Ŝout and the background noise estimate {circumflex over (B)}(f, k) and a leak energy estimate. So, the reliability of separation of speakers becomes 1 when there exists no leak (when a speech is completely separated without blending of any other sounds from other sources) and the reliability of separation of speakers approaches 0 as the leak becomes larger. The reliability of separation of speakers is obtained for each sound source, each time-series frame k and each Mel-frequency band f. Generation of a speaker MFM based on the reliability of separation of speakers thus obtained will be described later.
The ego noise MFM generating section 303 obtains a reliability of ego noise and generates an ego noise MFM based on the reliability. It is assumed that ego noise, that is, motor noise of the robot is distributed uniformly among the existing sound sources. Accordingly, noise energy for a sound source is obtained by dividing the whole energy by the number of the sound sources (the number of the speakers). The reliability for ego noise can be represented by the following expression.
Ŝe(f, k) is the noise template, that is, the noise energy estimate and l represents the number of speakers. To make me(f, k) and mm(f, k) value ranges consistent, the possible values that it can take are limited between 0 and 1. According to Expression (3), if high motor noise Ŝe(f, k) is estimated, the reliability is zero, whereas low motor noise sets me(f, k) close to 1. The reliability of ego noise is obtained for each sound source, each time-series frame k and each Mel-frequency band f.
Generation of the speaker MFM based on the reliability of separation of speakers and generation of the ego noise MFM based on the reliability for ego noise will be described below. Masks can be grouped into hard masks which take a value of wither 0 or 1 and soft masks which take any value between 0 and 1 inclusive. The hard mask (hard MFM) can be represented by the following expression. x means m which represents a reliability of separation of speakers mm(f, k) or a mask for a speaker Mm(f, k) or e which represents a reliability for ego noise me(f, k) or a mask for the ego noise Me(f, k).
Mx(f,k)=1 if mx(f,k)≧Tx
Mx(f,k)=0 if mx(f,k)<Tx (4)
The soft mask (soft MFM) can be represented by the following expression.
σx is the tilt value of a sigmoid weighting function and Tx is a predefined threshold. A speech feature is considered unreliable, if the reliability measure is below a threshold value Tx.
Further, a concept called minimum energy criterion (mec) is introduced. If the energy of the noisy signal is smaller than a given threshold Tmec, the mask is determined by the following expression.
Mx(f,k)=0 if Ŝout(f,k)<Tmec (6)
The minimum energy criterion is used to avoid wrong estimations caused by computations performed with very low-energy signals, e.g. during pauses or silent moments.
The MFM integrating section 305 integrates a speaker MFM and an ego noise MFM to generate a total MFM. As described above, the speaker MFM and the ego noise MFM serve different purposes. Nevertheless, they can be used in a complementary fashion within the context of multi-speaker speech recognition under ego-motion noise. The total mask can be represented by the following expression.
Mtot(f,k)=wmMm(f,k){dot over (+)}weMe(f,k) (7)
Mtot(f, k) is the total mask and wx is the weight of the corresponding mask. {dot over (+)} denotes any way of integration including AND operation and OR operation.
Explanation of the flowchart of
In step S0010 of
In step S0020 of
In step S0030 of
In step S0040 of
In step S0050 of
In step S0060 of
In step S0070 of
Experiments
Experiments for checking performance of the speech recognition system will be described below.
1) Experimental Settings
A humanoid robot is used for the experiments. The robot is equipped with an 8-ch microphone array on top of its head. Of the robots many degrees of freedom, only a vertical head motion (tilt), and 4 motors for the motion of each arm with altogether 9 degrees of freedom were used Random motions performed by the given set of limbs were recorded by storing a training database of 30 minutes and a test database 10 minutes long. Because the noise recordings are comparatively longer than the utterances used in the isolated word recognition, those segments, in which all joints contribute to the noise were selected. After normalizing the energies of the utterances to yield an SNR of −6 dB (noise: two other interfering speakers), the noise signal consisting of ego noise (including ego-motion noise and fan noise) and environmental background noise is mixed with clean speech utterances. This Japanese word dataset includes 236 words for 1 female and 2 male speakers that are used in a typical humanoid robot interaction dialog. Acoustic models are trained with Japanese Newspaper Article Sentences (JNAS) corpus, 60-hour of speech data spoken by 306 male and female speakers, hence the speech recognition is a word-open test. 13 static MSLS (Mel-scale logarithmic spectrum), 13 delta MSLS and 1 delta power were used as acoustic features. Speech recognition results are given as average Word Correct Rates (WCR).
MFMs with the following heuristically selected parameters Te=0, Te=0.2, σe=2, σm=0.003 (energy interval:[0 1]) were evaluated.
2) Results of the Experiments
1) Soft masks outperform hard masks for almost every condition. This improvement is attained due to the improved probabilistic representation of the reliability of each feature.
2) The ego noise masks perform well for low SNRs (Signal to Noise Ratio), however WCRs deteriorate for high SNRs. The reason resides in the fact that faulty predictions of ego-motion noise degrade the quality of the mask, thus ASR accuracy, of clean speech more compared to that of noisy speech. On the other hands, in high SNRs (inferring no robotic motion or very loud speech) multi-speaker masks improve the outcomes significantly, but their contribution suffers in lower SNRs instead.
3) As the separation interval gets narrower, the WCRs tend to reduce drastically. A slight increase in the accuracy provided by speaker masks Mm compared to ego noise masks Me in −5 dB for narrow separation angles was observed. The reason is that the artifacts caused by sound source separation for very close speakers become very dominant.
Based on the assessment of trend 1) and 2) described above, in Expression (7) representing total masks, soft masks are adopted as speaker masks Mm and ego noise masks Me, and weights wx are determined by the following expression.
{we,wm}={1,0} if SNR<0
{we,wm}={0,1} if SNR≧0 (8)
SNR represents signal to nose ratio. A signal to noise ratio is set for each speaker to a ratio of an output of the speech enhancing segment 105 to a quotient obtained by dividing an output of the ego noise predicting section 200 by the number of the speakers.
On the other hand, a WCR of total masks adopting AND or OR-based integration was inferior to the superior one between a WCR of speaker masks Mm and that of ego noise masks Me.
Number | Date | Country | Kind |
---|---|---|---|
2010-232817 | Oct 2010 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
6993483 | Milner | Jan 2006 | B1 |
8019089 | Seltzer et al. | Sep 2011 | B2 |
8073690 | Nakadai et al. | Dec 2011 | B2 |
8392185 | Nakadai et al. | Mar 2013 | B2 |
20080071540 | Nakano et al. | Mar 2008 | A1 |
Entry |
---|
Nakadai, K.; Yamamoto, S.; Okuno, H.G.; Nakajima, H.; Hasegawa, Y.; Tsujino, H., “A robot referee for rock-paper-scissors sound games,” Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on , vol., no., pp. 3469,3474, May 19-23, 2008. |
Yamamoto, S.; Nakadai, K.; Nakano, M.; Tsujino, H.; Valin, J.-M.; Komatani, K.; Ogata, T.; Okuno, H.G., “Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech,” Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on , vol., no., pp. 111,116, Dec. 9-13, 2007. |
Nishimura, Y.; Ishizuka, M.; Nakadai, K.; Nakano, M.; Tsujino, H., “Speech Recognition for a Humanoid with Motor Noise Utilizing Missing Feature Theory,” Humanoid Robots, 2006 6th IEEE-RAS International Conference on , vol., no., pp. 26,33, Dec. 4-6, 2006. |
Yamamoto, S.; Nakadai, K.; Tsujino, H.; Yokoyama, T.; Okuno, H.G., “Improvement of robot audition by interfacing sound source separation and automatic speech recognition with Missing Feature Theory,” Robotics and Automation, 2004. Proceedings. ICRA '04. 2004 IEEE International Conference on , vol. 2, no., pp. 1517,1523 vol. 2, Apr. 26-May 1, 2004. |
Steven F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE transactions on Acoustics, Speech and Signal Processing, vol. ASSP-27, No. 2, Apr. 1979, pp. 113-120. |
A. Ito et al., “Internal Noise Suppression for Speech Recognition by Small Robots”, Interspeech 2005, pp. 2685-2688. |
I Cohen et al., “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement”, IEEE Signal Processing Letters, vol. 9, No. 1, Jan. 2002, pp. 12-15. |
Kazuhiro Nakadai et al., “Sound Source Separation of Moving Speakers for Robot Audition”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 3685-3688. |
Shun'Ichi Yamamoto et al., “Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory” IEEE/RSJ International conference on Robotics and Automation (ICRA), 2005, pp. 1-6. |
Shun'Ichi Yamamoto et al., “Real-Time Robot Audition System That Recognizes Simultaneous Speech in the Real World”, Proc. of the IEEE/RSJ International Conference on Robots and Intelligent Systems (IROS), 2006, pp. 5333-5338. |
Jean-Marc Valin et al., “Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter”, Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004, pp. 2123-2128. |
Israel Cohen et al., “Microphone Array Post-Filtering for Non-Stationary Noise Suppression”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2002, pp. 901-904. |
Bhiksha Raj et al., “Missing-Feature Approaches in Speech Recognition”, IEEE Signal Processing Magazine, vol. 22, 2005, pp. 101-116. |
Number | Date | Country | |
---|---|---|---|
20120095761 A1 | Apr 2012 | US |