An embodiment of the invention relate generally to a system and method for performing speech enhancement using a deep neural network-based signal.
Currently, a number of consumer electronic devices are adapted to receive speech from a near-end talker (or environment) via microphone ports, transmit this signal to a far-end device, and concurrently output audio signals, including a far-end talker, that are received from a far-end device. While the typical example is a portable telecommunications device (mobile telephone), with the advent of Voice over IP (VoIP), desktop computers, laptop computers and tablet computers may also be used to perform voice communications.
When using these electronic devices, the user also has the option of using the speakerphone mode, at-ear handset mode, or a headset to receive his speech. However, a common complaint with any of these modes of operation is that the speech captured by the microphone port or the headset includes environmental noise, such as wind noise, secondary speakers in the background, or other background noises. This environmental noise often renders the user's speech unintelligible and thus, degrades the quality of the voice communication. Additionally, when the user's speech is unintelligible, further processing of the speech that is captured also suffers. Further processing may include, for example, automatic speech recognition (ASR).
The embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment of the invention in this disclosure are not necessarily to the same embodiment, and they mean at least one. In the drawings:
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown to avoid obscuring the understanding of this description.
In the description, certain terminology is used to describe features of the invention. For example, in certain situations, the terms “component,” “unit,” “module,” and “logic” are representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. The software may be stored in any type of machine-readable medium.
While not shown, the electronic device 10 may also be used with a headset that includes a pair of earbuds and a headset wire. The user may place one or both the earbuds into his ears and the microphones in the headset may receive his speech. The headset 100 in
The microphone 120 may be an air interface sound pickup device that converts sound into an electrical signal. As the near-end user is using the electronic device 10 to transmit his speech, ambient noise may also be present. Thus, the microphone 120 captures the near-end user's speech as well as the ambient noise around the electronic device 10. A reference signal may be used to drive the loudspeaker 130 to generate a loudspeaker signal. The loudspeaker signal that is output from a loudspeaker 130 may also be a part of the environmental noise that is captured by the microphone, and if so, the loudspeaker signal that is output from the loudspeaker 130 could get fed back in the near-end device's microphone signal to the far-end device's downlink signal. This loudspeaker signal would in part drive the far-end device's loudspeaker, and thus, components of this loudspeaker signal would include near-end device's microphone signal to the far-end device's downlink signal as echo. Thus, the microphone 120 may receive at least one of: a near-end talker signal (e.g., a speech signal), an ambient near-end noise signal, or a loudspeaker signal. The microphone 120 generates and transmits a microphone signal (e.g., acoustic signal).
In one embodiment, system 200 further includes an acoustic echo canceller (AEC) 140 that is a linear echo canceller. For example, the AEC 140 may be an adaptive filter that linearly estimate echo to generate a linear echo estimate. In some embodiments, the AEC 140 generates an echo-cancelled signal using the linear echo estimate. In
System 200 further includes a loudspeaker signal estimator 150 that receives the microphone signal from the microphone 120 and the AEC echo-cancelled signal from the AEC 140. The loudspeaker signal estimator 150 uses the microphone signal and the AEC echo-cancelled signal to estimate the loudspeaker signal that is received by the microphone 120. The loudspeaker signal estimator 150 generates a loudspeaker signal estimate.
In
The DNN 170 in
Once the DNN 170 is trained offline, the DNN 170 in
Using the DNN 170 has the advantage that the system 200 is able address the non-linearities in the electronic device 10 and suppress the noise and linear and non-linear echoes in the microphone signal accordingly. For instance, the AEC 140 is only able to address the linear echoes in the microphone signal such that the AEC 140's performance may suffer from the non-linearity from the electronic device 10.
Further, a traditional residual echo power estimator that is used in lieu of the DNN 170 in conventional systems may also not reliably estimate the residual echo due to the non-linearities that are not addressed by the AEC 140. Thus, in conventional systems, this would result in residual echo leakage. The DNN 170 is able to accurately estimate the residual echo in the microphone signal even during double-talk situations given the higher near-end speech quality during double-talk situations. The DNN 170 is also able to accurately estimate the near-end noise power level to minimize the impairment to near-end speech after noise suppression.
The frequency-time transformer 180 then receives the clean speech signal in frequency domain from the DNN 170 and performs an inverse transformation to generate a clean speech signal in the time domain. In one embodiment, the frequency-time transformer 180 performs an Inverse Short-Time Fourier Transform (STFT) on the clean speech signal in frequency domain to obtain the clean speech signal in the time domain.
As shown in
In both the systems 400 and 500, each feature processor 4101-4104 respectively receives the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal in the frequency domain from the time-frequency transformer 160.
As shown in
The feature normalization may be calculated based on the mean and standard deviation of the training data. The normalization may be performed over a whole feature dimensions or on a per feature dimension basis or a combination thereof. In one embodiment, the mean and standard deviation may be integrated into the weights and biases of the first and output layers of the DNN 170 to reduce computational complexity.
Referring back to
As an example, in
The following embodiments of the invention may be described as a process, which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a procedure, etc.
The method 700 starts at Block 701 with training a DNN offline by exciting at least one microphone using a target training signal that includes a signal approximation of clean speech. At Block 702, a loudspeaker is driven with a reference signal and the loudspeaker outputs a loudspeaker signal. At Block 703, the at least one microphone generates a microphone signal based on at least one of: a near-end speaker signal, an ambient noise signal, or the loudspeaker signal. At Block 704, an AEC generates an AEC echo-cancelled signal based on the reference signal and the microphone signal. At Block 705, a loudspeaker signal estimator generates an estimated loudspeaker signal based on the microphone signal and the AEC echo-cancelled signal. At Block 706, the DNN receives the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal and at Block 707, the DNN generates a speech reference signal that includes signal statistics for residual echo or signal statistics for noise based on the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal. In one embodiment, the speech reference signal that includes signal statistics for residual echo or signal statistics for noise includes at least one of: an estimate of non-linear echo in the microphone signal that is not cancelled by the AEC, an estimate of residual echo in the microphone signal, or an estimate of ambient noise power level in the microphone signal. At Block 708, a noise suppressor generates a clean speech signal by suppressing noise or residual echo in the microphone signal based on speech reference signal.
Keeping the above points in mind,
In the embodiment of the electronic device 10 in the form of a computer, the embodiment include computers that are generally portable (such as laptop, notebook, tablet, and handheld computers), as well as computers that are generally used in one place (such as conventional desktop computers, workstations, and servers).
The electronic device 10 may also take the form of other types of devices, such as mobile telephones, media players, personal data organizers, handheld game platforms, cameras, and/or combinations of such devices. For instance, the device 10 may be provided in the form of a handheld electronic device that includes various functionalities (such as the ability to take pictures, make telephone calls, access the Internet, communicate via email, record audio and/or video, listen to music, play games, connect to wireless networks, and so forth).
An embodiment of the invention may be a machine-readable medium having stored thereon instructions which program a processor to perform some or all of the operations described above. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), such as Compact Disc Read-Only Memory (CD-ROMs), Read-Only Memory (ROMs), Random Access Memory (RAM), and Erasable Programmable Read-Only Memory (EPROM). In other embodiments, some of these operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmable computer components and fixed hardware circuit components. In one embodiment, the machine-readable medium includes instructions stored thereon, which when executed by a processor, causes the processor to perform the method on an electronic device as described above.
In the description, certain terminology is used to describe features of the invention. For example, in certain situations, the terms “component,” “unit,” “module,” and “logic” are representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. The software may be stored in any type of machine-readable medium.
While the invention has been described in terms of several embodiments, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting. There are numerous other variations to different aspects of the invention described above, which in the interest of conciseness have not been provided in detail. Accordingly, other embodiments are within the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
5621724 | Yoshida | Apr 1997 | A |
5737485 | Flanagan | Apr 1998 | A |
9640194 | Nemala | May 2017 | B1 |
20050089148 | Stokes, III | Apr 2005 | A1 |
20090089053 | Wang | Apr 2009 | A1 |
20100057454 | Mohammad | Mar 2010 | A1 |
20110194685 | van de Laar | Aug 2011 | A1 |
20140142929 | Seide et al. | May 2014 | A1 |
20140257803 | Yu et al. | Sep 2014 | A1 |
20140257804 | Li et al. | Sep 2014 | A1 |
20150066499 | Wang et al. | Mar 2015 | A1 |
20150112672 | Giacobello | Apr 2015 | A1 |
20150301796 | Visser et al. | Oct 2015 | A1 |
20160358602 | Krishnaswamy | Dec 2016 | A1 |
Number | Date | Country |
---|---|---|
2015157013 | Oct 2015 | WO |
Entry |
---|
Schwarz, Andreas et al., “Spectral feature-based nonlinear residual echo suppression”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Conference Paper Oct. 20-23, 2013. |
Bendersky, Diego A. et al., “Nonliner Residual Acoustic Echo Suppresion for High Levels of Harmonic Distortion”, in Proc. IEEE ICASSP, 2008. |
Caroselli, Joe, “Adaptive Multichannel Dereverberation for Automatic Speech Recognition”, in Proc. Interspeech, 2017. |
Delcroix, Marc, “Linear Prediction-Based Dereverberation with Advanced Speech Enhancencement and Recognition Technologies for the Reverb Challenge”, in Proc. Reverb Workshop, 2014. |
Delcroix, Marc, “Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds”, Computer Speech and Language, vol. 27, No. 3, 2013, 851-873. |
Erdogan, H. et al., “Improved MVDR beamforming using single-channel mask prediction networks”, in Proc. Interspeech, 2016. |
Erdogan, Hakan et al., “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks”, in Proc. IEEE ICASSP, 2015. |
Helwani, Karim et al., “Source-domain adaptive filtering for MIMO systems with application to acoustic echo cancellation”, in Proc. IEEE HSCMA, 2010. |
Heymann, Jahn et al., “Neural Network Based Spectral Mask Estimation for Acoustic Beamforming”, in Proc. IEEE ICASSP, 2016. |
Higuchi, Takuya et al., “Robust MVDR Beamforming Using Time-Frequency Masks for Online/Offline ASR in Noise”, in Proc. IEEE ICASSP, 2016. |
Huang, Yiteng et al., “Bi-magnitude processing framework for nonlinear acoustic echo cancellation on android devices”, in Proc. IEEE IAWENC, 2016. |
Jukic, Ante et al., “Adaptive Speech Dereverberation Using Constrained Sparse Multichannel Linear Prediction”, IEEE Signal Processing Letters, vol. 24, No. 1, 2017, 101-105. |
Jukic, Ante et al., “Group Sparsity for MIMO Speech Dereverberation”, in Proc. IEEE WASPA, 2015. |
Jukic, Ante et al., “Multi-channel linear prediction-based speech dereverberation with sparse priors”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, No. 9, 2015, 1509-1520. |
Lee, Chul M. et al., “DNN-based Residual Echo Suppression”, in Proc. Interspeech, 2015. |
Li, Bo et al., “Acoustic Modeling for Google Home”, in Proc. Interspeech, 2017. |
Malik, Sarmad et al., “Variationally Diagonalized Multichannel State-Space Frequency-Domain Adaptive Filtering for Acoustic Echo Cancellation”, in Proc. IEEE ICASSP, 2013. |
Narayanan, Arun et al., “Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, No. 1, 2015, 92-101. |
Ono, Nobutaka, “Auxiliary-function-based Independent Vector-norm Type Weighting Functions”, in Proc. APSIPA, 2012. |
Ono, Nobutaka, “Stable and Fast Update Rules for Independent Vector Analysis Based on Auxiliary Function Technique”, in Proc. IEEE WASPAA, 2011. |
Schwartz, Boaz et al., “Online Speech Dereverberation Using Kalman Filter and EM Algorithm”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23. No. 2, 2015, 394-406. |
Schwarz, Andreas et al., “Combined nonlinear echo cancellation and residual echo suppression”, in Proc. Speech Communication; 11th ITG Symposium, 2014. |
Schwarz, Andreas et al., “Spectral Feature-Based Nonlinear Residual Echo Suppression”, in Proc. IEEE WASPAA, 2013. |
Sondhi, M. M., “Stereophonic Acoustic Echo Cancellation—An Overview of the Fundamental Problem”, IEEE Signal Processing Letters, vol. 2, No. 8 1995, 148-151. |
Souden, Mehrez, “A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, No. 9, 2013, 1913-1928. |
Souden, Mehrez et al., “An integrated solution for online multichannel noise tracking and reduction”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, No. 7, 2011, 2159-2169. |
Souden, Mehrez et al., “On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction”, IEEE Transactions on Audio, Speech, and Language Processing, 2010, 260-276. |
Taniguchi, Toru et al., “An Auxiliary-Function Approach to Online Independent Vector Analysis”, in Proc. IEEE HSCMA, 2014. |
Wang, Ziteng et al., “Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments”, Computer Speech and Language, vol. 49, 2018, 31-51. |
Xiao, Xiong et al., “On Time-Frequency Mask Estimation for MVDR Beamforming with Application in Robust Speech Recognition”, in Proc. IEEE ICASSP, 2017. |
Xu, Yong et al., “A Regression Approach to Speech Enhancement Based on Deep Neural Networks”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, No. 1, 2015, 7-19. |
Yoshioka, T. et al., “Making Machines Understand Us in Reverberant Rooms [Robustness against reverberation for automatic speech recognition]”, IEEE Signal Processing Magazine, vol. 29, No. 6, 2012, 114-126. |
Yoshioka, Takuta et al., “Dereverberation for Reverberation-Robust Microphone Arrays”, in Proc. IEEE EUSIPCO, 2013. |
Yoshioka, Takuya et al., “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, No. 10, 2012, 2707-2720. |
Yoshioka, Takuya et al., “The NTT Chime-3 System: Advances in Speech Enhancement and Recognition for Mobile Multi-Microphone Devices”, in Proc. IEEE Automatic Speech Workshop, 2015. |
Number | Date | Country | |
---|---|---|---|
20180040333 A1 | Feb 2018 | US |