In the field of voice communications, a communication device receives a far-end signal from a far-end talker typically over a network. The far-end signal is played via a loudspeaker of the communication device. A person who is co-located with a communication device is known as a “near-side talker.” A near-side talker may be relatively far away from the microphones of the communication device, as compared to a distance of the loudspeaker from the microphones. Accordingly, sound played out of the loudspeaker (e.g., sound corresponding to the far-end signal) echoes and reaches the microphones, along with sound from the near-side talker. Double talk refers to a situation where sound from the near-side talker reaches the microphones simultaneously with sound from the far-end talker (e.g., from the loudspeaker).
Due to the sound from the loudspeaker reaching the microphones along with the sound from the near-side talker, during double talk, a near-to-far ratio may decrease, resulting in poor bi-directional communication performance of the communication device. The near-to-far ratio, or NFR, is the ratio of the power of the near-side talker with respect to the far-end talker.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items.
Overview
Described herein are techniques to improve acoustic performance of a communication device. The techniques include use of an acoustic-echo processing (AEP) module that processes signals to adjust and/or cancel components from the signals emitted from a speaker, such as voice sounds from a far-end talker, audio from a media source, etc., that may reach one or more microphones of the device. Thus, the AEP module is often referred to as an acoustic echo cancellation module since the purpose of the AEP module is to process signals in order to adjust and/or cancel components within the signal that can result in echo. The AEP module outputs an echo processed signal, which is fed into a speech recognizer to recognize its content, such as whether the content includes the near-side talker giving out some commands. The speech recognizer produces a confidence score when it detects certain keywords and that score is fed back to the AEP module to improve its performance, particularly during a double talk situation. Several embodiments and enhancements are described for this architecture having an AEP module that utilizes feedback from a speech recognizer.
In one embodiment, a communication device may comprise one or more loudspeakers and one or more microphones. The communication device may be, for example, a telephone handset, a headset, a smart phone, a conference phone, a cellular phone, a voice controlled assistant, or any appropriate consumer electronic device that is capable of producing sound from one or more loudspeakers and receiving sound from one or more microphones. For clarity, the present disclosure will be described with respect to only one loudspeaker and only one microphone. This is not meant to be limiting. In various implementations, there are multiple loudspeakers and an array of multiple microphones.
In one implementation, the loudspeaker may output sound from a far-end talker, from a media source (e.g., music or a video), etc. The microphone may receive sound from a near-side talker located proximally to the communication device. The microphone may also receive sound that is echoed from the loudspeaker. That is, the microphone may receive signals that include components from the near-side talker and also include components from a far-end talker via echo from the loudspeaker, resulting in double talk. This may decrease the near-to-far ratio, which is the ratio of the power of the near-side talker with respect to the far-end talker. Detecting a double talk scenario is important for echo cancellation processing since the adaptive filter of the AEP module should not adapt when there is a near-side talker present in the room. If the filter continues to adapt during double talk, the adaptive filter coefficients could diverge and in this event, the performance of the AEP module in canceling echo will degrade.
In an event where double talk is present within the communication device, an AEP module processes the signals in order to attempt to cancel out components of the signal that undesirably interfere with the primary portion or desired components of the signal. For example, the communication device may include a loudspeaker that is being utilized to produce sounds such as, for example, a far-end talker, music, signals from an audio or video source, and so forth. A user of the communication device may speak to the communication device through a microphone on the communication device. Thus, the user can function as a near-side talker. The user may issue a command to the device, e.g., “call Lynn,” or the user may be using the communication device to speak with Lynn in such a way that Lynn's voice is coming from the loudspeaker and the user's voice is going into the microphone. If Lynn's voice and the user speak at the same time, the microphone will generally pick up sounds from the loudspeaker, as well as from the user's voice. This results in the occurrence of double talk. Since Lynn presumably wants to hear the user's voice as clearly as possible, then the AEP module will process the resulting signal in order to cancel out as much of Lynn's voice from the loudspeaker as possible.
In a situation where the user is speaking a command for the communication device, then the communication device wants to be able to discern the command. If the command arrives at the microphone at the same time as sound from the loudspeaker, the AEP module will process the resulting signal to cancel out as much of the sound from the loudspeaker as possible.
Since double talk can occur fairly regularly, it is helpful to predict when double talk will occur. Accordingly, in accordance with various embodiments of the present disclosure, an automatic speech recognition engine is included. The automatic speech recognition engine analyzes an acoustic-echo processed signal received from the AEP module and analyzes the signal to determine if certain content within the acoustic-echo processed signal is present. For example, the automatic speech recognition engine analyzes the acoustic-echo processed signal in order to determine if spoken words such as, commands are present. Thus, the automatic speech recognition engine may look for certain keywords. If the automatic speech recognition engine determines that certain content, such as, for example, spoken words or speech, is present within the acoustic-echo processed signal, the automatic speech recognition engine assigns a value, generally between zero and one. It is further assumed that a value of 0 implies that no target keywords are detected, while a value of 1 implies that some target keywords are detected; intermediate values between 0 and 1 reflects the confidence of the detection. In accordance with various embodiments, if speech is detected, then the value is one and is provided or fed back to the AEP module.
The AEP module utilizes this value in order to enhance acoustic-echo cancellation processing of future signals. The value indicates to the AEP module that there is a high likelihood of double talk, i.e. signals will include a component from a near-side talker, in the near future due to the determined presence of desirable content, i.e. speech from a near-side talker, by the automatic speech recognition engine. The value provided by the automatic speech recognition engine can be utilized to facilitate processing of signals by the AEP module by slowing down or halting adjustments to the adaptive filter. Additionally, control parameters of the communication device such as, for example, volume and/or a frequency of the audio that goes to the loudspeaker, can be adjusted. The purpose of the adaptive filter within the AEP module is to cancel the echo that is originated from the far-end.
The techniques and systems described herein may be implemented in a number of ways. Example implementations are provided below with reference to the following figures.
Illustrative Environment and System
The communication device 104 may be implemented in any number of ways. It may be a telephone handset, a headset, a phone, a portable phone, a tablet or computing device, or any number of electronic devices that is capable of producing sound from a speaker and receiving sound in one or more microphones. In this illustration, the communication device 104 is implemented as an electronic voice controlled assistant physically positioned on a table 108 within the environment (e.g., a room, hall, office, etc.). In other implementations, it may be placed in any number of locations (e.g., ceiling, wall, in a lamp, beneath a table, on a work desk, in a hall, under a chair, etc.). The device 104 is shown communicatively coupled to far-end talkers 110 over a network 112. The far-end talkers 110 may include individual people, such as person 114, or automated systems 116 that verbally interact with the user 106. The automated system 116 is shown hosted on one or more servers 118_1, . . . , 118_S.
The communication device 102 may include a loudspeaker 120 and a microphone 122. As previously noted, the communication device 102 may comprise one or more loudspeakers 120 and one or more microphones 122. For clarity, the present disclosure is described with respect to only one loudspeaker 120 and only one microphone 122. This is not meant to be limiting.
The loudspeaker 120 may be configured to output sound waves produced by the communication device 102. The sound may be generated based on information stored in the communication device 102 and/or information otherwise received by the communication device 102 from an appropriate source, such as far-end talkers 110. The loudspeaker 120 may be configured to operate in a loudspeaker mode, where sound waves produced by the loudspeaker 120 reach the user 104 and also the microphone 122.
The microphone 122 may receive sound from the user 104 or other sources in the environment. The microphone 122 may also receive sound produced and/or echoed from the loudspeaker 120. Thus, the microphone 122 may receive sound from both the user 104 and also receive sound produced by the loudspeaker 120. The microphone 122 may also receive sound from other sound sources placed proximally to the communication device 102.
In this embodiment, the loudspeaker 120 outputs sound from a far-end talker 110, and the user 104 is a near-side talker for the communication device 102. Thus, the microphone 122 may receive sound from both the near-side talker 104 and the far-end talker 110. A near-to-far ratio refers to a ratio of power from the near-side talker and sound energy from the far-end talker, as detected by the microphone 122 of the communication device 102.
During double talk, the microphone 122 may simultaneously receive sound from the near-side talker (e.g., from the user 104) and from the far-end talker 110 (e.g., via echo from the loudspeaker 120). For the far-end talker to clearly listen to the near-side talker, during double talk, it may be desirable to cancel and/or attenuate echo from the loudspeaker 120 and enhance sound from the near-side talker in the signals detected by the communication device 102.
Accordingly, the communication device 102 may also include an AEP module 124 and an automatic speech recognition engine 126. The AEP module 124 is often referred to as an acoustic echo cancellation module since the purpose of the AEP module is to process signals in order to adjust and/or cancel components within the signal that can result in echo. In other embodiments and not illustrated in
In an embodiment, the AEP module 124 comprises an adaptive filter 202. The signal processing module 200 receives a signal 204 from the microphone 122 where the content of the signal may include one or more “desirable” components from a near-side signal (i.e. components from the user 104) and one or more “undesirable” components from the loudspeaker 120 (or some other source). The adaptive filter 202 adaptively filters this signal to substantially cancel or remove the undesirable components or elements of this signal. Generally, most or all components or elements of a near-side signal are desirable, while most or all components or elements from the loudspeaker 120 (or some other source) are undesirable. Coefficients of the adaptive filter 202 may be adapted dynamically to capture the undesired components from the loudspeaker 120 and cancel those components from the input of the microphone 122.
In an embodiment, the AEP module 124 provides acoustic-echo processed signals 206 to a far-end speaker for output to a far-end talker (illustrated in
In accordance with an embodiment, a value 208 of one indicates to the AEP module 124 that there is a high likelihood of double talk, i.e. signals will include a component from a near-side talker, in the near future due to the determined presence of desirable content within an acoustic-echo processed signal 206, i.e. speech from the user 104, by the automatic speech recognition engine 126. The AEP module 124 can therefore use the value 208 to facilitate acoustic-echo cancellation processing of signals 204. The AEP module 124 utilizes this value 208 in order to enhance acoustic-echo cancellation processing of future signals. The value 208 indicates to the AEP module 124 that there is a high likelihood of double talk, i.e. signals will include a component from a near-side talker, in the near future due to the determined presence of desirable content, i.e. speech from a near-side talker such as user 104, by the automatic speech recognition engine 126. The value 208 provided by the automatic speech recognition engine 126 can be utilized to facilitate processing of signals by the AEP module 124 by adjusting various control parameters of the AEP module 124 such as, for example, by slowing down or halting adjustments to the adaptive filter 202. The purpose of the adaptive filter 202 within the AEP module 124 is to cancel the echo that is originated from the far-end. Additionally, volume and/or a frequency of the audio that goes to the loudspeaker 120 can be adjusted.
Illustrative Operations
At 402, the automatic speech recognition engine 126 receives an acoustic-echo processed signal 206 from the AEP module 124. At 404, the automatic speech recognition engine 126 determines audio content within the acoustic-echo processed signal 206. At 406, based upon the audio content within the acoustic-echo processed signal, the automatic speech recognition engine 126 produces a value. At 408, the value is provided to the AEP module 124. At 410, based upon the value, acoustic-echo cancellation processing of signals 204 within the AEP module 124 is facilitated.
As previously discussed, a value 208 of one indicates to the AEP module 124 that there is a high likelihood of double talk, i.e. signals will include a component from a near-side talker, in the near future due to the determined presence of desirable content within an acoustic-echo processed signal 206, i.e. speech from the user 104, by the automatic speech recognition engine 126. The AEP module 124 can therefore use the value 208 to facilitate acoustic-echo cancellation processing of signals 204. The AEP module 124 utilizes this value 208 in order to enhance acoustic-echo cancellation processing of future signals. The value 208 indicates to the AEP module 124 that there is a high likelihood of double talk, i.e. signals will include a component from a near-side talker, in the near future due to the determined presence of desirable content, i.e. speech from a near-side talker such as user 104, by the automatic speech recognition engine. The value 208 provided by the automatic speech recognition engine 126 can be utilized to facilitate processing of signals by the AEP module 124 by slowing down or halting adjustments to the adaptive filter 202. The purpose of the adaptive filter 202 within the AEC module 124 is to cancel the echo that is originated from the far-end. Additionally, volume and/or a frequency of the audio that goes to the loudspeaker 120 can be adjusted.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
4905288 | Gerson et al. | Feb 1990 | A |
5548681 | Gleaves et al. | Aug 1996 | A |
5712957 | Waibel et al. | Jan 1998 | A |
5842168 | Miyazawa et al. | Nov 1998 | A |
6134322 | Hoege et al. | Oct 2000 | A |
6415257 | Junqua et al. | Jul 2002 | B1 |
6606595 | Chengalvarayan et al. | Aug 2003 | B1 |
6725193 | Makovicka | Apr 2004 | B1 |
6744885 | Ketchum | Jun 2004 | B1 |
7552050 | Matsumoto et al. | Jun 2009 | B2 |
7684559 | Hoshuyama | Mar 2010 | B2 |
8116448 | Liu | Feb 2012 | B2 |
8165641 | Koike et al. | Apr 2012 | B2 |
8214219 | Prieto et al. | Jul 2012 | B2 |
8234120 | Agapi et al. | Jul 2012 | B2 |
8275120 | Stokes, III | Sep 2012 | B2 |
8340975 | Rosenberger | Dec 2012 | B1 |
8406415 | Lambert | Mar 2013 | B1 |
8650029 | Thambiratnam et al. | Feb 2014 | B2 |
8855295 | Chhetri et al. | Oct 2014 | B1 |
20040220800 | Kong | Nov 2004 | A1 |
20050010410 | Takiguchi | Jan 2005 | A1 |
20050080627 | Hennebert et al. | Apr 2005 | A1 |
20060136203 | Ichikawa | Jun 2006 | A1 |
20070198268 | Hennecke | Aug 2007 | A1 |
20070263848 | Sukkar | Nov 2007 | A1 |
20080071547 | Prieto et al. | Mar 2008 | A1 |
20080177534 | Wang | Jul 2008 | A1 |
20080240370 | Wang | Oct 2008 | A1 |
20100217604 | Baldwin | Aug 2010 | A1 |
20110112827 | Kennewick et al. | May 2011 | A1 |
20120221330 | Thambiratnam | Aug 2012 | A1 |
20120223885 | Perez | Sep 2012 | A1 |
20120288100 | Cho | Nov 2012 | A1 |
20130151248 | Baker, IV | Jun 2013 | A1 |
20140214414 | Poliak | Jul 2014 | A1 |
20150279352 | Willett | Oct 2015 | A1 |
Number | Date | Country |
---|---|---|
WO2011088053 | Jul 2011 | WO |
Entry |
---|
Pinhanez, “The Everywhere Displays Projector: A Device to Create Ubiquitous Graphical Interfaces”, IBM Thomas Watson Research Center, Ubicomp 2001, 18 pages. |