The invention relates to speech processing, and more particularly, to a speech processing apparatus and method for acoustic echo reduction.
Acoustic echo originates in a local audio loop back that occurs when a microphone picks up audio signals from a speaker, and sends it back to a far-end talker/user. The far-end talker will then hear the echo of his own voice as he speaks. The goal of acoustic echo cancellation/reduction is to reduce/cancel acoustic echoes in a microphone signal and then send the clean microphone signal to the far-end talker, thereby to improve the quality and intelligibility of microphone signals or dialog. In actual implementations, the performance of Acoustic Echo Cancellation (AEC) highly depends on mechanical designs of communication devices. For the communication devices, poor mechanical designs or mechanical defects, such as gasket leaks or proximity of microphones to speakers, are very likely to cause acoustic echoes. Even with the AEC function, it is difficult for the communication devices with the mechanical defects to improve the speech quality.
As well known in the art, an acoustic path in the communication device guides external sound to the microphone and must not have leaks (such as a gasket leak) that can cause multi-path echo or noise problems. A gasket is made of acoustically opaque material that prevents sound from passing through it. Common gasket materials include various kinds of rubber and compressible, closed-cell foams. The gasket must seal completely to a product case/housing and to the microphone or the printed circuit board (PCB). A leak in gasket seal allows the speaker output or other noise to propagate inside the product case into the microphone port. In some cases that the mechanical designs or gasket designs are not allowed to be modified, the multi-path echo or noise problems still need to be solved.
What is needed is a speech processing apparatus and method for acoustic echo reduction applicable to communication devices with mechanical defects that cause strong acoustic echoes.
In view of the above-mentioned problems, an object of the invention is to provide a speech processing apparatus capable of reducing acoustic echoes for a communication device having a mechanical defect that causes strong acoustic echoes.
One embodiment of the invention provides a speech processing apparatus in a communication device having a mechanical defect apparatus is disclosed. The apparatus comprises an acoustic echo cancellation (AEC) unit, a multiplier and a processor. The AEC unit cancels an echo in a first audio signal from a microphone using a known AEC algorithm to generate a second audio signal. The multiplier multiplies a gain by corresponding M frames of a downlink audio signal to provide a gained downlink signal for a speaker. The processor performs a set of operations comprising: muting an uplink audio signal when a first power level for M frames of a first input signal associated with the second audio signal is less than a first threshold value; and, reducing the gain when the first power level and a second power level for M frames of a second input signal associated with the downlink audio signal are respectively greater than or equal to the first threshold value and a second threshold value, where M>=1.
Another embodiment of the invention provides a speech processing method applicable to a communication device having a mechanical defect, comprising: cancelling an echo in a first audio signal from one or more microphones using a known AEC algorithm to generate a second audio signal; muting an uplink audio signal when a first power level for M frames of a first input signal associated with the second audio signal is less than a first threshold value; reducing a gain when the first power level and a second power level for M frames of a downlink audio signal are respectively greater than or equal to the first threshold value and a second threshold value; and, multiplying the gain by corresponding M frames of the downlink audio signal to provide a gained downlink signal for a speaker, where M>=1.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components with the same function are designated with the same reference numerals.
The invention deals with strong acoustic echoes caused by a mechanical defect of a communication device. A feature of the invention is to mute a uplink audio signal TX when the power level Pt of the uplink audio signal TX is less than a first threshold value TH1 to prevent a far-end talker from hearing his acoustic echoes. Another feature of the invention is to reduce the magnitude of a downlink audio signal RX or the volume of the speaker when Pt>=TH1 and the power level Pr of the downlink audio signal RX is greater than or equal to a second threshold value TH2; thus, the magnitudes of the echo signals received by the microphones would be reduced and the residual echo signals contained in the input audio signal S1 are easily eliminated by the AEC unit 130.
The speech processing apparatus 100 receives one or more microphone signals from the one or more microphones 110. The components contained in the pre-processing unit 115 vary according to the type and the number of microphones 110. For example, if there is only one microphone 110 that outputs an analog audio signal, the pre-processing unit 115 is an analog to digital converter (ADC) configured to convert the analog audio signal into a digital audio signal S1; if there are multiple microphones 110 that output multiple analog audio signals, the pre-processing unit 115 includes multiple ADCs (coupled to the multiple microphones 110) and one average unit, and the average unit is configured to average the output signals from the ADCs to generate the digital audio signal S1; if there are multiple microphones 110 that output multiple digital audio signals, the pre-processing unit 115 includes one average unit configured to average multiple digital audio signals from the multiple microphones 110 to generate one digital audio signal S1; if there is only one microphone 110 that outputs the digital audio signal S1, the pre-processing unit 115 would be eliminated. Thus, the preprocessing unit 115 is optional and represented by dash lines in
The AEC unit 130, the multiplier 170 and the pre-processing unit 115 may be implemented by software, hardware, firmware, or a combination thereof. An example of a pure solution would be a field programmable gate array (FPGA) design or an application specific integrated circuit (ASIC) design. The AEC unit 130 is configured to cancel acoustic echoes in the digital audio signal S1 by any well-known AEC algorithms or architecture to generate an echo-cancelled signal S2. In one embodiment, the AEC unit 130 includes a subtracter 131 only. In this embodiment, the subtracter 131 subtracts the downlink audio signal RX from the digital audio signal S1 to generate an echo-cancelled signal S2.
In an alternative embodiment, the AEC unit 130 includes a subtracter 131 and an adaptive filter 132. In practice, the speaker 120 can originate one or more echo signals, and each echo signal may traverse a direct or reflected path from the speaker 120 to the microphones 110; besides, the higher the volume of the speaker 120, the larger the magnitudes of the echo signals. To cancel the echo signals in the microphone channel, the adaptive filter 132 is placed in parallel to the echo paths between the downlink audio signal RX and the audio signal S1 with the downlink audio signal RX as a reference. The adaptive filter 132 has the ability to adjust its impulse response to filter out the correlated signal in the downlink audio signal RX and forms replicas of the echo paths such that the output signal S5 of the adaptive filter 132 are replicas of the echo signals. Since the operations of the adaptive filter 132 are well known in the art, its detailed descriptions are omitted herein. The subtracter 131 subtracts the echo replica signal S5 from the digital audio signal S1 to generate an echo-cancelled signal S2. Because the adaptive filter 132 is optional, it is represented by dashed lines in
The NR unit 140 is configured to reduce noise in the echo-cancelled signal S2 by any well-known NR algorithms, such as traditional NR algorithms or artificial intelligence NR (AI-NR). For traditional NR algorithms, noise can be reduced in either time domain or frequency domain: (1) time domain: an infinite impulse response (IIR) filtering operation is performed over the echo-cancelled signal S2 in time domain to obtain a noise-reduced signal S3; (2) frequency domain: noise contained in multiple frequency bands in the echo-cancelled signal S2 is filtered out in frequency domain to obtain the noise-reduced signal S3. For AI-NR, a machine learning model (implemented using a recurrent neural network or a convolutional neural network) is trained to classify each of multiple frequency bands contained in the echo-cancelled signal S2 as “speech-dominant” or “noise-dominant (or non-speech)”, and then the noise in the frequency bands classified as “noise-dominant (or non-speech)” in the echo-cancelled signal S2 is eliminated in frequency domain to obtain a noise-reduced signal S3.
Next, the power estimation unit 150 respectively calculates/estimates a power level Pt per M frames of the noise-reduced signal S3 and a power level Pr per M frames of the downlink audio signal RX according to the following power equation:
where x(n) denotes a discrete audio signal and N denotes the number of samples in M frames of the discrete audio signal x(n). N is a power of two, such as 128, 256 or 1024. M is a pre-defined integer and the M frames of the noise-reduced signal S3 correspond to the M frames of the downlink audio signal RX. Correspondingly, the decision unit 160 performs the decision method in
Step S201: Set the gain value g of the multiplier 170 to a default value, such as 1, upon system initialization. Please note that step S201 is performed only once (i.e., upon system initialization), but steps S202-S210 are performed once per M frames (M=1) of the signals S3 and RX.
Step S202: Respectively receive two power levels Pt and Pr from the power estimation unit 150 per M frames (M=1) of the signals S3 and RX.
Step S204: Determine whether the power level Pt is greater than or equal to a first threshold value TH1. If YES, the flow goes to step S206; otherwise, the flow goes to step S208.
Step S206: Determine whether the power level Pr is greater than or equal to a second threshold value TH2. If YES, the flow goes to step S210; otherwise, the flow returns to step S202. Please note that the TH1 and TH2 values are independent and varied according to the mechanical defects of the communication device 10, such as the relative distance of the microphone 110 to the speaker 120, or the degree of gasket leaks. The condition “Pt>=TH1 and Pr<TH2” represents the near-end talker is speaking and the far-end talker is mute; the noise-reduced signal S3 is transmitted as the uplink audio signal TX to the far-end talker. Since the speaker 120 is mute, no acoustic echoes are produced. Accordingly, there is no need to modify the gain value g.
Step S208: Mute the uplink audio signal TX. The condition “Pt<TH1” indicates the power level Pt for a near-end talker is quite small and it is hard for the far-end talker to hear the near-end talker's voice. In this scenario, the decision unit 160 regards the near-end talker as “not speaking” and directly mutes the uplink audio signal TX by setting the values of the uplink audio signal TX to zero. The advantage of transmitting the mute uplink signal TX is preventing the far-end talker from hearing the echo of his own voice as he speaks.
Step S209: Reset the gain value g to the default value 1 as set in step S201. Then, the flow goes back to step S202.
Step S210: Reduce the gain value g. The condition “Pt>=TH1 and Pr>=TH2” is related to a double-talk case. The term “double-talk” refers to both the near-end and the far-end talkers speaking concurrently. The double-talk case includes two following scenarios: scenario A: Pr>Pt>=TH1; and, scenario B: Pt>=TH1 and Pr>=TH2. Scenario A represents the far-end talker speaks louder than the near-end talker. Scenario B represents the far-end talker does not necessarily speak louder than the near-end talker, but the power level Pr is relatively higher than TH2. In either scenario, the volume of the speaker 120 would be so high that the microphones 110 can easily pick up the speaker's output and create acoustic echoes. Thus, the gain value needs to be reduced to reduce the magnitudes of the echo signals received by the microphones 110. There are two approaches for reducing the gain value each time the condition “Pt>=TH1 and Pr>=TH2” is satisfied. Approach 1: a previous gain value gP of the last round is multiplied by a constant number f1 to obtain a current gain value gC, i.e., gC=gP×f1, where 0<f1<1. For example, f1=0.5. Approach 2: adjust the current gain value gC according to the proportion of Pr to Prmax, i.e., gC=Pr/Prmax, where Prmax denotes the maximum power level per M frames of the downlink audio signal RX. For example, if Prmax=100 and Pr=80, then the current gain value gC=80/100=0.8. Theoretically, Approach 2 modifies the current gain value gC according to the proportion of Pr to Prmax, so the transition of the speaker volume is more smooth and the voice quality is better in comparison with Approach 1. After the gain value is reduced, residual echoes picked up by the microphones 110 and contained in the digital audio signal S1 would be also reduced. Afterward, it would be simple for the AEC unit 130 to eliminate the residual echoes in the digital audio signal S1, thus improving the quality and intelligibility of the uplink signal TX. Then, the flow backs to the step S202 and runs through the steps S202-S210 again for the following M frames of the signals S3 and RX.
Finally, the multiplier 170 is configured to multiply sample values of the following M frames of the downlink audio signal RX by the current gain value gC to produce a gained audio signal S4. The speaker 120 then plays the gained audio signal S4.
In summary, in a case that strong acoustic echoes are caused by mechanical defects or mechanical designs of the communication device 10/30 that are unlikely to be modified, the speech processing apparatus 100/300 of the invention can significantly reduce acoustic echoes for the far-end talker and improves the quality and intelligibility of the uplink audio signal TX.
In an embodiment, the speech processing apparatus 100/300 (excluding the ADC(s) in the pre-processing unit 115) is implemented with a general-purpose processor and a program memory. The program memory stores a processor-executable program. When the processor-executable program is executed by the general-purpose processor, the general-purpose processor is configured to function as: the pre-processing unit 115 (excluding the ADC(s)), the AEC unit 130, the NR units 140-141, the power estimation unit 150, the decision unit 160 and the multiplier 170.
The above embodiments and functional operations can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The method and logic flow described in
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.
This application claims priority under 35 USC 119(e) to U.S. provisional application No. 63/186,072, filed on May 8, 2021, the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63186072 | May 2021 | US |