The present application relates generally to audio processing and, more specifically, to systems and methods for providing noise reduction that has consistency between speech-present periods and speech-absent periods (speech gaps).
The proliferation of smart phones, tablets, and other mobile devices has fundamentally changed the way people access information and communicate. People now make phone calls in diverse places such as crowded bars, busy city streets, and windy outdoors, where adverse acoustic conditions pose severe challenges to the quality of voice communication. Additionally, voice commands have become an important method for interaction with electronic devices in applications where users have to keep their eyes and hands on the primary task, such as, for example, driving. As electronic devices become increasingly compact, voice command may become the preferred method of interaction with electronic devices. However, despite recent advances in speech technology, recognizing voice in noisy conditions remains difficult. Therefore, mitigating the impact of noise is important to both the quality of voice communication and performance of voice recognition.
Headsets have been a natural extension of telephony terminals and music players as they provide hands-free convenience and privacy when used. Compared to other hands-free options, a headset represents an option in which microphones can be placed at locations near the user's mouth, with constrained geometry among user's mouth and microphones. This results in microphone signals that have better signal-to-noise ratios (SNRs) and are simpler to control when applying multi-microphone based noise reduction. However, when compared to traditional handset usage, headset microphones are relatively remote from the user's mouth. As a result, the headset does not provide the noise shielding effect provided by the user's hand and the bulk of the handset. As headsets have become smaller and lighter in recent years due to the demand for headsets to be subtle and out-of-way, this problem becomes even more challenging.
When a user wears a headset, the user's ear canals are naturally shielded from outside acoustic environment. If a headset provides tight acoustic sealing to the ear canal, a microphone placed inside the ear canal (the internal microphone) would be acoustically isolated from the outside environment such that environmental noise would be significantly attenuated. Additionally, a microphone inside a sealed ear canal is free of wind-buffeting effect. A user's voice can be conducted through various tissues in a user's head to reach the ear canal, because the sound is trapped inside of the ear canal. A signal picked up by the internal microphone should thus have much higher SNR compared to the microphone outside of the user's ear canal (the external microphone).
Internal microphone signals are not free of issues, however. First of all, the body-conducted voice tends to have its high-frequency content severely attenuated and thus has much narrower effective bandwidth compared to voice conducted through air. Furthermore, when the body-conducted voice is sealed inside an ear canal, it forms standing waves inside the ear canal. As a result, the voice picked up by the internal microphone often sounds muffled and reverberant while lacking the natural timbre of the voice picked up by the external microphones. Moreover, effective bandwidth and standing-wave patterns vary significantly across different users and headset fitting conditions. Finally, if a loudspeaker is also located in the same ear canal, sounds made by the loudspeaker would also be picked by the internal microphone. Even with acoustic echo cancellation (AEC), the close coupling between the loudspeaker and internal microphone often leads to severe voice distortion even after AEC.
Other efforts have been attempted in the past to take advantage of the unique characteristics of the internal microphone signal for superior noise reduction performance. However, attaining consistent performance across different users and different usage conditions has remained challenging. It can be particularly challenging to provide robustness and consistency for noise reduction both when the user is speaking and in gaps when the user is not speaking (speech gaps). Some known methods attempt to address this problem; however, those methods may be more effective when the user's speech is present but less so when the user's speech is absent. What is needed is a method that overcomes the drawbacks of the known methods. More specifically, what is needed is a method that improves noise reduction performance during speech gaps such that it is not inconsistent with the noise reduction performance during speech periods.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Methods and systems for providing consistency in noise reduction during speech and non-speech periods are provided. An example method includes receiving a first audio signal and a second audio signal. The first audio signal includes at least a voice component. The second audio signal includes at least the voice component modified by at least a human tissue of a user. The voice component may be the speech of the user. The first and second audio signals including periods where the speech of the user is not present. The method can also include assigning a first weight to the first audio signal and a second weight to the second audio signal. The method also includes processing the first audio signal to obtain a first full-band power estimate. The method also includes processing the second audio signal to obtain a second full-band power estimate. For the periods when the user's speech is not present, the method includes adjusting, based at least partially on the first full-band power estimate and the second full-band power estimate, the first weight and the second weight. The method also includes blending, based on the first weight and the second weight, the first signal and the second signal to generate an enhanced voice signal.
In some embodiments, the first signal and the second signal are transformed into subband signals. In other embodiments, assigning the first weight and the second weight is performed per subband and based on SNR estimates for the subband. The first signal is processed to obtain a first SNR for the subband and the second signal is processed to obtain a second SNR for the subband. If the first SNR is larger than the second SNR, the first weight for the subband receives a larger value than the second weight for the subband. Otherwise, if the second SNR is larger than the first SNR, the second weight for the subband receives a larger value than the first weight for the subband. In some embodiments, the difference between the first weight and the second weight corresponds to the difference between the first SNR and the second SNR for the subband. However, this SNR-based method is more effective when the user's speech is present but less effective when the user's speech is absent. More specifically, when the user's speech is present, according to this example, selecting the signal with a higher SNR leads to the selection of the signal with lower noise. Because the noise in the ear canal tends to be 20-30 dB lower than the noise outside, there is typically a 20-30 dB noise reduction relative to the external microphone signal. However, when the user's speech is absent, in this example, the SNR is 0 at both the internal and external microphone signals. Deciding the weights based only on the SNRs, as in the SNR-based method, would lead to evenly split weights when the user's speech is absent in this example. As a result, only 3-6 dB of noise reduction is typically achieved relative to the external microphone signal when only the SNR-based method is used.
To mitigate this deficiency of SNR-based mixing methods during speech-absent periods (speech gaps), the full-band noise power is used, in various embodiments, to decide the mixing weights during the speech gaps. Because there is no speech, lower full-band power means there is lower noise power. The method, according to various embodiments, selects the signals with lower full-band power in order to maintain the 20-30 dB noise reduction in speech gaps. In some embodiments, during the speech gaps, adjusting the first weight and the second weight includes determining a minimum value between the first full-band power estimate and the second full-band power estimate. When the minimum value corresponds to the first full-band power estimate, the first weight is increased and the second weight is decreased. When the minimum value corresponds to the second full-band power estimate, the second weight is increased and the first weight is decreased. In some embodiments, the weights are increased and decreased by applying a shift. In various embodiments, the shift is calculated based on a difference between the first full-band power estimate and the second full-band power estimate. The shift receives a larger value for a larger difference value. In certain embodiments, the shift is applied only after determining that the difference exceeds a pre-determined threshold. In other embodiments, a ratio of the first full-band power estimate to the second full-band power estimate is calculated. The shift is calculated based on the ratio. The shift receives a larger value the further the value of ratio is from 1.
In some embodiments, the second audio signal represents at least one sound captured by an internal microphone located inside an ear canal. In certain embodiments, the internal microphone is at least partially sealed for isolation from acoustic signals external to the ear canal.
In some embodiments, the first signal represents at least one sound captured by an external microphone located outside an ear canal. In some embodiments, prior to associating the first weight and the second weight, the second signal is aligned with the first signal. In some embodiments, the assigning of the first weight and the second weight includes determining, based on the first signal, a first noise estimate and determining, based on the second signal, a second noise estimate. The first weight and the second weight can be calculated based on the first noise estimate and the second noise estimate.
In some embodiments, blending includes mixing the first signal and the second signal according to the first weight and the second weight. According to another example embodiment of the present disclosure, the steps of the method for providing consistency in noise reduction during speech and non-speech periods are stored on a non-transitory machine-readable medium comprising instructions, which, when implemented by one or more processors, perform the recited steps.
Other example embodiments of the disclosure and aspects will become apparent from the following description taken in conjunction with the following drawings.
Embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
The present technology provides systems and methods for audio processing which can overcome or substantially alleviate problems associated with ineffective noise reduction during speech-absent periods. Embodiments of the present technology can be practiced on any earpiece-based audio device that is configured to receive and/or provide audio such as, but not limited to, cellular phones, MP3 players, phone handsets and headsets. While some embodiments of the present technology are described in reference to operation of a cellular phone, the present technology can be practiced with any audio device.
According to an example embodiment, the method for audio processing includes receiving a first audio signal and a second audio signal. The first audio signal includes at least a voice component. The second audio signal includes the voice component modified by at least a human tissue of a user, the voice component being speech of the user. The first and second audio signals may include periods when the speech of the user is not present. The first and second audio signals may be transformed into subband signals. The example method includes assigning, per subband, a first weight to the first audio signal and a second weight to the second audio signal. The example method includes processing the first audio signal to obtain a first full-band power estimate. The example method includes processing the second audio signal to obtain a second full-band power estimate. For the periods when the user's speech is not present (speech gaps), the example method includes adjusting, based at least partially on the first full-band power estimate and the second full-band power estimate, the first weight and the second weight. The example method also includes blending, based on the adjusted first weight and the adjusted second weight, the first audio signal and the second audio signal to generate an enhanced voice signal.
Referring now to
In various embodiments, the microphones 106 and 108 are either analog or digital. In either case, the outputs from the microphones are converted into synchronized pulse coded modulation (PCM) format at a suitable sampling frequency and connected to the input port of the digital signal processor (DSP) 112. The signals xin and xex denote signals representing sounds captured by internal microphone 106 and external microphone 108, respectively.
The DSP 112 performs appropriate signal processing tasks to improve the quality of microphone signals xin and xex. The output of DSP 112, referred to as the send-out signal (sout), is transmitted to the desired destination, for example, to a network or host device 116 (see signal identified as sout uplink), through a radio or wired interface 114.
If a two-way voice communication is needed, a signal is received by the network or host device 116 from a suitable source (e.g., via the wireless or wired interface 114). This is referred to as the receive-in signal (rin) (identified as rin downlink at the network or host device 116). The receive-in signal can be coupled via the radio or wired interface 114 to the DSP 112 for processing. The resulting signal, referred to as the receive-out signal (rout), is converted into an analog signal through a digital-to-analog convertor (DAC) 110 and then connected to a loudspeaker 118 in order to be presented to the user. In some embodiments, the loudspeaker 118 is located in the same ear canal 104 as the internal microphone 106. In other embodiments, the loudspeaker 118 is located in the ear canal opposite the ear canal 104. In example of
In various embodiments, each ITE module 202 includes an internal microphone 106 and the loudspeaker 118 (shown in
In some embodiments, each of the BTE modules 204 and 206 includes at least one external microphone 108 (also shown in
In some embodiments, the seal of the ITE module(s) 202 is good enough to isolate acoustics waves coming from outside acoustic environment 102. However, when speaking or singing, a user can hear user's own voice reflected by ITE module(s) 202 back into the corresponding ear canal. The sound of voice of the user can be distorted because, while traveling through skull of the user, high frequencies of the sound are substantially attenuated. Thus, the user can hear mostly the low frequencies of the voice. The user's voice cannot be heard by the user outside of the earpieces since the ITE module(s) 202 isolate external sound waves.
In the example in
By way of example and not limitation, suitable noise reduction methods are described by Ephraim and Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, December 1984., and U.S. patent application Ser. No. 12/832,901 (now U.S. Pat. No. 8,473,287), entitled “Method for Jointly Optimizing Noise Reduction and Voice Quality in a Mono or Multi-Microphone System,” filed on Jul. 8, 2010, the disclosures of which are incorporated herein by reference for all purposes.
In various embodiments, the microphone signals xin and xex, with or without NR, and noise estimates (e.g., “external noise and SNR estimates” output from NT/NR module 302 and/or “internal noise and SNR estimates” output from NT/NR module 304) from the NT/NR modules 302 and 304 are sent to a microphone spectral alignment (MSA) module 306, where a spectral alignment filter is adaptively estimated and applied to the internal microphone signal xin. A primary purpose of MSA module 306, in the example in
The external microphone signal xex, the spectrally-aligned internal microphone signal xin,align, and the estimated noise levels at both microphones 106 and 108 are then sent to a microphone signal blending (MSB) module 308, where the two microphone signals are intelligently combined based on the current signal and noise conditions to form a single output with optimal voice quality. The functionalities of various embodiments of the NT/NR modules 302 and 304, MSA module, and MSB module 308 are discussed in more detail in U.S. patent application Ser. No. 14/853,947, entitled “Microphone Signal Fusion”, filed Sep. 14, 2015.
In some embodiments, external microphone signal xex and the spectrally-aligned internal microphone signal xin,align are blended using blending weights. In certain embodiments, the blending weights are determined in MSB module 308 based on the “external noise and SNR estimates” and the “internal noise and SNR estimates”.
For example, MSB module 308 operates in the frequency-domain and determines the blending weights of the external microphone signal and spectral-aligned internal microphone signal in each frequency bin based on the SNR differential between the two signals in the bin. When a user's speech is present (for example, the user of headset 200 is speaking during a phone call) and the outside acoustic environment 102 becomes noisy, the SNR of the external microphone signal xex becomes lower as compared to the SNR of the internal microphone signal xin. Therefore, the blending weights are shifted toward the internal microphone signals xin. Because acoustic sealing tends to reduce the noise in the ear canal by 20-30 dB relative to the external environment, the shift can potentially provide 20-30 dB noise reduction relative to the external microphone signal. When the user's speech is absent, the SNRs of both internal and external microphone signals are effectively zero, so the blending weights become evenly distributed between the internal and external microphone signals. Therefore, if the outside acoustic environment is noisy, the resulting blended signal sout includes the part of the noise. The blending of internal microphone signal xin and noisy external microphone signal xex may result in 3-6 dB noise reduction, which is generally insufficient for extraneous noise conditions.
In various embodiments, the method includes utilizing differences between the power estimates for the external and the internal microphone signals for locating gaps in the speech of the user of headset 200. In certain embodiments, for the gap intervals, blending weight for the external microphone signal is decreased or set to zero and blending weight for the internal microphone signal is increased or set to one before blending of the internal microphone and external microphone signals. Thus, during the gaps in the user's speech, the blending weights are biased to the internal microphone signal, according to various embodiments. As a result, the resulting blended signal contains a lesser amount of the external microphone signal and, therefore, a lesser amount of noise from the outside external environment. When the user is speaking, the blended weights are determined based on “noise and SNR estimates” of internal and external microphone signals. Blending the signals during user's speech improves the quality of the signal. For example, the blending of the signals can improve a quality of signals delivered to the far-end talker during a phone call or to an automatic speech recognition system by the radio or wired interface 114.
In various embodiments, DSP 112 includes a microphone power spread (MPS) module 310 as shown in
In various embodiments, the MPS module 310 generates microphone power spread (MPS) estimates for the internal microphone signal and external microphone signal. The MPS estimates are provided to MSB module 308. In certain embodiments, the MPS estimates are used for a supplemental control of microphone signal blending. In some embodiments, MSB module 308 applies a global bias toward the microphone signal with significantly lower full-band power, for example, by increasing the weights for that microphone signal and decreasing the weights for the other microphone signal (i.e., shifting the weights toward the microphone signal with significantly lower full-band power) before the two microphone signals are blended.
In block 404, method 400 can proceed with assigning a first weight to the first audio signal and a second weight to the second audio signal. In some embodiments, prior to assigning the first weight and the second weight, the first audio signal and the second audio signal are transformed into subband signals and, therefore, assigning of the weights may be performed per each subband. In some embodiments, the first weight and the second weight are determined based on noise estimates in the first audio signal and the second audio signal. In certain embodiments, when the user's speech is present, the first weight and the second weight are assigned based on sub-band SNR estimates in the first audio signal and the second audio signal.
In block 406, method 400 can proceed with processing the first audio signal to obtain a first full-band power estimate. In block 408, method 400 can proceed with processing the second audio signal to obtain a second full-band power estimate. In block 410, during speech gaps when the user's speech is not present, the first weight and the second weight may be adjusted based, at least partially, on the first full-band power estimate and the second full-band power estimate. In some embodiments, if the first full-band power estimate is less than the second full-band estimate, the first weight and the second weight are shifted towards the first weight. If the second full-band power estimate is less than the first full-band estimate, the first weight and the second weight are shifted towards the second weight.
In block 412, the first signal and the second signal can be used to generate an enhanced voice signal by being blended together based on the adjusted first weight and the adjusted second weight.
The components shown in
Mass data storage 530, which can be implemented with a magnetic disk drive, solid state drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by processor unit(s) 510. Mass data storage 530 stores the system software for implementing embodiments of the present disclosure for purposes of loading that software into main memory 520.
Portable storage device 540 operates in conjunction with a portable non-volatile storage medium, such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device, to input and output data and code to and from the computer system 500 of
User input devices 560 can provide a portion of a user interface. User input devices 560 may include one or more microphones, an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, stylus, or cursor direction keys. User input devices 560 can also include a touchscreen. Additionally, the computer system 500 as shown in
Graphics display system 570 include a liquid crystal display (LCD) or other suitable display device. Graphics display system 570 is configurable to receive textual and graphical information and processes the information for output to the display device.
Peripheral devices 580 may include any type of computer support device to add additional functionality to the computer system.
The components provided in the computer system 500 of
The processing for various embodiments may be implemented in software that is cloud-based. In some embodiments, the computer system 500 is implemented as a cloud-based computing environment, such as a virtual machine operating within a computing cloud. In other embodiments, the computer system 500 may itself include a cloud-based computing environment, where the functionalities of the computer system 500 are executed in a distributed fashion. Thus, the computer system 500, when configured as a computing cloud, may include pluralities of computing devices in various forms, as will be described in greater detail below.
In general, a cloud-based computing environment is a resource that typically combines the computational power of a large grouping of processors (such as within web servers) and/or that combines the storage capacity of a large grouping of computer memories or storage devices. Systems that provide cloud-based resources may be utilized exclusively by their owners or such systems may be accessible to outside users who deploy applications within the computing infrastructure to obtain the benefit of large computational or storage resources.
The cloud may be formed, for example, by a network of web servers that comprise a plurality of computing devices, such as the computer system 500, with each server (or at least a plurality thereof) providing processor and/or storage resources. These servers may manage workloads provided by multiple users (e.g., cloud resource customers or other users). Typically, each user places workload demands upon the cloud that vary in real-time, sometimes dramatically. The nature and extent of these variations typically depends on the type of business associated with the user.
The present technology is described above with reference to example embodiments. Therefore, other variations upon the example embodiments are intended to be covered by the present disclosure.