The invention relates to hearing devices, and more particularly, to a hearing device with an end-to-end neural network for reducing comb-filtering effect by performing active noise cancellation and audio signal processing.
It is hard for people to adjust to hearing aids. The fact is that no matter how good a hearing aid is, it always sounds like a hearing aid. A significant cause of this is the “comb-filter effect,” which arises because the digital signal processing in the hearing aid delays the amplified sound relative to the leak-path/direct sound that enters the ear through venting in the ear tip and any leakage around it. The delay is the time that the hearing aid takes to (1) sample and convert an analog audio signal into a digital audio signal; (2) perform digital signal processing; (3) convert the processed signal into an analog audio signal to be delivered to the hearing aid speaker. Prior experiments showed even a delay of around 2 milliseconds (ms) results in clear comb-filtering effect, while ultralow delay below 0.5 ms does not. This delay is perceived as echoes or reverberation by the person wearing a hearing aid and listening to the environmental sounds such as speeches and background noises. The comb-filter effect significantly reduces the sound quality.
As well known in the art, the sound through the leak path (i.e., direct sound) can be removed by introducing Active Noise Cancellation (ANC). After the direct sound is cancelled, the comb-filter effect would be mitigated. US Pub. No. 2020/0221236A1 disclosed a hearing device with an additional ANC circuit for cancelling the sound through the leak path. Theoretically, the ANC circuit may operate in time domain or frequency domain. Normally, the ANC circuit in the hearing aid includes one or more time-domain filters because the signal processing delay of the ANC circuit is typically required to be less than 50 μs. For the ANC circuit operating in frequency domain, the short-time Fourier Transform (STFT) and the inverse STFT processes contribute the signal processing delays ranging from 5 to 50 milliseconds (ms), which includes the effect of ANC circuit. However, most state-of-the-art audio algorithms manipulate audio signals in frequency domain for advanced audio signal processing.
What is needed is a hearing device for integrating time-domain and frequency-domain audio signal processing, reducing comb-filtering effect, performing ANC and advanced audio signal processing, and improving audio quality.
In view of the above-mentioned problems, an object of the invention is to provide a hearing device capable of integrating time-domain and frequency-domain audio signal processing and improving audio quality.
One embodiment of the invention provides a hearing device. The hearing device comprises a main microphone, M auxiliary microphones, a transform circuit, at least one processor, at least one storage media and a post-processing circuit. The main microphone and M auxiliary microphones respectively generate a main audio signal and M auxiliary audio signals. The transform circuit respectively transforms multiple first sample values in current frames of the main audio signal and the M auxiliary audio signals into a main spectral representation and M auxiliary spectral representations. The at least one memory including instructions operable to be executed by the at least one processor to perform a set of operations comprising: performing active noise cancellation (ANC) operations over the first sample values using an end-to-end neural network to generate multiple second sample values; and, performing audio signal processing operations over the main spectral representation and the M auxiliary spectral representations using the end-to-end neural network to generate a compensation mask. The post-processing circuit modifies the main spectral representation with the compensation mask to generate a compensated spectral representation, and generates an output audio signal according to the second sample values and the compensated spectral representation, where M>=0.
Another embodiment of the invention provides an audio processing method applicable to a hearing device. The audio processing method comprises: providing a main audio signal by a main microphone and M auxiliary audio signals by M auxiliary microphones, where M>=0; respectively transforming first sample values in current frames of the main audio signal and the M auxiliary audio signals into a main spectral representation and M auxiliary spectral representations; performing active noise cancellation (ANC) operations over the first sample values using an end-to-end neural network to obtain multiple second sample values; performing audio signal processing operations over the main spectral representation and the M auxiliary spectral representations using the end-to-end neural network to obtain a compensation mask; modifying the main spectral representation with the compensation mask to obtain a compensated spectral representation; and, obtaining an output audio signal according to the second sample values and the compensated spectral representation.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components with the same function are designated with the same reference numerals.
A feature of the invention is to use an end-to-end neural network to simultaneously perform ANC function and advanced audio signal processing, e.g., noise suppression, acoustic feedback cancellation (AFC) and sound amplification and so on. Another feature of the invention is that the end-to-end neural network receives a time-domain audio signal and a frequency-domain audio signal for each microphone so as to gain the benefits of both time-domain signal processing (e.g., extremely low system latency) and frequency-domain signal processing (e.g., better frequency analysis). In comparison with the conventional ANC technology that is most effective on lower frequencies of sound, e.g., between 50 to 1000 Hz, the end-to-end neural network of the invention can reduce both the high-frequency noise and low-frequency noise.
A main microphone 11, located outside the ear, is used to collect ambient sound to generate a main audio signal au-1. If Q>1, at least one auxiliary microphone 12˜1Q generates at least one auxiliary audio signal au-2˜au-Q. The pre-processing unit 120 is configured to receive Q audio signals au-1˜au-Q and generate audio data of current frames i of Q time-domain digital audio signals s1[n]˜sQ[n] and Q current spectral representations F1(i)˜FQ(i) corresponding to the audio data of the current frames i of time-domain digital audio signals s1[n]˜sQ[n], where n denotes the discrete time index and i denotes the frame index of the time-domain digital audio signals s1[n]˜sQ[n]. The end-to-end neural network 130 receives input parameters, the Q current spectral representations F1(i)˜FQ(i) and audio data for current frames i of the Q time-domain signals s1[n]˜sQ[n], performs ANC and AFC functions, noise suppression and sound amplification to generate a frequency-domain compensation mask stream G1(i)˜GN(i) and audio data of the current frame i of a time-domain digital data stream u[n]. The post-processing unit 150 receives the frequency-domain compensation mask stream G1(i)˜GN(i) and audio data of the current frame i of the time-domain data stream u[n] to generate audio data for the current frame i of a time-domain digital audio signal y[n], where N denotes the Fast Fourier transform (FFT) size. Finally, the output circuit 160 converts the digital audio signal y[n] into a sound pressure signal in an ear canal of the user. The output circuit 160 includes a digital to analog converter (DAC)161, an amplifier 162 and a loudspeaker 163.
The pre-processing unit 120, the end-to-end neural network 130 and the post-processing unit 150 may be implemented by software, hardware, firmware, or a combination thereof. In one embodiment, the pre-processing unit 120, the end-to-end neural network 130 and the post-processing unit 150 are implemented by at least one processor and at least one storage media (not shown). The at least one storage media stores instructions/program codes operable to be executed by the at least one processor to cause the processor to function as: the pre-processing unit 120, the end-to-end neural network 130 and the post-processing unit 150. In an alternative embodiment, only the end-to-end neural network 130 is implemented by at least one processor and at least one storage media (not shown). The at least one storage media stores instructions/program codes operable to be executed by the at least one processor to cause the at least one processor to function as: the end-to-end neural network 130.
The end-to-end neural network 130 may be implemented by a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a time delay neural network (TDNN) or any combination thereof. Various machine learning techniques associated with supervised learning may be used to train a model of the end-to-end neural network 130 (hereinafter called “model 130” for short). Example supervised learning techniques to train the end-to-end neural network 130 include, without limitation, stochastic gradient descent (SGD). In supervised learning, a function ƒ (i.e., the model 130) is created by using four sets of labeled training examples (will be described below), each of which consists of an input feature vector and a labeled output. The end-to-end neural network 130 is configured to use the four sets of labeled training examples to learn or estimate the function ƒ (i.e., the model 130), and then to update model weights using the backpropagation algorithm in combination with cost function. Backpropagation iteratively computes the gradient of cost function relative to each weight and bias, then updates the weights and biases in the opposite direction of the gradient, to find a local minimum. The goal of a learning in the end-to-end neural network 130 is to minimize the cost function given the four sets of labeled training examples.
According to the input parameters, the end-to-end neural network 130 receives the Q current spectral representations F1(i)˜FQ(i) and audio data of the current frames i of Q time-domain input streams s1[n]˜sQ[n] in parallel, performs ANC function and advanced audio signal processing and generates one frequency-domain compensation mask stream (including N mask values G1(i)˜GN(i)) corresponding to N frequency bands and audio data of the current frame i of one time-domain output sample stream u[n]. Here, the advanced audio signal processing includes, without limitations, noise suppression, AFC, sound amplification, alarm-preserving, environmental classification, direction of arrival (DOA) and beamforming, speech separation and wearing detection. For purpose of clarity and ease of description, the following embodiments are described with the advanced audio signal processing only including noise suppression, AFC and sound amplification. However, it should be understood that the embodiments of the the end-to-end neural network 130 are not so limited, but are generally applicable to other types of audio signal processing, such as environmental classification, direction of arrival (DOA) and beamforming, speech separation and wearing detection.
For the sound amplification function, the input parameters for the end-to-end neural network 130 include, with limitations, magnitude gains, a maximum output power value of the signal z[n] (i.e., the output of inverse STFT 154) and a set of N modification gains g1˜gN corresponding to N mask values G1(i)˜GN(i), where the N modification gains g1˜gN are used to modify the waveform of the N mask values G1(i)˜GN(i). For the noise suppression, AFC and ANC functions, the input parameters for the end-to-end neural network 130 include, with limitations, level or strength of suppression. For the noise suppression function, the input data for a first set of labeled training examples are constructed artificially by adding various noise to clean speech data, and the ground truth (or labeled output) for each example in the first set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G1(i)˜GN(i)) for corresponding clean speech data. For the sound amplification function, the input data for a second set of labeled training examples are weak speech data, and the ground truth for each example in the second set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G1(i)˜GN(i)) for corresponding amplified speech data based on corresponding input parameters (e.g., including a corresponding magnitude gain, a corresponding maximum output power value of the signal z[n] and a corresponding set of N modification gains g1˜gN). For the AFC function, the input data for a third set of labeled training examples are constructed artificially by adding various feedback interference data to clean speech data, and the ground truth for each example in the third set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G1(i)˜GN(i)) for corresponding clean speech data. For the ANC function, the input data for a fourth set of labeled training examples are constructed artificially by adding the direct sound data to clean speech data, the ground truth for each example in the fourth set of labeled training examples requires N sample values of the time-domain denoised audio data u[n] for corresponding clean speech data. For speech data, a wide range of people's speech is collected, such as people of different genders, different ages, different races and different language families. For noise data, various sources of noise are used, including markets, computer fans, crowd, car, airplane, construction, etc. For the feedback interference data, interference data at various coupling levels between the loudspeaker 163 and the microphones 11˜1Q are collected. For the direct sound data, the sound from the inputs of the hearing devices to the user eardrums among a wide range of users are collected. During the process of artificially constructing the input data, each of the noise data, the feedback interference data and the direct sound data is mixed at different levels with the clean speech data to produce a wide range of SNRs for the four sets of labeled training examples.
In a training phase, the TDNN 131 and the FD-LSTM network 132 are jointly trained with the first, the second and the third sets of labeled training examples, each labeled as a corresponding frequency-domain compensation mask stream (including N mask values G1(i)˜GN(i)); the TDNN 131 and the TD-LSTM network 133 are jointly trained with the fourth set of labeled training examples, each labeled as N corresponding time-domain audio sample values. When trained, the TDNN 131 and the FD-LSTM network 132 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding frequency-domain mask values G1(i)˜GN(i) for the N frequency bands while the TDNN 131 and the TD-LSTM network 133 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding time-domain audio sample values for the current frame i of the signal u[n]. In one embodiment, the N mask values G1(i)˜GN(i) are N band gains (being bounded between Th1 and Th2; Th1<Th2) corresponding to the N frequency bands in the current spectral representations F1(i)˜FQ(i). Thus, if any band gain value Gk(i) gets close to Th1, it indicates the signal on the corresponding frequency band k is noise-dominant; if any band gain value Gk(i) gets close to Th2, it indicates the signal on the corresponding frequency band k is speech-dominant. When the end-to-end neural network 130 is trained, the higher the SNR value in a frequency band k is, the higher the band gain value Gk(i) in the frequency-domain compensation mask stream becomes.
In brief, the low latency of the end-to-end neural network 130 between the time-domain input signals s1[n]˜sQ[n] and the responsive time-domain output signal u[n] fully satisfies the ANC requirements (i.e., less than 50 μs). In addition, the end-to-end neural network 130 manipulates the input current spectral representations F1(i)˜FQ(i) in frequency domain to achieve the goals of noise suppression, AFC and sound amplification, thus greatly improving the audio quality. Thus, the framework of the end-to-end neural network 130 integrates and exploits cross domain audio features by leveraging audio signals in both time domain and frequency domain to improve hearing aid performance.
The above embodiments and functional operations can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The operations and logic flows described in
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.
This application claims priority under 35 USC 119(e) to U.S. provisional application No. 63/171,592, filed on Apr. 7, 2021, the content of which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
8229127 | Jørgensen et al. | Jul 2012 | B2 |
10542354 | Tiefenau et al. | Jan 2020 | B2 |
10805740 | Snyder | Oct 2020 | B1 |
20060182295 | Dijkstra | Aug 2006 | A1 |
20070269066 | Derleth | Nov 2007 | A1 |
20140177857 | Kuster | Jun 2014 | A1 |
20140270290 | Cheung | Sep 2014 | A1 |
20200221236 | Jensen et al. | Jul 2020 | A1 |
20210125625 | Huang | Apr 2021 | A1 |
20220044696 | Kim | Feb 2022 | A1 |
Number | Date | Country |
---|---|---|
111584065 | Aug 2020 | CN |
111584065 | Aug 2020 | CN |
111916101 | Nov 2020 | CN |
111916101 | Nov 2020 | CN |
Entry |
---|
Wayne Staab, “Hearing Aid Delay,” https://hearinghealthmatters.org/waynesworld/2016/hearingaid-delay/, dated Jan. 19, 2016, 7 pages. |
Laura Winther Balling et al., “Reducing hearing aid delay for optimal sound quality: a new paradigm in processing”, https://www.hearingreview.com/hearingproducts/hearing-aids/bte/reducing-hearing-aid-delay-for-optimalsound-quality-a-new-paradigm-in-processing, dated Apr. 23, 2020, 11 pages. |
Erdogan, H, “Improved MVDR beamforming using single-channel mask prediction networks” Mitsubishi Electric Research Laboratories, Sep. 2016, 7 pages. |
Hao Zhang, Deliang Wang, “A Deep Learning Approach to Active Noise Control”, Interspeech Oct. 25, 2020, Computer Science, pp. 1141-1145. |
Erdogan et al., “Improved MVDR beamforming using single-channel mask prediction networks,” Mitsubishi Electric Research Laboratories, Sep. 2016, 6 pages. |
Zhang et al., “A Deep Learning Approach to Active Noise Control,” Interspeech 2000, Oct. 25-29, 2020, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20220329953 A1 | Oct 2022 | US |
Number | Date | Country | |
---|---|---|---|
63171592 | Apr 2021 | US |