The present disclosure relates to an ear-worn device, such as a hearing aid.
Hearing aids are used to help those who have trouble hearing to hear better. Typically, hearing aids amplify received sound. Some hearing aids attempt to enhance incoming sound.
According to one aspect, an ear-worn device includes: two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, and short-time Fourier transformation (STFT) circuitry, the processing circuitry configured to generate, from the time-domain audio signals, one or more frequency-domain non-beamformed audio signals and one or more frequency-domain beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple frequency-domain input audio signals originating from the one or more frequency-domain non-beamformed audio signals and the one or more frequency-domain beamformed signals; and implement a single neural network trained to generate, based on the multiple frequency-domain input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, and short-time Fourier transformation (STFT) circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more frequency-domain non-beamformed audio signals and one or more frequency-domain beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple frequency-domain input audio signals originating from either or both of the one or more frequency-domain non-beamformed audio signals and the one or more frequency-domain beamformed signals; and implement a single neural network trained to generate, based on the multiple frequency-domain input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
According to one aspect, an ear-worn device includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, and short-time Fourier transformation (STFT) circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more frequency-domain non-beamformed audio signals and one or more frequency-domain beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple frequency-domain input audio signals originating from either or both of the one or more frequency-domain non-beamformed audio signals and the one or more frequency-domain beamformed signals; and implement a neural network trained to generate, based on the multiple frequency-domain input audio signals, a noise-reduced output audio signal or an output for generating a noise-reduced output audio signal.
According to one aspect, an ear-worn device includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, and short-time Fourier transformation (STFT) circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more frequency-domain non-beamformed audio signals and one or more frequency-domain beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple frequency-domain input audio signals originating from either or both of the one or more frequency-domain non-beamformed audio signals and the one or more frequency-domain beamformed signals; and implement a neural network trained to generate, based on the multiple frequency-domain input audio signals, a spatially-focused output audio signal or an output for generating a spatially-focused output audio signal.
According to one aspect, an ear-worn device includes two or more microphones configured to generate audio signals, each of the two or more microphones configured to generate one of the audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more non-beamformed audio signals one or more beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple input audio signals originating from either or both of the one or more non-beamformed audio signals the one or more beamformed signals; and implement a neural network trained to generate, based on the multiple input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate audio signals, each of the two or more microphones configured to generate one of the audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more non-beamformed audio signals and one or more beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple input audio signals originating from either or both of the one or more non-beamformed audio signals and the one or more beamformed signals; and implement a neural network trained to generate, based on the multiple input audio signals, a noise-reduced output audio signal or an output for generating a noise-reduced output audio signal.
According to one aspect, an ear-worn device includes two or more microphones configured to generate audio signals, each of the two or more microphones configured to generate one of the audio signals; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, the processing circuitry configured to generate, from the time-domain audio signals, either or both of one or more non-beamformed audio signals and one or more beamformed signals; and enhancement circuitry including neural network circuitry configured to: receive multiple input audio signals originating from either or both of the one or more non-beamformed audio signals and the one or more beamformed signals; and implement a neural network trained to generate, based on the multiple input audio signals, a spatially-focused output audio signal or an output for generating a spatially-focused output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; and enhancement circuitry including neural network circuitry configured to receive multiple frequency-domain input audio signals originating from the time-domain audio signals and implement a neural network trained to to generate, based on the multiple frequency-domain input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; enhancement circuitry including neural network circuitry configured to receive multiple frequency-domain input audio signals originating from the time-domain audio signals and implement a neural network trained to generate, based on the multiple frequency-domain input audio signals, a noise-reduced output audio signal or an output for generating a noise-reduced output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate time-domain audio signals, each of the two or more microphones configured to generate one of the time-domain audio signals; and enhancement circuitry including neural network circuitry configured to receive multiple frequency-domain input audio signals originating from the time-domain audio signals and implement a neural network trained to generate, based on the multiple frequency-domain input audio signals, a spatially-focused output audio signal or an output for generating a spatially-focused output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate microphone audio signals, each of the two or more microphones configured to generate one of the microphone audio signals; and enhancement circuitry including neural network circuitry configured to receive multiple input audio signals originating from the microphone audio signals and implement a neural network trained to generate, based on the multiple input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
According to one aspect, an ear-worn device includes two or more microphones configured to generate microphone audio signals, each of the two or more microphones configured to generate one of the microphone audio signals; and enhancement circuitry including neural network circuitry configured to receive multiple input audio signals originating from the microphone audio signals and implement a neural network trained to generate, based on the multiple input audio signals, a noise-reduced output audio signal or an output for generating a noise-reduced output audio signal.
According to one aspect, an ear-worn device, includes two or more microphones configured to generate microphone audio signals, each of the two or more microphones configured to generate one of the microphone audio signals; and enhancement circuitry including neural network circuitry configured to receive multiple input audio signals originating from the microphone audio signals and implement a neural network trained to generate, based on the multiple input audio signals, a noise-reduced and spatially-focused output audio signal or an output for generating a noise-reduced and spatially-focused output audio signal.
In some embodiments of any of the above aspects, the one or more frequency-domain beamformed signals include a frequency-domain beamformed signal having a dipole, hypercardioid, supercardioid, or cardioid directional pattern.
In some embodiments of any of the above aspects, the two or more microphones include a front microphone and a back microphone, and the one of more frequency-domain non-beamformed audio signals include a frequency-domain non-beamformed audio signal originating from the front microphone and a frequency-domain non-beamformed audio signal originating from the back microphone.
In some embodiments of any of the above aspects, the single neural network includes a recurrent network.
In some embodiments of any of the above aspects, the ear-worn device further includes interpolation circuitry configured to interpolate between 1. one of the multiple frequency-domain input audio signals, and 2. the noise-reduced and spatially-focused output, or a processed version thereof.
In some embodiments of any of the above aspects, the noise-reduced and spatially-focused output audio signal uses a mapping of gains to respective spatial regions. In some embodiments of any of the above aspects, the mapping is predetermined. In some embodiments of any of the above aspects, the mapping of the gains to the respective spatial regions includes applying a gain of 1 to audio generated from sounds coming from a target spatial region and applying a gain of 0 to audio generated from sounds coming from other spatial regions. In some embodiments of any of the above aspects, the target spatial region has an angle relative to a wearer of the ear-worn device of approximately equal to or between 10-180 degrees. In some embodiments of any of the above aspects, the mapping of the gains to the respective spatial regions includes mapping more than two spatial regions each to a different gain, and one or more of the spatial regions are processed with gains not equal to 1 or 0. In some embodiments of any of the above aspects, the mapping is not predetermined. In some embodiments of any of the above aspects, the output for generating the noise-reduced and spatially-focused output audio signal includes a sound map indicating frequency components originating from each of multiple spatial regions. In some embodiments of any of the above aspects, the enhancement circuitry is further configured to apply a beam pattern to the sound map, and the beam pattern is based on a selection from a wearer of the ear-worn device. In some embodiments of any of the above aspects, the selection from the wearer of the ear-worn device includes a selection of a size of a front-facing spatial region to use for focusing. In some embodiments of any of the above aspects, the output for generating the noise-reduced and spatially-focused output audio signal includes values calculated for a metric from audio from the multiple beams, each of the multiple beams pointing at a different angle around a wearer of the ear-worn device, and the enhancement circuitry is configured to combine the audio from the multiple beams using the values for the metric.
In some embodiments of any of the above aspects, the neural network is trained on both captured data and synthetic data.
In some embodiments of any of the above aspects, the ear-worn device further includes an inertial measurement unit (IMU) and second processing circuitry configured to track head movements of a wearer of the ear-worn device using measurements from the IMU and cause an absolute coordinate system to be used for the spatial focusing based on the head movements.
In some embodiments of any of the above aspects, the processing circuitry is coupled between the two or more microphones and the enhancement circuitry; the analog processing circuitry is coupled between the two or more microphones and the digital processing circuitry; the digital processing circuitry is coupled between the analog processing circuitry and the beamforming circuitry; the beamforming circuitry is coupled between the digital processing circuitry and the STFT circuitry; the analog processing circuitry is configured to perform one or more of analog preamplification, analog filtering, and analog-to-digital conversion; and the digital processing circuitry is configured to perform one or more of wind reduction, input calibration, and anti-feedback processing.
In some embodiments of any of the above aspects, the processing circuitry is coupled between the two or more microphones and the enhancement circuitry; the analog processing circuitry is coupled between the two or more microphones and the digital processing circuitry; the digital processing circuitry is coupled between the analog processing circuitry and the STFT circuitry; the STFT circuitry is coupled between the digital processing circuitry and the beamforming circuitry; the analog processing circuitry is configured to perform one or more of analog preamplification, analog filtering, and analog-to-digital conversion; and the digital processing circuitry is configured to perform one or more of wind reduction, input calibration, and anti-feedback processing.
In some embodiments of any of the above aspects, the neural network circuitry is configured to output a single output based on the multiple frequency-domain input audio signals.
According to one aspect, an ear-worn device is configured to collect audio from multiple beams, each of the multiple beams pointing at a different angle around a wearer of the ear-worn device; calculate values for a metric from the audio from the multiple beams; and combine the audio from the multiple beams using the values for the metric. In some embodiments, the ear-worn device includes neural network circuitry configured to implement a neural network trained to calculate the values for the metric. In some embodiments, the metric includes signal-to-noise ratio. In some embodiments, the metric includes speaker power.
According to on aspect, an ear-worn device includes an inertial measurement unit (IMU) and second processing circuitry configured to track head movements of a wearer of the ear-worn device using measurements from the IMU and cause an absolute coordinate system to be used for the spatial focusing based on the head movements.
Various aspects and embodiments of the application will be described with reference to the following figures. It should be appreciated that the figures are not necessarily drawn to scale.
Items appearing in multiple figures are indicated by the same reference number in all the figures in which they appear.
Reducing noise in the output of hearing aids and other ear-worn devices is a difficult challenge. Reducing noise in scenarios in which the wearer is listening to one speaker while there are other interfering speakers in the vicinity is a particularly difficult challenge. Recently, neural networks for separating speech from noise have been developed. Further description of such neural networks for reducing noise may be found in U.S. Patent App. Pub. No. US20230232169A1, titled “Method, Apparatus and System for Neural Network Hearing Aid,” published on Jul. 20, 2023 and based on an application filed on Jan. 14, 2022, which is incorporated by reference herein in its entirety.
The inventors have recognized that neural networks that accept two more input audio signals originating from two or more microphones on an ear-worn device may be better able to reduce noise and sound from interfering speakers. For general noise reduction, the inventors have recognized that if, at a previous time step, the neural network heard noise coming from a certain direction-of-arrival (DOA), the neural network may have a prior to cancel out noise from that DOA on the current time step. From another perspective, sound sources tend to move slowly with time so if the neural network has identified a particular segment of sound as speech and knows its DOA, the neural network may reasonably infer that other sounds from the same direction are also speech.
For reducing sound from interfering speakers, conventional ear-worn devices may use beamforming to focus on sound received from certain directions and attenuate sounds received from other directions. This may involve processing sounds from different microphones in different ways (e.g., applying different delays to the signals received at different microphones). Conventional beamforming (both adaptive and non-adaptive) may provide an intelligibility boost, because it may enable focusing on sounds coming from in front of the wearer (from where it is assumed that sounds of interest originate) and attenuate sounds (e.g., background noise and interfering speakers) on the sides and back of the wearer.
However, conventional beamforming patterns (e.g., cardioids, supercardioids, hypercardiods, and dipoles) may also have shortcomings, including: 1. A theoretical beamforming pattern may become warped once it is implemented by microphones placed on a behind-the-ear hearing aid, due at least in part to interference from the head, torso, and ear of the wearer; this may cause performance to suffer. 2. In reverberant environments, the indirect path may come through from the front-facing direction. For example, in a reverberant room, when a speaker is talking from directly behind the wearer, that speaker's voice may reverberate all around the room and enter the hearing aid microphones from in front of the wearer; such sounds may not be attenuated by a front-facing beamforming pattern. 3. Conventional beamforming may work better on high-frequency sounds than low-frequency sounds. In other words, conventional beamforming may be better at using high-frequency sounds for sound localization than low-frequency sounds. 4. Generally, there is a limit to how much sound reduction conventional beamforming patterns can provide.
The inventors have addressed these shortcomings by developing neural networks that perform spatial focusing. Spatial focusing may include applying different gains to audio signals based on the locations of the sources of the sounds from which the audio signals were generated. The locations of the sounds may be derived from differences in timing of the sounds arriving at multiple microphones. The inventors have recognized that with a single microphone, speakers from different directions may sound the same to a neural network; in other words, a neural network may not be able to tell whether a speaker is in front of the wearer or behind the wearer (or, in general, where a wearer is located). A neural network using inputs from multiple microphones may break this ambiguity. Thus, the neural network may accept multiple input audio signals originating from two or more microphones on an ear-worn device and be trained to perform spatial focusing (in some embodiments, in addition to performing noise reduction), which may help to focus on sounds coming from a target direction and reduce sound coming from other directions. As one particular example, focusing on sounds originating from in front of the ear-worn device wearer may help to reduce sound from interfering speakers located behind and to the sides of the ear-worn device wearer; other target directions may be used as well.
Generally, the inventors have developed neural networks that may accept two or more input audio signals originating from two or more microphones on an ear-worn device and that may provide a significantly larger increase in signal-to-noise ratio (SNR) and higher-performance speaker isolation than achievable with conventional noise reduction and/or beamforming techniques. Thus, these neural networks may be trained to perform noise reduction, spatial focusing, or both. Additionally, the inventors have developed different methods for how such neural networks may perform spatial focusing, including methods for mapping gains to respective spatial regions from which sounds originated, methods for generating sound maps and applying beam patterns using the sound maps, and methods for collecting sound using multiple beams around the wearer and weighting the sound using a metric such as SNR. Additionally, the inventors have developed different graphical user interfaces (GUIs) for a processing device (e.g., a smartphone or tablet) in operative communication with an ear-worn device; a wearer may use these GUIs to control spatial focusing performed by the ear-worn device. The neural networks may have sufficiently low latencies to enable implementation of the neural networks in an ear-worn device. In some embodiments, a single recurrent neural network may be used in conjunction with frequency-domain processing, which may be helpful for realizing low latencies.
In the data path 400, the processing circuitry 404 is coupled between the microphones 402 and the enhancement circuitry 412. The enhancement circuitry 412 is coupled between the processing circuitry 404 and the processing circuitry 414. The processing circuitry 414 is coupled between the enhancement circuitry 412 and the receiver 418. As referred to herein, if element A is described as coupled between element B and element C, there may be other elements between elements A and B and/or between elements A and C.
The microphones 402 may include two or more (e.g., 2, 3, 4, or more) microphones. For example, the microphones 402 may include two microphones, a front microphone that is closer to the front of the wearer of the ear-worn device and a back microphone that is closer to the back of the wearer of the ear-worn device (e.g., as in the hearing aid 156). As another example, the microphones AAA02 may include more than two microphones in an array (e.g., as in the eyeglasses 358). As another example, one microphone may be on a first ear-worn device and one microphone may be on a second ear-worn device coupled wirelessly to the first ear-worn device. The microphones 402 may be configured to receive sound signals and generate time-domain audio signals 442 from the sound signals. The time-domain audio signals 442 may represent multiple individual audio signals, each generated by one of the microphones 402. Thus, each of the time-domain audio signals 442 may originate from one of the microphones 402.
In some embodiments, the processing circuitry 404 may include analog processing circuitry. The analog processing circuitry may be configured to perform analog processing on the time-domain audio signals 442 received from the microphones 402. For example, the analog processing circuitry may be configured to perform one or more of analog preamplification, analog filtering, and analog-to-digital conversion. Thus, the analog processing circuitry may be configured to generate analog-processed time-domain audio signals from the time-domain audio signals 442. The analog-processed time-domain audio signals may include multiple individual signals, each an analog-processed version of one of the time-domain audio signals 442. As referred to herein, analog processing circuitry may include analog-to-digital conversion circuitry, and an analog-processed signal may be a digital signal that has been converted from analog to digital by analog-to-digital conversion circuitry.
In some embodiments, the processing circuitry 404 may include digital processing circuitry. The digital processing circuitry may be configured to perform digital processing on the analog-processed time-domain audio signals received from the analog processing circuitry. For example, the digital processing circuitry may be configured to perform one or more of wind reduction, input calibration, and anti-feedback processing. Thus, the digital processing circuitry may be configured to generate digital-processed time-domain audio signals from the analog-processed time-domain audio signals. The digital-processed time-domain audio signals may include multiple individual signals, each a digital-processed version of one of the analog-processed time-domain audio signals
In some embodiments, the processing circuitry 404 may include beamforming circuitry. In some embodiments, the beamforming circuitry may be configured to operate in the time domain, and the enhancement circuitry 412 may be configured to operate in the frequency domain. In such embodiments, STFT circuitry in the processing circuitry 412 may be coupled between the beamforming circuitry and the enhancement circuitry 412, and the beamforming circuitry may be configured to perform beamforming on two or more of the digital-processed time-domain audio signals received from the digital processing circuitry. Thus, the beamforming circuitry may be configured to generate one or more time-domain beamformed audio signals from two or more of the digital-processed time-domain audio signals. The time-domain beamformed audio signals may include one or more individual signals, each a beamformed version of two or more digital-processed time-domain audio signals. Beamforming will be described in further detail below.
The STFT circuitry may be configured to perform STFT on one or more of the time-domain beamformed audio signals and/or one or more of the digital-processed time-domain audio signals. The STFT may convert a signal within a short time window (e.g., on the order of milliseconds) from a time-domain signal into a frequency-domain signal. Thus, the STFT circuitry may be configured to generate one or more frequency-domain beamformed audio signals 422 from the time-domain beamformed audio signals and to generate one or more frequency-domain non-beamformed audio signals 420 from one or more of the digital-processed time-domain (non-beamformed) audio signals. The one or more frequency-domain beamformed audio signals 422 may include one or more individual signals, each a frequency-domain version of one of the time-domain beamformed signals. The one or more frequency-domain non-beamformed audio signals 420 may include one or more individual signals, each a frequency-domain version of one of the digital-processed time-domain signals.
In some embodiments, the beamforming circuitry may be configured to operate in the frequency domain, and the enhancement circuitry 412 may be configured to operate in the frequency domain. In such embodiments, the STFT circuitry may be coupled between the processing circuitry 404 and the beamforming circuitry. The STFT circuitry may be configured to perform STFT on the digital-processed time-domain audio signals received from the digital processing circuitry. The STFT may convert a signal within a short time window (e.g., on the order of milliseconds) from a time-domain signal into a frequency-domain signal. Thus, the STFT circuitry may be configured to generate one or more frequency-domain non-beamformed audio signals 420 from the digital-processed time-domain audio signals. The one or more frequency-domain non-beamformed audio signals 420 may include one or more individual signals, each a frequency-domain version of one of the digital-processed time-domain signals. The beamforming circuitry may be configured to perform beamforming on two or more of the frequency-domain non-beamformed audio signals received from the STFT circuitry. Thus, the beamforming circuitry may be configured to generate one or more frequency-domain beamformed audio signals 422 from two or more of the frequency-domain non-beamformed audio signals. The frequency-domain beamformed audio signals 422 may include one or more individual signals, each a beamformed version of two or more frequency-domain non-beamformed audio signals. Beamforming will be described in further detail below.
Thus, in some embodiments (e.g., when the beamforming circuitry is configured to operate on time-domain signals), the analog processing circuitry may be coupled between the two or more microphones and the digital processing circuitry, the digital processing circuitry may be coupled between the analog processing circuitry and the beamforming circuitry, and the beamforming circuitry may be coupled between the digital processing circuitry and the STFT circuitry. In some embodiments (e.g., when the beamforming circuitry is configured to operate on frequency-domain signals), the analog processing circuitry may be coupled between the two or more microphones and the digital processing circuitry, the digital processing circuitry may be coupled between the analog processing circuitry and the STFT circuitry, and the STFT circuitry may be coupled between the digital processing circuitry and the beamforming circuitry.
The enhancement circuitry 412 includes the neural network circuitry 424 which may be configured to receive multiple input audio signals originating from the one or more frequency-domain non-beamformed audio signals 420 and the one or more frequency-domain beamformed audio signals 422. As referred to herein, a first signal may be said to originate from a second signal if the first signal is the same as the second signal or results from processing of the second signal. Thus, when this description describes that neural network circuitry 424, for example, receives first signals originating from second signals, it should be understood that in some embodiments, the neural network circuitry 424 may receive the second signals themselves. In some embodiments, the neural network circuitry 424 may receive processed versions of the second signals. As particular examples, in some embodiments, the multiple input audio signals received by the neural network circuitry 424 may be the one or more frequency-domain non-beamformed audio signals 420 and the one or more frequency-domain beamformed audio signals 422. In some embodiments, the multiple input audio signals received by the neural network circuitry 424 may be versions of the one or more frequency-domain non-beamformed audio signals 420 and the one or more frequency-domain beamformed audio signals 422 that have been processed further by the enhancement circuitry 412.
The neural network circuitry 424 may be configured to implement a neural network (e.g., a recurrent neural network) trained to perform noise reduction and/or spatial focusing. Thus, in some embodiments, the neural network may be trained to reduce noise (i.e., reduce non-speech). In some embodiments, the neural network may be trained to perform spatial focusing. In some embodiments, the neural network may be trained to both reduce noise and perform spatial focusing. It should be appreciated that a neural network may be considered trained to perform noise reduction even if the neural network itself does not generate a noise-reduced audio signal; a neural network that generates an output for use in generating a noise-reduced audio signal may still be considered trained to perform noise reduction. For example, the neural network may generate a mask that may be used to generate a noise-reduced audio signal. It should also be appreciated that a neural network may be considered trained to perform spatial focusing even if the neural network itself does not generate a spatially-focused audio signal; a neural network that generates an output for use in generating a spatially-focused audio signal may still be considered trained to perform spatial focusing. The output may be, as non-limiting examples, a mask for generating a spatially-focused audio signal, a sound map, a mask for generating a sound map, or values calculated for a metric from audio from multiple beams (each of the multiple beams pointing at a different angle around a wearer of the ear-worn device). In some embodiments, the neural network circuitry 424 may be configured to output a single output based on the multiple input audio signals. In some embodiments, the output 450 of the enhancement circuitry 412 may be the output of the neural network circuitry 424. In some embodiments, the output of the neural network circuitry 424 may undergo further processing (e.g., by the interpolation circuitry 1930 and/or the processing circuitry 1950 described below) prior to being outputted as the output 450 of the enhancement circuitry 412.
The processing circuitry 414 may be configured to perform further processing on the output 450 of the enhancement circuitry 412. For example, the processing circuitry 414 may include digital processing circuitry configured to perform one or more of wide-dynamic range compression and output calibration. Additionally, the processing circuitry 414 may include inverse STFT (iSTFT) circuitry configured to perform inverse STFT on the output of the digital processing circuitry. The iSTFT may be configured to convert a frequency-domain signal into a time-domain signal having a short time window.
The receiver 418 (which may be the same as the receivers 118 and/or 318) may be configured to play back the output of the iSTFT circuitry as sound into the ear of the user. The receiver 418 may also be configured to implement digital-to-analog conversion prior to the playing back.
In some embodiments, one or more of the processing circuitry 404, the enhancement circuitry 412 (including the neural network circuitry 424), and the processing circuitry 414 (or portions of any of the above) may be implemented on a single same chip (i.e., a single semiconductor die or substrate) in the ear-worn device. Further description of chips incorporating (among other elements) neural network circuitry for use in ear-worn devices may be found in U.S. patent application Ser. No. 18/232,854, entitled “Neural Network Chip for Ear-Worn Device,” filed Aug. 11, 2023, which is incorporated by reference herein in its entirety.
It should be appreciated from the above description of the data paths 400, 500, and 600 that, in general, two or more microphones (e.g., the microphones 402) may be configured to generate time-domain audio signals, such that each of the two or more microphones is configured to generate one of the time-domain audio signals. Processing circuitry (e.g., the processing circuitry 404) may be configured to generate, from the time-domain audio signals, either or both of 1. one or more non-beamformed audio signals, and 2. one or more beamformed signals. Enhancement circuitry (e.g., the enhancement circuitry 412) may include neural network circuitry (e.g., the neural network circuitry 424) configured to receive multiple input audio signals originating from either or both of 1. the one or more non-beamformed audio signals, and 2. the one or more beamformed signals. In some embodiments, the multiple input audio signals may be frequency-domain signals. In some embodiments, the multiple input audio signals may be time-domain signals. As will be described further below, the neural network circuitry may be further configured to implement a neural network trained to perform noise reduction and/or spatial focusing. The neural network circuitry may be trained to generate a noise-reduced and/or spatially-focused output audio signal, or an output for generating a noise-reduced and/or spatially-focused output audio signal.
In some embodiments, the multiple input audio signals may include one frequency-domain beamformed signal. In some embodiments, the multiple input audio signals may include two frequency-domain beamformed signals. In some embodiments, the multiple input audio signals may include more than two (e.g., 3, 4, 5, or any other suitable number) frequency-domain beamformed signals. In some embodiments, the multiple input audio signals may include one frequency-domain non-beamformed signal. In some embodiments, the multiple input audio signals may include two frequency-domain non-beamformed signals. In some embodiments, the multiple input audio signals may include more than two (e.g., 3, 4, 5, or any other suitable number) frequency-domain non-beamformed signals. When the microphones 402 include a front microphone and a back microphone (e.g., as in the hearing aid 156), the one of more frequency-domain non-beamformed audio signals may include a frequency-domain non-beamformed audio signal originating from the front microphone and/or a frequency-domain non-beamformed audio signal originating from the back microphone.
In some embodiments, instead of or in addition to the example combinations of signals fed into the neural network circuitry 424 described above, signals that have undergone other processing may be fed into the enhancement circuitry. For example, processing circuitry not illustrated may calculate the difference between the front and back microphone signals at any stage of a data path, and once this difference signal has undergone processing by the rest of the data path, it may be fed to the neural network circuitry in addition to or instead of either or both of the front frequency-domain audio signal 420f and the back frequency-domain audio signal 420b. Additionally or alternatively, such a processed signal may be used to generate beamformed signals; for example, the front frequency-domain audio signal 420f and the difference between the front frequency-domain audio signal 420f and the back frequency-domain audio signal 420b may be used as inputs to beamforming circuitry.
According to one non-limiting embodiment, an ear-worn device includes two or more microphones; processing circuitry including analog processing circuitry, digital processing circuitry, beamforming circuitry, and STFT circuitry; enhancement circuitry that includes neural network circuitry; further processing circuitry; and a receiver. According to one non-limiting embodiment, an ear-worn device includes two or more microphones; processing circuitry including analog processing circuitry, digital processing circuitry, and beamforming circuitry; enhancement circuitry that includes neural network circuitry; further processing circuitry; and a receiver. According to one non-limiting embodiment, an ear-worn device includes two or more microphones; processing circuitry including analog processing circuitry, digital processing circuitry, and STFT circuitry; enhancement circuitry that includes neural network circuitry; further processing circuitry; and a receiver. According to one non-limiting embodiment, an ear-worn device includes two or more microphones; processing circuitry including analog processing circuitry and digital processing circuitry; enhancement circuitry that includes neural network circuitry; further processing circuitry; and a receiver.
In some embodiments of any of the data paths described herein, the enhancement circuitry 412 may be configured to operate on time-domain signals. Thus, the STFT circuitry and iSTFT circuitry may be absent in any of the data paths described herein, and the enhancement circuitry 412 may be configured to receive as inputs one or more time-domain audio signals (e.g., one or more non-beamformed audio signals and/or one or more time-domain beamformed audio signals).
Referring back to beamforming, beamforming may generally include applying delays (which should be understood to include a delay of 0) to one or more audio signals and summing (which should be understood to include subtracting) the delayed signals together. Different delays may be applied to signals originating from different microphones. In embodiments including just two microphones, namely a front microphone (e.g., the front microphone 102f and/or 402f) and a back microphone (e.g., the back microphone 102b and/or 402b), beamforming may include applying a delay to the signal from one of the microphones and subtracting the delayed signal from the signal from the other microphone. The resulting signal may have a directional pattern that depends, at least in part, on the spacing between the front microphone and the back microphone as well as the delay applied; in other words, the gain of the resulting signal may vary as a function of angle from the microphones. Examples of directional patterns include dipoles, hypercardioids, supercardioids, and cardioids. Certain directional patterns may generally attenuate signals coming from behind the wearer and not attenuate signals coming from in front of the wearer.
As described above, the neural network circuitry 424 may be configured to receive as inputs (in some embodiments, in addition to other signals) one or more signals originating from beamformed signals. In some embodiments, these beamformed signals may be frequency-domain signals (e.g., the frequency-domain beamformed signals 422) although time-domain beamformed signals may also be used as described above. In some embodiments, the one or more beamformed signals may include a signal having a dipole pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing supercardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a back-facing supercardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing cardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a back-facing cardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing supercardioid pattern and a signal having a back-facing supercardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing cardioid pattern and a signal having a back-facing cardioid pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing supercardioid pattern, a signal having a back-facing supercardioid pattern, and a signal having a dipole pattern. In some embodiments, the one or more beamformed signals may include a signal having a front-facing cardioid pattern, a signal having a back-facing cardioid pattern, and a signal having a dipole pattern. In some embodiments, the one or more beamformed signals may include one signal. In some embodiments, the one or more beamformed signals may include two signals. In some embodiments, the one or more beamformed signals may include three signals. In some embodiments, the one or more beamformed signals may include more than three signals.
In some embodiments, the neural network may be trained to reduce noise (i.e., reduce non-speech). In some embodiments, the neural network may be trained to perform spatial focusing. In some embodiments, the neural network may be trained to both reduce noise and perform spatial focusing. In particular, the neural network circuitry 424 may be configured to receive multiple input audio signals 19321 . . . 1932n. In some embodiments, the input audio signals 19321 . . . 1932n may be frequency-domain audio signals. In some embodiments, the input audio signals 19321 . . . 1932n may be time-domain audio signals. In embodiments in which the input audio signals 19321 . . . 1932n to the neural network circuitry 424 are frequency-domain audio signals, in some embodiments, the input audio signals 19321 . . . 1932n may originate from either or both of the one or more frequency-domain non-beamformed audio signals 420 (which may be, for example, either or both of the front frequency-domain audio signal 420f and the back frequency-domain audio signal 420b) and the one or more frequency-domain beamformed audio signals 422 (which may be, for example, audio signals having dipole, cardioid, supercardioid, or hypercardioid beam patterns, and may be generated, for example, based on the front frequency-domain audio signal 420f and the back frequency-domain audio signal 420b, or generally, based on the one or more frequency-domain non-beamformed audio signals 420).
Based on the input audio signals 19321 . . . 1932n, the neural network circuitry 424 may be configured to output a single output mask 1934. The mask 1934 may be configured to generate an output 1986. In the example of
The mask 1934 may be a real or complex mask that varies with frequency. Thus, when the mask 1934 is multiplied by the input audio signal 19321, it may operate differently on different frequency components of the input audio signal 19321. In other words, the mask 1934 may cause different frequency components of the input audio signal 19321 to be multiplied by different real or complex values. A real mask may modify just magnitude, while a complex mask may modify both magnitude and phase.
Turning to
The processing circuitry 1950 of
Generally, the interpolation circuitry 1930 of
The control circuitry 1982 of
As described above, the output of the neural network circuitry 424 may be the output 1986, or a mask 1934 that generates the output 1986. In some embodiments, the output 1986 may be an output audio signal. In some embodiments, the output 1986 may be a different kind of output, such as a sound map, or values calculated for a metric from audio from multiple beams (each of the multiple beams pointing at a different angle around a wearer of the ear-worn device), both of which are described below. When the output 1986 is not an audio signal, in some embodiments, the output 1980 may be an audio signal. Thus, the processing circuitry 1950 may convert the output 1986 into an audio signal. When the output 1986 is an output audio signal, the output 1986 may be a noise-reduced and/or spatially-focused output signal. When the output 1986 is processed into an output audio signal (e.g., the output 1980), the output audio signal may be a noise-reduced and/or spatially-focused output signal. Because the neural network may be trained to generate a noise-reduced and/or spatially-focused output signal, or an output used for generating a noise-reduced and/or spatially-focused output signal, the neural network may be considered trained to perform noise reduction and/or spatial focusing.
Neural networks will now be described in further detail. When the neural network implemented by the neural network circuitry 424 is trained to perform noise reduction, in some embodiments the output audio signal (e.g., the output 1986) may be a noise-reduced output audio signal. In the example of
When the neural network implemented by the neural network circuitry 424 is trained to perform spatial focusing, in some embodiments the output audio signal (e.g., the output 1986) may be a spatially-focused output audio signal. In the example of
In some embodiments, a spatially-focused output audio signal (which may also be a noise-reduced output audio signal) may use a mapping of gains to respective spatial regions (in other words, particular DOAs). In some embodiments, gain may be equal to or between 0 and 1. In some embodiments, gain may be equal to or greater than 0. Mapping gains to spatial regions may result in focusing, as higher gains may be applied to sounds originating from certain spatial regions and lower gains may be applied to sounds originating from other spatial regions. The spatial regions may be defined relative to the wearer of the ear-worn device. In some embodiments, the mapping may include applying a gain of 1 (or approximately 1) to audio generated from sounds coming from a target spatial region (which need not necessarily be continuous) and applying a gain of 0 (or approximately 0) to audio generated from sounds coming from other spatial regions. Thus, a spatially-focused output audio signal may include sounds coming from spatial regions mapped to a gain of 1 and exclude sounds coming from spatial regions mapped to a gain of 0.
It should be appreciated that target spatial regions having angles relative to the wearer other than 60 degrees may also be used. In some embodiments, the target spatial region may have an angle approximately equal to or between 10-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 20-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 30-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 40-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 50-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 60-180 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 10-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 20-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 30-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 40-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 50-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 60-150 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 10-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 20-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 30-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 40-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 50-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 60-120 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 10-90 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 20-90 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 30-90 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 40-90 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 50-90 degrees. In some embodiments, the target spatial region may have an angle approximately equal to or between 60-90 degrees. For example, the angle may be equal to or approximately equal to 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, or 180 degrees. It should be appreciated that the ear-worn device may be considered to focus on sounds originating from within the target spatial region 2140.
For training, in some embodiments, a training input audio signal may be formed from component audio signals, some originating from within a target spatial region (e.g., the target spatial region 2140) and some originating from outside the target spatial region. Components of a training input audio signal may be labeled as target (gain=1) or distractor (gain=0) depending on whether they originate from within or outside the target spatial region. Signals from multiple microphones may be generated from the training input audio signal. For a neural trained to output a mask, a training mask may be determined such that, when the training mask is applied to one of the microphone signals or a beamformed version of the microphone signals, just components originating from within the target spatial region (i.e., those input signals labeled as target) remain. It should be appreciated that components of training input signals may be labeled as target (include) or distractor (null out) depending on whether they originate from within or outside the target spatial region, rather than explicitly labeling them with their DOA. The neural network may thereby be trained to take input signals during inference and predict a mask that, when used to mask an input audio signal, results in just components of the input audio signal that originated within the target spatial region remaining. For a neural trained to output an audio signal, the training output may be an audio signal having components just originating from within the target spatial region. The neural network may thereby be trained to take input signals during inference and predict an output audio signal (e.g., the output 1986) having just components of an input audio signal that originated within the target spatial region remaining.
In any of the training described herein, two methods for generating multi-microphone localized training data may include 1. Collecting audio originating from sounds at different DOAs with multiple microphones, and 2. Synthetically creating multiple microphone signals as though a sound source was localized at a specific DOA using audio simulation. Synthetic generation of training signals may include adding directionality to different sound sources (speech signals and noise) in simulation and then adding these together to create a new signal with spatial audio. A neural network may be trained on either or both of synthetic data and captured data.
In some embodiments, gain may smoothly transition, or transition approximately smoothly, as a function of DOA.
For training, in some embodiments, a training input audio signal may be formed from component audio signals originating from within different spatial regions. Each component of a training input audio signal may be labeled with a gain associated with the spatial region from which the component audio signal originated. Signals from multiple microphones may be generated from the training input audio signal. When the neural network is trained to output a mask, a training mask may be determined such that, when the training mask is applied to one of the microphone signals or a beamformed version of the microphone signals, what remains is each component audio signal multiplied by a gain (which may be 0, 1, or other values in between) corresponding to the spatial region from which it originated, and then summed together. The neural network may thereby be trained to take input audio signals during inference and predict a mask that, when used to mask an input audio signal, results in each component audio signal multiplied by a gain (which may be, for example, 0, 1, or other values in between) corresponding to the spatial region from which it originated, and then summed together. For a neural trained to output an audio signal, the training output may be an audio signal in which each component audio signal is multiplied by a gain corresponding to the spatial region from which it originated, and then summed together. The neural network may thereby be trained to take an input signal during inference and predict an output audio signal (e.g., the output 1986) in which each component audio signal is multiplied by a gain corresponding to the spatial region from which it originated, and then summed together.
In some embodiments, the mappings may be predetermined. In other words, the bounds of the various spatial regions and the gain associated with each spatial region may be determined at training time. The mappings of
In some embodiments, using the sound map, the enhancement circuitry 1912 may be configured to apply a beam pattern to the sound map using the processing circuitry 1950, and the beam pattern need not be predetermined. The result of applying the beam pattern may be a spatially-focused audio signal, and thus the sound map may be used for generating a spatially-focused audio signal. As will be described below, the beam pattern may be based on a selection from a wearer of the ear-worn device. To apply a beam pattern, the processing circuitry 1950 may be configured to apply different gains to sounds originating from the different spatial regions (as indicated by the sound map) and these gains need not be predetermined.
In some embodiments that include an ear-worn device with an array of more than two microphones (e.g., on eyeglasses such as the eyeglasses 358), beamforming circuitry (e.g., the beamforming circuitry 408) may be configured to generate, at each time step, multiple beams (e.g., between or equal to 10-20 beams) that each points at a different angle around the 360-degree circle relative to the wearer. The output 1986 may be values calculated for a metric from audio from the multiple beams. In particular, the neural network circuitry 424 may implement a neural network trained to calculate an individual value for the metric for the audio from each individual beam. Thus, each beam may have a different value for the metric. As examples, the metric may be SNR or speaker power. The enhancement circuitry 1912 or 2012, or in particular, the processing circuitry 1950, may be configured to combine the audio from the multiple beams using the values for the metric. For example, when the metric is SNR, the neural network may be configured to provide per-frame SNR (i.e., the output 1986) of inputted audio signals. Then the processing circuitry 1950 may be configured to output the sum of the audio from each beam weighted by each beam's SNR as calculated by the neural network. In particular, low-SNR beams may be down-weighted and high-SNR beams may be up-weighted. Thus, focus may be placed on those beams having the highest SNR audio. The result may therefore be a spatially-focused audio signal, and thus the values calculated for the metric may be used for generating a spatially-focused audio signal. In some embodiments, before summing the beams, the processing circuitry may be configured to apply a moving average across the audio from the different beams.
In some embodiments, spatial focusing may be based on wearer selection. The wearer may be able to control spatial focusing using a processing device (e.g., a smartphone or tablet) in operative communication with the ear-worn device. In some embodiments, the processing device may be configured to receive a selection from the wearer of whether to perform spatial focusing or not. If turned on, spatial focusing by the neural network as described above may be used. If turned off, spatial focusing may not be performed by the neural network. The processing device may transmit to the processing device an indication of the wearer selection of whether to turn focusing on or off, and control circuitry 1982 may be configured to use this indication to control whether focusing is enabled or disabled. For example, the control circuitry 1982 may use this indication to modify the interpolation coefficient used by the interpolation circuitry 1930 to 0 (focusing turned off) or a value greater than 0 (focusing turned on). As another example, the control circuitry 1982 may use this indication to modify a parameter of the neural network to perform spatial focusing or not.
In some embodiments, the processing device may be configured to receive a selection from the wearer of how much focusing to perform. For example, the output of the enhancement circuitry may be a weighted combination of focused and non-focused audio, with the weighting determined by the wearer selection. The processing device 2448 may transmit an indication of the wearer selection of a degree of focusing (e.g., a value between or equal to 0 and 1) to the ear-worn device, and the control circuitry 1982 may be configured to use this indication to control how much focusing is enabled or disabled. For example, the control circuitry 1982 may use this indication to modify the interpolation coefficient used by the interpolation circuitry 1930 (e.g., to the value between or equal to 0 and 1). As another example, the control circuitry 1982 may use this indication to modify a parameter of the neural network controlling how much spatial focusing the neural network should perform.
In embodiments in which the enhancement circuitry 1912 performs spatial focusing by applying a beam pattern to a sound map, in some embodiments, the beam pattern applied using the sound map may be based on wearer selection. In some embodiments, the selection from the wearer may be a selection of which spatial regions to include and which spatial regions to exclude. The wearer may make the selection using a processing device (e.g., a smartphone or tablet) in operative communication with the ear-worn device. The processing device 2448 may transmit to the ear-worn device an indication of the wearer selection of which spatial regions to focus on, and the control circuitry 1982 may use this indication to control the spatial regions used (e.g., by controlling operation of the processing circuitry 1950).
In some embodiments, the selection from the wearer may be a selection of a size of a front-facing spatial region to use for focusing. For example, consider that the sound map includes 16 spatial regions. The enhancement circuitry 1912 may modify the size of the front-facing spatial region used for focusing by selecting the one front-facing spatial region of the 16 spatial regions, or selecting the two front-facing spatial regions of the 16 spatial regions, or selecting the 4 front-facing spatial regions of the 16 spatial regions, or selecting the 8 front-facing spatial regions of the 16 spatial regions. Thus, the wearer may be able to select a focusing width using a processing device (e.g., a smartphone or tablet) in operative communication with the ear-worn device. The processing device 2448 may transmit an indication of the wearer selection of focusing width to the ear-worn device, and the control circuitry 1982 may use this indication to control the spatial regions used (e.g., by controlling operation of the processing circuitry 1950).
It should be appreciated that the forms of any of the GUIs described herein are non-limiting and other GUIs that accomplish the same functionality may also be used.
In some embodiments, an ear-worn device may include an inertial measurement unit (IMU). An IMU may include one or more of an accelerometer, a gyroscope, and a magnetometer. In some embodiments, the ear-worn device may use an IMU (in particular, an accelerometer and gyroscope of the IMU) configured to track head movements of the wearer. Two non-limiting uses for tracking head movements will be described below.
As described above, in some embodiments an ear-worn device may generate a sound map indicating the frequency components originating from each of multiple spatial regions. In such embodiments, it may be helpful to apply a moving average to values determined for different spatial regions, as this may average away some error. However, if the wearer rotates their head quickly, this may blur the average across the different spatial regions. As will be described further below, an IMU configured to track head movements (e.g., using an accelerometer and gyroscope) may be able to correct for this.
It should be appreciated that in arrays of two microphones (e.g., on hearing aids such as the hearing aid 156), the beams that can be created using beamforming (e.g., cardioids and supercardioids) may be broad, such that if the wearer was talking to a person in front of them and then turns their head, even by 90 degrees, the person's speech may only decrease in amplitude by a few dB. However, with an array of more than two microphones (e.g., on eyeglasses such as the eyeglasses 358), more narrow beams may be created. With a narrow beam, even slight head rotations may cause the amplitude of sound from a person previously directly in front of the wearer to decrease substantially. An IMU configured to track head movements (e.g., using an accelerometer and gyroscope) may be able to correct for this as well.
In more detail, an IMU configured to track head movements (e.g., using an accelerometer and gyroscope) may enable spatial regions to be defined in an absolute coordinate system, rather than the coordinate system of the wearer's head (which could rotate very quickly). The absolute coordinate system may be defined relative to the wearer's head, but on a slow timescale. Thus, if the wearer is sitting and talking to a person, briefly turns their head to look at something, then turns back to the person they are talking to, the coordinate system may stay in the same place and not rotate (or not rotate very much) with the head. But if the wearer turns and starts talking to another person, the coordinate system may slowly (e.g., over the course of several seconds) rotate with the head. To realize this, an exponential moving average may be applied to the coordinate system, such that the coordinate system is an exponential moving average of the head orientation. The timescale of the exponential moving average may be, for example, several seconds (e.g., 2, 3, 4, or 5, 6, 7, 8, 9, or 10 seconds, or any other suitable value). In some embodiments, during head movements, sounds from the new direction (i.e., the direction the wearer is turning their head towards) may be focused on immediately but sounds from the old direction (i.e., the direction the wearer is turning their head away from) may continue to be focused on and released slowly. In other words, as the wearer moves their head toward a new direction, the aperture may be broadened aperture fairly quickly to focus on sounds from the new direction but continue to focus on sounds from the previous direction as well, and then slowly wind down focus on the sounds from the previous direction as the wearer continues to look in the new direction. The winding down may be modulated as a function of how long the wearer looks in the new direction, so that a quick head glance may not cause a permanently wider aperture. In some embodiments, this behavior may be realized by an exponential moving average with a long timescale. In some embodiments, this behavior may be realized by combining 1. sounds from the new direction with full weight, and 2. sounds from the old direction processed with an exponential moving average.
As described above, binaural communication may occur in a system of two hearing aids (e.g., two of the hearing aid 156), cochlear implants, or earphones that communicate with each other over a wireless communication link. Binaural communication may also occur in an ear-worn device such as eyeglasses with built-in hearing aids (e.g., the eyeglasses 358), in which a device in a portion of the eyeglasses near one ear can communicate with a device in a portion of the eyeglasses near the other ear over a wired communication link within the eyeglasses. In some embodiments, binaural communication may facilitate communication of spatial information. For example, a mask or a sound map may be communicated from one device to another. In some embodiments, the binaural communication may occur over a low-latency communication link such as a near-field magnetic induction communication (NFMI) link.
In some embodiments, a GUI may display representations of different speakers, and the wearer may select from the GUI one of the speakers on which to focus. In some embodiments, multiple tight beams may be generated using beamforming on an array of more than two microphones, and the power of the speech signal in audio from each beam may be calculated. Beams having power above a threshold may be considered to have a speaker in the direction of the beam. In some embodiments, a neural network may be trained to determine the direction of speakers based on audio from one or more beams. In some embodiments, the information about the direction of the speakers may be transmitted from the ear-worn device to a processing device (e.g., a smartphone or tablet). The processing device may then use this information to display representations of the speakers at each of their respective directions in a GUI. Based on the wearer selecting one of the representations of the speakers from the GUI, the processing device may transmit an indication of this selection to the ear-worn device, and the ear-worn device may use a beam focused on the direction of that speaker going forward.
In some embodiments, a neural network trained to determine the direction of a target speaker (e.g., using any of the methods described herein) may run on a processing device in operative communication with an ear-worn device. The processing device may be configured to transmit an indication of the direction of the target speaker to the ear-worn device, and the ear-worn device may use a beam focused on that direction going forward.
For time-domain processing, combinations of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks may be desirable; however, using two networks may increase the latency and/or power consumption. Generally, latency may refer to how long it takes from reception of a sound signal by the ear-worn device to output of the enhanced sound signal by the ear-worn device. Lower latencies may be important when the enhancement circuitry is implemented on an ear-worn device such as a hearing aid; latencies above a threshold may result in the wearer hearing echoes and/or misalignment between sound and moving lips. The inventors have recognized that it may be possible to use a single recurrent neural network (RNNs) in the enhancement circuitry, and that RNNs may be well-suited for frequency-domain processing. Accordingly, in some embodiments, a single RNN neural network may be used (e.g., implemented by neural network circuitry), and all or most of the processing done by the enhancement circuitry may be in the frequency domain.
It should also be appreciated that, in some embodiments, a single neural network may be used both for spatial focusing and for noise reduction. While using one neural network optimized for spatial focusing and one neural network optimized for noise reduction may potentially provide higher performance, the inventors have recognized that using two neural networks may involve higher latencies and/or power consumption than feasible for an ear-worn device. In some embodiments, the latency of one neural network trained both for spatial focusing and for noise reduction may be approximately equal to or less than 6 milliseconds. In some embodiments, the latency of one neural network trained both for spatial focusing and for noise reduction may be approximately equal to or less than 5 milliseconds. In some embodiments, the latency of one neural network trained both for spatial focusing and for noise reduction may be approximately equal to or less than 4 milliseconds. However, in some embodiments, one neural network optimized for spatial focusing and one neural network optimized for noise reduction may be used.
At step 3102, the neural network circuitry (e.g., the enhancement circuitry 412, 1912, and/or 2012) receives multiple input audio signals originating from either or both of one or more non-beamformed audio signals and one or more beamformed signals. In some embodiments, the multiple input audio signals may be frequency-domain audio signals. In some embodiments, the multiple input audio signals may include one or more frequency-domain non-beamformed signals. In some embodiments, the multiple input audio signals may include one or more beamformed frequency-domain signals. In some embodiments, the multiple input audio signals may include one or more frequency-domain non-beamformed audio signals and one or more frequency-domain beamformed signals. A frequency-domain non-beamformed audio signal may originate from a microphone signal that has undergone processing by analog processing circuitry, digital processing circuitry, and STFT circuitry. In some embodiments, the one or more frequency-domain non-beamformed signals may include a signal originating from a signal generated by a front microphone. In some embodiments, the one or more frequency-domain non-beamformed signals may include a signal originating from a signal generated by a back microphone. In some embodiments, the one or more frequency-domain non-beamformed signals may include a difference between two signals originating from signals each generated by one of two different microphones. Further description of non-beamformed signals may be found above (e.g., with reference to the frequency-domain non-beamformed audio signals 420).
A beamformed frequency-domain signal may originate from two or more microphone signals that have undergone processing by the analog processing circuitry, the digital processing circuitry, beamforming circuitry, and the STFT circuitry. Further description of beamformed signals may be found above (e.g., with reference to the frequency-domain beamformed audio signals 422).
At step 3104, the neural network circuitry implements a neural network trained to generate, based on the multiple input audio signals, a noise-reduced and/or spatially-focused output audio signal or an output for generating a noise-reduced and/or spatially-focused output audio signal. In some embodiments, the neural network may be trained to perform noise reduction. In some embodiments, the neural network may be trained to perform spatial focusing. In some embodiments, the neural network may be trained to perform noise reduction and spatial focusing. The neural network circuitry may be configured to to output a single output based on the multiple input audio signals. Further description of neural networks, noise reduction, and spatial focusing may be found above (e.g., with reference to
While the above description of the process 3100 has focused on embodiments using frequency-domain signals, in some embodiments time-domain signals may be used.
At step 3202, the ear-worn device collects audio from multiple beams, each of the multiple beams pointing at a different angle around the wearer. In some embodiments, beamforming circuitry may be configured to generate the multiple beams at each time step. In some embodiments, the multiple beams may include between or equal to 10-20 beams.
At step 3204, the ear-worn device calculates values for a metric from the audio from the multiple beams. In particular, the ear-worn device may calculate an individual value for the metric for the audio from each individual beam. Thus, each beam may have a different value for the metric. As examples, the metric may be SNR or speaker power. In some embodiments, a neural network may calculate the values for the metric. The neural network may be implemented by neural network circuitry (e.g., the neural network circuitry 424).
At step 3206, the ear-worn device combines the audio from the multiple beams using the values for the metric. In some embodiments, the enhancement circuitry may calculate the sum of the audio from each beam weighted by each beam's value for the metric. For example, if the metric is SNR, then low-SNR beams may be down-weighted and high-SNR beams may be up-weighted. Thus, focus may be placed on those beams having the highest SNR audio. In some embodiments, before summing the beams, the enhancement circuitry may be configured to apply a moving average across the audio from the different beams.
It should be appreciated that at least in some embodiments, the spatial focusing described above may not be based on using image recognition to determine the direction the wearer is looking, nor on using voice recognition to identify sounds coming from a particular speaker.
Having described several embodiments of the techniques in detail, various modifications and improvements will readily occur to those skilled in the art. Such modifications and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only, and is not intended as limiting. For example, any components described above may comprise hardware, software or a combination of hardware and software.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
The terms “approximately” and “about” may be used to mean within ±20% of a target value in some embodiments, within ±10% of a target value in some embodiments, within ±5% of a target value in some embodiments, and yet within ±2% of a target value in some embodiments. The terms “approximately” and “about” may include the target value.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
Having described above several aspects of at least one embodiment, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be objects of this disclosure. Accordingly, the foregoing description and drawings are by way of example only.
This application is a Continuation of U.S. Ser. No. 18/477,087, filed Sep. 28, 2023; which claims priority to U.S. Provisional Ser. No. 63/517,755, filed Aug. 4, 2023; which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63517755 | Aug 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18477087 | Sep 2023 | US |
Child | 18592720 | US |