Any and all application for which a foreign or domestic priority claim is identified in the Application Data Sheet as filed with the present application are hereby incorporated by reference under 37 CFR 1.57.
The present applicant relates to the field of hearing devices, e.g. hearing aids or headsets, in particular to noise reduction in hearing devices.
In directional noise reduction, the target sound is often assumed to be impinging from a certain direction or position, as illustrated in
For a frequency band, k, given a microphone input signal, x(k), comprising a multitude M of microphone signals x(k)=[x1(k), . . . , xM(k)]T, we can obtain an output signal y from a linear combination of the input signals by multiplying each microphone signal by a (complex-valued) weight, i.e. y=wHx, where w(k)=[w1(k), . . . , wM(k)]T and H denotes the Hermitian transposition. In the present application, this functionality is termed ‘beamforming’ and is provided by a ‘beamformer’.
The optimal beamformer weight we that maximizes the signal to noise ratio (SNR) for a given (single) target position θ in a noise field described by the (inter-microphone) noise covariance matrix RV is given by
where dθ is the relative transfer function between the microphones for signals received from a target position θ, see also
A covariance matrix of a vector X=(X1, . . . , XM)T is a general term for a matrix C, whose elements CXi,Xj are equal to the covariance of Xi and Xj, i.e.
where E is the expectation operator.
Covariance matrices are (in the context of audio processing in the time-frequency domain) defined as Rx(k,l)=E [x(k,l) xH(k,l)] (when exemplified for the noisy microphone signals x), where k and l are frequency and time indices, respectively, and x is an M-dimensional vector:
comprising (generally complex) values of each of the M microphone signals for the given frequency and time (k,l).
The inter microphone target and noise covariance matrices RT(k,l) and RV(k,l) may be defined as respective covariance matrices for the input microphone signal vector x(k,l) comprising values of the M electric input signals (the microphone signals) at frequency index k at time/when the input microphone signal vector x(k,l) is labelled as target (T) and noise (N), respectively. The labelling as target (T) and noise (N) may e.g. be provided by a frequency band level voice activity detector (assuming that the target signal comprises voice (e.g. speech)).
In general, the distance between the sound source and the input transducer (e.g. microphone) picking up sound from the sound source in question matters more for the value of the acoustic transfer function (ATF) representing the propagation of sound from the source to the input transducer the smaller the distance (between source and input transducer). In other words, the larger the distance, the smaller the change of the acoustic transfer function per unit length (d(ATF)/dL, L representing distance) for a given direction from input transducer to sound source, so that direction may be a good approximation to define the acoustic transfer function for a given position of the sound source, when the distance between the sound source and the input transducer picking up sound form the sound source is above a threshold distance Lth. The threshold distance Lth may e.g. be taken to be in a range from 1-3 m, e.g. around 2 m (e.g. determined in dependence of the distance between the input transducers of the hearing device).
The above expression for optimal beamformer weight wθ is valid under the assumption that the target is impinging from a single position/direction relative to the user (e.g. for an MVDR beamformer). Often this is not true. The target signal may not always impinge from a single position/direction. We may consider several signals as target signals or several positions/directions as target positions/directions at the same time. As described in the present disclosure, several simultaneous sound sources from different positions/directions may e.g. be considered as simultaneous targets, or all sound sources (or every speech source) in the frontal half-plane may e.g. be considered as target signals. Also, in the case of uncertainty about the target position/direction, the target positions/directions may advantageously be assumed to cover a range of possible target positions/directions.
For an MVDR beamformer, the target covariance matrix RT may be determined as the outer product of the steering vector (dθ) and its Hermitian transposition dθH, i.e. RT=dθdθH, where dθ is the steering vector comprising relative transfer functions between the microphones and the (sole) target position θ.
The present disclosure relates (mainly, but not exclusively) to a Generalized EigenVector beamformer (GEV), sometimes also termed a ‘Generalized Eigen Value beamformer’. The weights w of a GEV-beamformer can be determined as the set of weights which maximizes the signal-to-noise ratio (SNR) given by:
where the signal to noise ratio (SNR) is defined from a (inter-microphone) target covariance matrix RT and a (inter-microphone) noise covariance matrix RV.
We may estimate the weights w which maximize the SNR as the eigenvector belonging to the largest generalized eigenvalue (hence the name GEV). It should be noticed that the set of weights (w) maximizing the SNR can be found only up to a scalar. We may hence choose to scale the weights e.g. in order to fulfill a unit response towards a pre-defined target position.
The present disclosure also deals with topics related to a minimum variance distortionless response (MVDR) beamformer.
The MVDR beamformer also referred to in the present disclosure is a special case of the GEV beamformer, where the inter-microphone target covariance matrix RT is a singular target covariance matrix given by the outer product of the steering vector dθ and its Hermitian transposition dθH.
In the present disclosure the ‘term look vector’ is sometimes used instead of the term ‘steering vector’. The term look vector refers to the case where a single target direction or position is considered, e.g. in connection with an MVDR beamformer, where the target direction is often in a look direction of the user (e.g. as determined by the direction of the nose of the user). Further, instead or the term look vector, the term “relative transfer function” may be used.
The present disclosure also deals with topics related to a generalized sidelobe canceller (GSC) beamformer structure. The GSC converts M input signals into a target-preserving path (comprising a target maintaining beamformer) and M−1 independent sidelobe cancelling beamformer paths (comprising respective target cancelling beamformers).
An MVDR beamformer as well as a GEV beamformer can be implemented as respective GSC structures. An MVDR beamformer, but also a GEV beamformer, can be constrained by implementing the beamformer using a GSC structure. Even though the target may not be limited to a single direction, we may choose to normalize the output of a GEV beamformer such that we obtain e.g. a unit response from a (one) preferred location/direction.
The present disclosure includes a plurality of aspects. It is the intention that features of the devices and methods of the different aspects can be combined between the different aspects.
A general aim of the present disclosure is to provide a basis for a migration from a scenario where a target sound source is a (single) localized (point) source to a scenario where the target sound originates from a multitude of differently located sound sources.
An aspect of the present disclosure relates to a hearing aid comprising a beamformer providing at least one beamformed signal as a linear combination of a multitude of electric input signals, wherein the weights of the beamformer are determined by maximizing a target signal to noise ratio for sound from a plurality of target positions. The target signal to noise ratio is e.g. determined in dependence of first and second output variances of the at least one beamformer determined when the electric input signals or the at least one beamformed signal (Y) are labelled as target and noise, respectively.
In a first aspect of the present application, a hearing aid adapted to be worn by a user is provided. The hearing aid comprises:
The beamformer weights (w) are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from said plurality of target positions (θ), wherein the signal to noise ratio (SNR) is determined in dependence of first and second output variances (|YT|2, |YV|2) or time-averaged first and second output variances (<|YT|2>, <|YV|2>) of said at least one beamformer, and where said first and second output variances (|YT|2, |YV|2) are determined when said electric input signals (x) or said at least one beamformed signal (Y) are labelled as target (T) and noise (V), respectively.
Thereby an improved hearing aid may be provided.
Instead of the term ‘output variance’ the term ‘output power’ may be used in the above defined hearing aid.
The definition of the output as a set of complex weights (w) multiplied by the input signal (x) may assume processing in the time-frequency domain. As is known in the art, this may as well be performed in the time-domain. In the time domain this would correspond to a convolution.
The beamformed signal may be a function of time (l) and frequency (k). The labelling of the electric input signals (or the beamformed signal) as target(S) (e.g. speech) or noise (V), respectively, may be performed on a time-frequency level (k,l) (i.e. for each time-frequency unit). The labelling may be provided by a target signal detector (e.g. comprising a voice activity detector). The target detector may e.g. be configured to provide an indicator of whether or not or with what probability, a given time frequency unit (k,l) comprises a target signal, e.g. speech.
The term ‘noise’ may in the present context cover signals that are not labelled as target, e.g. natural or artificial noise, or signals representing a disturbance or distraction of the user from the target signal(s).
The (target) signal to noise ratio (SNR) may be expressed as
where <z*z>T and <z*z>V indicate an average of the signal z for a frequency band k across time frames labelled as target and noise, respectively. zT(k,l) and zV(k,l) are the signals for a time and frequency units labelled as target and noise, respectively.
The hearing aid may comprise a multitude of analysis filter banks configured to provide the electric input signals (x) in a time-frequency representation (k,l), where k is a frequency index and l is a time index. The hearing aid may comprise a multitude (M) of analysis filter banks for converting time domain electric input signals (x) from said multitude (M) of microphones to respective electric input signals (X) in a time-frequency representation (k,l). Each time-frequency unit (k′,l′) of an electric input signal in a time-frequency representation (k,l) comprises a (generally complex) value of the electric input signal at that time (l′) and frequency (k′).
The hearing aid may comprise a target signal detector configured to provide an indicator of whether or not or with what probability, a given time frequency unit (k,l) comprises a target signal, e.g. speech. The target signal detector may comprise a voice activity detector (e.g. a speech detector). A target signal does not necessarily have to be a speech signal (even though it often is labelled so). It may as well be defined as a signal impinging from certain directions (such as the frontal half-plane). The target detector may e.g. comprise a direction of arrival detector.
The hearing aid may comprise a voice activity detector for estimating whether or not, or with what probability, an input signal, or a time-frequency unit of the input signal, comprises a voice signal at a given point in time, and to provide a voice activity control signal indicative thereof. The labelling of target or noise may be provided by the voice activity detector (alone or in combination with other detectors). Typically, the labelling will be based on the input signal x, e.g. at a reference microphone, but the labelling may also be based on a linear combination of signals from more than one microphone (it may be the at least one beamformed signal (Y), but it may alternatively be based on any other beamformer). The voice activity detector may be configured to classify the input signal as dominated by speech or dominated by noise (e.g, non-speech). The voice activity detector may be configured to provide the voice activity control signal in a time-frequency representation, e.g. so that a value of the voice activity control signal is provided for each time-frequency unit (k,l).
The voice activity detector may be based on or comprise an artificial neural network (see e.g.
The voice activity control signal may (in addition to the input audio signal(s), e.g. the/an electric input signal(s) or the beamformed signal) be dependent on spatial information derived from said electric input signals (x) or a signal or signals derived therefrom. The spatial information may be derived from a comparison between a beamformer with its maximum sensitivity towards the frontal half-plane and another beamformer with its sensitivity towards the back half-plane (‘front’ and ‘back’ being e.g. defined relative to the user).
The electric input signals (x) or the at least one beamformed signal (Y) may be labelled as target (T) and noise (V), respectively, in dependence of the voice activity control signal from a voice activity detector.
The target signal to noise ratio may be determined as a difference between, or a ratio of, the first and second output variances (|YS|2, |YV|2) or time-averaged first and second output variances (<|YT|2>, <|YV|2>) of said at least one beamformer.
The target signal to noise ratio may be determined in dependence of the beamformer weights (w) and of time averaged inner vector products of the multitude of electric input signals x and xH<xxH>T and <xxH>V, where <⋅> denotes average over time, and wherein <xxH>T and <xxH>V, are determined when said electric input signals (x) are labelled as target (T) and noise (V), respectively. The labelling may e.g. be provided by the target signal detector (e.g. comprising a voice activity detector and/or a direction of arrival detector).
The target signal to noise ratio may be determined in dependence of said beamformer weights (w) and of respective inter-microphone target covariance matrix (RS), and an inter-microphone noise covariance matrix (RV). The signal to noise ratio may e.g. be determined as the ratio
Where wHRTw approximates said first output variance, and where wHRVw approximates said second output variance. The inter-microphone target covariance matrix (RT) and the inter-microphone noise covariance matrix (RV) are determined when said electric input signals (x) or said at least one beamformed signal (Y) are labelled as target and noise, respectively. Typically, the labelling is based on the input signal x, e.g. at a reference microphone, but the labelling may also be based on signals generated from a linear combination between more than one of the multitude of electric input signals from the (M) microphones.
The at least one beamformer may implemented as a linear combination of two or more pre-defined or adaptively determined beamformers. The beamformer weights of the pre-defined beamformers may be fixed (e.g. determined during manufacture of the hearing aid or during fitting of the hearing aid to a particular user's needs. The pre-defined beamformers may thus be denoted ‘fixed beamformers’. The two or more pre-defined or adaptively determined beamformers may comprise M pre-defined or adaptively determined beamformers, where M is larger than or equal to two.
The first pre-defined or adaptively determined beamformer may be configured to have a unit response towards a target direction. A unit response towards a target direction may be provided by a target maintaining beamformer. When the beampattern is adapted in order to attenuate noise, a unit gain constraint towards a target position may be solely applied. But there may, in principle, be other directions, which have more directional amplification than the unit gain direction.
The first pre-defined or adaptively determined beamformer may be denoted a target-maintaining beamformer. The first pre-defined or adaptively determined beamformer may be configured to provide a unit response (sometimes termed a ‘distortionless response’) for a selected one of said multitude M of microphones, said microphone being denoted the reference microphone.
The one or more second pre-defined or adaptively determined beamformers may be configured to have a spatial minimum towards a respective one of said plurality of target positions. A spatial minimum towards a target direction may be provided by a target cancelling beamformer. The second pre-defined or adaptively determined beamformers may be denoted target-cancelling beamformers. The second pre-defined or adaptively determined beamformers may be constituted by M−1 second beamformers.
The at least one beamformer may be implemented as a generalized sidelobe canceller (GSC) for providing at least one beamformed signal (Y) in dependence of said adaptively determined beamformer weights (w), wherein the generalized sidelobe canceller comprises a multitude M of fixed beamformers, one of the M fixed beamformers being a target signal maintaining beamformer, and M−1 of the fixed beamformers being target-cancelling beamformers, each being configured to generate a beamformed signal in dependence of associated fixed beamformer weights (wF,m, m=1, . . . , M) and wherein the adaptively determined beamformer weights (w) are determined in dependence of an adaptive parameter β or parameter vector β.
The adaptive parameter β or parameter vector β may be determined in dependence of time-averaged values of the target-maintaining beamformer and the M−1 target-cancelling beamformers.
Alternatively, the adaptive parameter beta is determined in dependence of the target covariance matrix and the noise covariance matrix.
In a generalized sidelobe canceller we may express the output Y in terms of fixed beamformers, and the adaptive parameter vector β, i.e.
where β is an (M−1)×1 vector, and C is an (M−1)×1 vector of target cancelling beamformers. It can be mentioned that in some other texts (e.g. in the reference handbook quoted in [Bitzer &Simmer; 2001]) β is conjugated compared to the definition we use here. The beamformer output (Y) may as well be rewritten as
The fixed, distortionless target beamformer C0 and the fixed target cancelling beamformers C can be expressed in terms of the input signal x and the beamformer weights expressed by the M×1 vector a and the M×(M−1) blocking matrix B, respectively, i.e.
C
0
=a
H
x
and
C=B
H
x.
The output may thus be expressed in terms of the input signal x and the parameters a, B and β:
We see that the output either may be expressed in terms of the fixed beamformers or in terms of the input signal.
The average of the squared-magnitude of the output may thus be expressed as
which can be re-written as a sum of averages across different beamformer products. Or as
i.e. where the output instead is expressed in terms of the covariance matrix (e.g. RX, RT or RV, depending on the input samples, which are selected for averaging) and the parameters a, B and β.
We may notice that the estimation of covariance matrices requires averages across M real-valued terms and averages across M×(M−1)/2 complex-valued terms (taking into account that the covariance matrix is Hermitian). Similarly, we also see that the expression in terms of sums of beamformer products results in averages across M real-valued beamformer product terms and averages across M×(M−1)/2 complex-valued beamformer products (again taking into account that each complex-valued beamformer product also is represented by its complex conjugate).
The target covariance matrix (RT) may be determined in advance of normal use of the hearing aid. The target covariance matrix (RT) may e.g. be determined during manufacture of during a fitting session. The target covariance matrix (RT) may e.g. be determined in dependence of user input, e.g. about currently preferred target positions (e.g. directions, e.g. indicated via a user interface, e.g. a graphical user interface, e.g. of an APP of a smartphone, or similar (e.g. portable, e.g. handheld) processing device). The target covariance matrix (RT) may e.g. be determined in dependence of prior knowledge of one or more target positions.
The first output variance may be determined based on target directions determined in advance of normal use of the hearing aid.
2nd Aspect: (Target Covariance Matrix Determined in Dependence of a Multitude of Steering Vectors (dθ) for the Currently Present Target Sound Sources)
In an aspect, the present disclosure relates to a hearing aid comprising microphone/beamformer configurations that can be represented by a full-rank target covariance matrix. The 2nd aspect relates (mainly, but not exclusively) to a Generalized Eigenvector beamformer (GEV) (or an approximation thereof).
It is proposed to adaptively optimize the beamformer weights (w) of a beamformer to a plurality of target positions (θ), e.g. by maximizing a target signal to noise ratio (SNR) for sound from said plurality of target positions (θ). The signal to noise ratio may be expressed in dependence of the beamformer weights (w) and an inter-microphone target covariance matrix RS.
The inter-microphone target covariance matrix RT of a beamformer may be updated as
where σθ2 is the target variance for a target position θ, dθ is a steering vector for a given position θ, and H denotes Hermitian transposition, and a summation is made over a plurality of (target sound source) positions θ. The steering vector dθ is defined as the relative transfer function (in a frequency band k) between a reference microphone and the other microphones, for a given target (sound source) position θ, i.e. dθ(k)=hθ(k)/hθ(k,ref), where hθ(k) is a vector of transfer functions from the source position θ to each of the microphones (m=1, . . . , M), and hθ(k, ref) is the transfer function from the source position θ to the reference microphone (m=ref). The summation may be over separately identified relevant target sound source positions. The positions of the target sound sources (relative to the hearing device microphones) may be approximated by directions to the target sound sources (i.e. from the hearing device microphones to the target sound source positions). In the present disclosure the parameter θ is intended to cover both interpretations (position, direction, where position e.g. may be represented by a combination of direction and distance relative to the hearing device microphones).
The target variance σθ2 may be regarded as a direction-dependent weighting of the different directions.
The target covariance matrix RT may be estimated as a fixed full-rank matrix by different means:
1) Estimate the most likely target directions and obtain RT=ΣθpθdθdθH, as described below.
2) Estimate the most likely target directions and create a target covariance matrix from a weighted sum of desired target directions. E.g. a sum of target steering vectors from a frontal half-plane, a sum of front target steering vectors obtained from a group of individuals.
3) RT may be “calibrated” in a special calibration mode, where a target sound (in absence of noise) is played from the target direction. In an MVDR beamformer, we would solely estimate the steering vector from the target covariance matrix (assuming that the target is a point source) e.g. from the eigenvector belonging to the largest eigenvalue. Here we assume that the target is not fully described by a single direction, and instead we keep the full-rank target covariance matrix.
For the MVDR beamformer, the target covariance matrix RT=σθ2dθdθH (one target sound source) has rank 1. In cases where the target covariance matrix (RT) does not have rank 1 (but rank >1), the MVDR solution cannot be used and a GEV (or approximated GEV) beamformer may be used instead. For a GEV beamformer (in the presence of multiple target sound sources at different positions θ=θ1, . . . , θNTS, where NTS is the (current) number of target sound sources), the target covariance matrix (RT) may, according to the present disclosure (as indicated above), e.g. be expressed as a linear combination (e.g. a sum) of outer products of the steering vectors de of the individual target sound sources located at respective positions θ, RT=Σθσθ2dθdθH.
The summation in the expression of the target covariance matrix RT may e.g. be a weighted sum of the (outer) product of the most likely target steering vectors (see e.g.
where pθ is a (real) weight factor (e.g. a probability estimate) of a given position θ.
The parameters pθ and σθ2 are not necessarily related. pθ is a probability, which usually sum to 1, whereas the variances σθ2 do not necessarily sum to 1. However, when maximizing the SNR given by the below equation
the optimal weight (w) does not change, if we e.g. multiply RT by a scalar. Hence, if the sound source from a given direction is very energetic, it will dominate the target covariance matrix. And it would most likely also be estimated as the most likely target direction.
In a 2nd aspect of the present application, a hearing aid adapted to be worn by a user is provided. The hearing aid comprises:
Thereby an improved hearing aid may be provided.
The hearing aid is configured to adaptively optimize the beamformer weights to maximize the target signal-to-noise ratio (SNR). The target signal-to-noise ratio (SNR) may be given by:
where the superscript H denotes Hermitian transposition.
The hearing aid may be configured to adaptively optimize the beamformer weights (w) a) by estimating the beamformer weights as the eigenvector belonging to the largest generalized eigenvalue.
The directional noise reduction system (e.g. the at least one beamformer) may comprise (or be constituted by) a Generalized Eigenvector beamformer (GEV). The term GEV is taken from [Warsitz & Haeb-Umbach; 2007]. However, in [Warsitz & Haeb-Umbach; 2007] a slightly different SNR are maximized, namely the ratio between the noisy input mixture and the noise estimate. In the present disclosure, on the other hand, a GEV-term is used wherein we maximize the ratio between a target covariance matrix and a noise covariance matrix. The structure of the equation is similar (i.e. same equations to estimate the maximum), but the covariance matrix in the numerator is defined differently.
The beamformed signal may be provided by the at least one beamformer may be formed as a linear combination of at least two of said multitude of electric input signals by multiplying each electric input signal by a complex-valued beamformer weight (w). The beamformed signal provided by the at least one beamformer may be formed as a linear combination of the multitude of electric input signals by multiplying each of the multitude of electric input signals by the (e.g. complex-valued) beamformer weight (w).
The beamformer weights for the plurality of target positions (e.g. directions) (θ) to currently present target sound sources may be determined in dependence of a multitude of steering vectors (dθ) for the currently present target sound sources.
The inter-microphone target covariance matrix RT may be determined in dependence of steering vectors of the currently considered target sound sources. The inter-microphone target covariance matrix RT may be determined as a (possibly weighted) sum of the (outer) product of the steering vectors dθdθH of the currently considered target sound sources (from positions θ=θ1, . . . , θNTS, where NTS is the (current) number of target sound sources).
The summation in the expression of the target covariance matrix RT may be a weighted sum of the outer product of the most likely target steering vectors according to the following expression
where pθ is a (real) weight factor (e.g. a probability estimate) of a given position (θ).
The target covariance matrix (RT) may be adaptively determined. The plurality of target positions (θ) of the target sound sources and/or the corresponding target covariance matrix (RT) may be adaptively determined. The plurality of target positions (θ) of the target sound sources and/or the target covariance matrix (RT) may e.g. be adaptively determined using a maximum likelihood (ML) procedure, see e.g. EP3413589A1.
The target covariance matrix (RT) may be pre-determined. The target covariance matrix (RT) may e.g. be determined in a procedure prior to the normal operation of the hearing aid by the user, e.g. during manufacture or fitting of the hearing aid. The target covariance matrix (RT) may thus be fixed during use of the hearing aid.
The (pre-determined) target covariance matrix (RT) may be determined corresponding to a set of pre-determined target positions (θ).
The (pre-determined) target covariance matrix (RT) may be determined via a calibration routine in a separate calibration mode.
The target covariance matrix (RT) may be determined as a sum of outer products of steering vectors (dθ,p) for different persons (p). The target covariance matrix (RT) may be determined as a weighted sum outer products of steering vectors (dθ,p) for different persons (p=1, . . . P), the weights (wd,p) being e.g. dependent on age, or gender (e.g. relative to the age or gender of the user).
The target covariance matrix (RT) may be determined as a sum of RT-matrices obtained from different individuals. An individual target covariance may either be estimated by playing a sound from a desired target direction and estimating the target covariance matrix from the hearing aid microphones signals while the hearing instrument is mounted for intended us on an individual. During absence of noise (or during high SNR), a good estimate of a target covariance is easy to obtain form an individual hearing aid user. By basing the target covariance matrix on an average across many individuals, we can ensure that the hearing instrument performs well across a population of individual users.
The target covariance matrix (RT) may be pre-determined as a Rank >1 matrix of size M×M, where M is the number of electric input signals. The target covariance matrix may e.g. be a full rank matrix.
The number M of currently active microphone signals may e.g. be equal to the number of microphones of the microphone system.
The hearing aid may comprise a processor configured to apply one or more processing algorithms to the beamformed signal, or to a further noise reduced signal, and to provide a further processed signal. One of the processing algorithms may be a compressive amplification algorithm configured to compensate for a hearing impairment of the user.
The hearing aid may comprise a transform unit for converting a time domain signal to a signal in the transform domain (e.g. frequency domain or Laplace domain, Z transform, wavelet transform, etc.). The transform unit may be constituted by or comprise a time-frequency-conversion unit for providing a time-frequency (TF) representation of an input signal.
The hearing aid may comprise a multitude of time-frequency-conversion units for providing the multitude M of time-domain electric input signals xm(n), m=1, . . . , M, in a time-frequency representation (k,l), where k and l are frequency and time indices, respectively. The hearing aid may comprise a synthesis filter bank for converting the time-frequency representation of a signal (e.g. the further processed signal) to an output signal in the time domain. The time-domain output signal may be fed to an output transducer of the hearing device, e.g. to a loudspeaker, for providing the output signal as stimuli perceivable for the user of the hearing device as sound.
The time-frequency representation may comprise an array or map of corresponding complex or real values of the signal in question in a particular time and frequency range. The TF conversion unit may comprise a filter bank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. The TF conversion unit may comprise a Fourier transformation unit (e.g. comprising a Discrete Fourier Transform (DFT) algorithm, or a Short Time Fourier Transform (STFT) algorithm, or similar) for converting a time variant input signal to a (time variant) signal in the (time-) frequency domain. The hearing aid may comprise respective inverse transform units according to the particular application to convert one or more signals from the transform domain in question to the time domain.
The hearing aid may comprise a postfilter connected to said at least one beamformer and adapted to receive said at least one beamformed signal, or a processed version thereof, and wherein said postfilter is configured to provide gains to be applied to said at least one beamformed signal, or to said processed version thereof, in dependence of a postfilter target signal to noise ratio (PF-SNR) to provide a further noise reduced signal. The postfilter target signal to noise ratio (PF-SNR) is preferably estimated with a smaller time constant than the target signal to noise ratio (SNR) being maximized by the optimized weights of the at least one beamformer. The target and noise covariance matrices (RT and RV, respectively) used in an estimation of the postfilter target signal to noise ratio (PF-SNR) may e.g. be averaged with a smaller time constant, than the target signal to noise ratio (SNR). This is e.g. illustrated in
The hearing aid may be configured to provide the target signal to noise ratio (SNR) and/or the postfilter target signal to noise ratio (PF-SNR) in a time-frequency representation. The directional noise reduction system may comprise the postfilter.
The at least one beamformer may be configured to exhibit a distortionless response for a specified position of a target sound source. The distortionless response for a specified position of a target sound source can be expressed as wHdθ=1=(wHdθ)(dθHw), where w is a vector comprising the optimized weights of the at least one beamformer and dθ is the steering vector for the specified preferred target position (θ).
The beamformer weights (w) of the at least one beamformer may be determined to provide that the at least one beamformer exhibits a distortionless response for a specified position (θ) of a target sound source.
The beamformer weights may e.g. be further adapted in order to optimize the SNR of a target signal impinging from a range between +/−15°, but the beamformer weights may be normalized in order to achieve a distortionless response from 0°.
The weights of the at least one beamformer are normalized such that the sound from said plurality of target positions (θ) exhibits a unit response. A condition for providing a unit response from a plurality of target positions may be written as wHRsw=1 as a constraint of the optimized weights of the at least one beamformer providing unit energy of sound from the plurality of target directions.
The hearing aid may comprise an output transducer configured to provide output stimuli perceivable as sound to the user in dependence of beamformed signal or the further processed signal.
The hearing aid may be constituted by or comprise an air-conduction type hearing aid, a bone-conduction type hearing aid, a cochlear implant type hearing aid, or a combination thereof.
The 3rd aspect relates (mainly, but not exclusively) to a Generalized Eigenvector beamformer (GEV). The 3rd aspect may relate to configurations that can be represented by a full-rank covariance matrix.
It is proposed to update the inter-microphone target covariance matrix RT and the inter-microphone noise covariance matrix RV of a beamformer based on voice activity detection, e.g. based a) on an estimated direction of arrival of sound from a target sound source, b) on a comparison of signal content provided by a target-maintaining and a target cancelling beamformer, respectively (e.g. a difference between the two), or on c) speech detection.
In a third aspect of the present application, a hearing aid adapted to be worn by a user is provided by the present disclosure. The third hearing aid comprises:
The beamformer may be configured to update said inter-microphone target covariance matrix RT and said inter-microphone noise covariance matrix RV in dependence of a detection of voice activity, wherein said voice activity detection is based on one or more of
The purpose of the adaptation of the beamformer weights to a plurality of positions may be either to enhance a plurality of target positions, or to enhance a target signal, which direction acoustically is defined by a full-rank target covariance matrix.
In the present context, the term ‘speech detection’ is intended to include ‘voice detection’ and may e.g. be provided by a binary voice/no voice detection of a voice activity detector. The speech (voice) detection may further include an undecided state, when the presence or absence of speech (voice) cannot be decided. The voice activity detector may e.g. be implemented as a neural network based on training examples labelled in time and frequency as either voice or no voice (cf. e.g.
The beamformer weights (w) of the at least one beamformer may be determined to provide that the at least one beamformer exhibits a distortionless response for a specified position (θ) of a target sound source.
The specified position be constituted by a fixed position, e.g. in front of the listener. The specified position may, however, be allowed to change over time, e.g. based on the dominant direction (eigenvector) of the target covariance matrix, or simply based on selecting a column of the target covariance matrix (assuming that the target covariance matrix is close to rank 1). In an embodiment only specified positions within a range of possible positions are allowed.
The signal-to-noise ratio (SNR) may be given by:
where H denotes Hermitian transposition.
The directional noise reduction system may comprise a Generalized Eigenvector beamformer (GEV). The GEV beamformer may be configured to update an inter-microphone target covariance matrix RT and a noise covariance matrix RV in dependence of voice activity detection, as indicated in a), b) or c) above.
The hearing aid may comprise a transform unit for converting a time domain signal to a signal in the transform domain (e.g. frequency domain or Laplace domain, Z transform, wavelet transform, etc.). The hearing aid may comprise respective inverse transform units according to the particular application to convert one or more signals from the transform domain in question to the time domain.
The voice activity detection (specifically the comparison of signal content provided by a target-maintaining and a target cancelling beamformer, respectively) may be based on b′) a difference or a ratio between signal content provided by a target-maintaining and a target cancelling beamformer.
The hearing aid may comprise a front-back detector based on a comparison between a beamformer pointing towards the front and a beamformer pointing towards the back (front and back being defined relative to the user; a front direction being e.g. defined as a direction of the user's nose or a look direction of the user).
The beamformer pointing towards the front may be configured to have a lowest sensitivity (e.g. a null) for sound impinging from the back (cf.
The beamformer pointing towards the front may be a super-cardioid having its maximum sensitivity configured to have a maximum sensitivity for sound impinging from the front of the user (cf.
The front and the back beamformers may be based on two microphones (i.e. first order beampatterns), preferably microphones located in a horizontal plane (e.g. related to the heard of the user, on a line or close to a line parallel to the front-back axis, e.g. defined by the nose of the user). The front and back beamformers may, however, be based on all available microphone signals.
The target direction of the front and back facing beamformers may be fixed, e.g. tied to a look direction (θ) of the user defining a pre-defined steering vector dθ.
The target direction may however, deviate from the look direction of the user.
The fixed beampatterns may be chosen based any desired target angle of interest, i.e. e.g. a fixed beampattern that optimizes
where gθ is a possible weight of the signal impinging from the direction θ, and dθ is a relative acoustic transfer function from the direction θ known in advance. The summations Σθ∈target and Σθ∈noise, respectively, indicate that a sum across specific (e.g. pre-recorded) steering vectors (d) assigned as belonging to the target and assigned as belonging to noise directions, respectively.
The inter-microphone target covariance matrix RT and the inter-microphone noise covariance matrix RV may be updated in dependence of respective target- and noise-update criteria related to the comparison of signal content provided by a target-maintaining and a target cancelling beamformer, respectively.
The target update criterion and the noise update criterion may be complementary (e.g. in that the noise update criterion is equal to NOT (target update criterion)).
The target-update criterion may e.g. be
If: log|CF(t,f)|−log|CB(t,f)|>κF: Update RT
where ‘log’ is a logarithm function, |⋅| denotes an absolute value (magnitude), CF and CB refer to the outputs of the target-maintaining (e.g. front facing) and target cancelling (e.g. back facing) beamformers, respectively, and κF and κB are thresholds of the respective beamformers. The thresholds κF and κB are illustrated in
The noise-update criterion may e.g. be
If: log|CF(t,f)|−log|CB(t,f)|≤κB: Update RV.
The parameter κ (kappa) corresponds to the magnitude difference between the two beampatterns for a given angle (θ). For the particular plot, the beampattern difference as function of angle is a monotonic decreasing function, so in the present particular case, a difference between the front and back beamformer of e.g. 3 dB may correspond to the angle where CF−CB=3 dB. So, for all angles where the difference is greater than 3 dB, we update the inter microphone target covariance matrix Rs, and for all angles where the difference is smaller than e.g. −3 dB we update the inter microphone noise covariance matrix RV. From the angles where the difference between CF and CB is neither greater than +3 dB nor smaller than −3 dB, we do not update any of the two covariance matrices.
The thresholds κF and κB may e.g. be equal, e.g. equal to 3 dB. If the thresholds are equal, the update is front-back symmetric, and either RT or RV is updated.
It may be advantageous to not update the beamformers (CF, CB), when the difference CF−CB is (numerically) small. The same goes for other thresholds, e.g. a voice activity-based threshold.
The thresholds κF and κB may e.g. be adaptively determined, e.g. in dependence of a current acoustic environment.
The inter-microphone target covariance matrix RT and the inter-microphone noise covariance matrix RV may be updated recursively, when the respective target- and noise-update criteria are fulfilled. The target and noise covariance matrices may be updated recursively with the same time constant or with different time constants. The time constant may change based on a sensor input, e.g. if it is detected that the head is moving or turning, the update rate may be increased.
The front back ratio (FBR) optimizing beamformer may be based on estimated target (RT) and noise (RV) covariance matrices that are determined when the respective target- and noise-update criteria are fulfilled.
The dashed line in
The front back decision may be implemented using a neural network. An implementation of an input stage comprising an FBR optimizing beamformer using a neural network (e.g. comprising Gated Recurrent Unit (GRU) layers) is illustrated in
The target covariance matrix (RT) may be a fixed (predetermined), full rank matrix and only the noise covariance matrix (RV) may be adaptive.
The direction-based decision may be combined with a decision based on voice activity. E.g. the target covariance may only be updated when the time-frequency unit fulfils a direction-based criterion as well as when voice (or other sounds defined as sounds of interest) is present.
The combination with a voice activity detector is advantageous. The combination of a voice activity detector (VAD) and a direction-based decision (e.g. based on FBR) can ensure that the target covariance matrix is updated only when both a voice-based and a direction-based criterion is fulfilled.
The target covariance matrix RT is updated when voice is active AND the signal is impinging from the front.
The options for when noise covariance matrix RV is updated may differ from when the target covariance matrix RT is updated:
RV is updated when the signal is impinging from the back.
RV is updated when the signal is impinging from the back AND no speech is detected.
RV is updated when the signal is impinging from the back OR no speech is detected.
An example of a speech detector (e.g., a voice activity detector) implemented as a neural network is described in
Instead of a front-back detector based (general) voice activity detector as described in connection with
Instead of a voice activity detector for detecting voice in an environment of the user, an own-voice detector may be implemented in similar manner (by denoting the user's own voice as ‘target’ and substituting front- and back-facing beamformers with beamformers pointing towards the user's mouth and opposite (the latter e.g. being implemented as an own-voice cancelling beamformer).
In the framework of
A joint decision across frequency (OVDjoint(k,l)) may be based on a trained neural network. The neural network may be a feed forward network, a convolutive network or a recurrent network. The input layer may be based on more than one time frame of comparisons between at least one own voice cancelling beamformer and another linear combination of the input microphones, such as an own voice enhancing beamformer. The training on the neural network may be based on audio sampled which are labelled either as own voice or no own voice. At the output layer, before applying a threshold, a nonlinear activation function, e.g. a sigmoid function, may be applied.
In addition to the comparison between beampatterns, the own voice decision may be based on a (general) voice activity detector, such that own voice is only detected when the comparison yields that own voice and voice is detected. The voice activity detector may e.g. be based on a trained neural network as well (cf. e.g.
Own voice decisions may be found jointly across hearing aids in a binaural setup. E.g. “own voice” may only be detected if both hearing aids detect own voice. “no own voice” may only be detected if both instruments detect no own voice meaning that we also may have time frames which may be labelled “undecided”.
The own voice cancelling beamformer and the own voice enhancing beamformer may as well be applied as separate input to the detector (before comparison, hereby leaving the comparison to the neural network).
In an embodiment the neural network is implemented as a simple linear regression unit.
A target covariance matrix (RT) for a weighted sum of (e.g. currently) possible target steering vectors, i.e. RT=Σθ=θ
The position range included in the search for currently relevant positions of target sound sources may e.g. be limited based on prior knowledge related to a specific listening situation (e.g. a specific acoustic environment). The position range may e.g. be quantized, see e.g. EP3413589A1.
The range of positions evaluated by the MLE method may e.g. be the full range around the user relevant for a conversation between the user and one or more communication partners (e.g. distance ϵ [0.5 m; 3 m] AND [angle ϵ [0°, 360°]). The range of positions evaluated by the MLE method may e.g. be a subrange, or a plurality of subranges, of the full range (e.g. a sub range of a front half plane (e.g. a sub-range around 0°) and a sub-range of a back half plane (e.g. a sub-range around 180°), etc.). The range of angles evaluated by the MLE method may e.g. be [0°, 360°] (full range around the user) or limited to a sub-range, e.g. to [−90°, +90°] (e.g. a front half plane, e.g. relative to the user's look direction), or to [0°, +180°]/[0°, −180°] (the left or right half-plane), or to sub-ranges thereof.
The term ‘a range of positions’ may (alternatively or additionally) be understood as ‘a range of positions across individuals’, e.g. a specific position measured across different people.
The most likely target position (or target steering vector) may be selected from a dictionary of different candidates. Instead of selecting a single target steering vector (e.g. for use in an MVDR beamformer) a number (NΘ) of (currently) possible target steering vectors (given the current (noisy) electric input signals picked up by the input transducers of the hearing device) may be identified as the steering vectors (dθ) having the largest likelihood (e.g. probability, pθ). The steering vectors having the largest likelihood may e.g. be taken to be the steering vectors having a likelihood larger than a threshold value (pth), or the NΘ steering vectors corresponding to the NΘ largest (current) likelihood values (e.g. probabilities). The probability threshold value (pth) may e.g. be larger than or equal to 0.4, e.g. in a range between 0.5 and 1, such as larger than or equal to 0.6. The number (NΘ) of largest likelihood values may e.g. be in the range between two and ten, e.g. between three and five, e.g. equal to three or four.
A target covariance matrix (RT) may also be given by an average of target covariance matrices obtained across different individuals.
A mathematical division in an LMS/NLMS update may be less accurate compared to a division used in a single step (direct) estimation. In the NLMS case, however, the division may be implemented simply by a shift operator, which is computationally much cheaper.
The alternation of LMS, NLMS (and Sign LMS) solutions with a single step (direct) estimation have the implementation-related advantage that they only consider second order terms (two numbers multiplied) rather than 4th order terms (i.e. 4 numbers multiplied), because 4th order multiplications are much harder to implement in fixed-point arithmetic, as the bit-width is much wider.
In a fourth aspect of the present application, a hearing aid adapted to be worn by a user is provided by the present disclosure. The fourth hearing aid comprises:
The M×1 target-maintaining beamformer refers to the number (M) of input signals to the target maintaining beamformer.
For example, a beamformed signal for attenuating sound from said specified position of the target sound source, such as the output of a target cancelling beamformer of the at least M−1 target cancelling beamformers, can be seen as a target cancelling signal.
The NLMS algorithm may be normalized in different ways. The NLMS update term of β (μ<⋅>) may be normalized using the norm of the fixed target cancelling beamformers C, i.e.
Alternatively, the NLMS update of β may be normalized using the norm of the output signal (Y):
Both the numerator and the denominator in the above equations may be averaged across time, i.e.
and
For example, C*Y can be expressed as
where C=[C1 . . . . CM-1]T.
However, in this particular case, smoothing is not necessary, as the recursive update of β automatically applies smoothing.
An update of β for complex valued signals can be expressed as
The real part of beta can be updated as
The imaginary part of beta can be updated as
For example, in the case of M=2 microphones, we have
In this particular case, β1 may be updated as
or
For example, in the case of M>2 microphones, β1 can be updated as
The sign( ) shall be understood such that we update the real part of β in the direction of the sign of the real part of the gradient and the imaginary part of β in the direction of the sign of the imaginary part of the gradient. As the sign LMS is very simple, it may be advantageous to update β several times within one frame i.e. update β with a higher rate than the frame rate. E.g. 10 times per frame.
Rather than estimating the size of the update step of the gradient algorithm, we may simply take a step in the direction of the gradient algorithm, e.g.,
where sign(C*Y)=sign((C*Y))+isign((C*Y)).
In the special case of M=2 microphones, we may omit the expectation, and we have,
where sign(C*Y)=sign((C*Y))+isign((C*Y)).
Such determination of β(n) can be simplified even further as shown in the following:
Given that C=a+ib and Y=c+id, we have C*Y=ac+bd+i(ad−bc).
Hereby sign(C*Y)=sign(ac+bd)+isign(ad−bc).
For the real part, we have
For the imaginary part, we have
It is shown that if sign(a)=sign(c) and sign(b)=sign(d), then if sign(a)=sign(d) we also have sign(b)=sign(c). It is thus shown that either we have one of the two conditions:
sign(a)=sign(c) and sign(b)=sign(d),
sign(a)≠sign(c) and sign(b)≠sign(d)
or we have one of the two conditions.
sign(a)=sign(d) and sign(b)≠sign(c),
sign(a)≠sign(d) and sign(b)≠sign(c).
Consequently, embodiments of the present disclosure can allow either update the real or the imaginary part without calculating the products ad, ac, bd or bc.
For example, rather than calculating the following
where Δβ=sign(C*Y)=sign((C*Y))+isign((C*Y)), embodiments of the present disclosure provide from the following calculation
Hereby, for each iteration, embodiments of the present disclosure can allow either update the real part of β or the imaginary part of β, while avoiding the expensive calculation of the product between two complex numbers.
The adaptive update algorithm may e.g. comprise or be constituted by a SIGN LMS algorithm. The update of the adaptive parameter may be performed using the sign LMS-algorithm (e.g. a complex sign LMS algorithm), e.g. for the M>2 case, where the target cancelling beamformers are combined (added together).
In the choice between using the LMS or the NLMS algorithm, it may be worth mentioning that the division in an LMS/NLMS update can be less accurate compared to a division used in a single step estimation. In the NLMS case, the division can simply be implemented by a shift operator, which is (computationally) much cheaper than a division operation. The shift operator may correspond to rounding the denominator to nearest 2N.
The microphone system may comprise more than 2 microphones. The microphone system may comprise 3 microphones.
The equation ∇β=0 may be arranged to isolate βi such that each element of the β-vector is given in terms of beamformer averages and the other elements of the β-vector, as outlined in the following.
Based on the gradient, different update rules can be derived for three microphones. In the case of three microphones, the gradient w.r.t. β1 and β2 is given by
By setting the gradients=0, we obtain the two expressions:
In this solution, we do not have to select a learning rate. We solely need to select a smoothing coefficient for the average operators ⋅⋅. In addition, if the β-terms fluctuate during convergence, it may be advantageous to apply low-pass filtering to β1 and to β2. It is proposed to iteratively alternate between determining β1 (by the above equation (possibly including LP-filtering of β1, β2), given a previously estimated value of β2) and β2 (by the above equation (possibly including LP-filtering of β1, β2), given the previously estimated value of β1).
The equation ∇β=0 may be solved to isolate Bi such that each element of the β-vector is given solely in terms of beamformer averages as outlined in the following.
Notice, from the above equations, we may also insert
into
and thus isolate β1. Hereby we obtain the direct estimation of β1 and β2
and
As it appears, the adaptive parameter vector β (e.g. β1, β2 of a microphone system comprising three microphones), or the adaptive parameter β (of a microphone system comprising two microphones), may be estimated based on averaging across the M (e.g. M=2, 3) microphones (rather than estimating the noise covariance matrix (RV)).
The direct estimation of β1 and β2 is advantageous, as it does not depend on previous estimates of β1 and β2, and the convergence hereby only depends on the beamformer averages. On the other hand, the estimation depends on quartic (4th order) terms, which may be problematic to implement in fixed-point.
As an alternative, the alternating solution, where the estimate of β1 depends on β2 and vice versa, only depends on quadratic terms. Still both update rules do not contain any learning rate.
By defining specified position of the target sound source, it is ensured that the estimated MVDR beamformer weights are normalized such that the output signal has an undistorted response for the specified target position.
The equation ∇β=0, were ∇β refers the gradient with respect to β may refer to the derivative of the magnitude squared (|⋅|2) of the output (Y) of the MVDR beamformer with respect to the adaptive parameter vector β.
The target-maintaining beamformer and the at least two target-cancelling beamformers may be fixed (i.e. defined by predefined (or occasionally (not constantly) updated) weights).
The adaptive parameter vector β may be determined in dependence of the time-averaged values of the (e.g. fixed) target-maintaining beamformer (C0) and the at least two target-cancelling beamformers (C1, C2). Time-averaging the beamformers is an alternative to actually estimating the covariance matrices.
The at least M−1 (e.g. two) target-cancelling beamformers may be based on a subset of the M electric input signals.
The microphone system may comprise (e.g. consist of) M=3 microphones. In a first mode of operation, the two target-cancelling beamformers may be based on three electric input signals. In a second mode of operation, the two target-cancelling beamformers may be based on two electric input signals (cf. e.g.
The adaptive parameter vector β for a three input GSC-beamformer may be estimated directly, but the direct estimation is difficult to implement as it requires a large dynamic range (due to the quartic terms). As an alternative to the direct estimation, an estimate based on solving the equation ∇β=0, where ∇β refers to the gradient with respect to β. Thereby the adaptive parameters β1 and β2 for a three-input solution may be estimated in an alternating way (e.g. first β1, then β2, then β1, etc.). This estimation only contains second order terms, which is more desirable in a fixed-point implementation.
The Generalized Eigenvector beamformer (GEV) (or an approximation thereof) implemented as a generalized sidelobe canceller (GSC) structure is determined as a function of an adaptive parameter β (or parameter vector β), e.g. a) by using the LMS (or NLMS) algorithm (e.g. a Sign-LMS algorithm) or b) by solving the equation ∇β=0 (∇β being the gradient with respect to β).
The directional noise reduction system may e.g. comprise a Generalized Eigenvector beamformer (GEV) solution for 2 or 3 or more input transducers (e.g. microphones), in particular an update rule for estimating adaptive parameters (β) based on averaging across the three (or more) fixed beamformers (rather than estimating the noise covariance matrix). (e.g. including fading between beamformer weights for two and three input transducers (e.g. microphones)).
For two microphones, we still only have two fixed beamformers, but contrary to solely finding an average when noise is present, we also fine a separate average when speech (or target) is present. For two microphones, we thus estimate averages based on two beamformers, but the averages are split into averages when either we detect noise or we detect speech, which is similar to either updating a speech covariance matrix or a noise covariance matrix (based on the electric input signals from the input transducers), but in the present disclosure it is proposed to base the (speech and noise) covariances on beamformed signals instead (from the target-maintaining and the target-cancelling beamformers).
In a fifth aspect of the present application, a hearing aid adapted to be worn by a user is provided by the present disclosure. The fifth hearing aid comprises:
The adaptively determined beamformer weights of the generalized sidelobe canceller beamformer may be determined in dependence of an adaptive parameter β or parameter vector β
The sign LMS is a special case of the gradient algorithm, where the gradient step is limited to moving in the direction of the sign of the gradient (i.e., the sign of the real part and the imaginary part of the gradient).
The target signal to noise ratio may e.g., be expressed in dependence of the fixed (target-maintaining and target-cancelling) beamformers and the adaptive parameter β or parameter vector β.
When the adaptive parameter β or parameter vector β is updated using a Sign LMS algorithm, the step size may be held constant. The Sign LMS algorithm may be the complex Sign LMS algorithm. The step size may be complex. The real and imaginary parts of the complex step size may be kept constant.
The advantage of the sign LMS is its simplicity. E.g., divisions during the LMS update can be avoided. The adaptive parameter may be updated more frequently, e.g., more than once per frame. And the lack of accuracy in the sign-gradient step can be compensated by a frequent update of small gradient steps.
In order to increase the convergence rate, the step size(s) of the adaptive algorithm, e.g., the Sign LMS algorithm, may be updated with a momentum term.
By updating the beamformer weights/adaptation coefficients in order to maximize the general SNR rather than a specific target signal-to-noise-ratio (from one specific target sound source (Sθ), at one specific location (θ)) the beamformer weights/adaptive parameters of the beamformer thereby allow target signals to impinge from more directions than the position for which a distortionless response is provided. The target sound sources may e.g., be sound sources comprising speech. The target sound sources may e.g., be sound sources comprising music.
The (GSC) beamformer may e.g., comprise a Generalized Eigenvector beamformer (GEV) (or an approximation thereof) configured to provide a noise reduced signal determined as a function of an adaptive parameter β or parameter vector β. The Generalized Eigenvector beamformer (GEV) (or an approximation thereof) may e.g., comprise a multitude M of fixed beamformers, each being configured to generate a beam pattern/beamformed signal in dependence of associated beamformer weights. The adaptively determined beamformer weights of the GSC-beamformer may e.g. be determined in dependence of the adaptive parameter β or parameter vector β.
The (GSC) beamformer may e.g., comprise a multitude (e.g. M) of fixed beamformers, e.g. one target maintaining beamformer and M−1 target-cancelling beamformers.
The (GSC) beamformer may e.g., comprise one beamformer configured to have a distortionless response for a specific target position of a target sound source, and M−1 beamformers configured to be independent target-cancelling beamformers for the specific target position (or positions).
Hereby it is ensured that the estimated weights of the Generalized Eigenvector beamformer (GEV), or an approximation thereof, are normalized such that the output signal of the beamformer has an undistorted response for sound from the specific target position.
The distortionless position may be adaptively changed over time.
The hearing aid may comprise a postfilter connected to said at least one beamformer and adapted to receive said at least one beamformed signal, or a processed version thereof, and wherein said postfilter is configured to provide gains to be applied to said at least one beamformed signal, or to said processed version thereof, in dependence of a postfilter target signal to noise ratio to provide a further noise reduced signal.
The adaptive parameter β or parameter vector β (and thus the adaptively determined beamformer weights of the GSC beamformer) may e.g. be determined by averaging across the (e.g. M) fixed beamformers.
The update rule according to the present disclosure is an alternative to estimating a noise covariance matrix.
The Generalized Eigenvector beamformer (GEV) (or an approximation thereof) may e.g. comprise one target-maintaining beamformer and M−1 target cancelling beamformers.
In a specific power saving mode of operation, the M−1 target cancelling beamformers (and optionally the target maintaining beamformer) receive as inputs only a subset of the M electric input signals (see e.g.
The subset of the M electric input signals may comprise two electric input signals (whereby the M−1 target-cancelling beamformers (and optionally the target-maintaining beamformer) may be first order beamformers).
Each of the M−1 target cancelling beamformers may be configured to only take a subset of the M microphone signals as inputs, e.g. such that each target cancelling beamformer has inputs comprising a different subset of the M input signals (e.g. a subset consisting of two input signals, such that each target cancelling beamformer becomes a first order beamformer, e.g. a first order cardioid).
In connection with any of the preceding aspect or in a further, separate, aspect, the hearing aid is configured to provide that the first and second target cancelling beamformers of a three microphone input solution—at least in second mode of operation—are based on a subset of the three microphone inputs, and wherein the directional noise reduction system is configured to switch (e.g. fade) beamformer weights of the target maintaining beamformer and the first and second target cancelling beamformers between sets of beamformer weights optimized for three (first mode) and two microphones (second mode), respectively.
In connection with any of the preceding aspects or in a further, separate, aspect, the hearing aid comprises directional noise reduction system comprising a three microphone input GSC-structure comprising a (e.g. fixed) target maintaining beamformer and first and second (e.g. fixed) target cancelling beamformers, wherein the first and second target cancelling beamformers are based on a subset of two of the three microphone inputs, and wherein the hearing aid is configured to shift (e.g. fade) between a first and a second mode of operation, wherein all three microphone inputs are active in the first mode and wherein only two of the three microphone inputs are active in the second mode of operation.
The hearing aid is configured to store beamformer weights in memory that are optimized in advance for the first and second modes of operation of the directional system.
The transition from the first to the second mode of operation (or from the second to the first mode of operation) may be controlled by the user via a user interface of by a control signal in dependence of an indicator a complexity of the current acoustic environment around the user.
In a three-microphone input system (first mode of operation), the GSC-structure comprises a (e.g. fixed) three-input target maintaining beamformer and first and second (e.g. fixed) two-input target cancelling beamformers, whereas in a two-microphone input system the GSC-structure comprises a (one, e.g. fixed) two-input target maintaining beamformer and a single (e.g. fixed) two-input target cancelling beamformer.
The main advantage of using two-microphone (target-cancelling) beamformers in a three-microphone system is that it becomes easier to fade between a 2-microphone system and a 3-microphone system, as the target cancelling beamformer can be re-used.
The trigger for entering the specific mode of operation (change from three to two inputs to the target cancelling beamformers) may e.g., be related to saving power. The trigger may e.g., be based on the input level or another parameter, e.g. provided by a sound scene detector.
The sound scene complexity may trigger fading from two to three microphones. The trigger may e.g., be a function of level, SNR, remaining battery time, movement.
In general, it can be argued that the battery capacity is not well spent on powering three microphones in all situations. We may apply three microphone-based beamforming only in the most complex environments. Which microphones to fade away form may depend on the microphone configuration in the hearing device. It may be desirable to fade towards the two microphones whose axis is most parallel to a front-rear axis in an ordinary situation where the user focuses attention on a sound source in the environment. In an own voice pickup (telephone) mode of operation of the hearing aid, a different two-microphone configuration may be selected. In case one of the microphones is detected not to work, the two still functional microphones may preferably be selected.
The hearing aid may comprise a classifier of the current acoustic environment of the user. The classifier may provide a classification signal representative of a current acoustic environment. The fading between which microphones of the microphone system to be used for beamforming in a given acoustic situation may be controlled by the classification signal.
In situations, with little or no noise or a signal from a single direction, there is a risk that the target covariance matrices and the noise covariances (or the corresponding fixed beamformers) may converge towards the same value (i.e., RT=RV). This can be avoided by adding a bias to the covariance matrices, i.e.
and
where σT and σV are small constants and I is the identity matrix (having 1's in the diagonal and 0's elsewhere). As the microphone signals will always contain microphone noise, it is most important to add a bias to the target covariance matrix, hereby ensuring that the beamformer system will converge towards an MVDR beamformer in the case of low input levels.
In the above example, the bias is a rank-1 matrix, but the added target covariance bias may also be a full rank matrix.
An SNR estimate (e.g., the target signal to noise ratio (SNR) of the beamformed signal) may be used as input to a postfilter.
The LMS or NLMS update term of the adaptive parameter or parameter matrix β (μ<⋅>) may be normalized in a number of different ways, see e.g., 4th aspect.
The LMS or NLMS update rate may e.g., be further increased by adding a momentum term.
The 6th aspect relates to a two-(or three-) microphone input directional system implemented as a GSC-structure comprising a GEV beamformer comprising two (or three) fixed beamformers, wherein an update rule for estimating the adaptive parameter (β) of the GSC structure is based on one of a) a direct determination based on estimation values of the beamformed signals of the two (or three) fixed beamformers, and b) the Sign LMS algorithm.
In a sixth aspect of the present application, a hearing aid adapted to be worn by a user is provided by the present disclosure. The sixth hearing aid comprises:
An update rule for estimating said adaptive parameter (β) is based on at least one of
The multitude M of microphones may be equal to two. The multitude M of microphones may be equal to three.
For a two-microphone case (M=2), two solutions (providing maximum and minimum values) of the differential equation of the cost function (e.g. SNR) with respect to the adaptive parameter β may be given by the expression
where
where C0, and C1 are the (output signals of the) target-maintaining (C0) and target-cancelling (C1) beamformers, respectively, and the CTN and CVN, N=0, 1, are the values of the target-maintaining (C0) and target-cancelling (C1) beamformers, respectively, during target (T), e.g. speech, and noise (N), respectively, and wherein |⋅| indicates magnitude, * indicates complex conjugate, and <⋅> indicates time average.
The second degree polynomial yields two solutions for β: one value maximizing the SNR; and another value minimizing the SNR. Using Muller's method, the solution to the quadratic formula can be rewritten as
From this equation, we can see that for
So
yields the MVDR solution (i.e. A=0), which is the solution maximizing the SNR.
In the case of three microphones, there is a need to solve a set of complex second order polynomials of the type Aβ12+Bβ1+C=0, (and Aβ22+Bβ2+C=0) where (for the case of β1)
If the target is solely from the steering vector direction, all terms containing CT1 and CT2 will disappear, and |CT0|2=1. The polynomial thus reduces to
where β1 can be isolated as
corresponding to the MVDR solution.
Update rules for the adaptive parameter (β) based on the complex sign LMS algorithm for the two-microphone case (M=2) may be given by the following expressions for the real and imaginary parts of β:
where YT and YV are the most recent available output signal estimates of the GSC-GEV beamformer, when the output is labelled as target (T) and (non-target) noise (N), respectively, and and are fixed step sizes of the adaptive algorithm in the direction of the real and imaginary parts of β, respectively.
For three microphones (M=3), update rules for the adaptive parameter vector (β) based on the sign-based LMS may be given by
and
where CTN and CVN, N=0, 1, 2 are the values of the target-maintaining (C0) and target-cancelling (C1, C2) beamformers, respectively, during target (T), e.g. speech, and noise (N), respectively, and where YV and YT both depend on the most recent estimates of β1 and β2, and
and
The sign LMS algorithm (i.e. the adaptive parameters) may be updated more than once per time frame, e.g. 2 times per frame or 5 times per frame.
7th Aspect (GSC Beamformer where SNR is Expressed in of Beamformer Weights Optimized in Dependence of Electric Input Signals);
The 7th aspect relates to a hearing aid comprising an adaptive beamformer implemented as a generalized sidelobe canceller (GSC) comprising M fixed beamformers, wherein the beamformer weights (w) of the adaptive beamformer are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from said plurality of target positions (θ), where said beamformer weights (w) are optimized in dependence of time averaged inner vector products of the multitude of electric input signals x and xH <xxH>T and <xxH>V.
In a seventh aspect of the present application, a hearing aid adapted to be worn by a user is provided. The seventh hearing aid comprises:
The inner vector product of x and xH, refers to the vector product of the electric input signals from the M microphones arranged as a vector x=(x1, . . . , xM)T, where superscript T denotes transposition.
The hearing aid may comprise a voice activity detector configured to estimate whether or not or with what probability a given input signal contains voice (e.g., speech) and to provide a voice activity control signal indicative thereof.
The voice activity control signal may be used to decide when estimation of whether to contain target speech (T) and noise (V), respectively.
The decision may as well depend on other cues, e.g., spatial cues.
The beamformer weights (w) may be iteratively updated using a gradient based optimization procedure. This is possible since the weights are given as function of the adaptive parameter β (which may be optimized based on the gradient algorithm). This is further described in the paragraph below.
The beamformer weights (w) may be expressed as function of β as w=a−Bβ*, and the beamformed signal Y as Y=(a−Bβ*)Hx=C0−βC as C0=aHx, and C=BHx, where a is the vector containing the weights of the target maintaining beamformer C0, and B is the blocking matrix containing the weights of the target cancelling beamformers, C. We thus have
Where T and V indicate ‘target’ and ‘noise’ (not target), respectively, and where <z*z>T and <z*z>V indicate an average of the signal z for a frequency band k across time frames (l) labelled as target and noise, respectively. zT and zV are the signals for time-frequency units labelled as target (T) and noise (V), respectively (e.g. by a voice activity detector).
The SNR is given as
and
We thus see that the SNR is expressed in dependence of either the output signals (YT and YV) or the weights w (in terms of β) and the target and noise covariance matrices (RT, RV).
The 8th aspect relates to a hearing aid comprising an adaptive beamformer configured to generate at least one beamformed signal in dependence of beamformer weights (w), which are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from said plurality of target positions, and wherein the signal to noise ratio is expressed in dependence of the beamformer weights (w) and wherein the beamformer weights are optimized in dependence of steering vectors (dθ) for the plurality of target positions (θ).
In an eighth aspect of the present application, a hearing aid adapted to be worn by a user is provided. The eighth hearing aid comprises:
The inter-microphone target covariance matrix RT may be determined as a (possibly weighted) sum of the (outer) product of the steering vectors dθdθH of the currently considered target sound sources (from positions θ=θ1, . . . , θNTS, where NTS is the (current) number of target sound sources).
The summation in the expression of the target covariance matrix RT may be a weighted sum of the outer product of the most likely target steering vectors according to the following expression:
where pθ is a (real) weight factor (e.g. a probability estimate) of a given position θ,
The SNR may be expressed in terms of steering vectors (look vectors):
We notice that |Y|2 can be written as
As aH x and BH x are expressions for beamformers, we see that the covariance may be replaced by beamformer averages. Similarly, the numerator |YT|2 may be expressed in terms of beamformers
The numerator may thus as well be expressed in terms of fixed beamformer values given by a weighted sum of steering vectors multiplied by the fixed beamformer weights given by a and B.
The target covariance matrix (RT) may be determined as a sum of outer products of steering vectors (dθ,p) for different persons (p). The target covariance matrix (RT) may be determined as a weighted sum outer products of steering vectors (dθ,p) for different persons (p=1, . . . , P), the weights (wd,p) being e.g. dependent on age, or gender (e.g. relative to the age or gender of the user).
In one formulation the SNR is expressed in terms of the averages of the products between the fixed beamformers output (Y). More specifically, the expected beamformer output while target is dominant (YT) and the expected beamformer outputs in the beamformers while the noise is dominant (YV).
The SNR which is optimized is estimated in dependence of the.
The at least one beamformer may be implemented as a GEV beamformer may be expressed in terms of target covariance (RT) and noise covariance (RV) matrices. However, the SNR may also be expressed in terms of the different estimated fixed beamformers in the GSC structure (where the covariance matrices need not be specifically estimated). In other words, rather than estimating covariance matrices, the averages are estimated as cross-products between a target maintaining beamformer and target cancelling beamformers (estimated during presence/absence of target/noise)).
9th Aspect (GSC Beamformer with a Variable Number of Active Input Signals);
State of the art hearing devices may comprise more than two microphones, e.g. three or more. In general, it can be argued that the (generally scarce) battery capacity is not well spent on powering all microphones of a hearing aid in all situations. All (e.g. three) microphone-based beamforming may be applied only in the most complex acoustic environments. Which microphones to fade away form may depend on the microphone configuration in the hearing device. It may be desirable to fade towards the two microphones whose axis is most parallel to a front-rear axis in an ordinary situation where the user focuses attention on a sound source in the environment. In an own voice pickup (telephone) mode of operation of the hearing aid, a different two-microphone configuration may be selected. In that case, we need a different steering vector d, and consequently a different target-maintaining beamformer a and a different target cancelling beamformer B. In case one of the microphones is detected not to work, the two still functional microphones may preferably be selected.
In a ninth aspect of the present application, a hearing aid adapted to be worn by a user is provided. The ninth hearing aid comprises:
In case fading is implemented from two to one microphone, the GSC structure can be dispensed with. Otherwise, the GSC structure may be used.
The target-maintaining and target-cancelling beamformers need not necessarily be fixed. The beamformer weights (a and B) may be adaptively updated to adapt towards the target direction.
Preferably, the system comprises M−1 target cancelling beamformers. But less than M−1 may be used (so that we optimize for less than M−1 microphones).
The number M of microphones may be equal to three.
Different options exist in order to make two independent target-cancelling beamformers. One option is to select M−1 of M columns in the matrix given by (I−(ddref*/dHd)). In that case each target-cancelling beamformer becomes a linear combination of all three microphone signals (for M=3, cf. e.g.,
In another option (cf. e.g.,
The resulting weights become dependent on all three microphones, and we can obtain the same resulting gains from this linear combination as we could when each target cancelling beamformer was based on all three microphones.
In the second mode of operation, the number of electric input signals (Xm,) to the target maintaining beamformer may e.g., be equal to M−1.
The advantage of the second solution (for M=3) is that we can fade e.g., microphone 3 out simply by turning the target cancelling beamformer (C2) based on microphone 1 (M1) and microphone 3 (M3) off. The target canceling beamformer (C1) based on microphone 1 (M1) and microphone 2 (M2) do not need to be changed depending on whether the resulting beamformer depends on 2 or 3 microphones.
The target maintaining beamformer C0, however needs to be scaled differently. When it is based on a different number of microphones. In the case of three microphones (M=3), the weights a of the target maintaining beamformer are given by d d1*/(|d1|2+|d2|2+|d3|2), where microphone 1 (M1) is the reference microphone. In the case of two microphones (M=2), it is given by: d d1*/(|d1|2+|d2|2). The scaling difference between three and two microphones is thus (|d1|2+|d2|2+/d3|2)/(|d1|2+|d2|2), a value, which approximately is around 1.5. But the correct scaling difference would be (|d1|2+|d2|2+|d3|2)/(|d1|2+|d2|2) or (|d1|2+|d2|2)/(|d1|2+|d2|2+|d3|2) depending on the fading direction.
In the second mode of operation, the number Msub,0 of electric input signals received by the target maintaining beamformer (a) is M.
In the second mode of operation, the subsets (SSm-1, m=2, . . . , M) of the M electric input signals (Xm, m=1, . . . , M) received by the M−1 target-cancelling beamformers (bm-1, m=2, . . . , M) may comprise M−1>Msub,m-1≥2 of the M electric input signals (Xm, m=1, . . . , M).
The number (Msub,m-1) of electric input signals (Xm,) of a given subset (SSm-1, m=2, . . . , M) may be equal for all M−1 target cancelling beamformers (bm-1, m=2, . . . , M).
At least two of the subsets may comprise a different number of electric input signals.
At least two of the subsets may have at least one electric input signal different from each other.
The number (Msub,m-1) of electric input signals (Xm,) may e.g., be equal to two for at least one, such as a majority or all, of the M−1 target cancelling beamformers (Cm-1, m=2, . . . , M).
When switching between the first and second modes of operation, an instant change between using two and three microphones may be applied. To minimize abrupt sound effects that might annoy the user, a fading between the two modes may, however, be applied.
The hearing aid may be configured to switch (e.g., fade) between the first and second modes of operation of the directional system in dependence of a mode selection signal.
In a directional system comprising three microphones, the target maintaining beamformer may apply approximately ⅓ of the weighting to each microphone signal. In a directional system comprising two microphones, the target maintaining beamformer may apply approximately ½ of the weighting to each microphone signal. The simplest thing to do when fading is to apply a scaling to each weight while fading. i.e. while fading from three to two microphones. An example hereof is shown in
The hearing aid may comprise a classifier of the current acoustic environment of the user. The classifier may provide a classification signal representative of a current acoustic environment. The mode selection signal may be determined (or influenced) by the classification signal.
The mode selection signal may e.g. be determined (or influenced) by a detected sound scene complexity. Switching (fading) from M−1 to M (e.g. from two to three) inputs to the target-cancelling beamformers may e.g. be initiated by a detection of a more complex acoustic environment and vice versa.
The hearing aid may be configured to switch (e.g. fade) between the first and second modes of operation (e.g. change from M to M−1 (e.g. from three to two) inputs to the target-cancelling beamformers) to save power. The hearing aid may comprise a detector of a current status of the energy source, e.g. providing an estimate of a current rest-capacity of the battery of the hearing aid, or of a current power consumption of the hearing aid.
The mode selection signal may e.g. be a function of input level of the electric input signal(s), SNR, remaining battery time, movement of the user, etc.
The hearing aid may comprise a multitude (M≥2) of microphones.
The hearing aid may comprise a directional noise reduction system comprising a generalized sidelobe canceller (GSC) beamformer configured to provide at least one beamformed signal, wherein the generalized sidelobe canceller comprises a multitude M of fixed beamformers, one of the M fixed beamformers being a target signal maintaining beamformer (a), and M−1 of the fixed beamformers being target-cancelling beamformers (bm-1, m=2, . . . , M).
The at least M−1 (e.g. two) target-cancelling beamformers may be based on a subset of the M electric input signals.
The microphone system may consist of M=3 microphones. In a first mode of operation, the two target-cancelling beamformers may be based on three electric input signals. In a second mode of operation, the two target-cancelling beamformers may be based on two electric input signals. The hearing aid may be adapted to switch (e.g. fade) between the first and the second mode of operation (e.g. in dependence of a classifier of the acoustic environment, or of an estimate of a current rest-capacity of the battery of the hearing aid, or of a current power consumption of the hearing aid).
The directional noise reduction system may e.g. comprise a Generalized Eigenvector beamformer (GEV) solution for 2 or 3 or more input transducers (e.g. microphones), in particular comprising an update rule for estimating adaptive parameters (β) based on averaging across the three (or more) fixed beamformers (rather than estimating the noise covariance matrix). The directional noise reduction system may e.g. be configured to include the option of fading between beamformer weights for two and three input transducers (e.g. microphones)).
The Generalized Eigenvector beamformer (GEV) (or an approximation thereof) may e.g. comprise one target-maintaining beamformer and M−1 target cancelling beamformers.
In a specific power saving mode of operation, the M−1 target cancelling beamformers may receive as inputs only a subset of the M electric input signals (see e.g.
The subset of the M electric input signals may comprise two electric input signals (whereby the M−1 target-cancelling beamformers are first order beamformers).
Each of the M−1 target cancelling beamformers may only take a subset of the M microphone signals as inputs, e.g. such that each target cancelling beamformer has inputs comprising a different subset of the M input signals (e.g. a subset consisting of two input signals, such that each target cancelling beamformer becomes a first order beamformer).
In connection with any of the preceding aspect or in a further, separate, aspect, the hearing aid may be configured to provide that the first and second target cancelling beamformers of a three microphone input solution are based on a subset of the three microphone inputs, and wherein the beamformer—e.g. in a specific mode of operation—is configured to fade the beamformer weights of the first and second target cancelling beamformers between sets of beamformer weights optimized for two and three microphones, respectively.
The main advantage of using two-microphone beamformers in a three-microphone system is that it becomes easier to fade between a 2-microphone system and a 3-microphone system, as the target cancelling beamformer can be re-used.
The trigger for entering the specific mode of operation (change from three to two inputs to the target cancelling beamformers) may e.g. be related to saving power. The trigger may e.g. be based on the input level or another parameter, e.g. provided by a sound scene detector.
The sound scene complexity may trigger fading from two to three microphones. The trigger may e.g. be a function of level, SNR, remaining battery time, movement.
In general, it can be argued that the battery capacity is not well spent on powering three microphones in all situations. We may apply three-microphone-based beamforming only in the most complex environments. Which microphones to fade away form may depend on the microphone configuration in the hearing device. It may be desirable to fade towards the two microphones whose axis is most parallel to a front-rear axis in an ordinary situation where the user focuses attention on (a) sound source(s) in the environment. In an own voice pickup (telephone) mode of operation of the hearing aid, a different two-microphone configuration may be selected. Further, in case one of the microphones is detected not to work, the two still functional microphones may preferably be selected.
The hearing aid may be adapted to provide a frequency dependent gain and/or a level dependent compression and/or a transposition (with or without frequency compression) of one or more frequency ranges to one or more other frequency ranges, e.g. to compensate for a hearing impairment of a user. The hearing aid may comprise a signal processor for enhancing the input signals and providing a processed output signal.
The hearing aid may comprise an output unit for providing a stimulus perceived by the user as an acoustic signal based on a processed electric signal. The output unit may comprise a number of electrodes of a cochlear implant (for a CI type hearing aid) or a vibrator of a bone conducting hearing aid. The output unit may comprise an output transducer. The output transducer may comprise a receiver (loudspeaker) for providing the stimulus as an acoustic signal to the user (e.g. in an acoustic (air conduction based) hearing aid). The output transducer may comprise a vibrator for providing the stimulus as mechanical vibration of a skull bone to the user (e.g. in a bone-attached or bone-anchored hearing aid). The output unit may (additionally or alternatively) comprise a transmitter for transmitting sound picked up-by the hearing aid to another device, e.g. a far-end communication partner (e.g. via a network, e.g. in a telephone mode of operation, or in a headset configuration).
The hearing aid may comprise an input unit for providing an electric input signal representing sound. The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal. The input unit may comprise a wireless receiver for receiving a wireless signal comprising or representing sound and for providing an electric input signal representing said sound.
The wireless receiver and/or transmitter may e.g. be configured to receive and/or transmit an electromagnetic signal in the radio frequency range (3 kHz to 300 GHz). The wireless receiver and/or transmitter may e.g. be configured to receive and/or transmit an electromagnetic signal in a frequency range of light (e.g. infrared light 300 GHz to 430 THz, or visible light, e.g. 430 THz to 770 THz).
The hearing aid may comprise a directional microphone system adapted to spatially filter sounds from the environment, and thereby enhance a target acoustic source among a multitude of acoustic sources in the local environment of the user wearing the hearing aid. The directional system may be adapted to detect (such as adaptively detect) from which direction a particular part of the microphone signal originates. This can be achieved in various different ways as e.g. described in the prior art. In hearing aids, a microphone array beamformer is often used for spatially attenuating background noise sources. The beamformer may comprise a linear constraint minimum variance (LCMV) beamformer. Many beamformer variants can be found in literature. The minimum variance distortionless response (MVDR) beamformer is widely used in microphone array signal processing. Ideally the MVDR beamformer keeps the signals from the target direction (also referred to as the look direction) unchanged, while attenuating sound signals from other directions maximally. The generalized sidelobe canceller (GSC) structure is an equivalent representation of the MVDR beamformer offering computational and numerical advantages over a direct implementation in its original form. The present application further relates to a Generalized EigenVector beamformer, also termed a ‘Generalized Eigen Value beamformer’, in both cases abbreviated as GEV.
The hearing aid may comprise antenna and transceiver circuitry allowing a wireless link to an entertainment device (e.g. a TV-set), a communication device (e.g. a telephone), a wireless microphone, or another hearing aid, etc. The hearing aid may thus be configured to wirelessly receive a direct electric input signal from another device. Likewise, the hearing aid may be configured to wirelessly transmit a direct electric output signal to another device. The direct electric input or output signal may represent or comprise an audio signal and/or a control signal and/or an information signal.
In general, a wireless link established by antenna and transceiver circuitry of the hearing aid can be of any type. The wireless link may be a link based on near-field communication, e.g. an inductive link based on an inductive coupling between antenna coils of transmitter and receiver parts. The wireless link may be based on far-field, electromagnetic radiation. Preferably, frequencies used to establish a communication link between the hearing aid and the other device is below 70 GHz, e.g. located in a range from 50 MHz to 70 GHZ, e.g. above 300 MHz, e.g. in an ISM range above 300 MHz, e.g. in the 900 MHz range or in the 2.4 GHz range or in the 5.8 GHz range or in the 60 GHz range (ISM=Industrial, Scientific and Medical, such standardized ranges being e.g. defined by the International Telecommunication Union, ITU). The wireless link may be based on a standardized or proprietary technology. The wireless link may be based on Bluetooth technology (e.g. Bluetooth Low-Energy technology, e.g. LE Audio), or Ultra WideBand (UWB) technology.
The hearing aid may be or form part of a portable (i.e. configured to be wearable) device, e.g. a device comprising a local energy source, e.g. a battery, e.g. a rechargeable battery. The hearing aid may e.g. be a low weight, easily wearable, device, e.g. having a total weight less than 100 g, such as less than 20 g, such as less than 5 g.
The hearing aid may comprise a ‘forward’ (or ‘signal’) path for processing an audio signal between an input and an output of the hearing aid. A signal processor may be located in the forward path. The signal processor may be adapted to provide a frequency dependent gain according to a user's particular needs (e.g. hearing impairment). The hearing aid may comprise an ‘analysis’ path comprising functional components for analyzing signals and/or controlling processing of the forward path. Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks. Some or all signal processing of the analysis path and/or the forward path may be conducted in the time domain.
The hearing aid, e.g. the input unit, and or the antenna and transceiver circuitry may comprise a transform unit for converting a time domain signal to a signal in the transform domain (e.g. frequency domain or Laplace domain, Z transform, wavelet transform, etc.). The transform unit may be constituted by or comprise a TF-conversion unit for providing a time-frequency representation of an input signal. The time-frequency representation may comprise an array or map of corresponding complex or real values of the signal in question in a particular time and frequency range. The TF conversion unit may comprise a filter bank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. The TF conversion unit may comprise a Fourier transformation unit (e.g. a Discrete Fourier Transform (DFT) algorithm, or a Short Time Fourier Transform (STFT) algorithm, or similar) for converting a time variant input signal to a (time variant) signal in the (time-) frequency domain. The frequency range considered by the hearing aid from a minimum frequency fmin to a maximum frequency fmax may comprise a part of the typical human audible frequency range from 20 Hz to 20 kHz, e.g. a part of the range from 20 Hz to 12 kHz. Typically, a sample rate fs is larger than or equal to twice the maximum frequency fmax, fs≥2fmax. A signal of the forward and/or analysis path of the hearing aid may be split into a number NI of frequency bands (e.g. of uniform width), where NI is e.g. larger than 5, such as larger than 10, such as larger than 50, such as larger than 100, such as larger than 500, at least some of which are processed individually. The hearing aid may be adapted to process a signal of the forward and/or analysis path in a number NP of different frequency channels (NP≤NI). The frequency channels may be uniform or non-uniform in width (e.g. increasing in width with frequency), overlapping or non-overlapping.
The hearing aid may be configured to operate in different modes, e.g. a normal mode and one or more specific modes, e.g. selectable by a user, or automatically selectable. A mode of operation may be optimized to a specific acoustic situation or environment, e.g. a communication mode, such as a telephone mode. A mode of operation may include a low-power mode, where functionality of the hearing aid is reduced (e.g. to save power), e.g. to disable wireless communication, and/or to disable specific features of the hearing aid (e.g. to decrease the number of microphones actively used).
The hearing aid may comprise a number of detectors configured to provide status signals relating to a current physical environment of the hearing aid (e.g. the current acoustic environment), and/or to a current state of the user wearing the hearing aid, and/or to a current state or mode of operation of the hearing aid. Alternatively or additionally, one or more detectors may form part of an external device in communication (e.g. wirelessly) with the hearing aid. An external device may e.g. comprise another hearing aid, a remote control, and audio delivery device, a telephone (e.g. a smartphone), an external sensor, etc.
One or more of the number of detectors may operate on the full band signal (time domain). One or more of the number of detectors may operate on band split signals ((time-) frequency domain), e.g. in a limited number of frequency bands.
The number of detectors may comprise a level detector for estimating a current level of a signal of the forward path. The detector may be configured to decide whether the current level of a signal of the forward path is above or below a given (L-)threshold value. The level detector operates on the full band signal (time domain). The level detector operates on band split signals ((time-) frequency domain).
The hearing aid may comprise a voice activity detector (VAD) for estimating whether or not (or with what probability) an input signal comprises a voice signal (at a given point in time). A voice signal may in the present context be taken to include a speech signal from a human being. It may also include other forms of utterances generated by the human speech system (e.g. singing). The voice activity detector unit may be adapted to classify a current acoustic environment of the user as a VOICE or NO-VOICE environment. This has the advantage that time segments of the electric microphone signal comprising human utterances (e.g. speech) in the user's environment can be identified, and thus separated from time segments only (or mainly) comprising other sound sources (e.g. artificially generated noise). The voice activity detector may be adapted to detect as a VOICE also the user's own voice. Alternatively, the voice activity detector may be adapted to exclude a user's own voice from the detection of a VOICE.
The hearing aid may comprise an own voice detector for estimating whether or not (or with what probability) a given input sound (e.g. a voice, e.g. speech) originates from the voice of the user of the system. A microphone system of the hearing aid may be adapted to be able to differentiate between a user's own voice and another person's voice and possibly from NON-voice sounds.
The number of detectors may comprise a movement detector, e.g. an acceleration sensor. The movement detector may be configured to detect movement of the user's facial muscles and/or bones, e.g. due to speech or chewing (e.g. jaw movement) and to provide a detector signal indicative thereof.
The hearing aid may comprise a classification unit configured to classify the current situation based on input signals from (at least some of) the detectors, and possibly other inputs as well. In the present context ‘a current situation’ may be taken to be defined by one or more of
The classification unit may be based on or comprise a neural network, e.g. a recurrent neural network, e.g. a trained neural network.
The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path, but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms. They both have the property to minimize the error signal in the mean square sense with the NLMS additionally normalizing the filter update with respect to the squared Euclidean norm of some reference signal.
The hearing aid may further comprise other relevant functionality for the application in question, e.g. compression, noise reduction, etc.
The hearing aid may comprise a hearing instrument, e.g. a hearing instrument adapted for being located at the ear or fully or partially in the ear canal of a user, a headset, an earphone, an ear protection device or a combination thereof. A hearing system may comprise a speakerphone (comprising a number of input transducers (e.g. a microphone array) and a number of output transducers, e.g. one or more loudspeakers, and one or more audio (and possibly video) transmitters e.g. for use in an audio conference situation), e.g. comprising a beamformer filtering unit, e.g. providing multiple beamforming capabilities.
In an aspect, use of a hearing aid as described above, in the ‘detailed description of embodiments’ and in the claims, is moreover provided. Use may be provided in a system comprising one or more hearing aids (e.g. hearing instruments), headsets, ear phones, active ear protection systems, etc., e.g. in handsfree telephone systems, teleconferencing systems (e.g. including a speakerphone), public address systems, karaoke systems, classroom amplification systems, etc.
In an aspect, a method of operating a hearing aid adapted to be worn by a user is provided by the present disclosure. The hearing aid comprises a microphone system comprising a multitude M of microphones, where M is larger than or equal to two, adapted for picking up sound from an environment of the user and to provide corresponding electric input signals. The method comprises:
It is intended that some or all of the structural features of the device described above, in the ‘detailed description of embodiments’ or in the claims can be combined with embodiments of the method, when appropriately substituted by a corresponding process and vice versa. Embodiments of the method have the same advantages as the corresponding devices.
In an aspect, a method of operating a hearing aid adapted to be worn by a user is furthermore provided by the present application. The hearing aid comprises a microphone system comprising a multitude (M) of microphones, where M is larger than or equal to two, adapted for picking up sound from an environment of the user and to provide corresponding electric input signals xm(n), m=1, . . . , M, n representing time. The method comprises:
It is intended that some or all of the structural features of the device described above, in the detailed description of embodiments' or in the claims can be combined with embodiments of the method, when appropriately substituted by a corresponding process and vice versa. Embodiments of the method have the same advantages as the corresponding devices.
The beamformer weights (w) for the plurality of target positions (θ) to currently present plurality of target sound sources may be determined in dependence of a multitude of steering vectors (dθ) for the currently present plurality of target sound sources, e.g. a weighted sum.
Methods According to 3rd-9th Aspects:
Methods according to 3rd to 9th aspects are provided by the present disclose by substituting structural features of hearing aids according to the 3rd to 9th aspects by equivalent process features.
It is intended that some or all of the structural features of the hearing aid devices according to the 3rd to 9th aspects described above, in the ‘detailed description of embodiments’ or in the claims can be combined with embodiments of the methods according to the 3rd to 9th aspects, respectively, when appropriately substituted by corresponding processes (and vice versa). Embodiments of the methods have the same advantages as the corresponding devices.
A computer program (product) comprising instructions which, when the program is executed by a computer, cause the computer to carry out (steps of) the methods described above (e.g. according to any of the 1st to 9th aspects), in the ‘detailed description of embodiments’ and in the claims is furthermore provided by the present application.
In a further aspect, a hearing system comprising a hearing aid as described above (e.g. according to any of the 1st to 9th aspects), in the ‘detailed description of embodiments’, and in the claims, AND an auxiliary device is moreover provided.
The hearing system may be adapted to establish a communication link between the hearing aid and the auxiliary device to provide that information (e.g. control and status signals, possibly audio signals) can be exchanged or forwarded from one to the other.
The auxiliary device may be constituted by or comprise a remote control, a smartphone, or other portable or wearable electronic device, such as a smartwatch or the like.
The auxiliary device may be constituted by or comprise a remote control for controlling functionality and operation of the hearing aid(s). The function of a remote control may be implemented in a smartphone, the smartphone possibly running an APP allowing to control the functionality of the audio processing device via the smartphone (the hearing aid(s) comprising an appropriate wireless interface to the smartphone, e.g. based on Bluetooth or some other standardized or proprietary scheme).
The auxiliary device may be constituted by or comprise an audio gateway device adapted for receiving a multitude of audio signals (e.g. from an entertainment device, e.g. a TV or a music player, a telephone apparatus, e.g. a mobile telephone or a computer, e.g. a PC, a wireless microphone, etc.) and adapted for selecting and/or combining an appropriate one of the received audio signals (or combination of signals) for transmission to the hearing aid.
The auxiliary device may be constituted by or comprise another hearing aid. The hearing system may comprise two hearing aids adapted to implement a binaural hearing system, e.g. a binaural hearing aid system.
In a further aspect, a non-transitory application, termed an APP, is furthermore provided by the present disclosure. The APP comprises executable instructions configured to be executed on an auxiliary device to implement a user interface for a hearing aid or a hearing system described above in the ‘detailed description of embodiments’, and in the claims. The APP may be configured to run on cellular phone, e.g. a smartphone, or on another portable device allowing communication with said hearing aid or said hearing system.
Embodiments of the disclosure may e.g. be useful in applications such as hearing aids or headsets or earphones or combinations thereof.
The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
The electronic hardware may include micro-electronic-mechanical systems (MEMS), integrated circuits (e.g. application specific), microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, printed circuit boards (PCB) (e.g. flexible PCBs), and other suitable hardware configured to perform the various functionality described throughout this disclosure, e.g. sensors, e.g. for sensing and/or registering physical properties of the environment, the device, the user, etc. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
The present application relates to the field of hearing devices, e.g. aids or headsets, in particular to noise reduction in such devices.
For a minimum variance distortionless response (MVDR) beamformer, the set of beamformer weights which maximize the SNR at the output may be given by the following ratio:
Where σθ2 is the target variance, σv2 is the noise variance, and RV=σV2ΓV, where ΓV is the normalized correlation matrix, RB/TRACE (RV), where the TRACE-function extracts the diagonal elements of a matrix (here RV) and adds them together. The SNR at the reference microphone is thus given by σθ2/σV2. In the particular case where the target signal is a single point source, the target covariance matrix is given by RT=σT2ΓT=σT2ΓT=σθ2dθdθH. We notice that the target covariance matrix (RT) is singular, as it is given by the outer product of dθ. In this particular case, where we have a closed-form solution given by
which is the well-known MVDR beamformer solution.
Assuming that the target is only present at a single direction may not always comply with what the listener would like to listen to. In order to cope with that, one strategy is to estimate the most likely target direction, either by finding the most likely steering vector dθ (cf. e.g. EP3413589A1) or by estimating the most likely direction to a point sound source of current interest to the user (cf. e.g. EP3300078A1).
Now we will consider the more general case, where more than one direction could be of interest to the listener, e.g. multiple talkers from (separate) single directions, or everything in the front half plane. In that case we assume that the target covariance matrix has full rank, e.g. that the target covariance matrix may be a sum of different steering vectors, i.e.
In that case, we need to find the set of weights w which maximizes
The above problem is well-known and the solution w can be estimated as the eigenvector belonging to the largest generalized eigenvalue.
The weight w may as well be estimated, e.g. by iteratively updating w using a gradient based optimization.
The gradient is given by
Other gradient algorithms may be used. We may find a set of weights w that maximizes SNR, and e.g. fulfils ∇w=0. However, we can only find w up to a scaling and we thus have to choose how the set of weights that maximizes the SNR should be normalized, e.g. by having a unit gain towards a certain direction, i.e. fulfilling |wHdθ|2=1.
This type of beamforming is often referred to as generalized eigenvector (GEV) beamforming.
The directional noise reduction system comprises at least one beamformer (BF) for generating at least one beamformed signal (YF) in dependence of (typically complex) beamformer weights (w=[w1, . . . , wM]) configured to be applied to the multitude (M) of electric input signals, thereby providing the at least one beamformed signal (YF) as a weighted sum of the multitude (M) of electric input signals (x), YF=(wHx). To implement the weighted sum of the input signals, the beamformer comprises respective combination units (here multiplication units (‘X’)) for applying the (typically complex) weights (w1, . . . , wM) to the respective electric input signals (Xm, m=1, . . . , M) to provide respective weighted input signals (w1X1, . . . , wMXM). The weights may be complex conjugated before being multiplied to the input signal, i.e wH=[w1, . . . , wM]. The embodiment of the beamformer of
(cf. the M multiplication units (X) and the sum unit (+) whose output is the beamformed signal (YF))
The hearing device comprises an SNR-estimator (SNR-EST) for estimating an SNR (SNR) of the beamformed signal (YF). The SNR estimator is connected to or form part of the beamformer (BF) (as in
The hearing device, e.g. (as indicted in
The hearing device (HD) may comprise (or have access to) a number of stored parameters relevant for the optimization of the beamformer weights (w). The hearing device (HD) may comprise memory (MEM) storing such parameters. The parameters may include predetermined values of such parameters, e.g. steering vectors (dθ) (or inter-microphone target covariance matrices (RT)) for a plurality of target positions (θ), and/or inter-microphone noise covariance matrices RV for a multitude of current acoustic environments (preferably of relevance to the user of the hearing device).
The noise reduction system may comprise further noise reduction blocks, e.g. a single-channel postfilter for further reducing noise in the ‘spatially filtered (beamformed) signal (YF), cf. block (PF) with dashed enclosure in
In the embodiment of
The beamformer weights (w) are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions (θ). The optimization procedure is schematically indicated by block ‘Amend w to maximize SNR’, which in the embodiment of
The hearing device (e.g. the beamformer SNR-estimation block (SNR-EST)) may be configured to adaptively estimate steering vectors (dθ) for the plurality of target positions (θ) (or corresponding inter-microphone target covariance matrices RT), and/or inter-microphone noise covariance matrices RV in dependence of the voice activity control signal (VLAB) being labelled target (T) or noise (N), cf. arrow from the voice activity detector (VAD) to the SNR-estimation block (SNR-EST). The SNR-estimation block (SNR-EST) may include respective level detectors and/or smoothing units (e.g. low-pass filters) for smoothing the current estimates of target and noise over time (e.g. in the beamformed signal and/or in the electric input signals, etc.).
The hearing device may further comprise an audio signal processor (e.g. a hearing aid processor) (SPRO) for applying one or more processing algorithms to the beamformed (possibly further noise reduced) signal (YF, YNR), e.g. a compressive amplification algorithm that adapts a dynamic input level to fit a range of audible sound levels of the user and applies corresponding frequency dependent gains to the input signal (e.g. the beamformed signal (YF) (or the optionally further noise reduced signal YNR) in the embodiment of
The hearing device may further comprise a synthesis filter bank (FBS) configured to convert a signal (OUT) in the time-frequency domain to a signal (out) in the time domain.
The hearing device further comprises an output transducer (here a loudspeaker (SPK)) for converting the processed output signal (out) to stimuli perceivable as sound to the user. The output transducer may (e.g. in a ‘headset application’) alternatively or additionally comprise an audio transmitter for transmitting the processed signal (out) to another device or system. The output transducer may e.g. comprise a vibrator of a bone-conducting hearing aid. The output transducer may e.g. comprise an electrode array of a cochlear implant type of hearing aid (in which case the synthesis filter bank can be dispensed with).
The hearing aid may comprise other functional units, e.g. one or more detectors, e.g. a classifier of the current acoustic environment around the user, that may be used to improve the quality of the stimuli presented to the user, e.g. to increase user's a listening comfort of the user, and/or to increase an intelligibility of speech content in the audio signals picked up or received by the hearing device.
Different expressions of the signal to noise ratio used to optimize the beamformer weights are proposed in the following.
In an embodiment according to the present disclosure, the beamformer weights (w) are adaptively optimized to the plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions (θ), wherein the target signal to noise ratio (SNR) is expressed in dependence of first and second output variances (|YT2, |YV|2) (or time averaged versions thereof <|YT|2>, <|YV|2>) of the at least one beamformer (e.g. of the at least one beamformed signal (YF), or of the (further) noise reduced signal (YNR) from a postfilter (PF)).
The first and second output variances (|YT|2, |YV|2) (or time averaged versions thereof <|YT|2>, <|YV|2>) are determined when the electric input signals (x) or at least one beamformed signal (YF, (YNR) in
In an embodiment according to the present disclosure, the beamformer weights (w) are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions (θ), wherein the signal to noise ratio is determined in dependence of the beamformer weights (w) and of time averaged inner vector products of the multitude of electric input signals X and XH <XXH>T and <XXH>v, where <⋅> denotes average over time.
The time averaged inner vector products <XXH>T and <XXH>V, are determined when the electric input signals (X) are labelled as target (T) and noise (V), respectively. The inner product XXH is equal to X1X1*+ . . . +. XMXM*, where * denotes complex conjugation.
In an embodiment according to the present disclosure, the beamformer weights (w) are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions (θ), wherein the signal to noise ratio is determined in dependence of the beamformer weights (w) and of respective inter-microphone target covariance (RT) and inter-microphone noise covariance (RV) matrices.
One or both of the inter-microphone target covariance (RT) and inter-microphone noise covariance (RV) matrices are either predetermined (e.g. selected in a given acoustic situation among a multitude of predetermined values for known acoustic situations), cf. memory (MEM) and inputs dθ/RT, RV to the SNR-estimation block in
In an embodiment of the present disclosure, the beamformer weights (w) are adaptively optimized to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions (θ), wherein the signal to noise ratio is expressed in dependence of the beamformer weights (w) and an inter-microphone target covariance matrix RT. The inter-microphone target covariance matrix RT may be determined in dependence of respective steering vectors (dθ) for the plurality of target positions (θ). The plurality of steering vectors (dθ) may be dynamically determined during use of the hearing device in dependence of the voice activity control signal (VLAB) (cf. arrow to the SNR-estimation block in
The target covariance matrix (RT) may be a fixed (predetermined), full rank matrix. The noise covariance matrix (RV) may be adaptively determined.
The target and/or noise covariance matrices (RT, RV) may be estimated, when the voice activity control signal (VLAB) signal indicates that the input (audio) signal(s) monitored by the voice activity detector is (are) labelled ‘target’ (T=>RT) and ‘noise’ (N=>RV), respectively.
As shown and described in connection with
The 3rd example relates (mainly, but not exclusively) to a hearing device comprising a Generalized Eigenvector beamformer (GEV), in short denoted ‘GEV-beamformer’. In connection with the GEV-beamformer, the target covariance matrix (RT) is assumed to be a full-rank matrix. It is proposed to update the inter-microphone target covariance matrix RT and the inter-microphone noise covariance matrix RV of a beamformer based on voice activity detection, e.g. a) on an estimated direction of arrival of sound from a target sound source, b) on a comparison of signal content provided by a target-maintaining and a target cancelling beamformer, respectively (e.g. a difference between the two), or on c) speech detection.
As mentioned, maximizing
only finds a set of weights w up to a scaling factor.
One way to cope with that may e.g. be to still impose an undistorted direction constraint for a preferred direction θ*, e.g. scaling w such that
Consequently, only a single (real-valued) scaling value of w fulfils the above constraint. This particular scaling ensures that the response towards a preferred target direction θ* is distortionless. However, the SNR may still be optimized even though the current target direction deviates from the preferred target direction.
In MVDR beamforming, for a fixed target direction, we only have to estimate the noise covariance matrix. The target covariance matrix RT=dθdθH is known in advance. The noise covariance matrix (RV) is typically estimated during absence of speech. When we maximize an SNR, where the target is assumed not only to impinge from a desired direction, we thus need to estimate the target co-variance matrix (RT) as well as the noise covariance matrix (RV).
In a ‘front-back’-framework, it is assumed that sounds impinging from the front are of interest to the listener and sounds impinging from the back are considered as noise. In that case we aim at optimizing the front-back ratio, we thus have to decide when to update the target covariance matrix and the noise covariance matrix.
For that purpose, it is proposed to use a front-back detector based on the comparison between a beamformer pointing towards the front and a beamformer pointing towards the back. This is illustrated in
The input stage of the hearing device of
Based on a comparison (e.g. on a time-frequency unit level) between a) the magnitude response of a beamformer pointing towards the front (CF, with a null (or a maximum attenuation) pointing towards the back) and b) the magnitude response of a beamformer pointing towards the back ((rear) CB, with a null (or a maximum attenuation) pointing towards the front direction) it can be determined (cf. computation block ‘Compare’ in
For each unit in time and in frequency (t, f) (or l,k) we thus update the target covariance matrix RT, and the noise covariance matrix RV, based on the following criteria
where κF and κB are thresholds as illustrated in
The target (RT) and noise (RV) covariance matrices may be updated recursively as
and
respectively, where n is the time frame index, and λT, λV are coefficients in the interval [0;1]. λT, λV are coefficients controlling the exponential decay of the update. Often the coefficient λ∈[0; 1[ is expressed in terms of a time constant given by
where Fs is the sample rate of the covariance matrix update.
The advantage of using cardioid directivity patterns for comparison (as illustrated in
where gθ is a possible weight of the signal impinging from the direction θ, and dθ is a relative transfer function from the direction θ known in advance. The terms ‘relative transfer function’ or ‘steering vector’ are used interchangeably, and denoted dθ). A special case of the above equation is the super cardioid maximizing the front-back ratio, where the target directions are given by all directions from the frontal half-plane in a diffuse noise field, and the noise directions are given by all the directions from the rear half-plane in a diffuse noise field.
The text-book definition of the front-back ratio is given by the ratio between all signals impinging from the frontal compared to the signals impinging from the back half-plane. Implicitly, the FBR assumes that signals in the frontal halfplane are of interest (target) and signals from the back are of less interest (noise). The FBR is e.g defined in [Simmer et al.; 2001].
The above examples are shown in the case of two microphones, but it also holds for more than two microphones.
Further inputs to the neural network may be a voice activity control signal (VAD(k,l)) as shown in
In the embodiment of
The embodiments of
The advantage of applying a neural network is that joint decisions can be made taking into account information across frequency channels.
Whereas the MVDR beamformer is the optimal solution for enhancing a single target direction in a noise field, the GEV beamformer can regard multiple directions as target directions simultaneously. This is illustrated in
The direction-based decision may be combined with a voice-based criterion, e.g. one or more of a) RT is only updated when the sound is from the front and voice is present, b) Rv may only be updated in the absence of noise, c) Rv may only be updated in absence of noise and/or when the sound is from the rear halfplane.
The directivity index (DI) is given by the ratio between the response (R) of the target direction θ0 and the response of all other directions:
The front-back ratio (FBR) is defined as the ratio between the responses (R) of the front half plane and the responses of the back half plane:
where pθ is a weight based on the estimated probability (or likelihood) of the given direction θ, where pθ may be estimated as e.g. disclosed in EP3413589A1, or e.g. estimated by a trained neural network. The noise covariance matrix (RT) is estimated during speech absence (as e.g. indicated by a voice activity detector).
In this particular case, we assume that target directions from both 0 degrees and 180 degrees are likely, and we thus obtain a full rank target covariance matrix as Rs=d0d0H+d180d180H. As we see from
In the following, the connection between the minimum variance distortionless response (MVDR) beamformer weights and the generalized sidelobe canceller (GSC) weights is established. It is shown that the GSC structure with a single adaptive parameter can be used even though the target look vector (steering vector) is dynamically updated.
Given an acoustic transfer function h(k) (within the k'th frequency channel, we can calculate the normalized look vector
such that |d|2=1.
The MVDR beamformer is designed such that noise is suppressed maximally under the constraint that the signal from the target direction is passed through distortionless.
It can be shown (see e.g. [Bitzer &Simmer; 2001], [Souden et al.; 2010]) that the filter coefficients of the MVDR beamformer can be expressed by:
where RV(k) is the inter-microphone noise covariance matrix, and the vector d(k) is the look vector corresponding to the acoustic transfer function to the target, normalized with respect to the reference microphone.
The equation above is the same as
where
as
The above equations are valid for any number of microphones M>1. For the special case of (M=) 2 microphones, it can be shown that the MVDR filter output YF may be expressed in terms of the fixed beamformer outputs C0 and C1, as
where the complex scalar β is given by
where C0(k)=w0H(k)x(k) and C1(k)=w1H(k)x(k), where
noting that w0Hw1=0. Notice that this is only one example of how the beamformer weights may be selected. C1(k) just have to fulfill that the signal from a desired target direction is cancelled, and C0(k) is selected in order to have a given response in the desired target direction, e.g., a unit response. This is actually a special case of the generalized sidelobe canceller, where we have
Where a typically is an M×1 delay-sum beamformer vector not altering the target signal direction, and B is a blocking matrix of size M×(M−1), and β is an (M−1)×1 adaptive filtering matrix with the adaptive coefficients, which for the MVDR solution is given by [Bitzer &Simmer; 2001]
β=(BHRVB)−1BHCVa,
where a and B are orthogonal to each other, i.e. aHB=01×(M-1), and β is updated when speech is not present. The beamformer weights are thus calculated as
Notice that we in the following are using a slightly different notation, where the adaptive coefficient is conjugated compared to the definition of β from [Bitzer &Simmer; 2001].
In the special case of two microphones, we have
where (omitting the frequency index k)
We notice that we may find β either directly from the signals C0 and C1 or we may find β from the noise covariance matrix Rv, i.e.
according to the particular design. If, for example, the signals C0 and C1 are used other places in the device in in question, it might be advantageous to derive β directly from these signals
but, on the other hand, if it is necessary to change the look direction (steering vector d) (and hereby the weights w0 and w1), it is a disadvantage that the weights are included inside the expectation operator. In that case, it is an advantage to derive β directly from the noise covariance matrix.
The parameter β may be estimated adaptively. In that case, the least-square error is estimated. Given the output Y of the MVDR (GSC) beamformer
the least-square error (err, omitting the frequency band index k) can be written as
err=
YY*
=
|Y|
2
.
In terms of C0, C1 and β, we can re-write |Y|2 as
Recalling that a complex derivative is given by
where and indicate the real and imaginary part, respectively, of a complex parameter x (in this case the (complex) parameter β). We thus find
In the case of two microphones, we may remove the expectation, as smoothing is obtained by the recursive update of β. We thus have
where ‘j’ is the complex unit having the property j2=−1. In order to minimize |Y|2, we thus update β in the negative gradient direction, i.e.
where n is a time index. Rather than using the LMS algorithm, we may obtain faster update using the, the normalized LMS (NLMS) algorithm given by
Other normalization schemes may also be applied. E.g., we may normalize the gradient by the output:
Also, as we apply many small steps, the accuracy of the division is not required to be as high as the accuracy when estimating β in a single step
we may thus normalize using an approximation of the denominator (e.g., rounding |C1|2 to nearest 2N, where N is an integer.) Hereby the division can be implemented by a shift operator. In order to avoid dividing by zero a small value may be added to de denominator. In an embodiment the multiplied beamformers are averaged across time frames.
The illustrated GSC-implementation of the MVDR beamformer (GSC, cf. dashed outline enclosing the beamformers a, b1, b2 and combination units CU1, CU2, CU3, CU4) comprises a blocking matrix B, wherein each row of B corresponds to the weights of two independent target cancelling beamformers, i.e.
The two target-cancelling beamformers can be written as
Where
is the noisy input signals from the three microphones (M1, M2, M3, respectively, in
Similarly, we can write the noisy distortionless target signal as
C
0
=a
H
x,
where
Now we consider the adaptive update coefficient β=[β1, β2]T=(BH RVB)−1BH RVa, where superscript T denotes transposition. We notice that the coefficient depends on the noise covariance matrix given by
R
V
=
xx
H
,
where ⋅ denote the (time-) average operator (i.e. average in absence of target, ‘VAD=0’). Similar to the two-microphone case, we can avoid estimating the covariance matrix. We re-write β as
Notice that in the two-microphone case (M=2), β=(CCH)−1CC0* reduces to
which (apart from the complex conjugation) is similar to the definition of β otherwise in this application for a two-microphone GSC beamformer, but similar to the definition provided in [Bitzer &Simmer; 2001].
For the three-microphone case, we may further re-write the above equation as
We thus have
As|C1|2|C2|2≥C2C1*C1C2, we may add a constant to the denominator in order to avoid dividing by zero and hereby limit the size of β.
From the above, it is clear that in order to calculate β we need to average across two real (|C1|2 and |C2|2) and three imaginary terms (C2C1*=C1C2**, C2C0* and CRC0*). This is one average less compared to averaging each element of the covariance matrix (3 real averages, and 3 complex averages), and thus advantageous, when minimizing computational complexity (and thus power consumption) is a priority, as e.g., in hearing aids. In general, for M microphones, we need one real average less than when calculating the covariance matrix directly. This can be explained by the fact that the noise covariance matrix is present in both the numerator and the denominator. Therefore, we only need to estimate the covariance up to a scaling, which in the beamformer multiplication implementation saves a single real average.
The target preserving beamformer (C0) is given by the weights a. The weights may be obtained from the target steering d vector as
The weights of the target cancelling beamformers can be found by selecting M−1 rows from the matrix given by.
e.g.
Other options (than the example above) for selecting the beamformer weights b1 and b2, which both cancel the signal from the target direction, exists.
In this case, each target cancelling beamformer becomes a linear combination of all three microphones. We notice that aH B=0.
Alternatively, we may construct the blocking matrix solely from two independent two-microphone beamformers (similar to a Griffiths-Jim beamformer).
We may e.g., select the two first order beamformers in the blocking matrix in the following way (e.g. by removing input signal X3 to b1 and input signal X2 to b2):
We define
The weights of the two first order beamformers thus become
Hereby the structure of the blocking matrix becomes
We notice that aH B=0 is still true.
The three-microphone generalized sidelobe canceller based on two first order target cancelling beamformers is illustrated in
In general, it can be argued that the battery capacity is not well spent on powering three microphones in all situations. We may apply three microphone-based beamforming only in the most complex environments. Which microphones to fade away form may depend on the microphone configuration in the hearing device. It may be desirable to fade towards the two microphones whose axis is most parallel to a front-rear axis in an ordinary situation where the user focuses attention on a sound source in the environment. In an own voice pickup (telephone) mode of operation of the hearing aid, a different two-microphone configuration may be selected. In case one of the microphones is detected not to work, the two still functional microphones may preferably be selected.
In a three-microphone input system (first mode of operation), the GSC-structure comprises a (e.g. fixed) three-input target maintaining beamformer (a) and first and second (e.g. fixed) two-input target cancelling beamformers (b1, b2), whereas in a two-microphone input system the GSC-structure comprises a (one, e.g. fixed) two-input target maintaining beamformer (a) and a single (e.g. fixed) two-input target cancelling beamformer (b1).
The starting point is a system comprising 3 inputs to the target maintaining (TM) beamformer (a) and 2 inputs to each of the target cancelling (TC) beamformers (b1, b2).
We then remove an input (e.g. by deactivating a microphone) to the target maintaining as well as to the target cancelling beamformers, whereby the TC-beamformers that previously received the ‘removed input’ are ‘cancelled’ (input signals are faded to zero), whereas the two remaining inputs to the TM-beamformer are faded to increase their input levels (to avoid artefacts).
The fading from a more to less (e.g. three to two) input audio data streams to a (e.g. target maintaining) beamformer over a certain fading time period may e.g. comprise that an input stage is configured to provide the more (e.g. three) data streams as input signals to the directional noise reduction system at a first point in time t1 and to provide the less (e.g. two) data streams as input signals at a second point in time t2, where the second time t2 is larger than the first time t1. The fading time Δtfad=t2−t1 may e.g. be smaller than a predefined time range, e.g. Δtfad<20 s, or <10 s, such as <5 s, e.g. between 1 s and 5s.
The fading process may comprise determining respective fading parameters (t1, t2), α(t1), α(t2)) of a fading curve that gradually decreases (or increases) a weight of affected input signals, cf. e.g. fading curves (a vs. t) in the upper and lower parts of
In a directional system comprising three microphones as in
In the example of
As an example, imagine that the target maintaining beamformer a=[⅓, ⅓, ⅓]T in a three input configuration. When fading out the third microphone, we have
a
H
x=[⅓,⅓,⅓]*[x1,x2,0],
if we do not change the weights. Instead, we propose to change (e.g. fade) a to [½, ½, 0]T, before turning the third microphone off.
The hearing aid may be configured to store beamformer weights (and fading parameters) in memory that are optimized in advance for the first and second modes of operation of the directional system.
The main advantage of using two-microphone (target-cancelling) beamformers in a three-microphone system is that it becomes easier to fade between a 2-microphone system and a 3-microphone system, as the target cancelling beamformer can be re-used.
The initiation of a transition from the first to the second mode of operation (or from the second to the first mode of operation) may be controlled by the user via a user interface of by a control signal in dependence of an indicator a complexity of the current acoustic environment around the user.
The trigger for entering the specific mode of operation (change from three to two inputs to the target cancelling beamformers) may e.g. be related to saving power. The trigger may e.g. be based on the input level or another parameter, e.g. provided by a sound scene detector.
The sound scene complexity may trigger fading from two to three microphones. The trigger may e.g. be a function of level, SNR, remaining battery time, movement.
The MVDR beamformer weight is proportional to
w
MVDR
∝R
V
−1
d
H.
If the estimated beamformer weights w are not already scaled such that wH d=1, we may normalize the weights such that
Hereby we have
Recall that the weights can be written as
We thus have
For two microphones, we may isolate β1 from w1=a1−β1b11, i.e.
For three microphones, we may easily isolate β1 and β2 in the case, where the blocking matrix is given in terms of first order beamformers,
In that case we have
Besides saving a few multiplications in each row of the blocking matrix, the advantage of basing the target cancelling beamformers on first order cardioids (e.g. in the two-input target cancelling beamformer′-mode of operation of the system), is that it is possible to adjust each beamformer independently (to obtain a deeper target cancellation null (using the three input target-cancelling beamformers) simply by adjusting a single parameter.
Also, it becomes easier to fade from a three-microphone system into a two-microphone system without changing the target cancelling beamformer weights, because we have predefined sets of weights with 3 and 2 weights for each target-cancelling beamformer. When the target cancelling beamformer weight is based on two microphones, it becomes easier fading to a two-microphone system (or from 2 to 3), because the target cancelling beamformer will remain the
Similar to the two-microphone case, the update coefficients (β(β1, β2)) may be adaptively estimated for more than two microphones. For the three-microphone case, the least-square error (LMS) is estimated. Given the output Y of the 3-input GSC-structure
in the notation of
In terms of C0, C1, C2 β1, and β2, the magnitude squared of the output (|Y|2) can be re-written as
The derivative with respect to the real () and imaginary () parts of β1 can be derived:
For convenience the above equation can be expressed as
Likewise, the derivative w.r.t β2 can be derived and is given by
We thus notice that the gradient update is similar to the two-microphones case, and the LMS-update of
is given by
where
Rather than using the LMS framework, a faster update may be obtained using the Normalized LMS (NLMS) algorithm given by
The LMS/NLMS formula for updating the coefficients (β) are valid for any number of microphones >1.
It may however be worth mentioning that the division in an LMS/NLMS update can be less accurate compared to a division used in a single step estimation. In the NLMS case, however, the division can be implemented simply by a shift operator, which is computationally cheaper. An alternating estimation of the coefficients (β) is proposed in the following.
In order to estimate β, we have several options. We may either estimate β directly as
and
Or we may update β1 and β2 using the NLMS algorithm as
As the direct calculation of β1 and β2 include quartic (4th order) terms it is not very easy to implement in fixed-point arithmetic. The NLMS algorithm seems (at first sight) more promising, as only quadratic terms are involved.
However, the NLMS requires careful selection of the learning rate in order to ensure stability.
As an alternative solution to the above considered solutions, let us revisit the gradients
of the output magnitude squared |Y|2 with respect to the update coefficients β
By setting the gradients=0, we obtain the two expressions:
It is proposed to iteratively alternate between determining β1 (by the above equation, given a previously estimated value of β2) and β2 (by the above equation, given the previously estimated value of β1).
In this solution, we do not have to select a learning rate. We solely need to select a smoothing coefficient for the average operators ⋅⋅. In addition, if the β-terms fluctuate during convergence, it may be advantageous to apply a little low-pass filtering to β1 and to β2.
Notice, from the above equations, we may also insert
and thus isolate β1. Hereby, we again obtain
and
which are identical to the above formulas for β1 and β2.
Consider the signal to noise ratio defined from a target covariance matrix RT and a noise covariance matrix RV:
We may estimate the weights w which maximizes the SNR as the eigenvector belonging to the largest generalized eigenvalue (thereby providing a GEV-beamformer). Notice that the weight vector w can be scaled by an arbitrary phase and amplitude, and still maximize the SNR. If calculating an eigenvalue is too computationally expensive, we may instead provide an approximation by maximizing the SNR using gradient ascent. The gradient is given by
Now, let us consider a more constrained optimization, where we write the weight in terms of a generalized sidelobe canceller.
In the following we assume that we can define specific time frequency units belonging to either the target signals T or the noise signals V. We may thus estimate the target covariance matrix based on the part of the input signal belonging to the target RT=xxHT. The subscript T denotes that we average over samples belonging to the target signal. Similarly, we may define the noise covariance matrix as RV=xxHV, where the subscript V denotes an average based on the samples belonging to the noise.
Recall that the output signal is given by
where C0(k) is a signal providing an unaltered response for a given target direction, and C1(k), . . . , CM-1(k) are M−1 independent target cancelling beamformers. In the following, we disregard the frequency index k. β denotes the adaptive weights.
Recall that the output Y=wH x=(wc0−β1*wc1−β2wC2 . . . −βM-1*wC(M-1))Hx.
As CT can be written as xxHT, we may write the SNR as
For simplicity we consider the first order differential beamformer, where
Hereby we can rewrite the SNR loss function in terms of β.
In an aspect, a generalized sidelobe canceller (GSC) comprising a specific location/direction, for which the target is unaltered, is provided. The adaptation coefficients may be updated in order to maximize the SNR of the beamformed signal (rather than the target-direction-signal to-noise-ratio), hereby allowing the target to impinge from more directions than the specific (unaltered) target direction.
We know that the gradient of |Y|2 is given by
In order to find the derivative of the fraction, we use the quotient rule: The derivative of a quotient is the denominator times the derivative of the numerator minus the numerator times the derivative of the denominator, all divided by the square of the denominator, i.e.
We may thus update using gradient ascent (here generalized to M microphones, cf. CT*, CV*)
and
Where
Notice, removing averages in the above update assuming that the averaging can be applied via the β learning rate μ is not going to work (as it is mathematically incorrect). The MVDR LMS update without averaging does however work, as averaging via learning rate goes well, when there is only a single averaging term.
If we take a closer look at the gradient
we see that the numerator and denominator have the same order (quartic), the LMS update is thus already normalized. However, rather than normalizing by the real scalar |YV|22, we may also consider other normalization strategies:
(or alternatively (preferably) reusing the calculation of averages from the numerator (same as above))
We may also normalize by averaging across the output signal.
Or averaging by the target cancelling beamformers:
(or alternatively applying separate averages (same result))
The LMS update rate may be further increased by adding a momentum term. The momentum term is given by
where ρ>1, e.g. equal to 1.01 or 1.1, and 0<σ<1, e.g. 0.5. Notice that the intervals would be swapped if was to be minimized rather than maximized. We may also apply a maximum and a minimum value that μ may take, i.e.
By considering the gradient,
we see that in the case where the target signal impinges from the look direction, we have CT1=0, and the gradient reduces to
which only differs from the LMS gradient of the MVDR beamformer by a scalar.
In practice, it may be hard to ensure a stable step size, as small values in the denominator of the gradient dramatically may increase the step size. A more stable update is achieved by moving with a more fixed step size towards the gradient ascent direction or approximately towards the gradient ascent direction (e.g. using the (e.g., complex) sign-LMS algorithm). An approximate, but cheap step in the gradient direction is to adapt β with a fixed value in the direction the real and imaginary part, i.e.
In the case of two microphones, the expectation may be omitted:
Where YT and YV are the most recent available output signal estimates, and μ and μ are fixed step sizes.
We also notice that we may easily convert the GEV gradient ascent beamformer into an MVDR gradient ascent beamformer simply by setting CT1*YT|YV|2=0.
In order to increase the convergence rate, the step sizes may be updated with a momentum term, i.e. either
and
or directly depending on the (gradient ascent) cost function
and
Where ρ is slightly greater than 1, such as 1.01 or 1.1 and 0<σ<1, such as 0.5 or 0.2. We also recommend that an upper bound and a lower bound is set on and .
The update may also be based on averages (which is the mathematically correct implementation. However, omitting the average operator only has limited consequences for two microphones as the fixed step size ensures that):
For three microphones the sign-based LMS is given by
and
Notice that YV and YT both depend on the most recent estimates of β1 and β2.
We may also find the optimal β simply by setting
i.e. for the two-microphone case:
By rearranging the terms, we have
We thus find beta by solving the complex second order polynomial, where the two solutions correspond to the value of β corresponding to either the maximum or the minimum of . The two solutions to the polynomials are given by
where
An example of the SNR plotted as function of the real and imaginary parts of β is shown in
And we see that the solution reduces to the well-known
To summarize: We may estimate a β that maximizes an SNR, given two target maintaining beamformers: CT0 and CV0, and two target cancelling beamformers CT1 and CV1, where CT0 and CT1 are updated when the target is present and CV0 and CV1 are updated when the target is absent. It is not required that the target is solely impinging from the look direction, but the output signal is still distortionless with respect to the selected steering vector. Whether the current signal is defined as either target or noise, can be determined by a voice activity detector, DOA detector, or similar.
Based on a voice activity detector (VAD), which enables update of target, when speech is detected from the front direction (e.g. determined by a comparison between a front and rear cardioid and a voice activity detector, cf. lower signal path in
The voice activity detector (VAD, e.g. implemented using a trained neural network, see e.g.
The front-back comparison may, as illustrated in the lower signal branch of
As we directly aim at maximizing the SNR, our objective function
directly yields an estimate of a signal-to noise ratio. We may thus base a postfilter gain on this SNR estimate. Either as a direct mapping of SNR to gain per frequency, or we may train a neural network in order to map our SNR estimates across frequency to gain values (see e.g. EP3694229A1).
Avoiding Rs=Rv:
In situations, with little or no noise or a signal from a single direction, there is a risk that the target covariance matrices and the noise covariances (or the corresponding fixed beamformers) may converge towards the same value. This can be avoided by adding a bias to the covariance matrices,
i.e.
and
where σs and σv are small constants. As the microphone signals will always contain microphone noise, it is most important to add a bias to the target covariance matrix, hereby ensuring that the beamformer system will converge towards an MVDR beamformer in the case of low input levels.
The biases may in a similar way be added to the fixed target cancelling beamformers, i.e.
Notice, the target covariance matrix may not necessarily be biased towards a single look direction. The bias may as well be selected as a weighted sum across several target directions θ, i.e.
Recall the gradient given by
The gradient is given in terms of averages between different beamformer signals. We may as well express the equation in terms of the covariance matrices. As Y=(a−Bβ+)H x and =BH x, we have
The above modifications are intended to be used in all embodiments comprising beamformers, where SNR of the beamformed signal is maximized (e.g. in connection with GEV or GEV approximated beamformers).
The BTE- and ITE-parts are mechanically and electrically connected via an interconnecting element (IC) comprising an electric cable for electrically connecting electronic circuitry (e.g a processor) in the BTE-part to electronic circuitry (e.g. the loudspeaker) in the ITE-part.
The processor may comprise a directional noise reduction system according to the present disclosure. The processor may be connected to the microphone system. The directional noise reduction system comprises the at least one beamformer for generating at least one beamformed signal in dependence of beamformer weights (w) configured to be applied to the multitude (M) of electric input signals. The at least one beamformed signal may e.g., be provided as a weighted sum of the multitude (M) of electric input signals provided by the microphone system. The processor may e.g., be configured to adaptively optimize the beamformer weights (w) to a plurality of target positions (θ) by maximizing a target signal to noise ratio (SNR) for sound from the plurality of target positions as provided by the various aspects of the present disclosure.
The hearing aid shown in the embodiment of
As the estimates (PF-SNR) of the SNR estimator (SNR(w)) have to be faster due to the fast changes of the speech signals, the estimates of the covariance matrices are based on a much faster averages (e.g. in the order of 1-10 ms). This update of the covariance matrices (RT, RV) may potentially be controlled by a different voice activity control signal (VAD2) provided by a further voice activity detector. The estimated SNR (PF-SNR) is mapped to a postfilter gain (PFG, cf. output of postfilter gain block (PF-gain)), which is applied to the beamformed signal (Y) by multiplication unit (CU, ‘X’) thereby providing output signal (OUT). The postfilter gain block (PF gain) may contain smoothing across time as well. The postfilter gain block may be implemented using a neural network trained on examples of SNR estimates and desired gain patterns (cf. e.g. EP3694229A1). The (frequency domain) output signal (OUT) from the combination unit is fed to a synthesis filter bank (FBS) providing a corresponding output signal (out) in the time domain. In a typical hearing aid application, the output of the combination unit (CU) comprising a noise reduced input (audio) signal would be fed to a processor for applying further processing algorithms to the signal to enhance its value for a user of the hearing aid, e.g. for increasing intelligibility of speech in the signal or for increasing comfort in listening to an environment signal (e.g. music), as e.g. the audio signal processor (SPRO) in
The input signal to the neural network (NN) is shown as a single microphone signal (here an electric input signal in the time-frequency (logarithmic) domain, optionally normalized (Ũ(k,l)), but it may be several signals, or a combination of several signals. The output signal of the neural network may be provided as a time and frequency dependent voice activity control signal (VAD(k,l)) indicative of a voice (e.g. speech) being present or not (or indefinite) in a given time frequency unit (k,l) of the current input signal(s).
It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method are not limited to the exact order stated herein, unless expressly stated otherwise.
It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art.
The claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
Number | Date | Country | |
---|---|---|---|
Parent | 18330416 | Jun 2023 | US |
Child | 18732872 | US |