The present invention generally relates to integrated noise reduction for devices having at least one local microphone array.
Hearing loss is a type of sensory impairment that is generally of two types, namely conductive and/or sensorineural. Conductive hearing loss occurs when the normal mechanical pathways of the outer and/or middle ear are impeded, for example, by damage to the ossicular chain or ear canal. Sensorineural hearing loss occurs when there is damage to the inner ear, or to the nerve pathways from the inner ear to the brain.
Individuals who suffer from conductive hearing loss typically have some form of residual hearing because the hair cells in the cochlea are undamaged. As such, individuals suffering from conductive hearing loss typically receive an auditory prosthesis that generates motion of the cochlea fluid. Such auditory prostheses include, for example, acoustic hearing aids, bone conduction devices, and direct acoustic stimulators.
In many people who are profoundly deaf, however, the reason for their deafness is sensorineural hearing loss. Those suffering from some forms of sensorineural hearing loss are unable to derive suitable benefit from auditory prostheses that generate mechanical motion of the cochlea fluid. Such individuals can benefit from implantable auditory prostheses that stimulate nerve cells of the recipient's auditory system in other ways (e.g., electrical, optical and the like). Cochlear implants are often proposed when the sensorineural hearing loss is due to the absence or destruction of the cochlea hair cells, which transduce acoustic signals into nerve impulses. An auditory brainstem stimulator is another type of stimulating auditory prosthesis that might also be proposed when a recipient experiences sensorineural hearing loss due to damage to the auditory nerve.
In one aspect, a method is provided. The method comprises: receiving sound signals with at least a local microphone array of a device, wherein the sound signals comprise at least one target sound; generating an a priori estimate of the at least one target sound in the received sound signals based on a predetermined location of a source of the at least one target sound; generating a direct estimate of the at least one target sound in the received sound signals based on a real-time estimate of a location of a source of the at least one target sound; and generating a weighted combination of the a priori estimate and the direct estimate, wherein the weighted combination is an integrated estimate of the target sound.
In another aspect, a device is provided. The device comprises: a local microphone array configured to receive sound signals, wherein the sound signals comprise at least one target sound; and one or more processors configured to: generate an a priori estimate of the at least one target sound in the received sound signals using only an a priori relative transfer function (RTF) vector generated from the received sound signals, generate a direct estimate of the at least one target sound in the received sound signals using only an a priori relative transfer function (RTF) vector generated from the received sound signals, and generate a weighted combination of the a priori estimate and the direct estimate, wherein the weighted combination is an integrated estimate of the target sound.
Embodiments of the present invention are described herein in conjunction with the accompanying drawings, in which:
In devices having one or more microphone arrays, such as auditory prostheses (e.g., hearing aids, cochlear implants, bone conduction devices, etc.), multi-microphone noise reduction systems are used to preserve desired sounds (e.g., speech), while rejecting unwanted sounds (e.g., noise). In certain conventional noise reduction systems, a local microphone array (LMA) worn on the recipient (i.e., part of the device) is used to focus on a sound source (e.g., speaker) that is in a predefined direction, such as directly in front of recipient. While such a noise reduction system may be robust, it is also prone to poor performance in situations where the desired speaker is not in the predefined direction. Examples of such situations may be found in classroom environments or while a recipient is travelling in a motor vehicle. The integrated noise reduction techniques presented herein improve upon these existing noise reduction systems in several distinct ways: (i) by including the ability to focus on a target sound source (e.g., speaker) that is not in the predefined direction and, in certain arrangements, (ii) by including external microphones (XMs) that operate together with the LMA, resulting in further noise reduction as opposed to using only the LMA.
In certain embodiments presented herein, integrated noise reduction techniques will utilize two separate tuning parameters, one for controlling the sound received from the predefined direction, and the other for the sound received from an estimated direction where the target sound source may be located. In these embodiments, each of these directions can be defined using the LMA and the XMs. In order to define the predefined direction with the LMA and the XMs, a modified version of the improved method of estimation of a transfer function for the XM is used, where the input signals have to undergo a specific series of transformations.
Using one or several XMs along with the LMA can provide significant speech intelligibility improvement, for instance in the case where XMs may be quite close to the desired speaker, or even if it provides a relevant noise reference. Additionally, the integrated noise reduction techniques presented herein are flexible in that they encompass a wide range of noise reduction options according to the tuning of the system.
For ease of understanding, the following description is organized into several sections. In particular, section II describes a data model, which considers the general case of a local microphone array (LMA) in conjunction with one or several external microphones (XMs), which can be reduced to a single external microphone without compromising the equations provided herein. A transformed domain, as well as a pre-whitened-transformed domain is also introduced in order to simplify the flow of signal processing operations and realize distinct digital signal processing (DSP) block schemes.
In section III, an integrated minimum variance distortionless response (MVDR) beamformer is discussed as applied to a local microphone array. In particular, section III describes an integrated MVDR beamformer, which leverages the use of a priori assumptions and the use of estimated quantities. In section IV, an integrated MVDR beamformer as applied to a local microphone array together with one or more external microphones is described. Again, an integrated MVDR beamformer for application to a local microphone array together with one or more external microphones, which leverages the use of a priori assumptions and the use of estimated quantities is described.
Consider a noise reduction system that consists of a local microphone array (LMA) of Ma microphones and Me external microphones, providing a total of Ma+Me number of microphones. Also consider a scenario where there is only one desired/target sound source, such as a target speech source, in a noisy environment. Proceeding to formulate the problem in the short-time Fourier transform (STFT) domain, the received signal can be represented at one particular frequency, k, and one time frame, l as:
where (dropping the dependency on k and l for brevity), y=[yaT, yeT]T, ya=[ya,1 ya,2 . . . ya,M
In general, the speech component (target sound), x, can be represented in terms of a relative transfer function (RTF) vector such that:
x=as=hs1 (3)
where s1=aa,1s, is the speech in a reference microphone of the LMA (w.l.o.g the first microphone is chosen as the reference microphone) and h is the RTF vector defined as:
consisting of an RTF vector corresponding to the LMA signals, ha and an RTF vector corresponding to the XM signals, he. With such a formulation, the noise reduction system will aim to produce an estimate for the speech component in the reference microphone, s1.
The (Ma+Me)×(Ma+Me) speech-plus-noise, noise-only, and speech-only spatial correlation matrices are given respectively as:
R
yy={yyH} (5)
R
nn={nnH} (6)
R
xx={xxH} (7)
where {.} is the expectation operator and H is the Hermitian transpose. It is assumed that the speech components are uncorrelated with the noise components, and hence the speech-only correlation matrix can be found from the difference of the speech-plus-noise correlation matrix and the noise-only correlation matrix:
R
xx
=R
yy
−R
nn (8)
The speech-plus-noise and noise-only correlation matrices are estimated from the received microphone signals during speech-plus-noise and noise-only periods, using a voice activity detector (VAD). The correlation matrices can also be calculated solely for the LMA signals respectively as Ry
The estimate of the speech component in the reference microphone, z1, is then obtained through the linear filtering of the microphone signals, such that:
Where w=[waTweT]T is the complex-valued filter to be designed.
As will be described later, working with the signals in a transformed domain will result in convenient relations to be made and an overall simplification of the flow of signal processing operations. The transformation will be based on an a priori assumed RTF vector for the LMA signals, {tilde over (h)}a (which may or may not be equal to ha). Firstly, an Ma×(Ma−1) unitary blocking matrix Ba for {tilde over (h)}a and an Ma×1 vector ba are defined such that:
where BaHBa=I(M
where Ta=[Ba, ba],TaHTa=IM
The transformed noise signals can also be similarly defined:
It should be understood that this transformation domain is the LMA signals that pass through a blocking matrix and a matched filter, as in the first stage of a generalized sidelobe canceller (GSC) (i.e., the adaptive implementation of an MVDR beamformer), along with the XM signals.
A spatial pre-whitening operation can be defined from the noise-only correlation matrix in the previously described transform domain by using the Cholesky decomposition:
{(THn)(THn)H}=LLH (14)
where L is an (Ma+Me)×(Ma+Me) lower triangular matrix. In block form, L can be realized as:
Where La and Lx are lower triangular matrices. It should be noted that La corresponds to the LMA signals and are from a Cholesky decomposition of the noise correlation matrix from the LMA signals in the transformed domain, hence:
{(TaHna)(TaHna)H}=LaLaH (16)
A signal vector in the transformed domain can be consequently pre-whitened by pre-multiplying it with L−1. Such signal quantities will be denoted with the underbar (.) notation. Hence, the signal y in this so-called pre-whitened-transformed domain is given by:
and similarly for n:
The respective correlation matrices are also given by:
R
yy
=
{yy
H}
R
nn
=
{nn
H
}=I
(M
+M
)
R
xx
=R
yy
−R
nn
The spatial correlation matrices for the speech and noise and the noise-only, and the speech-only can also be calculated solely for the LMA signals respectively as Ry
The following is also a summary of how the symbolic notation should be interpreted throughout this document:
The MVDR beamformer minimizes the total noise power (minimum variance), while preserving the received signal in a particular direction (distortionless response). This direction is specified by defining the appropriate RTF vector for the MVDR beamformer. Considering only the LMA, the MVDR problem can be formulated as follows (which will be referred to as the MVDRa):
where ha is the RTF vector from (4), which in practice is unknown and hence will be replaced either by a priori assumptions or estimated from the speech-plus-noise correlation matrices. The optimal noise reduction filter is then given by:
Finally, the speech estimate, za,1, from this MVDRa beamformer is obtained through the linear filtering of the microphone signals with the complex-valued filter wa:
za,1=waHya (24)
In sections III-A and III-B, strategies for designing an MVDRa beamformer using an RTF vector based either on a priori assumptions or estimated from the speech-plus-noise correlation matrices are discussed. Section III-C illustrates an integrated beamformer that integrates the use of priori assumptions with estimates.
A. Using an a priori Assumed RTF Vector
The MVDRa problem can be formulated as in (22), except with using an a priori assumed RFT vector, {tilde over (h)}a=[1 {tilde over (h)}a,2 . . . {tilde over (h)}a,m]T instead of ha. This {tilde over (h)}a can be based on a priori assumptions regarding microphone characteristics, position, speaker location and room acoustics (e.g., no reverberation). Similar to (23), the optimal noise reduction filter is then given by:
The speech estimate, {tilde over (z)}a,1, from this MVDRa with an a priori assumed RTF vector is then:
{tilde over (z)}a,1={tilde over (w)}aHya (26)
This conventional formulation of the MVDRa can also be equivalently posed in the pre-whitened-transformed domain (section II-C). As derived in Appendix A, the speech estimate in this domain is given by:
Where lM
More specifically,
and processing block 112 which applies
to ya,M
to ya,M
The RTF vector may also be estimated without reliance on any a priori assumptions and can be used to enhance the speech regardless of the speech source location. One such method is a method of covariance whitening or equivalently that which involves a Generalized Eigenvalue Decomposition (GEVD).
In such examples, a rank-1 matrix approximation problem can be formulated to estimate the RTF vector for a given set of LMA signals such that:
where ∥.∥F is the Frobenius norm, and {circumflex over (R)}xa,r1 is a rank-1 approximation to (Ry
{circumflex over (R)}
xa,r1={circumflex over (ϕ)}xa,r1ĥaĥaH (29)
Where ĥa=[1ĥa,2 . . . ĥa,M
As opposed to using the raw signal correlation matrices, the estimation problem of (28) can be equivalently formulated in the pre-whitened-transformed domain. In appendix B, it is shown that the estimated RTF vector is then:
where pmax is a generalized eigenvector of the matrix pencil {Ry
As was done in section III-A, this filter based on estimated quantities can also be reformulated in the transformed, pre-whitened-transformed domain. Leaving the derivations once again to Appendix B, the corresponding speech estimate using the estimated RTF vector is:
where η*ρ*pmax can be considered as the pre-whitened-transformed filter (where {.}* is the complex conjugate), which can be used to directly filter the pre-whitened, transformed signals, ya. These operations can also be realized in a distinct set of signal processing blocks, as illustrated in
More specifically,
The direct speech estimate, {circumflex over (z)}a,1, is an estimate of the target sound (e.g., speech) in the received sound signals, based solely on an estimated RTF vector. The estimated RTF vector is generated using real-time estimates of, for example, the location of the source of the target sound, reverberant characteristics of the target sound source, etc. The direct speech estimate, {circumflex over (z)}a,1, is an example of a direct estimate of at least one target sound in the received sound signals.
Described above are two general MVDR approaches, one that imposes a priori assumptions for the definition of the RTF vector in the MVDR filter, and another that involves an estimation of this RTF vector. In conventional arrangements, a choice typically has to be made between one of these approaches with an acceptance of their inevitable drawbacks. However, in accordance the integrated noise reduction techniques presented herein, both approaches are integrated into one global filter, referred to herein as an “integrated MVDRa beamformer” that exploits the benefits of each approach.
In general, the integrated MVDRa beamformer provides for integrated tunings which allow different “weights” to be applied to each of (1) an a priori assumed representation of target sound within received sound signals (e.g., an a priori estimate of at least one target sound in the received sound signals), and (2) an estimated representation of the target sound within received sound signals (e.g., a direct estimate of at least one target sound in the received sound signal). The weights applied to each of the a priori assumed representation of the target sound and the estimated representation of the target sound are selected based on “confidence measures” associated with each of the a priori assumed representation of the target sound and the estimated representation of the target sound, respectively.
For instance, with the integrated MVDRa beamformer, if the speech source moves outside of the direction defined by an a priori assumed RTF vector, more weight can be given to an estimated RTF vector to account for the loss in performance that would otherwise result from using the a priori assumed RTF vector alone. On the other hand, if the estimated RTF vector becomes unreliable, less weight can be given thereto and the system can revert to using the a priori assumed RTF vector, which may have an improved performance if the speech source is indeed in the direction defined by the a priori assumed RTF vector. Combination/mixing of the a priori assumed RTF vector and the estimated RTF vector is also possible. That is, the tuning parameters can achieve multiple beamformers, i.e. one that relies on a priori assumptions alone, one that relies on estimated quantities alone, or the mixture of both.
One particular tuning of interest may be to place a large weight on an a priori assumed RTF vector, but weighting an estimated RTF vector only when appropriate. This represents a mechanism for reverting to an a priori assumed RTF vector when the estimated RTF vector was unreliable.
In the following, the integrated MVDRa beamformer is briefly derived. If the case is considered where ĥa is defined according to a priori assumptions and ĥa is estimated from (86), an integrated MVDRa cost function can be given as:
where α∈[0, ∞] and β∈[0, ∞] are tuning parameters that control how much of the respective RTF vectors (i.e., the apriori assumed RTF vector and the estimated RTF vector) are weighted. This cost function is the combination of that of an MVDRa (as in (22)) defined by {tilde over (h)}a and another defined by ĥa, except that the constraints have been softened by α and β.
The solution to (33) is given by:
w
a,int
=f
pr(α, β){tilde over (w)}a+fest(α,β)ŵa (34)
where {tilde over (w)}a and ŵa are defined in (25) and (31) respectively.
with the constants:
This integrated MVDR beamformer reveals that the MVDRa beamformer based on a priori assumptions from (25) and that which is based on estimated quantities from (31) can be combined according to the functions fpr(aα, β) and fest(α, β) respectively.
As in the previous sections, this integrated beamformer can also be expressed in the pre-whitened-transformed domain as follows:
and with the constants equivalently, but alternatively defined as:
where ĥa and ĥa are given in (79) and (88) respectively.
The resulting speech estimate from this integrated beamformer is then given by:
The benefit of this pre-whitened-transformed domain is apparent where, with such an integrated beamformer of (38), {tilde over (w)}a,M
More specifically,
Also shown in
and a processing block 112 which applies
to ya,M
to ya,M
to ya,M
The first branch 113(1) also comprises a first weighting block 116. The first weighting block 116 is configured to weight the speech estimate, {circumflex over (z)}a,1, in accordance with the complex conjugate of the function fpr(α, β) (i.e., (35) and (40), above). More generally, the first weighting block 116 is configured to weight the speech estimate, {circumflex over (z)}a,1, in accordance with a cost function controlled by a plurality of tuning parameters (e.g., (α, β)). The tuning parameters of the cost function (e.g., fpr(α, β)), are set based on one or more confidence measures 118 generated for the speech estimate, {circumflex over (z)}a,1. The one or more confidence measures 118 represent an assessment or estimate of the accuracy/reliability of the a priori speech estimate, {circumflex over (z)}a,1, and the hence the accuracy of the a priori RTF vector used to generate the speech estimate, {circumflex over (z)}a,1. The first weighting block 116 generates a weighted a priori speech estimate, shown in
The second branch 113(2) includes a pre-whitened-transformed filter 114, which filters the pre-whitened-transformed signals in accordance with (32). The output of the pre-whitened-transformed filter 114 is a direct speech estimate, {circumflex over (z)}a,1, that is generated based solely on an estimated RTF vector (i.e., an estimate of the speech in the received sound signals, which takes into consideration microphone characteristics and may contain information such as the location and some reverberant characteristics of the speech source). In other words, the direct speech estimate {circumflex over (z)}a,1, is an example of a direct estimate of at least one target sound in the received sound signals.
The second branch 113(2) also comprises a second weighting block 120. The second weighting block 120 is configured to weight the speech estimate, {circumflex over (z)}a,1, in accordance with complex conjugate of the function fest(α, β) (i.e., (36) and (40), above). More generally, the second weighting block 120 is configured to weight the direct speech estimate, {circumflex over (z)}a,1, in accordance with a cost function controlled by a plurality of tuning parameters (e.g., (α, β)). The tuning parameters of the cost function (e.g., fest(α, β) are set based on one or more confidence measures 122 generated for the speech estimate, {circumflex over (z)}a,1. The one or more confidence measures 122 represent an assessment or estimate of the accuracy/reliability of the speech estimate, {circumflex over (z)}a,1, and the hence the accuracy of the estimated RTF vector used to generate the speech estimate, {circumflex over (z)}a,i. The second weighting block 120 generates a weighted direct speech estimate, shown in
IV. MVDR with a LMA and XM Signals (MVDRa,e)
Section III, above, illustrates an embodiment in which the integrated beamformer operates based on local microphone array (LMA) signals. As noted above, LMA signals are generated by a local microphone array (LMA) that are part of the device that performs the integrated noise reduction techniques. In the case of auditory prostheses, such as cochlear implants, the LMA is worn on the recipient.
As described further below, the integrated noise reduction techniques described herein can be extended to include external microphone (XM) signals, in addition to the LMA signals. These XM signals are generated by one or more external microphones (XMs) that are not part of the device that performs the integrated noise reduction techniques, but that can nevertheless communicate with the device (e.g., via a wireless connection). The external microphones may be any type of microphone (e.g., microphones in a wireless microphone device, microphones in a separate computing device (e.g., phone laptop, tablet, etc.), microphones in another auditory prosthesis, microphones in a conference phone system, microphones in hands-free system, etc.) for which the location of the microphone(s) is unknown relative to the microphones of the LMA. In other words, as used herein, an external microphone may be any microphone that has an unknown location, which may change over time, with respect to the local microphone array.
Extending the techniques herein to the use of LMA signals and XM signals, the integrated beamformer is referred to as the MVDRa,e:
where h is the RTF vector ((4), above) that includes Ma components corresponding to the LMA, ha, and Me components corresponding to the XMs, he, and Rnn is the (Ma+Me)×(Ma+Me) noise correlation matrix:
where the upper left block is the noise correlation matrix from the LMA signals, Rn
with the speech estimate, z=wHy. Since, as noted above, the XMs have an unknown location, which may change over time, with respect to the local microphone array, generally no a priori assumptions can be made about the location of the XMs. Consequently, there are two potential approaches that can be taken in order to find h, namely: (i) only the missing component of the RTF vector corresponding to that of the XM signals needs to be estimated, while the a priori assumed RTF vector for the LMA signals is preserved; or (ii) the entire RTF vector is estimated for the LMA signals and the XM signals. In sections, IV-A and IV-B strategies for both approaches are briefly described.
As previously mentioned, one option for the definition of h for the MVDRa,e is such that the a priori RTF vector for the LMA signals, {tilde over (h)}a, is preserved and only the RTF vector for the XM signals is estimated. Such an RTF will therefore be defined as follows:
It should be noted that although {tilde over (h)} partially contains an estimated RTF vector, this is done with respect to the a priori assumptions set by {tilde over (h)}a, and hence the notation for {tilde over (h)} is kept to be that of an a priori RTF vector (this is further elaborated upon in section IV-E). A method to compute {tilde over (h)}e in the case of one XM using the cross-correlation between the external microphone and a speech reference provided by (26) using a GEVD is outlined below
As in (28) a rank-1 matrix approximation problem can be formulated to estimate an entire RTF vector for a given set of microphone signals such that:
where {tilde over (R)}x,r1 is a rank-1 approximation to Rxx (recall (8)). The a priori assumed RTF vector for the LMA signals can also be included for the definition of {tilde over (R)}x,r1 and hence is given by:
As opposed to using the raw signal correlation matrices, the estimation problem of (45) can be equivalently formulated in the pre-whitened-transformed domain. In Appendix C, it is shown that the estimated RTF vector could be found from a GEVD on the matrix pencil {JTRyyJ, JTRnn
where the selection matrix, Je=[0(M
Finally, this estimate is then used to compute the corresponding MVDRa,e filter with an a priori assumed RTF vector and a partially estimated RTF vector as:
where {tilde over (h)} as defined in (53) can be equivalently represented as:
As was done in section III, this filter can also be reformulated in the pre-whitened-transformed domain. Leaving the derivations once again to Appendix C, the corresponding speech estimate was then found to be:
where
can be considered as a pre-whitened-transformed filter, which can be used to directly filter the last (Me+1) elements of the pre-whitened-transformed signals, i.e. ya,M
More specifically,
Also shown in
In the case where the RTF vector for both the LMA and XM signals is to be estimated, a variation of (45) is considered:
where {circumflex over (R)}x,r1 is a rank-1 approximation to Rxx (without any a priori information):
with {circumflex over (q)}a the estimated RTF vector for the LMA signals and {circumflex over (q)}e the RTF vector for the XM signals.
Once again, it will be convenient to re-frame the problem in the pre-whitened-transformed domain. From the derivations in Appendix D, the estimated RTF vector is given by:
where qmax is a generalized eigenvector of the matrix pencil {Ryy, Rn }, which as a result of the pre-whitening (Rnn−1M
As derived in Appendix D, the corresponding speech estimate in the pre-whitened-transformed domain is given by:
where η*qqmax can be considered as a pre-whitened-transformed filter, which can be used to directly filter the pre-whitened-transformed signals, y.
More specifically,
Also shown in
In the case of the integrated MVDRa for the LMA signals in section III-C, two general approaches for designing the beamformer were considered: one that imposes a priori assumptions for the definition of the RTF vector in the MVDR filter, and another that involves an estimation of this RTF vector. For the MVDRa,e, two analogous approaches can also be considered: one that imposes a priori assumptions for the definition of the RTF vector for the LMA signals, while estimating only the RTF vector for the XM signals or an estimation of the entire RTF vector including both the LMA and XM signals. Although in both approaches there is an estimation; for the approach where only the RTF vector for the XM signals is estimated, it is done so in accordance with the a priori assumptions set by the LMA. Therefore, just as in the integrated MVDRa, two general approaches to designing the MVDRa,e according to either a priori assumptions or full estimation can be considered. Consequently, an integrated MVDRa,e beamformer can also be derived in order to integrate the two general approaches. The resulting cost function, is:
where {tilde over (h)} is defined from (49) and h from (53). The solution is then:
w
int
=g
pr(α,β){tilde over (w)}+gest(α, β){tilde over (w)} (57)
where {tilde over (w)}λ and ŵλ are given (48) and (54) respectively.
with the constants:
As in section III-C, this integrated MVDRa,e beamformer also reveals that the MVDRa,e beamformer based on a priori assumptions from (48) and that which is based on estimated quantities from (54) can be combined according to the functions gpr(α, β) and gest(a,(3) respectively.
This integrated beamformer can also be expressed in the pre-whitened-transformed domain as follows:
and the constants equivalently, but alternatively defined as:
where {tilde over (h)} and ĥ are given in (88) from Appendix C and (97) from Appendix D respectively.
The resulting speech estimate from this integrated beamformer is then given by:
The benefit of the pre-whitened-transformed domain is once again apparent. With such an integrated beamformer, the transformed, pre-whitened signals can be directly filtered accordingly, and then combined with the appropriate weightings as defined by the functions gpr(α, β) and gest(α, β), to yield the respective speech estimate. These functions gp, (α, β) and gest(α, β) can be tuned such as to emphasize the result from an MVDR beamformer that uses either an a priori assumed RTF vector or an estimated RTF vector. This results in a digital signal processing scheme as depicted in
More specifically,
Also shown in
The first branch 513(1) also comprises a first weighting block 516. The first weighting block 516 is configured to weight the speech estimate, {tilde over (z)}1, in accordance with the complex conjugate of the function gpr(α, β) (i.e., (58) and (63), above). More generally, the first weighting block 516 is configured to weight the speech estimate, {tilde over (z)}1, in accordance with a cost function controlled by a plurality of tuning parameters (e.g., (α, β)). The tuning parameters of the cost function (e.g., gpr(α, β)), are set based on one or more confidence measures 518 generated for the speech estimate, {tilde over (z)}1. The one or more confidence measures 518 represent an assessment or estimate of the accuracy/reliability of the speech estimate, {tilde over (z)}1, and the hence the accuracy of the partial a priori assumed RTF vector and partial estimated RTF vector used to generate the speech estimate (i.e., using a priori assumptions for the definition of the RTF vector for the LMA signals, while estimating only the RTF vector for the XM signals). The first weighting block 518 generates a weighted a priori speech estimate, shown in
The second branch 513(2) includes the filter 532 (i.e., (55), above), which uses the whitened-transformed signals 509 to generate a direct speech estimate, {tilde over (z)}1 (i.e., a speech estimate generated using an estimated RTF vector including both the LMA and XM signals). The second branch 513(2) also comprises a second weighting block 520. The second weighting block 520 is configured to weight the direct speech estimate, {tilde over (z)}1, in accordance with the complex conjugate of the function gest(α, β) (i.e., (59) and (63), above). More generally, the second weighting block 120 is configured to weight the direct speech estimate, {circumflex over (z)}1, in accordance with a cost function controlled by a plurality of tuning parameters (e.g., (α, β)). The tuning parameters of the cost function (e.g., gest(α, β) are set based on one or more confidence measures 522 generated for the speech estimate, {tilde over (z)}1. The one or more confidence measures 522 represent an assessment or estimate of the accuracy/reliability of the speech estimate, {tilde over (z)}1, and the hence the accuracy of the estimated RTF vector including both the LMA and XM signals. The second weighting block 520 generates a weighted direct speech estimate, shown in
With this integrated beamformer for both the LMA and XMs, the decision process is now, as shown in the flowchart of
At 844, after determining whether or not the XM signals should be used, a decision is made as to whether or not estimated RTF vector is reliable. In other words, a decision can then be made on how much to weight the a priori assumed RTF vector and the estimated RTF vector. This decision is controlled by a and in the same manner as for the Integrated MVDRa Beamformer from section III-C. In the case where the XM is used, the a priori assumed RTF vector consists of an a priori assumed RTF vector for the LMA signals and an estimated RTF vector for the XM signals, the estimated RTF vector is for both the LMA and XM signals.
In the second stage of the decision process, it should be noted that in order to simplify the tuning, α and β could be made inversely proportional, and can even be tuned such that gpr(α, β) and gest(α, β) form a convex combination. Alternatively, if it is imposed that α→∞, then this preserves the a priori constraint and it is only β that remains to be tuned, which would be that of a contingency noise reduction strategy. In the case where both α→∞ and β→∞, this corresponds to two hard constraints imposed upon the noise minimization, and is then considered as a linearly constrained minimum variance (LCMV) beamformer . It is also noted for the case of the MVDRa where α→∞, β=0, that the original MVDRa with a priori constraints is achieved. Hence, the original beamformer has not been compromised and can be reverted to at anytime with this particular tuning.
A summary of the various noise reduction strategies encompassed by this integrated beamformer is summarized in
The integrated noise reduction techniques presented herein may be implemented in a number of devices/systems that include a local microphone array (LMA) to capture sound signals. These devices/systems include, for example, auditory prostheses (e.g., cochlear implant, acoustic hearing aids, auditory brainstem stimulators, bone conduction devices, middle ear auditory prostheses, direct acoustic stimulators, bimodal auditory prosthesis, bilateral auditory prostheses, etc.), computing devices (e.g., mobile phones, tablet computers, etc.), conference phones, hands-free telephone systems, etc.
Referring first to
The cochlear implant 1000 comprises an external component 1002 and an internal/implantable component 1004. The external component 1002 includes a sound processing unit 1012 that is directly or indirectly attached to the body of the recipient, an external coil 1006 and, generally, a magnet (not shown in
The sound processing unit 1012 comprises a local microphone array (LMA) 1013, comprised of microphones 1008(1) and 1008(2), configured to receive sound input signals. In this example, the sound processing unit 1012 may also include one or more auxiliary input devices 1009, such as one or more telecoils, audio ports, data ports, cable ports, etc., and a wireless transmitter/receiver (transceiver) 1011.
The sound processing unit 1012 also includes, for example, at least one battery 1007, a radio-frequency (RF) transceiver 1021, and a processing block 1050. The processing block 1050 comprises a number of elements, including an integrated noise reduction module 1025 and a sound processor 1033. The processing block 1050 may also include other elements that, have for ease of illustration, been omitted from
The integrated noise reduction module 1025 is configured to perform the integrated noise reduction techniques described elsewhere herein. For example, the integrated noise reduction module 1025 corresponds to the integrated MVDRa beamformer 125 and the MVDRa,e beamformer 525, described above. As such, in different embodiments, the integrated noise reduction module 1025 may include the processing blocks described above with reference to
As noted above, the integrated noise reduction techniques, and thus the integrated noise reduction module 1025, generates an integrated speech estimate from sound signals received via at least the LMA 1013. Shown in
Returning to the example embodiment of
As noted, stimulating assembly 1018 is configured to be at least partially implanted in the recipient's cochlea 1037. Stimulating assembly 1018 includes a plurality of longitudinally spaced intra-cochlear electrical stimulating contacts (electrodes) 1026 that collectively form a contact or electrode array 1028 for delivery of electrical stimulation (current) to the recipient's cochlea. Stimulating assembly 1018 extends through an opening in the recipient's cochlea (e.g., cochleostomy, the round window, etc.) and has a proximal end connected to stimulator unit 1020 via lead region 1016 and a hermetic feedthrough (not shown in
As noted, the cochlear implant 1000 includes the external coil 1006 and the implantable coil 1022. The coils 1006 and 1022 are typically wire antenna coils each comprised of multiple turns of electrically insulated single-strand or multi-strand platinum or gold wire. Generally, a magnet is fixed relative to each of the external coil 1006 and the implantable coil 1022. The magnets fixed relative to the external coil 1006 and the implantable coil 1022 facilitate the operational alignment of the external coil with the implantable coil. This operational alignment of the coils 1006 and 1022 enables the external component 1002 to transmit data, as well as possibly power, to the implantable component 1004 via a closely-coupled wireless link formed between the external coil 1006 with the implantable coil 1022. In certain examples, the closely-coupled wireless link is a radio frequency (RF) link. However, various other types of energy transfer, such as infrared (IR), electromagnetic, capacitive and inductive transfer, may be used to transfer the power and/or data from an external component to an implantable component and, as such,
As noted above, the integrated noise reduction module 1025 is configured to generate an integrated speech estimate, and the sound processor 1033 is configured to use the integrated speech estimate to generate stimulation signals for delivery to the recipient. More specifically, the sound processor 1033 (e.g., one or more processing elements implementing firmware, software, etc.) is configured to use the integrated speech estimate to generate stimulation control signals 1036 that represent electrical stimulation for delivery to the recipient. In the embodiment of
The local microphone array (LMA) 1113 comprises microphones 1108(1) and 1108(2) that are configured to convert received sound signals 1116 into LMA signals. Although not shown in
The LMA signals are provided to electronics module 1170 for further processing. In general, electronics module 1170 is configured to convert the LMA signals into one or more transducer drive signals 1180 that active transducer 1171. More specifically, electronics module 1170 includes, among other elements, a processing block 1150 and transducer drive components 1176.
The processing block 1174 comprises a number of elements, including an integrated noise reduction module 1125 and sound processor 1133. Each of the integrated noise reduction module 1125 and the sound processor 1133 may be formed by one or more processors (e.g., one or more Digital Signal Processors (DSPs), one or more uC cores, etc.), firmware, software, etc. arranged to perform operations described herein. That is, the integrated noise reduction module 1125 and the sound processor 1133 may each be implemented as firmware elements, partially or fully implemented with digital logic gates in one or more application-specific integrated circuits (ASICs), partially or fully in software, etc.
The integrated noise reduction module 1125 is configured to perform the integrated noise reduction techniques described elsewhere herein. For example, the integrated noise reduction module 1125 corresponds to the integrated MVDRa beamformer 125 and the MVDRa,e beamformer 525, described above. As such, in different embodiments, the integrated noise reduction module 1125 may include the processing blocks described above with reference to
The sound processor 1133 is configured to process the integrated speech estimate (generated from one or both of the LMA signals and the XM signals) for use by the transducer drive components 1176. The transducer drive components 1176 generate transducer drive signal(s) 1180 which are provided to the transducer 1171. The transducer 1171 illustrates an example of a stimulation unit that receives the transducer drive signal(s) 1180 and generates vibrations for delivery to the skull of the recipient via a transcutaneous or percutaneous anchor system (not shown) that is coupled to bone conduction device 1100. Delivery of the vibration causes motion of the cochlea fluid in the recipient's contralateral functional ear, thereby activating the hair cells in the functional ear.
User interface 1172 allows the recipient to interact with bone conduction device 1100. For example, user interface 1172 may allow the recipient to adjust the volume, alter the speech processing strategies, power on/off the device, etc. Although not shown in
Mobile computing device 1200 first comprises an antenna 1236 and a telecommunications interface 1238 that are configured for communication on a telecommunications network. The telecommunications network over which the radio antenna 1236 and the radio interface 1238 communicate may be, for example, a Global System for Mobile Communications (GSM) network, code division multiple access (CDMA) network, time division multiple access (TDMA), or other kinds of networks.
The mobile computing device 1200 also includes a wireless local area network interface 1240 and a short-range wireless interface/transceiver 1242 (e.g., an infrared (IR) or Bluetooth® transceiver). Bluetooth® is a registered trademark owned by the Bluetooth® SIG. The wireless local area network interface 1240 allows the mobile computing device 1200 to connect to the Internet, while the short-range wireless transceiver 1242 enables the external device 1206 to wirelessly communicate (i.e., directly receive and transmit data to/from another device via a wireless connection), such as over a 2.4 Gigahertz (GHz) link. It is to be appreciated that that any other interfaces now known or later developed including, but not limited to, Institute of Electrical and Electronics Engineers (IEEE) 802.11, IEEE 802.16 (WiMAX), fixed line, Long Term Evolution (LTE), etc., may also or alternatively form part of the mobile computing device 1200.
In the example of
The display screen 1258 is an output device, such as a liquid crystal display (LCD), for presentation of visual information to the cochlear implant recipient. The user interface 1256 may take many different forms and may include, for example, a keypad, keyboard, mouse, touchscreen, display screen, etc. Memory 1260 may comprise any one or more of read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. The one or more processors 1258 are, for example, microprocessors or microcontrollers that execute instructions for the integrated noise reduction logic 1225 and sound processing logic 1233.
When executed by the one or more processors 1250, the integrated noise reduction logic 1225 is configured to perform the integrated noise reduction techniques described elsewhere herein. For example, the integrated noise reduction logic 1225 corresponds to the integrated MVDRa beamformer 125 and the MVDRa,e beamformer 525, described above. As such, in different embodiments, the integrated noise logic 1225 may include software forming the processing blocks described above with reference to
At 1394, an a priori estimate of the at least one target sound in the received sound signals is generated, wherein the a priori estimate is based at least on a predetermined location of a source of the at least one target sound. At 1396, a direct estimate of the at least one target sound in the received sound signals is generated, wherein the direct estimate is based at least on a real-time estimate of a location of a source of the at least one target sound. At 1398, a weighted combination of the a priori estimate and the direct estimate is generated, where the weighted combination is an integrated estimate of the target sound. Subsequent sound processing operations may be performed in the device using the integrated estimate of the target sound.
In certain embodiments, the a priori estimate of the at least one target sound is generated using only an a priori relative transfer function (RTF) vector generated from the received sound signals. In certain embodiments, the direct estimate of the at least one target sound is generated using only an estimated relative transfer function (RTF) vector for the received sound signals.
In certain embodiments, the weighted combination of the a priori estimate and the direct estimate is generated by weighting the a priori estimate in accordance with a first cost function controlled by a first set of tuning parameters to generate a weighted a priori estimate; and weighting the direct estimate in accordance with a second cost function controlled by a second set of tuning parameters to generate a weighted direct estimate. The weighted direct estimate with the weighted a priori estimate are then mixed with one another. The first set of tuning parameters may be set based on one or more confidence measures associated with the a priori estimate of the of the at least one target sound, wherein the one or more confidence measures represent an estimate of a reliability of the a priori estimate. The second set of tuning parameters may be set based on one or more confidence measures associated with the direct estimate of the of the at least one target sound, wherein the one or more confidence measures represent an estimate of a reliability of the direct estimate.
As detailed above, presented herein are integrated noise reduction techniques, sometimes referred to as an integrated beamformer (e.g., an integrated MVDRa beamformer or an integrated MVDRa,e beamformer). In general, the integrated noise reduction techniques combine the use of an apriori (i.e., predetermined, assumed, or pre-defined) location of a target sound source with a real-time estimated location of the sound source.
It is to be appreciated that the above described embodiments are not mutually exclusive and that the various embodiments can be combined in various manners and arrangements.
The invention described and claimed herein is not to be limited in scope by the specific preferred embodiments herein disclosed, since these embodiments are intended as illustrations, and not limitations, of several aspects of the invention. Any equivalent embodiments are intended to be within the scope of this invention. Indeed, various modifications of the invention in addition to those shown and described herein will become apparent to those skilled in the art from the foregoing description. Such modifications are also intended to fall within the scope of the appended claims.
A pre-whitened-transformedversion of the a priori assumed RTF vector can be considered where:
where lM
Substitution of (65) into (26) yields the speech estimate as:
II. Appendix B—MVDRa with Estimated RTF Vector
As opposed to using the raw signal correlation matrices, the estimation problem of (28) can be equivalently formulated first in the transformed domain since the Frobenius norm is invariant under a unitary transformation, therefore:
Furthermore, it is argued in that spatial pre-whitening should also be included in the optimisation problem. Consequently, the estimation problem can be re-framed in the pre-whitened-transformed domain as follows:
where Ry
R
y
y
=PλPH (70)
where P is a unitary matrix of eigenvectors and λ is a diagonal matrix with the associated eigenvalues in descending order. The estimated RTF vector is then defined using the principal (first in this case) eigenvector, pmax:
where the scaling ηρ=ea1TTaLapmax and the M×1 vector ea1=[1 0 . . . 0]T.
This estimated RTF vector can now be used as an alternative to {tilde over (h)}a for the MVDRa defined in (25), and is given by:
This filter based on estimated quantities cart also be reformulated in tine pre whitened-transformed domain. Starting with the definition of the pre-whitened-transformed version of ĥa:
Hence (72) becomes:
ŵ
a
=T
a
L
a
−H
ŵ
a (74)
where
Substitution of (74) into (32) yields the speech estimate as:
Following the procedure as in (68), the transformation is firstly applied, also including the penalty term:
after the pre-whitening operation can also be included in the optimisation probie
where Ryy=L−1THRyyTL−H and Rnn=L−1THRnn
where the block dimensions are such that KA is (Ma−1)×(Ma−1) matrix, KB an (Ma−1)×(Me+1) matrix, KC a (Me+1)×(Ma−1) matrix and Kx,r1 and Kx are (Me+1)×(Me+1) matrices realised as:
where {tilde over (R)}x,r1=L−1TH{tilde over (R)}x,r1TL−H and J=[0(M
and similarly for the second term of Kx+. It follows that (79) then reduces to the following (Me+1)×(Me+1) matrix approximation problem:
The solution then follows from the GEVD on the matrix pencil {JTRyy, J, JTRnn J} and hence reduces to an EVD of JTRyy J .
JTRRyy J=VΓVH (84)
where V is a Me+1)×(Me+1) unitary matrix of eigenvectors and Γ is a diagonal matrix with the associated eigenvalues in descending, order. The estimated RTF vector for the XM signals is then defined from the corresponding principal (first in this case) eigcivcctor, vmax:
where the selection matrix, Je=[0(M
Finally, this estimate is then used to compute the corresponding MVDRa,e filter with an a priori assumed RTF vector and a partially estimated RTF vector, along with the penalty term as:
where {tilde over (h)} as defined in (44) can be equivalently represented as:
This filter can also be realised in the pre-whitened-transformed domain. The pre-whitened-transformed version of {tilde over (h)} can firstly be considered where:
Therefore, (86) can be re-written as:
{tilde over (w)}=TL−H{tilde over (w)} (89)
where:
Therefore, the corresponding speech estimate will be:
Once again, it will be convenient to re frame the problem in the pre-whitened-transformed domain similarly to (78):
In this case however, the problem cannot be reduced to a lower order as the entire RTF vector is being estimated. Hence the solution follows from an EVD on Ryy:
R
yy=QΣQH (93)
(94)
where Q is a (Ma+Me)×(Ma+Me) unitary matrix of eigenvectors and E is a diagonal matrix with the associated eigenvalues in descending order. The estimated RTF vector is then given by the principal (first in this case) eigeiwector, qmax:
where ηq=ex1TTLqmax and ex1=[1 0 . . . 0|0 . . . 0]T.
The estimated RTF vector an therefore be used as an alternative to {tilde over (h)} for the MVDRa,e:
This filter based on estimated quantities can also be reformulated in the pre-whitened-transformed domain. Starting with the definition for the pre-whitened-transformed version of this estimated RTF:
Hence (96) becomes:
ŵ=TL−H{circumflex over (w)} (98)
where
The co-responding speech estimat sing the estimated RTF vector is therefore:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2019/057011 | 8/20/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62723157 | Aug 2018 | US |