The present disclosure relates to a hearing device, and a method of operating a hearing device.
Generally, the speech intelligibility for users of assistive listening devices depends highly on the specific listening environment. One of the main issues encountered by hearing aid (HA) users is severely degraded speech intelligibility in noisy multi-talker environments such as the “cocktail party problem”.
To assess speech intelligibility, various intrusive methods exist to predict the speech intelligibility with acceptable reliability, such as the short-time objective intelligibility (STOI) metric and the normalized covariance metric (NCM).
However, the STOI method, and the NCM method are intrusive, i.e., they all require access to the “clean” speech signal. However, in most real-life situations, such as the cocktail party, access to the “clean” speech signal as reference speech signal is rarely available.
Accordingly, there is a need for hearing devices, methods and hearing systems that overcome drawbacks of the background.
A hearing device is disclosed. The hearing device comprises an input module for provision of a first input signal, the input module comprising a first microphone; a processor for processing input signals and providing an electrical output signal based on input signals; a receiver for converting the electrical output signal to an audio output signal; and a controller operatively connected to the input module. The controller comprises a speech intelligibility estimator for estimating a speech intelligibility indicator indicative of speech intelligibility based on the first input signal. The controller may be configured to control the processor based on the speech intelligibility indicator. The speech intelligibility estimator comprises a decomposition module for decomposing the first input signal into a first representation of the first input signal, e.g. in a frequency domain. The first representation may comprise one or more elements representative of the first input signal. The decomposition module may comprise one or more characterization blocks for characterizing the one or more elements of the first representation e.g. in the frequency domain.
Further, a method of operating a hearing device is provided. The method comprises converting audio to one or more microphone input signals including a first input signal; obtaining a speech intelligibility indicator indicative of speech intelligibility related to the first input signal; and controlling the hearing device based on the speech intelligibility indicator. Obtaining the speech intelligibility indicator comprises obtaining a first representation of the first input signal in a frequency domain by determining one or more elements of the representation of the first input signal in the frequency domain using one or more characterization blocks.
It is an advantage of the present disclosure that it allows to assess the speech intelligibility without having a reference speech signal available. The speech intelligibility is advantageously estimated by decomposing the input signals using one or more characterization blocks into a representation. The representation obtained enables reconstruction of a reference speech signal, and thereby leads to an improved assessment of the speech intelligibility. In particular, the present disclosure exploits the disclosed decomposition, and disclosed representation to improve accuracy of the non-intrusive estimation of the speech intelligibility in the presence of noise.
A hearing device includes: an input module for provision of a first input signal, the input module comprising a first microphone; a processor for processing the first input signal and providing an electrical output signal based on the first input signal; a receiver for converting the electrical output signal to an audio output signal; and a controller operatively connected to the input module, the controller comprising a speech intelligibility estimator configured to determine a speech intelligibility indicator indicative of speech intelligibility based on the first input signal, wherein the controller is configured to control the processor based on the speech intelligibility indicator; wherein the speech intelligibility estimator comprises a decomposition module configured to decompose the first input signal into a first representation of the first input signal in a frequency domain, wherein the first representation comprises one or more elements representative of the first input signal; and wherein the decomposition module comprises one or more characterization blocks for characterizing the one or more elements of the first representation in the frequency domain.
Optionally, the decomposition module is configured to decompose the first input signal into the first representation by mapping a feature of the first input signal to the one or more characterization blocks.
Optionally, the decomposition module is configured to map the feature of the first input signal to the one or more characterization blocks by comparing the feature with the one or more characterization blocks, and deriving the one or more elements of the first representation based on the comparison.
Optionally, the one or more characterization blocks comprise one or more target speech characterization blocks.
Optionally, the one or more characterization blocks comprise one or more noise characterization blocks.
Optionally, the decomposition module is configured to decompose the first input signal into the first representation by comparing a feature of the first input signal with one or more target speech characterization blocks and/or one or more noise characterization blocks, and determining the one or more elements of the first representation based on the comparison.
Optionally, the decomposition module is configured to determine a second representation of the first input signal, wherein the second representation comprises one or more elements representative of the first input signal, and wherein the decomposition module is also configured to characterize the one or more elements of the second representation.
Optionally, the decomposition module is configured to determine the second representation by comparing a feature of the first input signal with one or more target speech characterization blocks and/or one or more noise characterization blocks, and determining the one or more elements of the second representation based on the comparison.
Optionally, the hearing device is configured to train the one or more characterization blocks.
Optionally, the one or more characterization blocks are a part of a codebook, and/or a dictionary.
A method of operating a hearing device, includes: converting sound to one or more microphone signals including a first input signal; obtaining a speech intelligibility indicator indicative of speech intelligibility related to the first input signal; and controlling the hearing device based on the speech intelligibility indicator, wherein the act of obtaining the speech intelligibility indicator comprises obtaining a first representation of the first input signal in a frequency domain by determining one or more elements of the first representation of the first input signal in the frequency domain using one or more characterization blocks.
Optionally, the act of determining the one or more elements of the first representation of the first input signal using the one or more characterization blocks comprises mapping a feature of the first input signal to the one or more characterization blocks.
Optionally, the act of obtaining the speech intelligibility indicator comprises generating a reconstructed reference speech signal based on the first representation, and determining the speech intelligibility indicator based on the reconstructed reference speech signal.
Optionally, the one or more characterization blocks comprise one or more target speech characterization blocks.
Optionally, the one or more characterization blocks comprise one or more noise characterization blocks.
Other features will be described in the detailed description.
The above and other features and advantages will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:
Various exemplary embodiments and details are described hereinafter, with reference to the figures when relevant. It should be noted that the figures may or may not be drawn to scale and that elements of similar structures or functions are represented by like reference numerals throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the invention or as a limitation on the scope of the invention. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.
Speech intelligibility metrics are intrusive, i.e., they require a reference speech signal, which is rarely available in real-life applications. It has been suggested to derive a non-intrusive intelligibility measure for noisy and nonlinearly processed speech, i.e. a measure which can predict intelligibility from a degraded speech signal without requiring a clean reference signal. The suggested measure estimates clean signal amplitude envelopes in the modulation domain from the degraded signal. However, the measure in such an approach does not allow to reconstruct the clean reference signal and does not perform sufficiently accurate compared to the original intrusive STOI measure. Further, the measure in such an approach performs poorly in complex listening environment, e.g. with a single competing speaker.
The disclosed hearing device and methods propose to determine a representation estimated in the frequency domain from the (noisy) input signal. The representation may be for example a spectral envelope. The representation disclosed herein is determined using one or more predefined characterizations blocks. The one or more characterization blocks are defined and computed so that they fit or represent sufficiently well the noisy speech signal, and support a reconstruction of the reference speech signal. This results in a representation that is sufficient to be considered as a representation of the reference speech signal, and that enables reconstruction of the reference speech signal to be used for the assessment of the speech intelligibility indicator.
The present disclosure provides a hearing device that non-intrusively estimates the speech intelligibility of the listening environment by estimating a speech intelligibility indicator based on a representation of the (noisy) input signal. The present disclosure proposes to use the estimated speech intelligibility indicator to control the processing of input signals.
It is an advantage of the present disclosure that no access to a reference speech signal is needed in the present disclosure to estimate the speech intelligibility indicator. The present disclosure proposes a hearing device and a method that is capable of reconstructing the reference speech signal (i.e. a reference speech signal representing the intelligibility of the speech signal) based on a representation of the input signal (i.e. the noisy input signal). The present disclosure overcomes the lack of availability or lack of access to a reference speech signal by exploiting the input signals, and features of the input signals, such as the frequency or the spectral envelop, or autoregressive parameters thereof, and characterization blocks to derive a representation of the input signal, such as a spectral envelope of the reference speech signal, without access to the reference speech signal.
A hearing device is disclosed. The hearing device may be a hearing aid, wherein the processor is configured to compensate for a hearing loss of a user. The hearing device may be a hearing aid, e.g. of a behind-the-ear (BTE) type, in-the-ear (ITE) type, in-the-canal (ITC) type, receiver-in-canal (RIC) type or receiver-in-the-ear (RITE) type. The hearing device may be a hearing aid of the cochlear implant type, or of the bone anchored type.
The hearing device comprises an input module for provision of a first input signal, the input module comprising a first microphone, such as a first microphone of a set of microphones. The input signal is for example an acoustic sound signal processed by a microphone, such as a first microphone signal. The first input signal may be based on the first microphone signal. The set of microphones may comprise one or more microphones. The set of microphones comprises a first microphone for provision of a first microphone signal and/or a second microphone for provision of a second microphone signal. A second input signal may be based on the second microphone signal. The set of microphones may comprise N microphones for provision of N microphone signals, wherein N is an integer in the range from 1 to 10. In one or more exemplary hearing devices, the number N of microphones is two, three, four, five or more. The set of microphones may comprise a third microphone for provision of a third microphone signal.
The hearing device comprises a processor for processing input signals, such as microphone signal(s). The processor is configured to provides an electrical output signal based on the input signals to the processor. The processor may be configured to compensate for a hearing loss of a user.
The hearing device comprises a receiver for converting the electrical output signal to an audio output signal. The receiver may be configured to convert the electrical output signal to an audio output signal to be directed towards an eardrum of the hearing device user.
The hearing device optionally comprises an antenna for converting one or more wireless input signals, e.g. a first wireless input signal and/or a second wireless input signal, to an antenna output signal. The wireless input signal(s) origin from external source(s), such as spouse microphone device(s), wireless TV audio transmitter, and/or a distributed microphone array associated with a wireless transmitter.
The hearing device optionally comprises a radio transceiver coupled to the antenna for converting the antenna output signal to a transceiver input signal. Wireless signals from different external sources may be multiplexed in the radio transceiver to a transceiver input signal or provided as separate transceiver input signals on separate transceiver output terminals of the radio transceiver. The hearing device may comprise a plurality of antennas and/or an antenna may be configured to be operate in one or a plurality of antenna modes. The transceiver input signal comprises a first transceiver input signal representative of the first wireless signal from a first external source.
The hearing device comprises a controller. The controller may be operatively connected to the input module, such as to the first microphone, and to the processor. The controller may be operatively connected to a second microphone if present. The controller may comprise a speech intelligibility estimator for estimating a speech intelligibility indicator indicative of speech intelligibility based on the first input signal. The controller may be configured to estimate the speech intelligibility indicator indicative of speech intelligibility. The controller is configured to control the processor based on the speech intelligibility indicator.
In one or more exemplary hearing devices, the processor comprises the controller. In one or more exemplary hearing devices, the controller is collocated with the processor.
The speech intelligibility estimator may comprise a decomposition module for decomposing the first microphone signal into a first representation of the first input signal. The decomposition module may be configured to decompose the first microphone signal into a first representation in the frequency domain. For example, the decomposition module may be configured to determine the first representation based on the first input signal, e.g. the first representation in the frequency domain. The first representation may comprise one or more elements representative of the first input signal, such as one or more elements in the frequency domain. The decomposition module may comprise one or more characterization blocks for characterizing the one or more elements of the first representation, such as in the frequency domain.
The one or more characterization blocks may be seen as one or more frequency-based characterization blocks. In other words, the one or more characterization blocks may be seen as one or more characterization blocks in the frequency domain. The one or more characterization blocks may be configured to fit or represent the noisy speech signal, e.g. with minimized error. The one or more characterization blocks may be configured to support a reconstruction of the reference speech signal.
The term “representation” as used herein refers to one or more elements characterizing and/or estimating a property of an input signal. The property may be reflected or estimated by a feature extracted from the input signal, such as a feature representative of the input signal. For example, a feature of the first input signal may comprise a parameter of the first input signal, a frequency of the first input signal, a spectral envelop of the first input signal and/or a frequency spectrum of the first input signal. A parameter of the first input signal may be an auto-regressive, AR, coefficient of an auto-regressive model.
In one or more exemplary hearing devices, the one or more characterization blocks form part of a codebook, and/or a dictionary. For example, the one or more characterization blocks form part of a codebook in the frequency domain or a dictionary in the frequency domain.
For example, the controller or the speech intelligibility estimator may be configured to estimate the speech intelligibility indicator based on the first representation, which enables the reconstruction of the reference speech signal. Stated differently, the speech intelligibility indicator is predicted by the controller or the speech intelligibility estimator based on the first representation as a representation sufficient for reconstructing the reference speech signal.
In an illustrative example where the disclosed technique is applied, an additive noise model is assumed to be part of the (noisy) first input signal where:
y(n)=s(n)+w(n), (1)
where y(n), s(n) and w(n) represent the first input signal (e.g. a noisy sample speech signal from the input module), the reference speech signal and the noise, respectively. The reference speech signal can be modelled as a stochastic autoregressive, AR, process e.g.:
where s(n−1)=[s(n−1), . . . , s(n−P)]T represents the P past reference speech sample signals, as(n)=[as
where w(n−1)=[w(n−1), . . . , w(n−Q)]T represents the Q past noise sample signal, aw(n)=[aw
In one or more exemplary hearing devices, the hearing device is configured to model the input signals using an autoregressive, AR, model.
In one or more exemplary hearing devices, the decomposition module may be configured to decompose the first input signal into the first representation by mapping a feature of the first input signal into one or more characterization blocks, e.g. using a projection of a frequency-based feature of the first input signal. For example, the decomposition module may be configured to map a feature of the first input signal into one or more characterization blocks using an autoregressive model of the first input signal with linear prediction coefficients relating the frequency-based feature of the first input signal to the one or more characterization blocks of the decomposition module.
In one or more exemplary hearing devices, mapping the feature of the first input signal into the one or more characterization blocks may comprise comparing the feature with one or more characterization blocks and deriving the one or more elements of the first representation based on the comparison. For example, the decomposition module may be configured to compare a frequency-based feature of the first input signal with the one or more characterization blocks by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to the first input signal for each of the characterization blocks.
In one or more exemplary hearing devices, the one or more characterization blocks may comprise one or more target speech characterization blocks. For example, the one or more target speech characterization blocks may form part of a target speech codebook in the frequency domain or a target speech dictionary in the frequency domain.
In one or more exemplary hearing devices, a characterization block may be an entry of a codebook or an entry of a dictionary.
In one or more exemplary hearing devices, the one or more characterization blocks may comprise one or more noise characterization blocks. For example, the one or more noise characterization blocks may form part of a noise codebook in the frequency domain or a noise dictionary in the frequency domain.
In one or more exemplary hearing devices, the decomposition module is configured to determine the first representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and/or the one or more noise characterization blocks and determining the one or more elements of the first representation based on the comparison. For example, the decomposition module is configured to determine the one or more elements of the first representation as estimated coefficients related to the first input signal for each of the one or more of the target speech characterization blocks and/or for each of the one or more of the noise characterization blocks. For example, the decomposition module may be configured to map a feature of the first input signal into the one or more target speech characterization blocks and the one or more of the noise characterization blocks using an autoregressive model of the first input signal with linear prediction coefficients relating a frequency-based feature of the first input signal to the one or more target speech characterization blocks and/or to the one or more noise characterization blocks. For example, the decomposition module may be configured to compare a frequency-based feature of the estimated reference speech signal with the one or more characterization blocks by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to estimated reference speech signal for each of the one or more target speech characterization blocks and/or each of the one or more noise characterization blocks.
In one or more exemplary hearing devices, the first representation may comprise a reference signal representation. In other words, the first representation may be related to a reference signal representation, such as a representation of the reference signal, e.g. of the reference speech signal. The reference speech signal may be seen as a reference signal representing the intelligibility of the speech signal accurately. In other words, the reference speech signal exhibits similar properties as the signal emitted by an audio source, such as sufficient information about the speech intelligibility.
In one or more exemplary hearing devices, the decomposition module is configured to determine the one or more elements of the reference signal representation as estimated coefficients related to an estimated reference speech signal for each of the one or more of the characterization blocks (e.g. target speech characterization blocks). For example, the decomposition module may be configured to map a feature of the estimated reference speech signal into one or more characterization blocks (e.g. target speech characterization blocks) using an autoregressive model of the first input signal with linear prediction coefficients relating a frequency-based feature of the estimated reference speech signal to the one or more characterization blocks (e.g. target speech characterization blocks). For example, the decomposition module may be configured to compare a frequency-based feature (e.g. a spectral envelope) of the estimated reference speech signal with the one or more characterization blocks (e.g. target speech characterization blocks) by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to estimated reference speech signal for each of the one or more characterization blocks (e.g. target speech characterization blocks).
In one or more exemplary hearing devices, the decomposition module is configured to decompose the first input signal into a second representation of the first input signal, wherein the second representation comprises one or more elements representative of the first input signal. The decomposition module may comprise one or more characterization blocks for characterizing the one or more elements of the second representation.
In one or more exemplary hearing devices, the second representation may comprise a representation of a noise signal, such as a noise signal representation.
In one or more exemplary hearing devices, the decomposition module is configured to determine the second representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and/the one or more noise characterization blocks and determining the one or more elements of the second representation based on the comparison. For example, when the second representation is targeted at representing the estimated noise signal, the decomposition module is configured to determine the one or more elements of the second representation as estimated coefficients related to the estimated noise signal for each of the one or more of the noise characterization blocks. For example, the decomposition module may be configured to map a feature of the estimated noise signal into the one or more of the noise characterization blocks using an autoregressive model of the estimated noise signal with linear prediction coefficients relating a frequency-based feature of the estimated noise signal to the one or more noise characterization blocks. For example, the decomposition module may be configured to compare a frequency-based feature of the estimated noise signal with the one or more noise characterization blocks by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to the estimated noise signal for each of the one or more noise characterization blocks.
In one or more exemplary hearing devices, the decomposition module is configured to determine the first representation as a reference signal representation and the second representation as a noise signal representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and the one or more noise characterization blocks and determining the one or more elements of the first representation and the one or more elements of the second representation based on the comparisons. For example, the decomposition module is configured to determine the reference signal representation and the noise signal representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and the one or more noise characterization blocks and determining the one or more elements of the reference signal representation and the one or more elements of the noise signal representation based on the comparisons.
In an illustrative example where the disclosed technique is applied, the first representation is considered to comprise an estimated frequency spectrum of the reference speech signal. The second representation comprises an estimated frequency spectrum of the noise signal. The first representation and the second representation are estimated from linear prediction coefficients and excitation variances concatenated in an estimation vector θ=[as aw σu2(n) σv2(n)]. The first representation and the second representation are estimated using a target speech codebook comprising one or more target speech characterization blocks and/or a noise codebook comprising one or more noise characterization blocks. The target speech codebook and/or a noise codebook may be trained by the hearing device using a-priori training data or live training data. The characterization blocks may be seen as related to the spectral shape(s) of the reference speech signal or the spectral shape(s) of the first input signal in the form of linear prediction coefficients. Given the observed vector of the first input signal y=[y(0) y(1) . . . y(N−1)] for the current frame of length N, the minimum mean square error, MMSE, estimate of the vector θ may be given as {circumflex over (θ)}=E(θ|y) for the space of the parameters to be estimated, Θ, and may be reformulated using Bayes' theorem as e.g.:
The estimation vector, θij=[as
where Asi and Awj are the frequency spectra of the ith and jth vector, i.e. the ith target speech characterization block and jth noise characterization block. The target speech characterization blocks may form part of a target speech codebook and the noise characterization block may form part of a noise codebook. Also it is assumed that ∥f(ω)∥=∫|f(ω)|dω. The spectral envelope of the target speech codebook, the noise codebook and the first input signal are given by
and Py(ω), respectively. In practice, the MMSE estimate of the estimation vector θ in Eq. 4 is evaluated as a weighted linear combination of θij by e.g.:
where Ns and Nw are number of target speech characterization blocks and noise characterization blocks respectively. Ns and Nw may be seen as number of entries in the target speech codebook and in the noise codebook, respectively. The weight of the MMSE estimate of the first input signal, p(y|θij), can be computed as e.g.:
where the Itakura-Saito distortion between the first input signal (or noisy spectrum) and the modelled first input signal (or modelled noisy spectrum) is given by dIS(Py(ω),{circumflex over (P)}yij(ω)). The weighted summation of the LPC is optionally performed in the line spectral frequency domain e.g. in order to ensure stable inverse filters. The line spectral frequency domain is a specific representation of the LPC coefficients having mathematical and numerical benefits. As an example, the LPC coefficient is a low-order spectral approximation—they define the overall shape of the spectrum. If we want to find the spectrum in between two set of LPC coefficients, we need to transfer from LPC->LSF, find the average, and transfer LSF->LPC. Thus, the line spectral frequency domain is a more convenient (but identical) representation of the information of the LPC coefficients. The pair LPC and LSF are similar to the pair Cartesian and polar coordinates.
In one or more exemplary hearing devices, the hearing device is configured to train the one or more characterization blocks. For example, the hearing device is configured to train the one or more characterization blocks using a female voice, and/or a male voice. It may be envisaged that the hearing device is configured to train the one or more characterization blocks at manufacturing, or at the dispenser. Alternatively, or additionally, it may be envisaged that the hearing device is configured to train the one or more characterization blocks continuously. The hearing device is optionally configured to train the one or more characterization blocks so as to obtain representative characterization blocks that enable an accurate first representation, which in turn allows a reconstruction of the reference speech signal. For example, the hearing device may be configured to train the one or more characterization blocks using an autoregressive, AR, model.
In one or more exemplary hearing devices, the speech intelligibility estimator comprises a signal synthesizer for generating a reconstructed reference speech signal based on the first representation (e.g. a reference signal representation). The speech intelligibility indicator may be estimated based on the reconstructed reference speech signal. For example, the signal synthesizer may be configured to generate the reconstructed reference speech signal based on the first representation being a reference signal representation.
In one or more exemplary hearing devices, the speech intelligibility estimator comprises a signal synthesizer for generating a reconstructed noise signal based on the second representation. The speech intelligibility indicator may be estimated based on the reconstructed noisy speech signal. For example, the signal synthesizer may be configured to generate the reconstructed noisy speech signal based on the second representation being a noise signal representation, and/or the first representation being a reference signal representation.
In an illustrative example where the disclosed technique is applied, the reference speech signal may be reconstructed in the following exemplary manner. The first representation comprises an estimated frequency spectrum of the reference speech signal. The second representation comprises an estimated frequency spectrum of the noise signal. In other words, the first representation is a reference signal representation and the second representation is a noise signal representation. The first representation, in this example, comprises a time-frequency, TF, spectrum of the estimated reference signal, Ŝ. The first representation comprises one or more estimated AR filter coefficients as of the reference speech signal for each time frame. The reconstructed reference speech signal may be obtained based on the first representation by e.g.:
where
The second representation, in this example, comprises a time-frequency, TF, power spectrum of the estimated noise signal, Ŵ. The second representation comprises estimated noise AR filter coefficients, aw, of the estimated noise signal that compose a TF spectrum of the estimated noise signal. The estimated noise signal may be obtained based on the second representation by e.g.:
where
The linear prediction coefficients, i.e. as and aw, determine the shape of the envelope of the corresponding estimated reference signal Ŝ(ω) and of estimated noise signal Ŵ(ω), respectively. The excitation variances, {circumflex over (σ)}u and {circumflex over (σ)}v, determine the overall signal magnitude. Finally, the reconstructed noisy speech signal may be determined as a combined sum of the reference signal spectrum and the noise signal spectrum (or power spectrum), e.g.:
Ŷ(ω)=Ŝ(ω)+Ŵ(ω). (13)
The time-frequency spectra may replace the discrete Fourier transform of the reference speech signal and the noisy speech signal as input in a STOI estimator.
In one or more exemplary hearing devices, the speech intelligibility estimator comprises a short-time objective intelligibility estimator. The short-time objective intelligibility estimator may be configured to compare the reconstructed reference speech signal with the reconstructed noisy speech signal and to provide the speech intelligibility indicator, e.g. based on the comparison. For example, elements of the first representation of the first input signal (e.g. the spectra (or power spectra) of the noisy speech, Ŷ) may be clipped by a normalisation procedure expressed in Eq. 14 in order to de-emphasize the impact of region in which noise dominates the spectrum:
Ŷ′=max(min(λ·Ŷ,(1+10−β/20)·Ŝ),(1−10−β/20)·Ŝ), (14)
where ŝ is the spectrum (or power spectrum) of the reconstructed reference signal, λ=√{square root over (ΣŜ2/ΣŶ2)} is a scale factor for normalizing the noisy TF bins and β=−15 dB is e.g. the lower signal-to-distortion ratio. Given the local correlation coefficient, rf(t), between Ŷ and Ŝ at frequency f and time t, the speech intelligibility indicator, SII, may be estimated by averaging across frequency bands and frames:
In one or more embodiments, the short-time objective intelligibility estimator may be configured to compare the reconstructed reference speech signal with the first input signal to provide the speech intelligibility indicator. In other words, the reconstructed noisy speech signal may be replaced by the first input signal as obtained from the input module. The first input signal may be captured by a single microphone (which is omnidirectional) or by a plurality of microphones (e.g. using beamforming). For example, the speech intelligibility indicator may be predicted by the controller or the speech intelligibility estimator by comparing the reconstructed speech signal and the first input signal using the STOI estimator, such as by comparing the correlation of the reconstructed speech signal and the first input signal using the STOI estimator.
In one or more exemplary hearing devices, the input module comprises a second microphone and a first beamformer. The first beamformer may be connected to the first microphone and the second microphone and configured to provide a first beamform signal, as the first input signal, based on first and second microphone signals. The first beamformer may be connected to a third microphone and/or a fourth microphone and configured to provide a first beamform signal, as the first input signal, based on a third microphone signal of the third microphone and/or a fourth microphone signal of the fourth microphone. The decomposition module may be configured to decompose the first beamform signal into the first representation. For example, the first beamformer may comprise a front beamformer or zero-direction beamformer, such as a beamformer directed to a front direction of the user.
In one or more exemplary hearing devices, the input module comprises a second beamformer. The second beamformer may be connected to the first microphone and the second microphone and configured to provide a second beamform signal, as a second input signal, based on first and second microphone signals. The second beamformer may be connected to a third microphone and/or a fourth microphone and configured to provide a second beamform signal, as the second input signal, based on a third microphone signal of the third microphone and/or a fourth microphone signal of the fourth microphone. The decomposition module may be configured to decompose the second input signal into a third representation. For example, the second beamformer may comprise an omni-directional beamformer.
The present disclosure also relates to a method of operating a hearing device. The method comprises converting audio to one or more microphone signals including a first input signal; and obtaining a speech intelligibility indicator indicative of speech intelligibility related to the first input signal. Obtaining the speech intelligibility indicator comprises obtaining a first representation of the first input signal in a frequency domain by determining one or more elements of the representation of the first input signal in the frequency domain using one or more characterization blocks.
In one or more exemplary methods, determining one or more elements of the first representation of the first input signal using one or more characterization blocks comprises mapping a feature of the first input signal into the one or more characterization blocks. In one or more exemplary methods, the one or more characterization blocks comprise one or more target speech characterization blocks. In one or more exemplary methods, the one or more characterization blocks comprise one or more noise characterization blocks.
In one or more exemplary methods, obtaining the speech intelligibility indicator comprises generating a reconstructed reference speech signal based on the first representation, and determining the speech intelligibility indicator based on the reconstructed reference speech signal.
The method may comprise controlling the hearing device based on the speech intelligibility indicator.
The figures are schematic and simplified for clarity. Throughout, the same reference numerals are used for identical or corresponding parts.
The hearing device 2 comprises an input module 6 for provision of a first input signal 9. The input module 6 comprises a first microphone 8. The input module 6 may be configured to provide a second input signal 11. The first microphone 8 may be part of a set of microphones. The set of microphones may comprise one or more microphones. The set of microphones comprises a first microphone 8 for provision of a first microphone signal 9′ and optionally a second microphone 10 for provision of a second microphone signal 11′. The first input signal 9 is the first microphone signal 9′ while the second input signal 11 is the second microphone signal 11′.
The hearing device 2 optionally comprises an antenna 4 for converting a first wireless input signal 5 of a first external source (not shown in
The hearing device 2 comprises a processor 14 for processing input signals. The processor 14 provides an electrical output signal based on the input signals to the processor 14.
The hearing device comprises a receiver 16 for converting the electrical output signal to an audio output signal.
The processor 14 is configured to compensate for a hearing loss of a user and to provide an electrical output signal 15 based on input signals. The receiver 16 converts the electrical output signal 15 to an audio output signal to be directed towards an eardrum of the hearing device user.
The hearing device comprises a controller 12. The controller 12 is operatively connected to input module 6, (e.g. to the first microphone 8) and to the processor 14. The controller 12 may be operatively connected to the second microphone 10 if any. The controller 12 is configured to estimate the speech intelligibility indicator indicative of speech intelligibility based on one or more input signals, such as the first input signal 9. The controller 12 comprises a speech intelligibility estimator 12a for estimating a speech intelligibility indicator indicative of speech intelligibility based on the first input signal 9. The controller 12 is configured to control the processor 14 based on the speech intelligibility indicator.
The speech intelligibility estimator 12a comprises a decomposition module 12aa for decomposing the first input signal 9 into a first representation of the first input signal 9 in a frequency domain. The first representation comprises one or more elements representative of the first input signal 9. The decomposition module comprises one or more characterization blocks, A1, . . . , Ai for characterizing the one or more elements of the first representation in the frequency domain. In one or more exemplary hearing devices, the decomposition module 12aa is configured to decompose the first input signal 9 into the first representation by mapping a feature of the first input signal 9 into one or more characterization blocks A1, . . . , Ai. For example, the decomposition module is configured to map a feature of the first input signal 9 into one or more characterization blocks A1, . . . , Ai using an autoregressive model of the first input signal with linear prediction coefficients relating the frequency-based feature of the first input signal 9 to the one or more characterization blocks A1, . . . , Ai of the decomposition module 12aa. The feature of the first input signal 9 comprises for example a parameter of the first input signal, a frequency of the first input signal, a spectral envelop of the first input signal and/or a frequency spectrum of the first input signal. A parameter of the first input signal may be an auto-regressive, AR, coefficient of an auto-regressive model, such as the coefficients in Equation (1).
In one or more exemplary hearing devices, the decomposition module 12aa is configured to compare the feature with one or more characterization blocks A1, . . . , Ai and deriving the one or more elements of the first representation based on the comparison. For example, the decomposition module 12aa compares a frequency-based feature of the first input signal 9 with the one or more characterization blocks A1, . . . , Ai by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to the first input signal 9 for each of the characterization blocks, as illustrated in Equation (4).
For example, the one or more characterization blocks A1, . . . , Ai may comprise one or more target speech characterization blocks. In one or more exemplary hearing devices, a characterization block may be an entry of a codebook or an entry of a dictionary. For example, the one or more target speech characterization blocks may form part of a target speech codebook in the frequency domain or a target speech dictionary in the frequency domain.
In one or more exemplary hearing devices, the one or more characterization blocks A1, . . . , Ai may comprise one or more noise characterization blocks. For example, the one or more noise characterization blocks A1, . . . , Ai may form part of a noise codebook in the frequency domain or a noise dictionary in the frequency domain.
The decomposition module 12aa may be configured to determine the second representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and/the one or more noise characterization blocks and determining the one or more elements of the second representation based on the comparison. The second representation may be a noise signal representation while the first representation may be a reference signal representation.
For example, the decomposition module 12aa may be configured to determine the first representation and the second representation by comparing the feature of the first input signal with the one or more target speech characterization blocks and the one or more noise characterization blocks and determining the one or more elements of the first representation and the one or more elements of the second representation based on the comparisons, as illustrated in any of the Equations (5-10).
The hearing device may be configured to train the one or more characterization blocks, e.g. using a female voice, and/or a male voice.
The speech intelligibility estimator 12a may comprise a signal synthesizer 12ab for generating a reconstructed reference speech signal based on the first representation. The speech intelligibility estimator 12a may be configured to estimate the speech intelligibility indicator based on the reference reconstructed speech signal provided by the signal synthesizer 12ab. For example, a signal synthesizer 12ab is configured to generate the reconstructed reference speech signal based on the first representation, following e.g. Equations (11).
The signal synthesizer 12ab may be configured to generate a reconstructed noise signal based on the second representation, e.g. based on Equation (12).
The speech intelligibility indicator may be estimated based on the reconstructed noisy speech signal.
The speech intelligibility estimator 12a may comprise a short-time objective intelligibility (STOI) estimator 12ac. The short-time objective intelligibility estimator 12ac is configured to compare the reconstructed reference speech signal and a noisy input signal (either a reconstructed noisy input signal or the first input signal 9) and to provide the speech intelligibility indicator based on the comparison, as illustrated in Equations (13-15).
For example, the short-time objective intelligibility estimator 12ac compares the reconstructed reference speech signal and the noisy speech signal (reconstructed or not). In other words, the short-time objective intelligibility estimator 12ac assesses the correlation between the reconstructed reference speech signal and the noisy speech signal (e.g. the reconstructed noisy speech signal) and uses the assessed correlation to provide a speech intelligibility indicator to the controller 12, or to the processor 14.
The input module 6 is configured to provide a second input signal 11. The input module 6 comprises a second beamformer 19 connected the second microphone 10 and to the first microphone 8. The second beamformer 19 is configured to generate a second beamform signal 11″ based on the first microphone signal 9′ and the second microphone signal 11′.
The hearing device 2A comprises a processor 14 for processing input signals. The processor 14 provides an electrical output signal based on the input signals to the processor 14.
The hearing device comprises a receiver 16 for converting the electrical output signal to an audio output signal.
The processor 14 is configured to compensate for a hearing loss of a user and to provide an electrical output signal 15 based on input signals. The receiver 16 converts the electrical output signal 15 to an audio output signal to be directed towards an eardrum of the hearing device user.
The hearing device comprises a controller 12. The controller 12 is operatively connected to input module 6, (i.e. to the first beamformer 18) and to the processor 14. The controller 12 may be operatively connected to the second beamformer 19 if any. The controller 12 is configured to estimate the speech intelligibility indicator indicative of speech intelligibility based on the first beamform signal 9″. The controller 12 comprises a speech intelligibility estimator 12a for estimating a speech intelligibility indicator indicative of speech intelligibility based on the first beamform signal 9″. The controller 12 is configured to control the processor 14 based on the speech intelligibility indicator.
The speech intelligibility estimator 12a comprises a decomposition module 12aa for decomposing the first beamform signal 9″ into a first representation in a frequency domain. The first representation comprises one or more elements representative of the first beamform signal 9″. The decomposition module comprises one or more characterization blocks, A1, . . . , Ai for characterizing the one or more elements of the first representation in the frequency domain.
The decomposition module 12a is configured to decompose the first beamform signal 9″ into the first representation (related to the estimated reference speech signal), and optionally into a second representation (related to the estimated noise signal) as illustrated in Equations (4-10).
When a second beamformer is included in the input module 6, the decomposition module may be configured to decompose the second input signal 11″ into a third representation (related to the estimated reference speech signal) and optionally a fourth representation (related to the estimated noise signal).
The speech intelligibility estimator 12a may comprise a signal synthesizer 12ab for generating a reconstructed reference speech signal based on the first representation, e.g. in Equation (11). The speech intelligibility estimator 12a may be configured to estimate the speech intelligibility indicator based on the reconstructed reference speech signal provided by the signal synthesizer 12ab.
The speech intelligibility estimator 12a may comprise a short-time objective intelligibility (STOI) estimator 12ac. The short-time objective intelligibility estimator 12ac is configured to compare the reconstructed reference speech signal and a noisy speech signal (e.g. reconstructed or directly obtained from the input module) and to provide the speech intelligibility indicator based on the comparison. For example, the short-time objective intelligibility estimator 12ac compares the reconstructed speech signal (e.g. the reconstructed reference speech signal) and noisy speech signal (e.g. reconstructed or directly obtained from the input module). In other words, the short-time objective intelligibility estimator 12ac assesses the correlation between the reconstructed reference speech signal and the noisy speech signal (e.g. the reconstructed noisy speech signal or input signal) and uses the assessed correlation to provide a speech intelligibility indicator to the controller 12, or to the processor 14.
In one or more exemplary hearing devices, the decomposition module 12aa is configured to decompose the first input signal 9 into the first representation by mapping a feature of the first input signal 9 into one or more characterization blocks A1, . . . , Ai. For example, the decomposition module is configured to map a feature of the first input signal 9 into one or more characterization blocks A1, . . . , Ai using an autoregressive model of the first input signal with linear prediction coefficients relating the frequency-based feature of the first input signal 9 to the one or more characterization blocks A1, . . . , Ai of the decomposition module 12aa. The feature of the first input signal 9 comprises for example a parameter of the first input signal, a frequency of the first input signal, a spectral envelop of the first input signal and/or a frequency spectrum of the first input signal. A parameter of the first input signal may be an auto-regressive, AR, coefficient of an auto-regressive model.
In one or more exemplary hearing devices, the decomposition module 12aa is configured to compare the feature with one or more characterization blocks A1, . . . , Ai and deriving the one or more elements of the first representation based on the comparison. For example, the decomposition module 12aa compares a frequency-based feature of the first input signal 9 with the one or more characterization blocks A1, . . . , Ai by estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to the first input signal 9 for each of the characterization blocks, as illustrated in Equation (4).
For example, the one or more characterization blocks A1, . . . , Ai may comprise one or more target speech characterization blocks. For example, the one or more target speech characterization blocks may form part of a target speech codebook in the frequency domain or a target speech dictionary in the frequency domain.
In one or more exemplary hearing devices, a characterization block may be an entry of a codebook or an entry of a dictionary.
In one or more exemplary hearing devices, the one or more characterization blocks may comprise one or more noise characterization blocks. For example, the one or more noise characterization blocks may form part of a noise codebook in the frequency domain or a noise dictionary in the frequency domain.
In one or more exemplary methods, determining 104aa one or more elements of the first representation of the first input signal using one or more characterization blocks comprises mapping 104ab a feature of the first input signal into the one or more characterization blocks. For example, mapping 104ab a feature of the first input signal into one or more characterization blocks may be performed using an autoregressive model of the first input signal with linear prediction coefficients relating the frequency-based feature of the first input signal to the one or more characterization blocks of the decomposition module.
In one or more exemplary methods, mapping 104ab the feature of the first input signal into the one or more characterization blocks may comprise comparing the feature with one or more characterization blocks and deriving the one or more elements of the first representation based on the comparison. For example, comparing a frequency-based feature of the first input signal with the one or more characterization blocks may comprise estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to the first input signal for each of the characterization blocks.
In one or more exemplary methods, the one or more characterization blocks comprise one or more target speech characterization blocks. In one or more exemplary methods, the one or more characterization blocks comprise one or more noise characterization blocks.
In one or more exemplary methods, the first representation may comprise a reference signal representation.
In one or more exemplary methods, determining 104aa one or more elements of the first representation of the first input signal using one or more characterization blocks may comprise determining 104ac the one or more elements of the reference signal representation as estimated coefficients related to an estimated reference speech signal for each of the one or more of the characterization blocks (e.g. target speech characterization blocks). For example, mapping a feature of the estimated reference speech signal into one or more characterization blocks (e.g. target speech characterization blocks) may be performed using an autoregressive model of the first input signal with linear prediction coefficients relating a frequency-based feature of the estimated reference speech signal to the one or more characterization blocks (e.g. target speech characterization blocks). For example, mapping a frequency-based feature of the estimated reference speech signal to the one or more characterization blocks (e.g. target speech characterization blocks) may comprise estimating a minimum mean square error of the linear prediction coefficients and of excitation co-variances related to estimated reference speech signal for each of the one or more characterization blocks (e.g. target speech characterization blocks).
In one or more exemplary methods, determining 104aa one or more elements of the first representation may comprise comparing 104ad the feature of the first input signal with the one or more target speech characterization blocks and/or the one or more noise characterization blocks and determining 104ae the one or more elements of the first representation based on the comparison.
In one or more exemplary methods, obtaining 104 a speech intelligibility indicator may comprise obtaining 104b a second representation of the first input signal, wherein the second representation comprises one or more elements representative of the first input signal. Obtaining 104b the second representation of the first input signal may be performed using one or more characterization blocks for characterizing the one or more elements of the second representation. In one or more exemplary methods, the second representation may comprise a representation of a noise signal, such as a noise signal representation.
In one or more exemplary methods, obtaining 104 the speech intelligibility indicator comprises generating 104c a reconstructed reference speech signal based on the first representation, and determining 104d the speech intelligibility indicator based on the reconstructed reference speech signal.
The method may comprise controlling 106 the hearing device based on the speech intelligibility indicator.
The intelligibility performance results shown in
The simulations show a high correlation between the disclosed non-intrusive technique and the intrusive STOI indicating that the disclosed technique is a suitable metric for automatic classification of speech signals. Further, these performance results also support that the representation disclosed herein provides a cue sufficient for accurately estimating speech intelligibility.
The use of the terms “first”, “second”, “third” and “fourth”, etc. does not imply any particular order, but are included to identify individual elements. Moreover, the use of the terms first, second, etc. does not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another. Note that the words first and second are used here and elsewhere for labelling purposes only and are not intended to denote any specific spatial or temporal ordering. Furthermore, the labelling of a first element does not imply the presence of a second element and vice versa.
Although particular features have been shown and described, it will be understood that they are not intended to limit the claimed invention, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the claimed invention. The specification and drawings are, accordingly to be regarded in an illustrative rather than restrictive sense. The claimed invention is intended to cover all alternatives, modifications, and equivalents.
Number | Date | Country | Kind |
---|---|---|---|
17181107 | Jul 2017 | EP | regional |
This application is a continuation of U.S. patent application Ser. No. 16/011,982 filed on Jun. 19, 2018, pending, which claims priority to, and the benefit of, European Patent Application No. 17181107 filed on Jul. 13, 2017. The entire disclosures of the above applications are expressly incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
5133013 | Munday | Jul 1992 | A |
7599507 | Hansen | Oct 2009 | B2 |
10225669 | Jensen et al. | Mar 2019 | B2 |
20030014249 | Ramo | Jan 2003 | A1 |
20050141737 | Hansen | Jun 2005 | A1 |
20130218578 | Gao | Aug 2013 | A1 |
Number | Date | Country |
---|---|---|
101853665 | Oct 2010 | CN |
104703107 | Jun 2015 | CN |
105872923 | Aug 2016 | CN |
Entry |
---|
Mahdie Karbasi, “Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures,” 2021. (Year: 2021). |
Parvaneh Janbakhshi, et al., “Pathological Speech Intelligibility Assessment Based On the Short-Time Objective Intelligibility Measure,” ICASSP, 2019. (Year: 2019). |
Asger Heidemann Andersen, et al. “A Non-Intrusive Short-Time Objective Intelligibility Measure,” ICASSP 2017. (Year: 2017). |
Foreign Office Action for CN Patent Appln. No. 201810756892.6 dated Jul. 28, 2021. |
Notice of Allowance for U.S. Appl. No. 16/011,982 dated Jun. 30, 2021. |
Translation of office action dated Jul. 28, 2021 for Chinese Patent Application No. 201810756892.6. |
Final Office Action for U.S. Appl. No. 16/011,982 dated Apr. 8, 2020. |
Non-Final Office Action for U.S. Appl. No. 16/011,982 dated Aug. 6, 2020. |
Non-Final Office Action for U.S. Appl. No. 16/011,982 dated Oct. 19, 2019. |
Notice of Allowance for U.S. Appl. No. 16/011,982 dated Mar. 11, 2021. |
Foreign Office Action dated Jan. 29, 2021 for related Chinese Appln. No. 201810756892.6. |
Sorensen, Charlotte, et al. “Pitch-based non-intrusive objective intelligibility prediction.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017. |
Extended European Search Report dated Nov. 3, 2017 for corresponding European Application No. 17181107.8. |
Charlotte Sorensen, et al. “Pitch-based Non-Intrusive Objective Intelligibility Prediction” 2017 IEEE International Conference On Acoustics, Speech and Signal Processing, Mar. 1, 2017, pp. 386-390. |
Asger Heidmann Andersen, et al. “A Non-Intrusive Short-Time Objective Intelligibility Measure” 2017 IEEE International Conference On Acoustics, Speech and Signal Processing, Mar. 5, 2017. pp. 5085-5089. |
Kavalekalam Mathew Shaji, et al. “Kalman Filter for speech enhancement in cocktail party scenarios using a codebook-based approach”. 2016 IEEE International Conference On Acoustics, Speech and Signal Processing, Mar. 20, 2016. pp. 191-195. |
Srinivasan S, et al., “Codebook-based Bayesian Speech Enhancement” 2005 IEEE International Conference On Acoustics, Speech and Signal Processing—18-23, vol. 1, Mar. 18, 2005. |
Falk Tiago H, et al. “Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices: Advantages and Limitations of Existing tools”, IEEE Signal Processing Magazine, IEE Service Center, Piscataway, NJ, vol. 32, No. 2, Mar. 1, 2015, pp. 114-124. |
Toshihiro Sakano, et al. “A Speech Intelligibility Estimation Method Using a Non-Reference Feature Set” IEICE Transactions on Information and System., vol. e98-D, No. 1, Jan. 1, 2015, pp. 21-28. |
Advisory Action for U.S. Appl. No. 16/011,982 dated Jul. 13, 2020. |
Amendment Response to NFOA for U.S. Appl. No. 16/011,982 dated Jan. 6, 2021. |
Amendment Response to NFOA for U.S. Appl. No. 16/011,982 dated Mar. 4, 2020. |
Amendment Response to FOA for U.S. Appl. No. 16/011,982 dated Jul. 13, 2020. |
Supplementary Search Report dated Dec. 6, 2021 for Chinese patent application No. 201810756892.6 with translation. |
Number | Date | Country | |
---|---|---|---|
20210335380 A1 | Oct 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16011982 | Jun 2018 | US |
Child | 17338029 | US |