The present invention relates to hearing implant systems, and more specifically, to techniques for producing electrical stimulation signals in such systems based on estimates and predictions of noise powers in the input sound signals.
A normal ear transmits sounds as shown in
Hearing is impaired when there are problems in the ability to transduce external sounds into meaningful action potentials along the neural substrate of the cochlea 104. To improve impaired hearing, auditory prostheses have been developed. For example, when the impairment is related to operation of the middle ear 103, a conventional hearing aid or middle ear implant may be used to provide acoustic-mechanical stimulation to the auditory system in the form of amplified sound. Or when the impairment is associated with the cochlea 104, a cochlear implant with an implanted stimulation electrode can electrically stimulate auditory nerve tissue with small currents delivered by multiple electrode contacts distributed along the electrode.
Typically, the electrode array 110 includes multiple electrode contacts 112 on its surface that provide selective stimulation of the cochlea 104. Depending on context, the electrode contacts 112 are also referred to as electrode channels. In cochlear implants today, a relatively small number of electrode channels are each associated with relatively broad frequency bands, with each electrode contact 112 addressing a group of neurons with an electric stimulation pulse having a charge that is derived from the instantaneous amplitude of the signal envelope within that frequency band.
The details of such an arrangement are set forth in the following discussion.
In the signal processing arrangement shown in
The band pass signals y1 to yK (which can also be thought of as electrode channels) are output to a Stimulation Timer 206 that includes an Envelope Detector 202 and Fine Structure Detector 203. The Envelope Detector 202 extracts characteristic envelope signals outputs Y1, . . . , YK that represent the channel-specific band pass envelopes. The envelope extraction can be represented by Yk=LP(|yk|), where |.| denotes the absolute value and LP(.) is a low-pass filter; for example, using 12 rectifiers and 12 digital Butterworth low pass filters of 2nd order, IIR-type. Alternatively, the Envelope Detector 202 may extract the Hilbert envelope, if the band pass signals U1, . . . , UK are generated by orthogonal filters.
The Fine Structure Detector 203 functions to obtain smooth and robust estimates of the instantaneous frequencies in the signal channels, processing selected temporal fine structure features of the band pass signals U1, . . . , UK to generate stimulation timing signals X1, . . . , XK. The band pass signals y1, . . . , yK can be assumed to be real valued signals, so in the specific case of an analytic orthogonal filter bank, the Fine Structure Detector 203 considers only the real valued part of yK. The Fine Structure Detector 203 is formed of K independent, equally-structured parallel sub-modules.
The extracted band-pass signal envelopes Y1, . . . , YK from the Envelope Detector 202, and the stimulation timing signals X1, . . . , XK from the Fine Structure Detector 203 are output from the Stimulation Timer 206 to a Pulse Generator 204 that produces the electrode stimulation signals Z for the electrode contacts in the implanted electrode array 205. The Pulse Generator 204 applies a patient-specific mapping function—for example, using instantaneous nonlinear compression of the envelope signal (map law)—That is adapted to the needs of the individual cochlear implant user during fitting of the implant in order to achieve natural loudness growth. The Pulse Generator 204 may apply logarithmic function with a form-factor C as a loudness mapping function, which typically is identical across all the band pass analysis channels. In different systems, different specific loudness mapping functions other than a logarithmic function may be used, with just one identical function is applied to all channels or one individual function for each channel to produce the electrode stimulation signals. The electrode stimulation signals typically are a set of symmetrical biphasic current pulses.
In some stimulation signal coding strategies, stimulation pulses are applied at a constant rate across all electrode channels, whereas in other coding strategies, stimulation pulses are applied at a channel-specific rate. Various specific signal processing schemes can be implemented to produce the electrical stimulation signals. Signal processing approaches that are well-known in the field of cochlear implants include continuous interleaved sampling (CIS), channel specific sampling sequences (CSSS) (as described in U.S. Pat. No. 6,348,070, incorporated herein by reference), spectral peak (SPEAK), and compressed analog (CA) processing.
In the CIS strategy, the signal processor only uses the band pass signal envelopes for further processing, i.e., they contain the entire stimulation information. For each electrode channel, the signal envelope is represented as a sequence of biphasic pulses at a constant repetition rate. A characteristic feature of CIS is that the stimulation rate is equal for all electrode channels and there is no relation to the center frequencies of the individual channels. It is intended that the pulse repetition rate is not a temporal cue for the patient (i.e., it should be sufficiently high so that the patient does not perceive tones with a frequency equal to the pulse repetition rate). The pulse repetition rate is usually chosen at greater than twice the bandwidth of the envelope signals (based on the Nyquist theorem).
In a CIS system, the stimulation pulses are applied in a strictly non-overlapping sequence. Thus, as a typical CIS-feature, only one electrode channel is active at a time and the overall stimulation rate is comparatively high. For example, assuming an overall stimulation rate of 18 kpps and a 12 channel filter bank, the stimulation rate per channel is 1.5 kpps. Such a stimulation rate per channel usually is sufficient for adequate temporal representation of the envelope signal. The maximum overall stimulation rate is limited by the minimum phase duration per pulse. The phase duration cannot be arbitrarily short because, the shorter the pulses, the higher the current amplitudes have to be to elicit action potentials in neurons, and current amplitudes are limited for various practical reasons. For an overall stimulation rate of 18 kpps, the phase duration is 27 μs, which is near the lower limit.
The Fine Structure Processing (FSP) strategy by Med-El uses CIS in higher frequency channels, and uses fine structure information present in the band pass signals in the lower frequency, more apical electrode channels. In the FSP electrode channels, the zero crossings of the band pass filtered time signals are tracked, and at each negative to positive zero crossing, a Channel Specific Sampling Sequence (CSSS) is started. Typically CSSS sequences are applied on up to 3 of the most apical electrode channels, covering the frequency range up to 200 or 330 Hz. The FSP arrangement is described further in Hochmair I, Nopp P, Jolly C, Schmidt M, Schößer H, Garnham C, Anderson I, MED-EL Cochlear Implants: State of the Art and a Glimpse into the Future, Trends in Amplification, vol. 10, 201-219, 2006, which is incorporated herein by reference. The FS4 coding strategy differs from FSP in that up to 4 apical channels can have their fine structure information used. In FS4-p, stimulation pulse sequences can be delivered in parallel on any 2 of the 4 FSP electrode channels. With the FSP and FS4 coding strategies, the fine structure information is the instantaneous frequency information of a given electrode channel, which may provide users with an improved hearing sensation, better speech understanding and enhanced perceptual audio quality. See, e.g., U.S. Pat. No. 7,561,709; Lorens et al. “Fine structure processing improves speech perception as well as objective and subjective benefits in pediatric MED-EL COMBI 40+ users.” International journal of pediatric otorhinolaryngology 74.12 (2010): 1372-1378; and Vermeire et al., “Better speech recognition in noise with the fine structure processing coding strategy.” ORL 72.6 (2010): 305-311; all of which are incorporated herein by reference in their entireties.
In signal processing of electronic communications signals such as for hearing implants, the input sound signal y[n] can be characterized as an additive mixture of an information bearing target signal s[n] and a non-information bearing noise signal d[n]. To extract the information from the target signal s[n], clearly it is desirable to minimize the effects of the noise signal d[n]. Accomplishing such minimization typically requires estimation of the noise power from signal d[n].
R. Martin, Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, IEEE Trans. Speech Audio Proc., Vol. 9, No. 5, July 2001 (incorporated herein by reference in its entirety) describes a classic approach for estimating noise power in an input communications signal without a voice activity detector, tracking the spectral minima of the power spectrum of the noisy signal in frequency bands over a relatively long time window of typically 1-3 seconds. One drawback of this method is the limited tracking performance—if the noise power changes over time, the long observation window prevents the noise power estimate from following the changing noise power with little or no delay. This leads then to an underestimation of the noise power. But making the observation window shorter might lead to overestimation of the noise power since no speech pause might occur within the short window.
R. C. Hendriks, et. al., Noise tracking using DFT domain subspace decompositions, IEEE Trans. Audio, Speech, and Lang. Proc., Vol. 16, no. 3, March 2008 (incorporated herein by reference in its entirety) also requires no voice activity detector and achieves a better noise power tracking by an eigenvalue decomposition of correlation matrices constructed from time series of noisy discrete Fourier time (DFT) coefficients. It attains a good tracking performance for changing noise power, but at the cost of a high calculation effort due to the need of the eigenvalue decomposition. A year later, R. C. Hendriks, et. al., Fast noise PSD-estimation with low complexity, Proc. of the 34th IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., April 2009 (incorporated herein by reference in its entirety) proposes an algorithm with similar noise power tracking, but with lower computational requirements. This method is based on the construction of high-resolution periodograms per frequency band/DFT bin in a lower resolution filter bank. Although no eigenvalue decomposition is necessary, the computation of a high resolution periodogram is necessary, making this approach also computationally demanding.
R. C. Hendriks, et. al., MMSE based noise PSD tracking with low complexity, Proc. of the 35th IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., March 2010 (incorporated herein by reference in its entirety) proposes a Noise Power Spectral Density Estimation based on using Minimum Mean Square Estimators (MMSE), which offers better tracking performance. T. Gerkmann, R. C. Hendriks, Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay, IEEE Trans. Audio, Speech, and Lang. Proc., Vol. 20, no. 4, May 2012 shows that this MMSE noise estimation can be interpreted as a voice activity detector-based power estimator, which requires prior knowledge of the a priori signal-to-noise ratio (SNR) that is typically not known in advance, though this can be approximated assuming a uniformly distributed SNR over a relatively wide range with a fixed value.
U.S. Pat. No. 8,634,581 (incorporated herein by reference in its entirety) uses a combined approach for estimation of the noise level. The input signal level is compared against a threshold that is derived from the estimated noise level at the previous time frame and a fixed multiplication factor (recursive). Based on this comparison, a first estimate of the noise level for the current time frame is built. A second mechanism derives a second estimate for the noise level at the current time frame by using a codebook. The larger of the two estimates is finally used as the noise level estimate.
U.S. Pat. No. 8,385,572 (incorporated herein by reference in its entirety) describes a noise reduction method that uses a multitude of models for the target signal and/or the interfering noise signal. The motivation for this approach lies in the fact that known noise reduction methods (e.g., Y. Ephraim, D. Malah, Speech Enhancement using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Trans. Acoustics, Speech, and Sig. Proc., Vol. ASSP-32, no. 6, December 1984; or R. Martin, Speech Enhancement based on Minimum Mean-Square Error Estimation and Supergaussian Priors, IEEE Trans. Speech Audio Proc., vol. 13, no. 5, pp. 845-856, September 2005, both incorporated herein by reference in their entireties) rely on assumptions of the signal statistics (target signal and/or noise signal), assuming typically a Gaussian or supergaussian distribution. These assumptions might not always match reality equally well, which therefore limits the achievable performance of a noise reduction algorithm relying on these signal models. To achieve a better match with reality in terms of signal statistics, and therefore potentially increasing the performance of a noise reduction algorithm, a variety of signal models and a selection procedure for selecting the best match to reality are proposed, using, e.g., a situation classification algorithm. Based on a quality metric provided by the user, the noise and signal models can also be exchanged, e.g., by the hearing aid acoustician and remain static during daily usage. An alternative approach using dynamic models is also described, whereby the models are trained by an algorithm using the input signal and situation detection.
Embodiments of the present invention are directed to a method of signal processing to generate hearing implant stimulation signals for a hearing implant system. An input sound signal characterized as an additive mixture of an information bearing target signal and a non-information bearing noise signal is transformed into multiple band pass signals each representing an associated frequency band of audio frequencies. The band pass signals are then processed in a sequence of sampling time frames and iterative steps to produce a noise power estimate. For each time frame and iteration the processing includes using a noise prediction model to determine if a currently observed signal sample includes the target signal. If so, then a current noise power estimate is updated without using the currently observed signal sample. Otherwise, the current noise power estimate is updated using the currently observed signal sample. The noise prediction model also is adapted based on the updated noise power estimate. The hearing implant stimulation signals are developed from the band pass signals and the noise power estimate for delivery to an implanted portion of the hearing implant system.
In further specific embodiments, updating the current noise power estimate using the currently observed signal sample may include using the current signal power and the estimated noise power from an immediately preceding time frame and a last iteration step. Updating the current noise power estimate without using the currently observed signal sample may include maintaining constant the current noise power estimate, or additionally using a weighted sum of neighboring noise power estimates with suitably chosen weights and parameters.
Using the noise prediction model to determine if the currently observed signal sample includes the target signal may be based on a hard decision comparison of the currently observed signal sample to a variable threshold; for example, a likelihood ratio test-statistic. Or it may be based on a probability-based decision comparison of the currently observed signal sample to a variable threshold using a speech absence probability function; for example, a sigmoidal function.
The noise prediction model may be a time variant noise model. For example, the noise prediction model may be based on previous time frame noise power estimates and/or previous iteration power estimates. The noise prediction model may be a first order autoregressive model; for example, based on estimates from neighboring sub-bands, or a linear autoregressive model of a linear combination of estimated noise power of a previous iteration and two directly neighboring sub-bands, or a linear autoregressive model of a linear combination of already estimated noise powers and estimated noise power of a preceding iteration and two neighboring noise power estimates, or a nonlinear model where predicted noise power is a nonlinear function with respect to estimated noise powers.
Adapting the noise prediction model may be based on a difference between the noise prediction model and the noise power estimate and/or a continuous adaptation of one or more model optimization criteria such as a mean squared error of a prediction error Adapting the noise prediction model may be performed after all the iterative steps for a given time frame n have been performed, or after each iteration for a given time frame.
Developing the hearing implant stimulation signals may include using the noise power estimate for noise reduction or channel selection of the band pass signals, or for a power saving functionality of the hearing implant system.
Embodiments of the present invention are directed to an improved approach to blind estimation of the noise power in an input sound signal y[n] characterized as an additive mixture of an information bearing target signal s[n] (e.g., speech) and a non-information bearing disturbing (noise) signal d[n]:y[n]=s[n]+d[n], where n is the time-index, referred to as the time frame. In particular, the problem of detecting time frames when the target signal s[n] is absent is addressed. In those time-frames, an estimate for the noise power can be updated by using the (observable) input sound signal y[n], since then y[n]=d[n]. The noise power estimate is recursively reused to update the prediction for the next estimation step. This approach differs from existing methods such as described in U.S. Pat. No. 8,385,572 in that no signal model is directly used in a noise power estimation algorithm.
Estimating the noise power can be useful for a number of signal processing applications in a hearing implant system. These applications include:
In such systems, the Noise Power Estimation Module 306 splits the estimation of the unknown noise power into three main steps:
Prediction and estimation can be performed several times for the same time point n, so that the Noise Power Estimation Module 306 processes the band pass signals yk[n] in a sequence of sampling time frames n and iterative steps i=1, . . . , I to produce a noise power estimate {circumflex over (P)}d[n, k, l]. For each time frame n and iteration i, the Noise Power Estimation Module 306 uses a noise prediction model {tilde over (P)}d[n, k, i] to determine if a currently observed signal sample Py[n, k] includes the target signal s[n]. If the currently observed signal sample Py[n, k] includes the target signal s[n], then a current noise power estimate {circumflex over (P)}d[n, k, i] is updated without using the currently observed signal sample Py[n, k]. Otherwise, if the currently observed signal sample Py[n, k] does not include the target signal s[n], then the current noise power estimate {circumflex over (P)}d[n, k, i] is updated using the currently observed signal sample Py[n, k]. The noise prediction model {tilde over (P)}d[n, k, i] also is adapted based on the updated noise power estimate {circumflex over (P)}d[n, k, i]. Performing multiple iterative steps increases the probability of a correct decision regarding speech presence or absence, and thus leads to a more accurate noise power estimate {circumflex over (P)}d[n, k, I].
The observed target signal s[n] and noise signal d[n] are assumed to be realizations of locally stationary stochastic processes in which the statistics of the processes (e.g., represented by statistical moments such as mean and variance) are allowed to change slowly over time. For example, the signal powers are time-variant, but remain more or less constant within a short time window. The time window within which the noise process can be regarded as being stationary (i.e., the moments don't change) is assumed to be longer than that of the target (speech) process. In addition, it is assumed that the noise and speech processes are statistically independent with zero mean. Using the second assumption the signal power is Py=E{(s+d)2}=E(s2)+E(d2)=Ps+Pd. That is, simply the addition of the speech power and noise power, where E{⋅} denotes statistical expectation.
Typically, the input sound signal y[n] is decomposed into a number of sub-bands using, e.g., a filter bank (time domain, DFT, other subspaces, . . . ): yk[n]=FB(y[n]), k=1, . . . , K. The processing is typically performed per time and sub-band. If not needed, time and sub-band indices are suppressed in the following. Since the expectation operation cannot be performed in a real implementation, it is typically approximated using an average over time, e.g., by using a low pass filter. The estimated signal power is then Py=(s+d)2=s2+d2=Ps+Pd, where ⋅ denotes averaging over time. Either the squared signal as stated above or, equivalently, the squared envelope is used. For speech processing applications the low pass filter has typically a 6 dB cut-off frequency of approximately 5-50 Hz, which comprises the speech modulations. After low pass filtering, a sampling rate decimation to a significantly lower sampling rate (e.g., 80-100 Hz) can be applied in order to reduce the computational complexity of the following stages.
More specifically, the hypothesis test at iteration i is a simple comparison of the current sample Py against a variable threshold η:
Py[n, k]≤η[n, k, i]: Py[n, k] consists of noise only (null-hypothesis H0)
Py[n, k]>η[n, k, i]: consists of noise and speech (hypothesis H1).
The noise power estimate {circumflex over (P)}d[n, k, i] is then constructed based on the hypothesis-test decision. Recursive smoothing over time n and/or sub-band k may also be applied by which the correlation of the noise power over time and/or sub-bands is taken into account. If the hypothesis test indicates that the speech signal s[n] is absent (null-hypothesis H0), then the noise power estimate {circumflex over (P)}d[n, k, i] is updated using the current signal sample Py[n, k] and the estimated noise power from time point n−1 and the last iteration step I, {circumflex over (P)}d[n−1, k, I]:
{circumflex over (P)}d,sa[n,k,i]=α{circumflex over (P)}d[n−1,k,I]+(1−α)Py[n,k]
Using a hard threshold decision, the noise power estimate is then:
Py[n,k]≤η[n,k,i]:{circumflex over (P)}d[n,k,i]={circumflex over (P)}d,sa[n,k,i].
If the null-hypothesis is rejected (speech is present), the noise power estimate {circumflex over (P)}d[n, k, i] is kept constant, i.e.,
{circumflex over (P)}d,sp[n,k,i]={circumflex over (P)}d[n−1,k,I].
The update of the noise power estimate is then
Py[n,k]>η[n,k,i]:{circumflex over (P)}d[n,k,i]={circumflex over (P)}d,sp[n,k,i].
Alternatively, in the case of speech present, the noise power estimate {circumflex over (P)}d[n, k, i] can be updated using additionally a weighted sum of neighbouring noise power estimates,
with suitably chosen weights wl,k, e.g.,
wl,k=α exp(−b|l−k|m)
and suitably chosen parameters a, b, m. With this weighting, distant sub-bands contribute less than neighbouring sub-bands, reflecting, e.g., a decrease of the correlation if the distance in frequency increases. The weights wl,k and/or the parameters a, b, m can also be estimated and updated continuously using already existing noise power estimates from time frames before n or from time frame n and previous iterations i. The smoothing parameters α (in case speech absent) and γ (speech present) determine the degree of influence of the noise power estimate from time frame n−1 and model in a simple manner the correlation of the noise power over time.
Instead of a hard threshold decision as described above, a soft threshold decision could be used and might be advantageous since errors regarding the decision of speech absence or presence would have less weight. The output of the comparison with the threshold η is defined as speech absence probability. A decision
p[n,k,i]=g(η[n,k,i],Py[n,k]),
with a suitable function g (⋅) providing (soft) values for the speech-absent probability in the interval [0,1] can be used. E.g., a sigmoidal function
and βk determining the steepness of the function. A hard decision is achieved for the limit case βk→∞. Using the speech absence probability p[n, k, i], the noise power estimate at iteration i, time frame n, and sub-band k is then
{circumflex over (P)}d[n,k,i]=p[n,k,i]{circumflex over (P)}d,sa[n,k,i]+(1−p[n,k,i]){circumflex over (P)}d,sp[n,k,i],
with the speech-presence probability 1−p[n, k, i]. For the first simple case described above, the noise power estimate is then
with a scaled speech-absence probability {tilde over (p)}[n, k, i]=p[n, k, i](1−α).
The threshold can be derived using a stochastic signal model that treats the involved signals Py, Ps, Pd as stochastic processes, using a likelihood ratio test-statistic (Neyman, J., Pearson, E., On the problem of the most efficient test of statistical hypotheses, Philosophical Transactions of the Royal Society of London, Series A, Containing Papers of a Mathematical or Physical Character 231, pp. 289-337, 1933; incorporated herein by reference in its entirety):
where fP
pFA=p[Λ(Py)>η|H0]=∫{P
With this equation, the threshold for a given false-alarm probability can be determined.
The threshold is a function of the unknown noise power Pd since Py=Ps+Pd. In order to be able to calculate a threshold, a prediction {tilde over (P)}d[n] of the unknown noise power for time n as discussed below is used. This yields for the threshold η[n]=η(pFA, {tilde over (P)}d[n]) where the function η(⋅) depends on the assumed probability density fP
The key for an accurate estimation of the noise power is a correct decision whether the currently observed sample Py[n] results from speech and noise or noise only. This decision is based on the threshold calculation and depends on the targeted false-alarm probability and the noise-power. Since the noise-power is unknown and the aim of the process, the threshold cannot be calculated directly. Instead, a predicted value for the unknown noise power based on a time-variant noise-model can be used based on previous noise power estimates {circumflex over (P)}d[n−1, k, I]{circumflex over (P)}d[n−2, k, I], . . . as well as estimates produced at previous iteration steps, i.e., {circumflex over (P)}d[n, k, i−1], {circumflex over (P)}d[n, k, i−2], . . . . A prediction for the noise power for the current iteration step then can be made by using, e.g., an auto regressive model of first order (AR-1):
{tilde over (P)}d[n,k,i]=f(θ,{circumflex over (P)}d[n−1,k,l],{circumflex over (P)}d[n,k,i−1]),
where θ=[θ1, θ2, . . . , θM]T are the model parameters. In some specific embodiments, estimates from neighbouring sub-bands can be used in the prediction model, too:
{tilde over (P)}d[n,k,i]=f(θ,{circumflex over (P)}d[n−1,k,I],{circumflex over (P)}d[n,k,i−1],{circumflex over (P)}d[n−1,l≠ k,I],{circumflex over (P)}d[n,l≠k,i−1]).
The prediction model parameters θ for the noise power are adapted to increase the accuracy of succeeding predictions. This is done by using the final estimate for the noise power at time n and iteration-end I, {circumflex over (P)}d[n, k, I] and the prediction {tilde over (P)}[n, k, I]. Specifically, the difference between the two gives information about the mismatch between the model and the actual noise process, and is used for adapting the model parameters. Since the model is adapted, the parameters are changing over time, i.e., the (linear or nonlinear) model itself changes over time. The adaptation rule as described further below defines how the parameters are adapted to the current situation.
For predicting the noise-power, various different specific models can be used; for example, a linear AR-11 model in which the predicted noise power is a linear combination of the estimated noise power of the previous iteration and two directly neighbouring sub-bands:
whereby for i=1, {circumflex over (P)}d[n, k, 0]={circumflex over (P)}d[n−1, k, I], i.e., the estimate from the previous time frame n−1. Or a linear AR-ML model could be employed where the predicted noise power is a linear combination of M already estimated noise powers and the estimated noise power of the last iteration, as well as 2 L neighbouring noise power estimates:
Or a nonlinear model could be used where the predicted noise power is a nonlinear function with respect to the estimated noise powers, in which case, many different alternatives can be implemented, such as a recursive polynomial model.
For a linear-in-the-parameters prediction model, the model parameters can be condensed into a vector and the prediction is written as {tilde over (P)}d[n, k, i]=ψn,k,iTθn,k. For a linear AR-11 model:
ψn,k,iT=[{circumflex over (P)}d[n,k−1,i−1],{circumflex over (P)}d[n,k,i−1],{circumflex over (P)}d[n,k+1,i−1]]
and:
θn,kT=[θ−1[n,k],θ0[n,k],θ+1[n,k]].
Two cases, reflecting two situations prone for a false decision for speech presence or absence can be briefly considered. In a case where there is a rising noise power and speech is absent, then it is likely that it might be decided for speech presence due to the increasing signal power. If at time frame n, sub-band k, iteration i=1 it was erroneously decided for speech presence, the noise power estimate is not updated and will not follow the increasing noise power, i.e., it will be too small. If in the neighbouring sub-bands k−1, k+1 the decision is correct, the estimates for the noise power are updated correctly and increase. In the next iteration step, the prediction for the noise power in sub-band k is based on the updated noise power estimates in the neighbouring sub-bands and will increase also, assuming the noise model is sufficiently accurate. The probability for a correct speech presence or absence decision at this iteration step is increased now since the noise power prediction will be more accurate, and thus the decision for speech absence more likely, resulting in a larger probability for an update of the noise power estimate.
In a different case where there is a falling noise power and speech is present, it is likely that it might be decided for speech absent due to the decreasing signal level. That is, it might happen that at time frame n, sub-band k, iteration i=1, it is decided for speech absent. The noise power then will be updated erroneously. Assuming correct decisions and updates in the neighbouring sub-bands, i.e., decreasing noise power estimates there, at iteration i=2 it might be decided for speech presence, leading to a correct update of the noise power.
With this method, the speech absence probability is iteratively calculated, and, due to the correlation across sub-bands, it is assumed that a false decision at one iteration step is corrected in one of the following steps.
e[n,k,I]={circumflex over (P)}d[n,k,I]−{tilde over (P)}d[n,k,I].
The prediction model parameters can then be adapted, e.g., using a steepest decent method
θn,k=θn−1,k−μ∇θJ
with a fixed (or time variant) step-size μ determining the adaptation accuracy and tracking speed. Typically, since the expectation E{⋅} cannot be calculated due to lack of knowledge of the statistics of the prediction error, a stochastic gradient decent method can be used, e.g., the least mean square (LMS) method
θn,k=θn−1,k−μ∇θe[n,k,I]2=θn−1,k+μψn,k,Ie[n,k].
Advantageously, the adaptation considers only cases where the probability for a good noise power estimation is high, i.e., cases when it is relatively sure that speech is not present, since then the noise power was estimated accurately with high probability. For a AR-11 prediction model with
ψn,k,IT=[{circumflex over (P)}d[n,k−1,I],{circumflex over (P)}d[n,k,I],{circumflex over (P)}d[n,k+1,I]]
the fixed step-size turns into a 3×3 diagonal time-variant step-size matrix,
incorporating the speech-absent probabilities. With this matrix step-size, the update equation reads
θn,k=θn−1,k+Qn,k,Iψn,k,Ie[n,k],
thus restricting model adaptation more or less to speech-absent periods.
Adaptation and iteration are interleaved with at least two possible methods.
Due to the recursive approach described above, changing noise power over time can be tracked with only a short delay. And due to the adaptation of the prediction model, the system is able to adapt to various acoustical situations, especially adaptation to various noise types. In addition, this approach is of relatively low arithmetical complexity compared to existing arrangements. Of course, due to the recursive approach, a system might become unstable for some unfavourable combination of parameters and input signal.
Embodiments of the invention may be implemented in part by any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g., “C”) or an object oriented programming language (e.g., “C++”, Python). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
Embodiments can be implemented in part as a computer program product for use with a computer system. Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).
Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention.
This application claims priority to U.S. Provisional Patent Application 62/349,175, filed Jun. 13, 2016, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/036968 | 6/12/2017 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2017/218386 | 12/21/2017 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5012519 | Adlersberg et al. | Apr 1991 | A |
20070055508 | Zhao et al. | Mar 2007 | A1 |
20080172108 | Zierhofer | Jul 2008 | A1 |
20100249880 | Aschbacher | Sep 2010 | A1 |
20120213395 | Rosenkranz | Aug 2012 | A1 |
20130191118 | Makino | Jul 2013 | A1 |
20140056435 | Kjems et al. | Feb 2014 | A1 |
20170303053 | Falch | Oct 2017 | A1 |
Number | Date | Country |
---|---|---|
10 2011 004338 | Jul 2012 | DE |
2916320 | Sep 2015 | EP |
WO 2017218386 | Dec 2017 | WO |
Entry |
---|
International Search Report and Written Opinion of the International Searching Authority, Application No. PCT/US2017/036968, dated Aug. 23, 2017, 14 pages. |
Constantin Wiesener et al., Adaptive Noise Reduction for Real-time Applications, AES Convention 128, May 2010, XP040509431, 7 pages. |
European Patent Office, Extended European Search Report, Application No. 17813862.4, dated May 23, 2019, 10 pages. |
Number | Date | Country | |
---|---|---|---|
20190124454 A1 | Apr 2019 | US |
Number | Date | Country | |
---|---|---|---|
62349175 | Jun 2016 | US |