Embodiments described herein relate generally to speech processing system
It is often necessary to understand speech in noisy environment, for example, when using a mobile telephone in a crowded place, listening to a media file on a mobile device, listening to a public announcement at a station etc.
It is possible to enhance a speech signal such that it is more intelligible in such environments.
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which:
In an embodiment, a speech intelligibility enhancing system is provided for enhancing speech to be outputted in a noisy environment, the system comprising:
In systems in accordance with the above embodiments, the output is adapted to the noise environment. Further, the output is continually updated such that it adapts in real time to the changing noise environment. For example, if the above system is built into a mobile telephone and the user is standing outside a noisy room, the system can adapt to enhance the speech dependent on whether the door to the room is open or closed. Similarly, if the system is used in a public address system in a railway station, the system can adapt in real time to the changing noise conditions as trains arrive and depart.
In an embodiment, the signal to noise ratio is estimated on a frame by frame basis and the signal to noise ratio for a previous frame is used to update the parameters for a current frame. A typical frame length is from 1 to 3 seconds.
The above system can adapt either the spectral shaping filter and/or the dynamic range compression stage to the noisy environment. In some embodiments, both the spectral shaping filter and the dynamic range compression stage will be adapted to the noisy environment.
When adapting the dynamic range compression in line with the SNR, the control parameter that is updated may be used to control the gain to be applied by said dynamic range compression. In further embodiments, the control parameter is updated such that it gradually suppresses the boosting of the low energy segments of the input speech with increasing signal to noise ratio. In some embodiments, a linear relationship is assumed between the SNR and control parameter, in other embodiments a non-linear or logistic relationship is used.
To control the volume of the output, in some embodiments, the system further comprises an energy banking box, said energy banking box being a memory provided in said system and configured to store the total energy of said input speech before enhancement, said processor being further configured to increase the energy of low energy parts of the enhanced signal using energy stored in the energy banking box.
The spectral shaping filter may comprise an adaptive spectral shaping stage and a fixed spectral shaping stage. The adaptive spectral shaping stage may comprise a formant shaping filter and a filter to reduce the spectral tilt. In an embodiment, a first control parameter is provided to control said format shaping filter and a second control parameter is configured to control said filter configured to reduce the spectral tilt and wherein said first and/or second control parameters are updated in accordance with the signal to noise ratio. The first and/or second control parameters may have a linear dependence on said signal to noise ratio.
The above discussion has concentrated on adapting the signal in response to an SNR. However, the system may be further configured to modify the spectral shaping filter in accordance with the input speech independent of noise measurements. For example, the processor may be configured to estimate the maximum probability of voicing when applying the spectral shaping filter, and wherein the system is configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
The system may also be additionally or alternatively configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements. For example, the processor is configured to estimate the maximum value of the signal envelope of the input speech when applying dynamic range compression and wherein the system is configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
The system may also be configured to output enhanced speech in a plurality of locations. For example, such a system may comprise a plurality of noise inputs corresponding to the plurality of locations, the processor being configured to apply a plurality of spectral shaping filters and a plurality of corresponding dynamic range compression stages, such that there is a spectral shaping filter and dynamic range compression stage pair for each noise input, the processor being configured to update the control parameters for each spectral shaping filter and dynamic range compression stage pair in accordance with the signal to noise ratio measured from its corresponding noise input. Such a system would be of use for example in a PA system with a plurality of speakers in different environments.
In further embodiments, a method for enhancing speech to be outputted in a noisy environment is provided, the method comprising:
The above embodiments, have discussed adaptability of the system in response to SNR. However, in some embodiments, the speech is enhanced independent of the SNR of the environment where it is to be output. Here, a speech intelligibility enhancing system for enhancing speech to be output is provided, the system comprising:
For example, the processor may be configured to estimate the maximum probability of voicing when applying the spectral shaping filter, and wherein the system is configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
The system may also be additionally or alternatively configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements. For example, the processor is configured to estimate the maximum value of the signal envelope of the input speech when applying dynamic range compression and wherein the system is configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
In a further embodiment, a method for enhancing speech intelligibility is provided, the method comprising:
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
The system 1 comprises a processor 3 which comprises a program 5 which takes input speech and information about the noise conditions where the speech will be output and enhances the speech to increase its intelligibility in the presence of noise. The storage 7 stores data that is used by the program 5. Details of what data is stored will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input for data relating to the speech to be enhanced and also and input for collecting data concerning the real time noise conditions in the places where the enhanced speech is to be output. The type of data that is input may take many forms, which will be described in more detail later. The input 15 may be an interface that allows a user to directly input data. Alternatively, the input may be a receiver for receiving data from an external storage medium or a network.
Connected to the output module 13 is output is audio output 17.
In use, the system 1 receives data through data input 15. The program 5 executed on processor 3, enhances the inputted speech in the manner which will be described with reference to
Step S21 operates in the frequency domain and its purpose is to increase the “crisp” and “clean” quality of the speech signal, and therefore improve the intelligibility of speech even in clear (not-noisy) conditions. This is achieved by sharpening the formant information (following observations in clear speech) and by reducing spectral tilt using pre-emphasis filters (following observations in Lombard speech). The specific characteristics of this sub-system are adapted to the degree of speech frame voicing.
The steps S21 and S23 are shown in more detail in
In this embodiment, the spectral intelligibility improvements are applied inside the adaptive Spectral Shaping stage S31. In this embodiment, the adaptive spectral shaping stage comprises a first transformation which is a formant sharpening transformation and a second transformation which is a spectral tilt flattening transformation. Both the first and second transformations are adapted to the voiced nature of speech, given as a probability of voicing per speech frame. These adaptive filter stages are used to suppress artefacts in the processed signal especially in fricatives, silence or other “quiet” areas of speech.
Given a speech frame, the probability of voicing which is determined in step S35 is defined as:
Where α=1/max(Pv(t)) is a normalisation parameter, rms(t) and z(t) denote the RMS value and the zero-crossing rate.
A speech frame sri(t)
sri(t)=s(t)wr(ti−t) (2)
is extracted from the speech signal s(t) using a rectangular window wr(t) centred at each analysis instant ti, In an embodiment, the window is length 2.5 times the average fundamental period of speaker's gender (8:3 ms and 4:5 ms for males and women, respectively). In this particular embodiment, analysis frames are extracted each 10 ms. The two above transformations are adaptive (to the local probability of voicing) filters that are used to implement the adaptive spectral shaping.
First, the formant shaping filter is applied. The input of this filter is obtained by extracting speech frames sni(t) using Hanning windows of the same length as those specified for computing the probability of voicing, then applying an N-point discrete Fourier transform (DFT) in step S37
and estimating the magnitude spectral envelope E(ωk; ti) for every frame i. The magnitude spectral envelope is estimated using the magnitude spectrum in (3) and a spectral envelope estimation vocoder (SEEVOC) algorithm in step S39. Fitting the spectral envelope by cepstral analysis provides a set of cepstral coefficients, c:
which are used to compute the spectral tilt, T(ω, t1):
log T(ω,ti)=c0+2c1 cos(ω) (5)
Thus, the adaptive formant shaping filter is defined as:
The formant enhancement achieved using the filter defined by equation (6) is controlled by the local probability of voicing Pv(ti) and the β parameter, which allows for an extra noise-dependent adaptivity of Hs.
In an embodiment, β is fixed, in other embodiments, it is controlled in accordance with the signal to noise ratio (SNR) of the environment where the voice signal is to be outputted.
For example, β may be set to a fixed value of β0. In an embodiment, β0 is 0.25 or 0.3. If β is adapted with noise, then for example:
if SNR<=0, β=β0
if 0<SNR<=15, β=β0*(1−SNR/15)
if SNR>15, β=0
The above example assumes a linear relationship between β and the SNR, but a non-linear relationship could also be used.
The second adaptive (to the probability of voicing) filter which is applied in step S31 is used to reduce the spectral tilt. In an embodiment, the pre-emphasis filter is expressed as:
where ω0=0:1257π for a sampling frequency of 16 kHz.
In some embodiments, g is fixed, in other embodiments, g is dependent on the SNR environment where the voice signal is to be outputted.
For example, g may be set to a fixed value of g0. In an embodiment, g0 is 0.3. If g is adapted with noise, then for example:
if SNR<=0, g=g0
if 0<SNR<=15, g=g0*(1−SNR/15)
if SNR>15, g=0
The above example assumes a linear relationship between g and the SNR, but a non-linear relationship could also be used.
The fixed Spectral Shaping step (S33) is a filter Hr(ω; ti) used to protect the speech signal from low-pass operations during its reproduction. In frequency, Hr boosts the energy between 1000 Hz and 4000 Hz by 12 dB/octave and reduces by 6 dB/octave the frequencies below 500 Hz. Both voiced and unvoiced speech segments are equally affected by the low-pass operations. In this embodiment, the filter is not related to the probability of voicing.
Finally, after the magnitude spectra are modified accordingly to:
|Ŝ(ω,ti)|=|Sω,ti)|·Hs(ω,ti)·Hp(ω,ti)·Hr(ω,ti) (8)
the modified speech signal is reconstructed by means of inverse DFT (S41) and Overlap-and-Add, using the original phase spectra as shown in
In the above described spectral shaping step, the parameters β and g may be controlled in accordance with real time information about the signal to noise ratio in the environment where the speech is to be outputted.
Returning to
The signal's time envelope is estimated in step S51 using the magnitude of the analytical signal:
{tilde over (e)}(n)=|s(n)+jš(n)| (9)
where š(n) denotes the Hilbert transform of the speech signal s(n). Furthermore, because the estimate in (9) has fast fluctuations, a new estimate e(n) is computed based on a moving average operator with order given by the average pitch of the speaker's gender. In an embodiment, the speaker's gender is assumed to be male since the average fundamental period is longer for men. However, in some embodiments as noted above, the system can be adapted specifically for female speakers with a shorter fundamental period.
The signal is then passed to the DRC dynamic step S53. In an embodiment, during the DRC's dynamic stage S53, the envelope of the signal is dynamically compressed with 2 ms release and almost instantaneous attack time constants:
where ar=0.15 and aa=0.0001.
Following the dynamic stage S53, a static amplitude compression step S55 controlled by an Input-Output Envelope Characteristic (IOEC) is applied.
The IOEC curve depicted in
ein(n)=20 log10({circumflex over (e)}(n)/c0) (11)
setting the reference level e0, to 0.3 the maximum level of the signal's envelope, selection that provided good listening results for a broad range of SNRs. Then, applying the IOEC to (11) generates eout(n) and allows the computation of the time-varying gains:
g(n)=10(e
which produces the DRC-modified speech signal which is shown in
sg(n)=g(n)s(n) (13)
As a final step, the global power of sg (n) is altered to match the one of the unmodified speech signal.
In an embodiment, the IOEC curve is controlled in accordance with the SNR where the speech is to be output. Such a curve is shown in
In
A piecewise linear IOEC (as the one given in
(Pi2, Pi+12) has the following analytical expression:
(Pi2,Pi+12):y(x,λ)=α(λ)x+b(λ);xϵ[xi,xi+1] (14)
where a(λ) is the segment's slope
and b(λ) is the segment's offset
b(λ)=yi(λ)−a(λ)xi (16)
Two embodiments will now be discussed where respectively two types of effective morphing methods were selected to control the IOEC curve: a linear and a non-linear (logistic) slope variation over λ. For an embodiment, where a linear relationship is employed, the following expression may be used for a:
For the non-linear (logistic) form:
where λ0 is the logistic offset, σ0 is the logistic slope, while
In an embodiment, λ0 and σ0 are constants given as input parameters for each type of noise (e.g., for SSN type of noise they may be chosen −6 dB and 2, respectively). In a further embodiment, and λ0 or σ0 may be controlled in accordance with the measured SNR. For example, they may be controlled as described above for β and g with a linear relationship on the SNR.
Finally, imposing P01=P02 adaptive IOEC is computed for a given λ, considering the expression (17) or (18) as slopes for each of its segments i=
Psychometric measurements have indicated that speech intelligibility changes with SNR following a logistic function of the type used in accordance with the above embodiment.
In the above embodiments, the spectral shaping step S21 and the DRC step S23 are very fast processes which allow real time execution at a perceptual high quality modified speech.
Systems in accordance with the above described embodiments, show enhanced performance in terms of speech intelligibility gain especially for low SNRs. They also provide suppression of audible arte-facts inside the modified speech signal at high SNRs. At high SNRs, increasing the amplitude of low energy segments of speech (such as unvoiced speech) can cause perceptual quality and intelligibility degradation.
Systems and methods in accordance with the above embodiments provide a light, simple and fast method to adapt dynamic range compression to the noise conditions, inheriting high speech intelligibility gains at low SNRs from the non-adaptive DRC and improve perceptual quality and intelligibility at high SNRs.
Returning to
If speech is not present the system is off. In stage S61 a voice activity detection module is provided to detect the presence of speech. Once speech is detected, the speech signal is passed for enhancement. The voice activity detection module may employ a standard voice activity detection (VAD) algorithm can be used.
The speech will be output at speech output 63. Sensors are provided at speech output 63 to allow the noise and SNR at the output to be measured. The SNR determined at speech output 63 is used to calculate β and g in stage S21. Similarly, the SNR λ is used to control stage S23 as described in relation to
The current SNR at frame t is predicted from previous frames of noise as they have been already observed in the past (t-1, t-2, t-3 . . . ). In an embodiment, the SNR is estimated using long windows in order to avoid fast changes in the application of stages S21 and S23. In an example, the window lengths can be from 1 s to 3 s.
The system of
In stage S23, in the above embodiment, e0 was set to 0.3 times the maximum value of the signal envelope. This envelope can be continually updated dependent on the input signal. Again, the envelope can be updated every n seconds, where n is a value between 2 and 10, in one embodiment, n is from 3-5.
The initial values for the maximum probability of voicing and the maximum value of the signal envelope are obtained from database 65 where speech signals have been previously analysed and these parameters have been extracted. These parameters are passed to parameter update stage S67 with the speech signal and stage S67 updates these parameters.
In an embodiment, the dynamic range compression, energy is distributed over time. This modification is constrained by the following condition: total energy of the signal before and after modifications should remain the same (otherwise one can increase intelligibility by increasing the energy of the signal i.e the volume). Since the signal which is modified is not known a priori, Energy Banking box 69 is provided. In box 69, energy from the most energetic part of speech is “taken” and saved (as in a Bank) and it is then distributed to the less energetic parts of speech. These less energetic parts are very vulnerable to the noise. In this way, the distribution of energy helps the overall the modified signal to be above the noise level. In an embodiment, this can be implemented by modifying equation (13) to be:
sga(n)=sga(n)a(n) (20)
Where a(n) is calculated from the values saved in the energy banking box to allow the overall modified signal to be above the noise level.
If E(sg(n))>E(Noise(n)) then a(n)=1, (21)
where E(sg(n)) is the energy of the enhanced signal sg(n) for the frame (n) and E(Noise(n)) is the energy of the noise for the same frame.
If E(sg (n))≤E(Noise(n)) the system attempts to further distribute energy to boost low energy parts of the signal so that they are above the level of the noise. However, the system only attempts to further distribute the energy if there is energy Eb stored in the energy banking box.
If the gain g(n)<1, then the energy difference between the input signal and the enhanced signal (E(s(n))−E(sg(n))) is stored in the energy banking box. The energy banking box stores the sum of these energy differences where g(n)<1 to provide the stored energy Eb.
To calculate a(n) when E(sg(n))≤E(Noise(n)), a bound on α is derived as α1:
A second expression a2 (n) for a(n) is derived using Eb
Where γ is a parameter chosen such that 0<γ≤1 which expresses a percentage of the energy bank which can be allocated to a single frame. In an embodiment, γ=0.2, but other values can be used.
If α2(n)≥α1, then α(n)=α2(n) (24)
However,
If α2(n)<α1, then α(n)=1 (25)
When energy is distributed as above, the energy is removed from the energy banking box Eb such that the new value of Eb is:
Eb−E(sg(n))(α(n)−1) (26)
Once α(n) is derived, it is applied to the enhanced speech signal in step S71.
The system of
The system of
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
1319694.4 | Nov 2013 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2014/053320 | 11/7/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/067958 | 5/14/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20030002659 | Erell | Jan 2003 | A1 |
20060271358 | Erell | Nov 2006 | A1 |
20080140396 | Grosse-Schulte | Jun 2008 | A1 |
20090281800 | LeBlanc | Nov 2009 | A1 |
20090287496 | Thyssen et al. | Nov 2009 | A1 |
20100017205 | Visser | Jan 2010 | A1 |
20100020986 | Nemer | Jan 2010 | A1 |
20100121635 | Erell | May 2010 | A1 |
20110125490 | Furuta | May 2011 | A1 |
20120101816 | Erell | Apr 2012 | A1 |
20130282373 | Visser | Oct 2013 | A1 |
20140056435 | Kjems | Feb 2014 | A1 |
Number | Date | Country |
---|---|---|
102246230 | Nov 2011 | CN |
1 286 334 | Feb 2003 | EP |
WO 02097977 | Dec 2002 | WO |
Entry |
---|
International Search Report and Written Opinion of the International Searching Authority dated Feb. 9, 2015, in PCT/GB2014/053320, filed Nov. 7, 2014. |
Great Britain Search Report dated May 8, 2014, in Patent Application No. GB1319694.4, filed Nov. 7, 2013. |
Emma Jokinen, et al., “Signal-to-noise ratio adaptive post-filtering method for intelligibility enhancement of telephone speech”, The Journal of the Acoustical Society of America, vol. 132, No. 6, XP 012163510, Dec. 2012, pp. 3990-4001. |
Tudor-C{hacek over (a)}t{hacek over (a)}lin Zoril{hacek over (a)}, et al., “Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression”, INTERSPEECH, XP 002734717, Sep. 9-13, 2012, pp. 635-638 (with presentation). |
Henning Schepker, et al., “Improving speech intelligibility in noise by SII-dependent preprocessing using frequency-dependent amplification and dynamic range compression”, INTERSPEECH, XP 002734731, Aug. 25-29, 2013, pp. 3577-3581. |
Combined Office Action and Search Report dated Mar. 13, 2017 in Chinese Patent Application No. 2014800032369 (English translation only). |
T.C. Zorila et al., “Speech-In-Noise Intelligibility Improvement Based On Power Recovery And Dynamic Rangei Compression”, EUSIPCO 2012, pages 2075-2079. |
Sungyub D. Yoo, et al., “Speech signal modification to increase intelligiblity in noisy environment”, The Journal of the Acoustical Society of America, vol. 122, No. 2, Aug. 2007, pp. 1138-1149. |
Thomas F. Quatieri et al., “Peak-to-RMS Reduction of Speech Based on a Sinusoidal Model”, IEEE Trans. on signal processing, vol. 39, No. 2, Feb. 1991, pp. 273-288. |
Douglas B. Paul, “The Spectral Envelope Estimation Vocoder”, IEEE Trans. On Acoustics, Speech and Signal Processing. vol. ASSP-29, No. 4, Aug. 1961, pp. 786-794. |
Russell S. Niederjohn, et al., “The Enhancement of Speech Intelligibility in High Noise Levels by High-Pass Filtering Followed by Rapid Amplitude Compression”, IEEE Trans Acoustic, Speech, and Signal Processing, vol. ASSP-24, No. 4, Aug. 1976, pp. 277-262. |
Youyi Lu, et al., “Speech production modifications produced by competing talkers, babble, and stationary noise”, The Journal of the Acoustical Society of America vol. 124, No. 5, Nov. 2006, pp. 3261-3275. |
Valerie Hazan et al., “Acoustic-phoentic characteristic of speech produced with communicative intent to counter adverse listening conditions”, The Journal of the Acoustical Society of America vol. 130, No. 4, Oct. 2011, pp. 2139-2152. |
Valerie Hazan et al., “Cue-Enhancement Strategies for Natural VCV And Sentence Materials Presented In Noise”, Speech and Language, 9:43-55, 1996. |
Martin Cooke et al., “Evaluating the intelligibility of speech modifications in known noise conditions”, Speech Communication, 2013, pp. 572-585, http://dx.doi.org/10.10168/j.specom.2013.01.001. |
Berry A. Blesser, Audio Dynamic Range Compression For Minimum Perceived Distortion, IEEE Transactions on Audio and Electroacoustics, vol. AU-17, No. 1, 1969, pp. 22-32. |
Number | Date | Country | |
---|---|---|---|
20160019905 A1 | Jan 2016 | US |