This application is a national phase filing under 35 U.S.C. § 371 of International Patent Application No. PCT/IN2015/000048, filed Jan. 27, 2015, which claims the benefit of Indian Patent Application No. 739/MUM/2014, filed Mar. 4, 2014, each of which is incorporated herein by reference in its entirety.
The present invention generally relates to signal processing and more particularly to a method and system for improving the speech intelligibility under adverse listening conditions.
It has been observed that a talker in a difficult communication environment usually alters the speaking style to make the speech more intelligible. The resulting speech is known as “clear speech”. Studies have shown that, in comparison to the conversational style speech, it is more intelligible for listeners in noisy backgrounds and for listeners with hearing impairment, children with learning disabilities, and non-native listeners. Increased consonant intensity and duration have been identified as the main contributors to the intelligibility advantage of clear speech. Studies using modification of conversational speech have shown that enhancement of consonant intensity resulted in improved speech intelligibility, while duration modification resulted in only marginal improvements, possibly due to errors in locating the boundaries of segments to be modified and due to processing related artifacts. It may also be due to the fact that formants in conversational speech are relatively less targeted which cannot be improved by duration modification.
Increasing the intensity of consonant segments relative to the nearby vowel segments is known as consonant-vowel ratio (CVR) modification. It is reported to be effective in improving perception of consonants, across speakers and vowel context dependencies, for listeners in noisy backgrounds and for hearing-impaired listeners. The techniques for CVR modification can be broadly classified into manual and automated depending on the methods used for locating the segments for modification. The manual techniques are useful in investigating the effectiveness of CVR modification in improving speech perception. Results of investigations with such techniques have shown that a significant improvement in speech intelligibility can be achieved by accurate selection and careful modification of perceptually salient segments in conversational speech. Automated techniques for CVR modification, implemented for real-time processing, can be useful for enhancing speech intelligibility in communication devices and hearing aids. For being useful in such applications, the technique should meet the following requirements: (i) the segments for modification should be detected with a high temporal accuracy and low rate of insertion errors and without being significantly affected by speaker variability, (ii) modification of speech characteristics should be carried out without introducing perceptible distortions, (iii) the processing should have low computational complexity and memory requirement to enable real-time processing using the processors available in communication devices and hearing aids, (iv) the signal delay introduced by the processing (processing delay consisting of the algorithmic and computational delays) should not be disruptive for audio-visual speech perception. These requirements are only partly met by the existing systems.
Kates (J. M. Kates, “Speech intelligibility enhancement,” U.S. Pat. No. 4,454,609, 1984) has described a method for enhancement of intelligibility of consonant sounds in communication systems by boosting high frequency components. The system comprises a bank of band-pass filters and envelope detectors, a controller to set the gain for each filter channel, by comparing its short-time energy with those of the selected reference channels, and application of these gains for dynamically modifying the overall spectral shape. Reference channels are selected for boosting the short-time energy of the high frequency channels with respect to the low frequency channels. Thus the method enhances the sounds characterized by high frequency release bursts and transitions and not all transient segments. Further, use of fixed frequency bands in the processing limits its adaptability to speaker variability.
Terry (A. M. Terry, “Method and apparatus for enhancement of telephonic speech signals,” U.S. Pat. No. 5,737,719, 1998) has described a system for boosting the second formant with respect to the first formant and modification of the consonant-vowel ratio. Processing uses a bank of bark-scale based band-pass filters. Short-time band energies are used to get an approximation of the auditory spectrum. Peak-picking is applied to locate first two formants and the second formant is enhanced with respect to the first one. Segments having energy levels below those associated with vowels but above those associated with silence are identified as consonantal and these are amplified. Auditory spectrum is converted to Fourier spectrum and inverse Fourier transform is used to produce the output. Although the method is suitable for real-time processing, errors in formant identification, errors in selecting consonantal segments, and use of analysis-synthesis, particularly conversion from auditory spectrum to Fourier spectrum and discarding of the phase information, are likely to result in processing related artifacts. Further, use of fixed bands in the method limits its adaptability to speech and speaker variability.
Michaelis (P. R. Michaelis, “Method and apparatus for improving the intelligibility of digitally compressed speech,” U.S. Pat. No. 6,889,186B1, 2005) has described a method which involves segmenting input speech into frames, carrying out spectral analysis to identify the type of sound in each frame, and applying a gain based on the type of sound in the frame and in the surrounding frames, to improve speech intelligibility. Frames identified as unvoiced fricatives and plosives are amplified and the preceding voiced frames are attenuated. This method does not address enhancement of voiced stops and fricatives which may be hard to perceive under adverse listening conditions. Fixed-frame based segmentation may cause short duration release bursts to get merged with the voiced segments, resulting in errors in classification of frames, thereby limiting the effectiveness of the modification in improving speech intelligibility. Further, need for classification of the frames increases computational complexity and dependence of the gain of a frame on the type of neighbouring frames causes excessive signal delay.
Vandali et al. (A. E. Vandali, G. M. Clark, “Emphasis of short-duration transient speech features,” U.S. Pat. No. 8,296,154B2, 2012) have described a transient emphasis system for use in auditory prostheses to assist in perception of low-intensity short-duration speech features. The method uses a bank of band-pass filters and envelope detectors. For each filter channel, a running history buffer of the envelope spanning 60 ms with 2.5 ms intervals is used to estimate its second derivative which is used to determine a channel gain function. As the method uses fixed frequency bands, it is not adaptive to speech and speaker variability and it also suffers from a relatively large signal delay.
Skowronski et al. (M. D. Skowronski, J. G. Harris, “Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments,” Journal of Speech Communication, vol. 48, pp. 549-558, 2006) reported a method for speech intelligibility enhancement based on redistribution of energy in voiced and unvoiced segments. In this method, a measure of spectral flatness derived from the short-time speech spectrum along with a Schmitt trigger based thresholding is used for classifying the segments as voiced or unvoiced. The voiced segments (those corresponding to vowels, semivowels, nasals, voiced plosives, and voiced fricatives) are attenuated and unvoiced segments are amplified, maintaining the overall energy unaltered. Possible errors in classification and sensitivity of the classification method to additive noise are the limiting factors in its usefulness in enhancing the unvoiced segments. Further, attenuation of the low-energy voiced plosives and fricatives may adversely affect their perception. Colotte et al. (V. Colotte, Y. Laprie, “Automatic enhancement of speech intelligibility,” Proceedings of ICASSP 2000, Istanbul, pp. 1057-1060) have reported a method using spectral variation function based on mel-cepstral analysis to locate stop and fricative segments and their amplification by 4 dB. In a method reported by Yoo et al. (S. D. Yoo, J. R. Boston, A. Jaroudi, C. C. Li, “Speech signal modification to increase intelligibility in noisy environment,” Journal of Acoustical Society of America, vol. 122, pp. 1138-1149, 2007), the transient regions of speech are extracted and emphasized using time-varying band-pass filters based on formant tracking. Tantibundhit et al. (C. Tantibundhit, F. Pernkopf, G. Kubin, “Speech enhancement based on joint time-frequency segmentation,” Proceedings of ICASSP 2009, Taipei, pp. 4673-4676) have described a method for speech modification based on wavelet packet decomposition. These methods are computation intensive and introduce significant signal delays.
In view of the foregoing, there is a need for a new method and system for consonant-vowel ratio modification without introducing perceptible distortions for improving speech intelligibility.
The present invention proposes a method and system for consonant-vowel ratio modification for improving speech perception under adverse listening conditions, such as those experienced by listeners in noisy backgrounds, hearing-impaired listeners, children with learning disabilities, and non-native listeners. It uses signal processing for enhancing the consonant-vowel ratio in speech signal by applying a gain function on the signal in time-domain and it introduces minimal perceptible distortions. The technique, presented in this disclosure, comprises the steps of (i) detection of perceptually salient segments for modification in digital speech signal, (ii) calculation of time-varying gain in accordance with the location of the detected segments for modification, and (iii) application of the calculated gain to the signal for improving its perception under adverse listening conditions. The segments for modification, consisting of the stop release and frication burst, are detected with a high temporal accuracy and low error rate, using the rate of change of spectral centroid derived from the short-time magnitude spectrum of speech added with a tone. The processing steps have low computational complexity and memory requirement. The method for detecting perceptually salient segments and calculation of time-varying gain have steps of windowing the samples of digital speech signal to form overlapping frames and calculating energy of the frames, smoothening the frame energy by a moving-average filter to get smoothened short-time energy and applying a peak detector with exponential decay on frank energy to track peak energy, generating a low-frequency tone and multiplying the low-frequency tone with peak energy and adding the resulting scaled tone to the digital speech signal to obtain a tone-added signal, windowing the tone-added signal and applying Discrete Fourier transform (DFT) to obtain short-time magnitude spectrum of the tone-added signal, applying a moving-average filter on the short-time magnitude spectrum to get smoothened short-time magnitude spectrum, calculating spectral centroid of the smoothened short-time magnitude spectrum, smoothening the spectral centroid by median filtering to get smoothened spectral centroid, calculating first-difference of the smoothened spectral centroid to get the rate of change of smoothened spectral centroid, and selecting said time-varying gain using said smoothened short-time energy, said peak energy, and said rate of change of spectral centroid.
The signal delay introduced by the processing is acceptable for audio-visual perception and hence the method is suitable for real-time processing of speech signals in communication devices and hearing aids. In an aspect of the present invention, a system provides consonant-vowel ratio (CVR) modification using a 16-bit fixed-point processor with on-chip FFT hardware and interfaced to an audio codec for inputting the speech signal as analog audio input from a microphone and outputting the processed speech signal as analog audio output through a speaker. The preferred embodiment can be integrated with other FFT based speech enhancement techniques like noise suppression and dynamic range compression for use in communication devices, hearing aids, and other audio devices.
The present invention proposes a method and a system for consonant-vowel ratio modification for improving speech perception under adverse listening conditions and for use in communication devices and hearing aids. The processing technique assumes clean speech at a conversational level to be available as the input signal. In case of noisy input, the processing may be used along with a speech enhancement technique for noise suppression. In case of input with wide variation in the signal level, a dynamic range compression technique may be used. The processing is applied to make the speech signal robust against further degradation under adverse listening conditions and it does not adversely affect the perception of non-speech audio signals. The processing method along with the system is explained below with reference to the accompanying drawings in accordance with an embodiment of the present invention.
For CVR modification, the spectral transitions of interest need to be detected with a good temporal accuracy and without a significant effect of speaker variability. The processing associated with the detection of segments and their modification should have low computational complexity and memory requirement. Further, the algorithmic and computational delays associated with the processing should be low in order to be acceptable for use in speech communication devices. In a study on the use of the first four spectral moments for detection of stop release bursts, the spectral centroid was found to be the most significant contributor. It is the first moment of the distribution of spectral power and is related to the spectral slope. It is close to the center frequency for a flat spectrum and shifts towards the frequencies of highest power in a tilted spectrum. Its value is generally less than 0.5 kHz for vowels, semivowels, and nasals, greater than 0.5 kHz for voiced and unvoiced stops, and greater than 1 kHz for voiced and unvoiced fricatives. In the present invention, the peaks in the rate of change of spectral centroid are used for detecting the segments with sharp spectral transitions which are associated with major changes in the vocal tract configuration and occur at the release of closures in stops and affricates, and also in fricatives and nasals. The segments for modification are detected without labeling them.
The short-time spectrum is calculated by applying discrete Fourier transform (DFT) on windowed frames of the input signal. The spectral centroid Fc(n) of the nth frame of the speech signal is calculated by using the following equation:
where X(n,k) is the short-time magnitude spectrum, k is the frequency index, N is the DFT size, and fs is the sampling frequency. The centroid values obtained from spectra of short frame lengths (5-10 ms) are more sensitive to the changes in formant structure than to the harmonic structure, and hence are better suited for locating the spectral transitions associated with major changes in the vocal tract configuration. The rate of change of centroid is computed using a first difference with time step K using the following equation:
dFc(n)=Fc(n)−Fc(n−K) (2)
In the preferred embodiment of the invention, the input speech signal is sampled at fs of 10 kHz. The centroid computation is carried out using 6 ms frames with Hanning window and frame shift of 1 ms. A relatively large DFT size N of 512 is used for calculating the spectrum as it helps in a fine tracking of the change in the centroid obtained from the frame-averaged spectra.
Ep(n)=E(n), E(n)≥Ep(n−1)
αEp(n−1), otherwise (3)
Use of α=0.5(1/200), with frame shift of 1 ms, corresponds to half-value release time of 200 ms and the resulting Ep(n) tracks the vowel energy and retains it during stop closures and other low energy clusters. The frame energy E(n) is smoothened by the L-point moving average filter 224 to get the smoothened short-time energy Es(n).
A 100 Hz tone is generated by the tone generator 233 and its output is scaled by the multiplier 232 to get the tone at a level of −20 dB with reference to the peak energy Ep(n). This tone is added using the adder 231 to the input signal 111 to obtain a tone added signal 112. Hanning window 241 is applied on this signal and N-point DFT is used by the magnitude spectrum calculator 242 to get the magnitude spectrum which is applied as input to the M-frame moving average filter 243 to get smoothened magnitude spectrum. It is applied to spectral centroid calculator 244, which calculates the spectral centroid using Equation-1. The output of the spectral centroid calculator 244 is smoothened by the L-point median filter 245 for suppressing ripples without significantly smearing the changes due to major spectral transitions. For detecting changes in the spectral centroid, the K-point first difference calculator 246 calculates dFc(n) using Equation-2. In the preferred embodiment of the invention, values of K, L, and M are set to correspond to time-step of 20 ms, i.e., K=L=M=20 for frame shift of 1 ms, as it was found to be optimal for detecting spectral transitions.
The gain to be applied at frame position n is calculated by the gain selector 250 using three inputs: first difference of spectral centroid dFc(n), smoothened short-time energy Es(n), and peak energy Ep(n). The gain selection for CVR modification uses a hysteresis-based thresholding of dFc(n) with upper and lower thresholds of θh and θl. This is carried out with the help of a flag updated at each frame position as
CVR(n)=1, dFc(n)>θh
0, dFc(n)<θl
CVR(n−1), θl≤dFc(n)≤θh (4)
The threshold values of 350 Hz and 300 Hz are selected as θh and θl, respectively. Hysteresis based thresholding with these values prevents momentary fluctuations in dFc(n) from triggering CVR modification, without missing actual transitions. The maximum gain for enhancing the segment is set as Am subject to the condition that the energy of the frame after its amplification does not exceed the peak energy Ep(n). The maximum gain for a frame is calculated by the following equation:
Gm(n)=min[Am, (Ep(n)/Es(n))1/2] (5)
To avoid perceptible distortions caused by abrupt changes, the gain is changed from the current value to the target value in p logarithmic steps of y given as the following:
γ=[Gm(n)]1/p] (6)
The gain to be applied is calculated as the following:
G(n)=min[G(n−1)γ, Gm(n)], CVR(n)=1
max[G(n−1)/γ, 1] otherwise (7)
To provide significant enhancement of the transient segments without introducing perceptible distortions, maximum gain of 9 dB (i.e. Am=2.82) is applied and p is selected as 3.
In the gain application path 210, the signal delay block 211 introduces a delay to approximately compensate for the delay in the detection of the spectral transitions. In the preferred embodiment with K=L=M=20, this delay is kept at 10 ms. The delayed signal is multiplied by the gain G(n) to get the CVR modified signal 141 as the output.
The processing method has been validated by conducting listening tests for recognition of consonants in consonant-vowel, vowel-consonant, and consonant-vowel-consonant word lists and speech-spectrum shaped noise as a masker. The improvements in consonant recognition scores correspond to an SNR advantage of 2-6 dB.
The data transfer and buffering operations are interrupt driven and are devised for an efficient realization of the processing with analysis frame of 6 ms and frame shift of 1 ms. As shown in
The processing steps in CVR modification block 140 are the same as shown in
The processed outputs from the real-time processing system with the fixed-point processor described above with reference to
The invention has been described above with reference to its application in communication devices and hearing aids, wherein the analog input signal is processed to generate analog output signal using a processor interfaced to ADC and DAC. An example of the preferred embodiment is described using a 16-bit fixed-point DSP chip with on-chip FFT hardware and interfaced to a codec chip (with ADC and DAC) through serial data interface and DMA. The method can also be implemented using processors with other architectures and other types of interface to ADC and DAC, or using a processor with on-chip ADC and DAC. The processor chip used need not have on-chip FFT hardware if it has sufficiently high processing speed to implement the technique. The method described in this disclosure can also be used in communication devices with a processor operating on digitized speech signals available in the form of digital samples at regular intervals or in the form of data packets. In addition to its application in hearing aids and communication devices, the invention can also be used in applications like public address systems and other audio systems to improve speech intelligibility under various background noise and distortions.
The above description along with the accompanying drawings is intended to be illustrative and should not be interpreted as limiting the scope of the invention. Those skilled in the art to which the invention relates will appreciate that many variations of the described example implementations and other implementations exist within the scope of the claimed invention.
Number | Date | Country | Kind |
---|---|---|---|
739/MUM/2014 | Mar 2014 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IN2015/000048 | 1/27/2015 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/132798 | 9/11/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4454609 | Kates | Jun 1984 | A |
5737719 | Terry | Apr 1998 | A |
6889186 | Michaelis | May 2005 | B1 |
8296154 | Vandali et al. | Oct 2012 | B2 |
20090168939 | Constantinidis | Jul 2009 | A1 |
20110051924 | LeBlanc | Mar 2011 | A1 |
20110191101 | Uhle | Aug 2011 | A1 |
20110286618 | Vandali | Nov 2011 | A1 |
20120281863 | Iwano | Nov 2012 | A1 |
20130143618 | Seshadri | Jun 2013 | A1 |
20130218568 | Tamura | Aug 2013 | A1 |
20130282379 | Stephenson | Oct 2013 | A1 |
Entry |
---|
International Search Report dated Aug. 25, 2016 in corresponding International Patent Application No. PCT/IN2015/000048. |
Skowronski et al., “Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments,” Journal of Speech Communication, vol. 48, pp. 549-558, 2006. |
Colotte et al., “Automatic enhancement of speech intelligibility,” Proceedings of ICASSP 2000, Istanbul, pp. 1057-1060. |
Yoo et al., “Speech signal modification to increase intelligibility in noisy environment,” Journal of Acoustical Society of America, vol. 122, pp. 1138-1149, 2007. |
Tantibundhit et al., “Speech enhancement based on joint time-frequency segmentation,” Proceedings of ICASSP 2009, Taipei, pp. 4673-4676. |
Number | Date | Country | |
---|---|---|---|
20160365099 A1 | Dec 2016 | US |