Voice Activity Detection (VAD), also known as speech activity detection or speech detection, may be relied upon by many speech/audio applications, such as speech coding, speech recognition, or speech enhancement applications, in order to detect a presence or absence of speech in a speech/audio signal. VAD is usually language independent. VAD may facilitate speech processing and may also be used to deactivate some processes during a non-speech section of an audio session to avoid unnecessary coding/transmission of silence in order to save on computation and network bandwidth.
According to an example embodiment, a method for detecting speech in an audio signal may comprise identifying a pattern of at least one occurrence of time-separated first and second distinctive feature values of first and second feature values, respectively, in at least two different frequency bands of an electronic representation of an audio signal of speech that includes voiced and unvoiced phonemes and noise. The identifying may include associating the first distinctive feature values with the voiced phonemes and the second distinctive feature values with the unvoiced phonemes. The first and second distinctive feature values may represent information distinguishing the speech from the noise. The time-separated first and second distinctive feature values may be non-overlapping, temporally, in the at least two different frequency bands. The method may comprise producing a speech detection result for a given frame of the electronic representation of the audio signal based on the pattern identified. The speech detection result may indicate a likelihood of a presence of the speech in the given frame.
The first feature values may represent power over time of the electronic representation of the audio signal in a first frequency band of the at least two frequency bands. The first distinctive feature values may represent a first concentration of power in the first frequency band. The second feature values may represent power over time of the electronic representation of the audio signal in a second frequency band of the at least two frequency bands. The second distinctive feature values may represent a second concentration of power in the second frequency band, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The first feature values may represent degrees of harmonicity over time of the electronic representation of the audio signal in a first frequency band of the at least two frequency bands. The first distinctive feature values may represent non-zero degrees of harmonicity in the first frequency band. The second feature values may represent power over time of the electronic representation of the audio signal in a second frequency band of the at least two frequency bands. The second distinctive feature values may represent a concentration of power in the second frequency band, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The identifying may include employing feature values accumulated in at least one previous frame to identify the pattern of time-separated first and second distinctive feature values, the at least one previous frame transpiring previous to the given frame.
The identifying may include computing phase differences between first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The identifying may include employing the phase differences computed to detect a temporal alternation of the time-separated distinctive features in the at least two different frequency bands. The likelihood of the presence of the speech may be higher in response to the temporal alternation being detected relative to the temporal alternation not being detected, and the pattern may be the temporal alternation.
The identifying may include applying a modulation filter to the electronic representation of the audio signal and the modulation filter may be based on a syllable rate of human speech.
In an event the speech detection result satisfies a criterion for indicating that speech is present, the producing may include extending, temporally, the speech detection result for the given frame by associating the speech detection result with one or more frames immediately following the given frame.
The speech detection result may be a first speech detection result indicating the likelihood of the presence of the speech in the given frame. The producing may include combining the first speech detection result with a second speech detection result indicating the likelihood of the presence of the speech in the given frame to produce a combined speech detection result indicating the likelihood of the presence of the speech in the given frame with improved robustness against false-alarms during absence of speech relative to the first speech detection result and the second speech detection result. The combined speech detection result may prevent an indication that the speech is present in the given frame in an event the first speech detection result indicates that the likelihood of the presence of the speech is not present at the given frame or during frames previous to the given frame. The combining may employ the second speech detection result to detect an end of the speech in the electronic representation of the audio signal.
The method may include producing the second speech detection result by averaging magnitudes of first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands.
According to another example embodiment, an apparatus for detecting speech in an audio signal may comprise an audio interface configured to produce an electronic representation of an audio signal of speech including voiced and unvoiced phonemes and noise. The apparatus may further comprise a processor coupled to the audio interface, the processor configured to implement an identification module configured to identify a pattern of at least one occurrence of time-separated first and second distinctive feature values of first and second feature values, respectively, in at least two different frequency bands of the electronic representation of the audio signal of speech including the voiced and unvoiced phonemes and noise. To identify the pattern, the identification module may be configured to associate the first distinctive feature values with the voiced phonemes and the second distinctive feature values with the unvoiced phonemes. The first and second distinctive feature values may represent information distinguishing the speech from the noise. The time-separated first and second distinctive feature values may be non-overlapping, temporally, in the at least two different frequency bands. The apparatus may still further comprise a speech detection module configured to produce a speech detection result for a given frame of the electronic representation of the audio signal based on the pattern identified. The speech detection result may indicate a likelihood of a presence of the speech in the given frame.
The first feature values may represent power over time of the electronic representation of the audio signal in a first frequency band of the at least two frequency bands. The first distinctive feature values may represent a first concentration of power in the first frequency band. The second feature values may represent power over time of the electronic representation of the audio signal in a second frequency band of the at least two frequency bands. The second distinctive feature values may represent a second concentration of power in the second frequency band, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The first feature values may represent degrees of harmonicity over time of the electronic representation of the audio signal in a first frequency band of the at least two frequency bands. The first distinctive feature values may represent non-zero degrees of harmonicity in the first frequency band. The second feature values may represent power over time of the electronic representation of the audio signal in a second frequency band of the at least two frequency bands. The second distinctive feature values may represent a concentration of power in the second frequency band, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The identification module may be further configured to compute phase differences between first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands. The first frequency band may be lower in frequency relative to the second frequency band.
The identification module may be further configured to employ the phase differences computed to detect a temporal alternation of the time-separated first and second distinctive features in the at least two different frequency bands. The likelihood of the presence of the speech may be higher in response to the temporal alternation being detected relative to the temporal alternation not being detected. The pattern may be the temporal alternation.
The identification module may be further configured to apply a modulation filter to the electronic representation of the audio signal. The modulation filter may be based on a syllable rate of human speech.
In an event the speech detection result satisfies a criterion for indicating that speech is present, the speech detection module may be further configured to extend, temporally, the speech detection result for the given frame by associating the speech detection result with one or more frames immediately following the given frame.
The speech detection result may be a first speech detection result indicating the likelihood of the presence of the speech in the given frame. The speech detection module may be further configured to combine the first speech detection result with a second speech detection result indicating the likelihood of the presence of the speech in the given frame to produce a combined speech detection result indicating the likelihood of the presence of the speech in the given frame with improved robustness against false-alarms during absence of speech relative to the first speech detection result and the second speech detection result. The combined speech detection result may prevent an indication that the speech is present in the given frame in an event the first speech detection result indicates that the likelihood of the presence of the speech is not present at the given frame or during frames previous to the given frame. The second speech detection result may be employed to detect an end of the speech in the electronic representation of the audio signal.
The speech detection module may be further configured to produce the second speech detection result by averaging magnitudes of first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands.
Yet another example embodiment may include a non-transitory computer-readable medium having stored thereon a sequence of instructions which, when loaded and executed by a processor, causes the processor to complete methods disclosed herein.
It should be understood that embodiments disclosed herein can be implemented in the form of a method, apparatus, system, or computer readable medium with program codes embodied thereon.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
Example embodiments produce VAD features for speech detection (referred to interchangeably herein as voice activity detection (VAD) or speech activity detection) that robustly distinguish between speech and interfering noise. In contrast to features described in previous publications, an example embodiment of a VAD feature explicitly takes into account a temporally alternating structure of voiced and unvoiced speech components, as well as a typical syllable rate of human speech. A low frequency resolution of the spectrum is sufficient to determine the VAD feature according to the example embodiment, enabling a computationally low-complex feature. According to an example embodiment, the VAD feature may obviate a need for spectral analysis of an audio signal for many frequency bands, thus, reducing a number of frequency bins otherwise employed for VAD.
An example embodiment disclosed herein may exploit a temporal alternation of voiced and unvoiced phonemes for VAD. In contrast to speech recognizers that employ complex models of phoneme sequences to identify spoken utterances, the example embodiment disclosed herein may employ detection of the temporal alternation. It should be understood that speech recognizers are employed for a different application, that is, speech recognition versus speech detection. An example embodiment disclosed herein may produce a speech detection result that may be used by a speech recognition application in order to understand whether a sequence of audio samples has speech to know when to perform speech recognition processing and may improve the speech recognition application based on improved VAD.
As disclosed above, speech processing methods may rely on VAD that separates speech from noise. For this task, several features have been introduced in literature that employ different characteristic properties of speech. An embodiment disclosed herein introduces a VAD feature that is robust against various types of noise. By considering an alternating excitation structure of low and high frequencies, speech may be detected with a high confidence, as disclosed further below. The example embodiment may be a computationally low-complex feature that can cope even with a limited spectral resolution that may be typical for in-car-communication (ICC) systems, as disclosed further below. As disclosed further below, simulations confirm robustness of an example embodiment of a VAD feature disclosed herein and show an increasing performance compared to established VAD features.
VAD is an essential prerequisite for many speech processing methods (J. Ramirez, J. M. Górriz, and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,” in Robust Speech Recognition and Understanding (M. Grimm and K. Kroschel, eds.), pp. 1-22, Intech, 2007). In automotive scenarios, different applications may benefit from VAD: speech controlled applications, such as navigation systems, achieve more accurate recognition results when only speech intervals are taken into account. Hands-free telephony allows the passengers to make phone calls with high speech quality out of the driving car. In-car-communication (ICC) systems (G. Schmidt and T. Haulick, “Signal processing for in-car communication systems,” Signal processing, vol. 86, no. 6, pp. 1307-1326, 2006, C. Luke, H. Özer, G. Schmidt, A. Theiß, and J. Withopf, “Signal processing for in-car communication systems,” in 5th Biennial Workshop on DSP for In-Vehicle Systems, Kiel, Germany, (Kiel, Germany), 2011) even facilitate conversations inside the passenger cabin. Many of the incorporated speech enhancement methods, such as noise estimation, focus on time intervals where either speech is present or where the signal purely consists of noise.
In ICC systems (G. Schmidt and T. Haulick, “Signal processing for in-car communication systems,” Signal processing, vol. 86, no. 6, pp. 1307-1326, 2006, C. Luke, H. Özer, G. Schmidt, A. Theiß, and J. Withopf, “Signal processing for in-car communication systems,” in 5th Biennial Workshop on DSP for In-Vehicle Systems, Kiel, Germany, (Kiel, Germany), 2011), one passenger's speech is recorded by speaker-dedicated microphones and an enhanced signal is immediately played back by a loudspeaker close to another passenger. These systems allow for a convenient communication between passengers even under adverse noise conditions at high speeds. Speech enhancement techniques may be employed to process a microphone signal and generate the enhanced signal played on the loudspeaker that may be adjusted to an acoustic environment of the car.
Special constraints may be considered in an ICC system that take effect on the design of the VAD. Since both the original speech and the processed signal are accessible in parallel by the passengers, latency is a more critical issue compared to other applications, such as speech recognition or hands-free telephony. System latencies of more than 10 ms result in reverberations that are perceived as annoying by the passengers (G. Schmidt and T. Haulick, “Signal processing for in-car communication systems,” Signal processing, vol. 86, no. 6, pp. 1307-1326, 2006). Therefore, the signal is typically processed using very small block sizes and small window lengths. Small window lengths and high sampling rates result in a low frequency resolution of the spectrum. Therefore, fine-structured harmonic components of speech can barely be observed from the spectrum. Only the format structure of speech is reflected by the envelope of the spectrum.
Several features for VAD have been introduced that represent characteristic properties of speech (S. Graf, T. Herbig, M. Buck, and G. Schmidt, “Features for voice activity detection: a comparative analysis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, November 2015). Some of these features rely on a high frequency resolution and are, therefore, not directly applicable in ICC. Instead, given the constraints for ICC applications, features that employ information from the spectral envelope or from temporal variations of the signal come into consideration.
An example embodiment disclosed herein may be employed in a variety of in-car-communication (ICC) systems; however, it should be understood that embodiments disclosed herein are not limited to ICC applications. An ICC system may support communications paths within a car by receiving speech signals of a speaking passenger via a microphone or other suitable sound receiving device and playing back such speech signals for one or more listening passengers via a loudspeaker or other suitable electroacoustic transducer. In ICC applications, frequency resolution of the spectrum is much lower compared to other applications, as disclosed above. An example embodiment disclosed herein may cope with such a constraint that may be applicable in an ICC system, such as the ICC system of
The microphone signal may be enhanced by the ICC system based on differentiating acoustic noise produced in the acoustic environment 103, such as windshield wiper noise 114 produced by the windshield wiper 113a or 113b or other acoustic noise produced in the acoustic environment 103 of the car 102, from the speech signals 104 to produce the enhanced speech signals 110 that may have the acoustic noise suppressed. It should be understood that the communications path may be a bi-directional path that also enables communication from the second user 106b to the first user 106a. As such, the speech signals 104 may be generated by the second user 106b via another microphone (not shown) and the enhanced speech signals 110 may be played back on another loudspeaker (not shown) for the first user 106a.
The speech signals 104 may include voiced signals 105 and unvoiced signals 107. The speaker's speech may be composed of voiced phonemes, produced by the vocal cords (not shown) and vocal tract including the mouth and lips 109 of the first user 106a. As such, the voiced signals 105 may be produced when the speaker's vocal cords vibrate during pronunciation of a phoneme. The unvoiced signals 107, by contrast, do not entail use of the speaker's vocal cords. For example, a difference between the phonemes /s/ and /z/ or /f/ and /v/ is vibration of the speaker's vocal cords. The voiced signals 105 may tend to be louder like the vowels /a/, /e/, /i/, /u/, /o/, than the unvoiced signals 107. The unvoiced signals 107, on the other hand, may tend to be more abrupt, like the stop consonants /p/, /t/, /k/.
It should be understood that the car 102 may be any suitable type of transport vehicle and that the loudspeaker 108 may be any suitable type of device used to deliver the enhanced speech signals 110 in an audible form for the second user 106b. Further, it should be understood that the enhanced speech signals 110 may be produced and delivered in a textual form to the second user 106b via any suitable type of electronic device and that such textual form may be produced in combination with or in lieu of the audible form.
An example embodiment disclosed herein may be employed in an ICC system, such as disclosed in
An example embodiment disclosed herein may cope with a very low frequency resolution in a spectrum that may result from special conditions, such as the small window lengths, high sampling rate, low Fast Fourier Transform (FFT) lengths that may be employed in an ICC system, limiting spectral resolution, as disclosed above. An example embodiment disclosed herein may consider a temporally alternating excitation structure of low and high frequencies corresponding to voiced and unvoiced phonemes, respectively, such as the voiced and unvoiced phonemes disclosed above with reference to
By considering multiple speech characteristics, an example embodiment disclosed herein may achieve an increase in robustness against interfering noises. Example embodiments disclosed herein may be employed in a variety of speech/audio applications, such as speech coding, speech recognition, speech enhancement applications, or any other suitable speech/audio application in which VAD may be applied, such as the ICC application disclosed in
An example embodiment disclosed herein may employ a temporally alternating concentration of power in low and high frequencies or, alternatively, a temporally alternating concentration of non-zero degrees of harmonicity in the low frequencies and power in the high frequencies, to indicate an alternating occurrence of voiced and unvoiced phonemes. Modulation may be employed to quantify a temporal profile of a speech signal's variations and phase differences of modulation between different frequency bands may be employed to detect the alternating structure providing an improvement over modulation features that may consider magnitude alone.
Embodiments disclosed herein take into account modulation frequencies next to the typical syllable rate of human speech of about 4 Hz (E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in ICASSP, vol. 2, (Munich, Germany), pp. 1331-1334, April 1997). In earlier publications, e.g., (J.-H. Bach, B. Kollmeier, and J. Anemilller, “Modulation based detection of speech in real background noise: Generalization to novel background classes,” in ICASSP, (Dallas, Tex., USA), pp. 41-44, 2010), modulation was identified as a good indicator for the presence of speech since it reflects the characteristic temporal structure. According to an example embodiment disclosed herein, phase differences of the modulation between high and low frequencies may be employed to detect the presence of speech more robustly. Further, by employing, additionally, a modulation feature, embodiments disclosed herein may further improve performance, as disclosed further below.
According to an example embodiment, speech detection may employ two characteristic properties of human speech to distinguish speech from pure noise. A typical syllable rate of speech of about 4 Hz may be utilized by considering modulations in this frequency range. Furthermore, an alternating structure of voiced and unvoiced speech, such as disclosed in
According to an example embodiment, the speech detection result may indicate a probability that speech is present, for example, via a probability value in a range between 0 and 1, or any other suitable range of values.
An example embodiment of a VAD feature may identify a pattern of time-separated occurrence of voiced and unvoiced phonemes. Identifying the voiced and unvoiced phonemes may be based on detecting concentrations of power in the different frequency bands. Alternatively, identifying the voiced and unvoiced phonemes may be based on detecting a concentration of harmonic components in the first frequency band and a concentration of power in the second frequency band. Voiced speech may be represented by a high value of the VAD feature. By using a harmonicity-based feature, such as auto-correlation maximum, to detect the concentration of harmonic components, voiced speech may be detected based on the high value of the VAD feature resulting from a harmonic concentration value instead of a power concentration value in the lower frequency band, that is, the first frequency band, disclosed above.
As disclosed above, speech includes both voiced and unvoiced phonemes. The terms “voiced” and “unvoiced” refer to a type of excitation, such as a harmonic or noise-like excitation. For example, for voiced phonemes, periodic vibrations of the vocal cords result in the harmonic excitation. Some frequencies, the so-called “formants,” are emphasized due to resonances of the vocal tract. Primarily low frequencies are emphasized that may be captured by the first frequency band. Unvoiced phonemes are generally produced by a constriction of the vocal tract resulting in the noise-like excitation (turbulent airflow) which does not exhibit harmonic components. Primarily high frequencies are excited that are captured by the second frequency band.
Some values of the voiced signal feature values 405 and unvoiced signal feature values 407 are distinctive, that is, some values may be prominent feature values that are distinctively higher relative to other feature values, such as the first distinctive feature values 415 of the first feature values 419 and the second distinctive feature values 417 of the second feature values 421. According to an example embodiment, speech detection may be triggered in an event an occurrence of distinctive feature values of the first and second feature values is time-separated (i.e., non-overlapping, temporally), such as the occurrence of time-separated first and second distinctive feature values 423 shown in the plot 400. The first distinctive feature values 415 and the second distinctive feature values 417 may represent information distinguishing speech from noise and the time-separated first distinctive feature values 415 and the second distinctive feature 417 are non-overlapping, temporally, in the first and second frequency bands, as indicated by the occurrence of time-separated first and second distinctive feature values 423. The first distinctive feature values 415 and the second distinctive feature values 417 may represent prominent feature values distinctively higher relative to other feature values of the voiced signal feature values 405 and the unvoiced signal feature values 407, respectively, separating the first distinctive feature values 415 and the second distinctive feature values 417 from noise.
Embodiments disclosed herein may produce a speech detection result, such as the speech detection result 662 of
A frame, corresponding to the variable employed in the equations following below, may correspond to a time interval of a sequence of audio samples in an audio signal, wherein the time interval may be shifted for each . For example, a frame corresponding to =1 may include an initial 128 samples from the audio signal whereas another frame corresponding to =2 may include a next 128 samples from the audio signal, wherein the initial 128 samples and the next 128 samples may overlap as the frames corresponding to =1 and =2 may be overlapping frames.
Alternatively, the pattern of time-separated distinctive feature values may be a time-separated occurrence 523b that may be identified by identifying the power concentration features 530 for the second frequency band of high frequencies 534a and instead of the power concentration features 530, harmonicity features 536 may be detected in the first frequency band of low frequencies 532b. As shown in the block diagram 500, the harmonicity features 536 as applied to the higher frequencies 534b do not contain any information about the presence or absence of speech, as indicated by “no reaction” in the block diagram 500.
As such, turning back to
Alternatively, the first feature values 419 may represent degrees of harmonicity over time of the electronic representation of the audio signal in a first frequency band of the at least two frequency bands. The first distinctive feature values 415 may represent non-zero degrees of harmonicity in the first frequency band. The second feature values 421 may represent power over time of the electronic representation of the audio signal in a second frequency band of the at least two frequency bands. The second distinctive feature values 417 may represent a concentration of power in the second frequency band, wherein the first frequency band may be lower in frequency relative to the second frequency band.
To identify the pattern, the identification module 656 may be configured to associate the first distinctive feature values, such as the first distinctive feature values 415 of
The apparatus 650 may still further comprise a speech detection module 660 configured to produce a speech detection result 662 for a given frame (not shown) of the electronic representation of the audio signal 658 based on the pattern identified. The speech detection result 662 may indicate a likelihood of a presence of the speech in the given frame.
Embodiments disclosed herein may exploit a nearly alternating occurrence of low and high frequency speech components. In the following, the underlying speech characteristics that are exploited are disclosed in more detail. Subsequently, an example embodiment of a VAD feature is introduced that uses both modulation and phase differences to indicate the presence of speech. In addition, in another example embodiment of the VAD feature, a contribution of magnitude and phase information to the VAD feature performance is evaluated. An example embodiment of the VAD feature performance is compared to established, however, more complex VAD features, as disclosed further below.
Speech Characteristics
Several different characteristic properties of speech can be employed to distinguish speech from noise (S. Graf, T. Herbig, M. Buck, and G. Schmidt, “Features for voice activity detection: a comparative analysis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, November 2015). A first indicator for presence of speech is given by the signal's power. High power values may be caused by speech components; however, power-based features are vulnerable against various types of noise. An example embodiment of a VAD feature disclosed herein may also take into account a temporal or spectral structure of speech.
Improvements can be achieved by considering the non-stationarity of the signal. Typically, the background noise is more stationary compared to speech. This is employed, e.g., by the long-term signal variability (LTSV) feature (P. K. Ghosh, A. Tsiartas, and S. Narayanan, “Robust voice activity detection using long-term signal variability,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 600-613, 2011) that evaluates the temporal entropy
H(k,)=−Φxx(k,−′)·log(Φxx(k,−′))
of the power spectrum Φxx(k, ). Here, k indicates the frequency index, whereas the frame index is denoted by . However, non-stationarity only reflects that the power spectrum varies over time, as illustrated in the bottom spectrogram 774 of
Many interferences, such as a car's signal indicator or wiper noise, also result in fast variations of the power spectrum. More advanced features, therefore, consider the manner of the power spectrum's variations. Human speech consists of a sequence of different phonemes. Modulation-based features are capable to reflect this sequential structure by investigating the temporal profile (E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in ICASSP, vol. 2, (Munich, Germany), pp. 1331-1334, April 1997, J.-H. Bach, B. Kollmeier, and J. Anemüller, “Modulation based detection of speech in real background noise: Generalization to novel background classes,” in ICASSP, (Dallas, Tex., USA), pp. 41-44, 2010) as illustrated in the middle spectrogram 772 of
An example embodiment of a VAD feature disclosed herein may assume an alternating structure of voiced and unvoiced phonemes. The corresponding alternating excitations of low and high frequencies is illustrated in the top spectrogram 770 of
Modulation-Phase Difference Feature
An example embodiment disclosed herein may exploit the alternating structure of low and high frequencies by determining phase differences of modulated signal components between different frequency bands. In a pre-processing step, a magnitude spectrum of the audio signal may be compressed to two frequency bands and stationary components removed. Modulations of the non-stationary components may be determined and normalized with respect to a temporal variance of the spectrum. Phase differences between both frequency bands may be taken into account, improving robustness of the example embodiments of the VAD features as compared to conventional modulation features. In the following, example embodiments for processing an input signal are disclosed in more detail.
Pre-Processing
According to an example embodiment, a pre-emphasis filter may be applied to the input audio signal in order to reduce noise in low frequencies, such as automotive noise, and to emphasize frequencies that are relevant for speech. The resulting signal may be transferred to a frequency domain using a short-time Fourier transform (STFT). As disclosed above, a window length and frame-shift may be very low, such as in an ICC application. According to an example embodiment, a Hamming window of 128 samples and a low frame-shift R=32 samples may be employed.
A magnitude spectrum |X(k, )| of the -th frame may be accumulated along the frequency bins k
to capture the excitation of low and high frequencies by different frequency bands. Embodiments disclosed herein may employ two frequency bands w∈{1,2} that capture low frequencies [200 Hz, 2 kHz] and high frequencies [4.5 kHz, 8 kHz] corresponding to voiced and unvoiced speech, respectively.
According to an example embodiment, stationary components, corresponding to modulation frequency zero, may be removed by applying a high-pass filter
B
hp(w,)=(1+β1)B(w,)−B(w,−1))/2
+β1·Bhp(w,−1) (2)
to each frequency band. The slope of the filter may be controlled by a parameter β1=10−4.8·R/fs−48 dB/s, wherein fs is the sampling rate.
Modulation Filter
According to an example embodiment, the VAD feature may investigate non-stationary components that are modulated with a frequency Ωmod4 Hz corresponding to the syllable rate of human speech. By applying a band-pass filter
B
mod(w,)=(1−β2)Bhp(w,)
+β2·Bmod(w,−1)·e2πjΩ
embodiments disclosed herein achieve complex-valued signals where the modulation frequency is emphasized as illustrated in the transfer function of a combination of high-pass and modulation (hp+mod) filter 846 of
The complex values can be separated into phase and magnitude that both contain valuable information about the signal. The phase of Bmod(w, ) captures information where power is temporally concentrated. Values next to zero indicate that at frame the frequencies in the frequency band w are excited. In contrast, a phase next to ±π corresponds to excitation in the past, a half period ago.
The magnitude value represents the degree of modulation. High values indicate speech, whereas absence of speech results in low values next to zero. Since the magnitude still depends on the scaling of the input signal, embodiments disclosed herein apply normalization.
Normalization
An example embodiment may normalize the modulation signal with respect to the variance in each frequency band to acquire a VAD feature that is independent from the scaling of the input signal. The variance
B
norm
2(w,)=(1−β3)Bhp2(w,)
+β3·Bnorm2(w,−1) (4)
may be estimated by recursive smoothing of Bhp2 (w, ) with a smoothing constant β3=β2−24 dB/s. The transfer function of the filter is depicted in the transfer function of the filter with normalization 844 of
{tilde over (B)}(w,)=Bmod(w,)/√{square root over (Bnorm2(w,))} (5)
the magnitude of {tilde over (β)}(w, ) represents the contribution of the modulation frequency to the overall non-stationary signal components. Magnitude as well as phase information from both frequency bands may be combined in an example embodiment of the VAD feature to increase robustness against interferences.
Magnitude and Phase Differences
According to an example embodiment, magnitude as well as phase difference may be taken into account to detect an alternating excitation of low and high frequencies:
MPD()=−|{tilde over (B)}(1,)|·|{tilde over (B)}(2,)|
·cos(∠({tilde over (B)}(1,)·{tilde over (B)}*(2,)). (6)
High feature values next to one indicate a distinct modulation and alternating excitation of the low and high frequencies. In this case, the magnitudes of both frequency bands are close to one and the cosine of the phase difference results in −1. In an event no distinct modulation is present, an example embodiment of the VAD feature assumes values next to zero since at least one magnitude is zero. Negative feature values indicate a distinct modulation but no alternating excitation structure.
Since the cosine function employed in Equation (6), disclosed above, produces a continuous value in a range [−1, 1], it may be employed to detect values close to +1, −1, or 0, enabling the scalar product disclosed in Equation (6) to be used to detect 180° phase shift, corresponding to values close to +1 or −1, or no phase difference, corresponding to values close to 0. As such, according to the example embodiment of Equation (6), disclosed above, a VAD feature may employ detection of both modulation, for example 4 Hz, as well as a specific phase shift (i.e., phase difference), such as a 180° phase shift, between high and low frequencies to determine a likelihood of the presence of speech for a time interval of speech corresponding to a given frame . The 180° phase shift may be determined based on the scalar product employing a cosine, as in Equation (6), disclosed above, or by shifting one of the two cosines of the scalar product in time and summing the two cosines to get a maximum value.
According to embodiments disclosed herein, an example embodiment of the VAD feature may be implemented, efficiently, by:
MPD()=−(Re{{tilde over (B)}(1,)}·Re{{tilde over (B)}(2,)}
+Im{{tilde over (B)}(1,)}·Im{{tilde over (B)}(2,)}). (7)
A conventional modulation feature, similar to (E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in ICASSP, vol. 2, (Munich, Germany), pp. 1331-1334, April 1997), can be derived by averaging the magnitudes of both frequency bands
without considering phase differences. High values next to one indicate a distinct modulation, whereas low values next to zero indicate absence of modulation. According to an example embodiment, either Equation (6) or Equation (7), disclosed above, may be employed to determine a modulation feature that checks for modulation in a specific way, that is, by detecting the 180° phase shift to determine whether speech is present.
Post-Processing
Based on results from experiments disclosed herein, it may be observed that an example embodiment of the VAD feature is robust against various types of interferences, such as various types of interferences in automotive environments. According to an example embodiment, an alternating structure of speech that is considered by the feature is quite specific. A high confidence that speech is actually present in the signal can be expected when the example embodiment of the VAD feature indicates speech. However, even during the presence of speech the example embodiment of the VAD feature may not permanently assume high values, as speech signals don't consistently exhibit this alternating energy (i.e., power concentration) characteristic in the first and second frequency bands or, alternatively, an alternating concentration of harmonic components in the first frequency band and a concentration of power in the second frequency band, as disclosed above with reference to
According to an example embodiment, maximum values may be held for some frames, e.g., by
to implement a hold time. With this mechanism, the example embodiment may start to detect speech when the expected characteristic occurs. However, a duration of detection is fixed by a parameter L′. As such, to better react on an end of speech, a combination with another feature may be employed according to an example embodiment.
An example embodiment may use a combination of different features to take advantage of capabilities of the different features. According to an example embodiment, in an event MPD with hold time, that is, MPDholdtime (), disclosed above, indicates speech, a high confidence that speech was actually present during the previous L′ frames may be assumed. Using this information, another feature can be controlled. As such, an example embodiment of a VAD feature, such as MPDholdtime (), disclosed above, may be used to control other VAD features, such as MOD(), as disclosed by COMB(), disclosed below. For example, if MPDholdtime () results in a low value, the effect of the MOD() value may be limited. A higher value of the MOD() feature may be required before the value of the COMB() feature, disclosed below, exceeds a given threshold. Thus, a false speech detection rate may be reduced compared to the MOD() feature without the MPDholdtime () feature.
An example embodiment may combine MPD with hold time with MOD using a multiplication:
COMB()=MPDhold time()·MOD(). (10)
According to an example embodiment, such a combination prevents MOD from detection when MPD did not indicate speech during the previous frames. On the other hand, the end of speech can be detected according to embodiments disclosed herein by taking the MOD feature into account. The combination, therefore, adopts the high robustness of MPD and the possibility of MOD to react to the end of speech.
as employed in Equation (1), disclosed above, where k denotes the frequency bin and denotes the current frame.
As disclosed above with regard to Equation (1), two frequency bands w ∈ {1,2} that capture low frequencies [200 Hz, 2 kHz] and high frequencies [4.5 kHz, 8 kHz] corresponding to voiced and unvoiced speech, respectively, may be employed. As such, the system 930 includes two paths, a first path 935a and a second path 935b, each corresponding to a particular frequency band of the two frequency bands. Frequency bins kmin and kmax may be chosen that correspond to the cut-off frequencies of the two frequency bands, so that low frequency components (primarily voiced speech portions) and high frequency components (primarily unvoiced speech portions) may be captured by the separated frequency bands.
The system 930 may further include a modulation filter and normalization section 938. The modulation filter and normalization section 938 may include a first 4 Hz modulation filter 940a and normalization term pair 942a employed in the first path 935a and a second 4 Hz modulation filter 940b and normalization term 942b pair employed in the second path 935b. In the 4 Hz modulation filters 940a and 940b, a typical syllable rate of speech (e.g., 4 Hz) may be considered by applying a filter:
along time for each frequency band. An infinite impulse response (IIR) filter may emphasize the modulation frequency:
Ωmod 4 Hz
and attenuate all other frequencies. A parameter of:
β−3 dB/cycle
may be used to control decay of the impulse response. Resulting complex-valued signals Bmod(w, ) 944, such as:
B
mod(w,)=(1−β)B(w,)+βe2πjΩ
represent strength and phase of the modulation.
The 4 Hz modulation filters 940a and 940b may be extended by a high-pass and low-pass filter to achieve a stronger suppression of the influence of the stationary background (i.e., modulation frequency 0 Hz), as well as highly fluctuating components (i.e., modulation frequency >>4 Hz):
The magnitudes of the complex-valued signals Bmod(w, ) 944 depend on a power of the input signal x(n) 930. To remove this dependency and to emphasize phase information, normalization {tilde over (B)}mod(w, ) 946 may be applied, for example:
Alternatively, by normalization with respect to the second moment ( power) in each frequency band, that is:
enables a contribution of the 4 Hz modulation to the complete power to be taken into account.
The system 930 may further include a weighted sum of phase shifted bands section 948. In the weighted sum of phase shifted bands section 948, one of the phase shifters 949a and 949b may be employed for shifting the normalized {tilde over (B)}mod(w, ) of one of the frequency bands, such as the normalized {tilde over (B)}mod(w, ) 946 in the second path 935b. Either of the phase shifters 949a or 949b may be employed to detect a 180° phase difference between the lower and higher frequency bands for detecting the alternating pattern. According to an example embodiment, in an event the 180° phase difference is not detected, noise may be assumed. As such, according to an example embodiment, a VAD feature, such as the modulation-phase difference (MPD) feature MPD() 952, may be designed to not only detect a modulation frequency but to also detect a phase shift of, for example, 180° between the high and low frequency bands.
In the weighted sum of phase shifted bands section 948, the normalized complex-valued signals {tilde over (B)}mod(w, ) 946 from different bands, that is, from the paths 935a and 935b, may be combined via a combiner 950 to generate the modulation-phase difference feature MPD() 952 using a weighted sum:
According to an example embodiment, weighted coefficients may be chosen so that the expected phase differences of 180° between the lower and higher frequency bands may be compensated, for example, for four frequency bands or σ1=1, σ2=0.5, σ3=−0.5, σ4=−1. Alternatively, by taking into account the magnitude as well as the phase difference, the alternating excitation of low and high frequencies may be detected as in Equation (6), disclosed above.
Advantages of the example embodiment of the system 930, disclosed above, and other example embodiment disclosed herein, include increased robustness against various types of noise by considering speech characteristics. According to example embodiments, a computationally low-complex VAD feature may be produced because only a few frequency bands are considered. According to an example embodiment, a feature value for VAD may be normalized to the range 0≤MPD()≤1, wherein higher values, such as values closer to one or any other suitable value, may indicate a higher likelihood of presence of speech, and may be independent from input signal power due to normalization. It should be understood that indicates a current frame for which the MPD() may be generated.
Simulations and Results
Simulations disclosed herein employed the UTD-CAR-NOISE database (N. Krishnamurthy and J. H. L. Hansen, “Car noise verification and applications,” International Journal of Speech Technology, December 2013) that contains an extensive collection of car noise recordings. Driving noise as well as typical nonstationary interferences, such as indicator or wiper noise, were recorded in 20 different cars. The noise database is available on a high sampling rate. An example embodiment of the VAD feature was, therefore, evaluated with different sampling rates of fs=24 kHz as well as fs=16 kHz that are typically employed for ICC applications.
Also, in each car, a short speech sequence—counting from zero to nine—was recorded. This sequence was employed to investigate whether the alternating excitation structure is detected as expected by an example embodiment of the VAD feature. Some digits, such as “six”: “s” (unvoiced) “i” (voiced) “x” (unvoiced), exhibit the alternating structure, whereas others, e.g., “one,” depend purely on voiced phonemes. The expectation was that only digits with alternating structure can be detected.
As a basic VAD approach, a threshold was applied to an example embodiment of the VAD feature value, as the VAD feature value may be a soft feature value indicating a likelihood value as opposed to a hard result that may be a boolean value. Detection rate Pd and false-alarm rate Pfa are common measures to evaluate VAD methods (J. Ramirez, J. M. Górriz, and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,” in Robust Speech Recognition and Understanding). The speech sequence was manually labeled to determine the ratio Pd between the number of correctly detected frames and the overall number of speech frames. The false-alarm rate Pfa is determined analogously by dividing the number of false alarms with the overall number of non-speech frames.
For further simulations, a lower sampling rate of 16 kHz is employed and speech data from the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database (J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallet, and N. L. Dahlgren, “DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM,” 1993) is artificially mixed to the automotive noise. In
MPD can detect speech with a low false-alarm rate. The speech characteristic that is considered is very specific, therefore much speech is missed. By temporally extending the detections using hold time, more speech can be captured by MPD (with hold time). Even with hold time, the false-alarm rate is very low which underlines the robustness of MPD against noise.
LTSV employs the non-stationarity of the signal to detect speech, therefore, it is also triggered by many non-stationary interferences. Using LTSV, very high detection rates can be achieved when accepting these false-alarms as shown by the LTSV ROC curve 1118.
The modulation feature (MOD) ROC curve 1116 lies between the curves of MPD (with hold time) 1112 and the LTSV ROC curve 1118. By combining (COMB) MPD (with hold time) and MOD, the best performance is achieved in this simulation as shown by the COMB ROC curve 1120.
The MPD ROC curve 1114 again shows the robustness of the MPD feature disclosed herein against interferences. On the left side, the slope is very steep, so speech can be detected even for a very low false-alarm rate. Other features, such as LTSV, are less robust against interferences reflected by a less steep slope of the LTSV ROC curve 1118.
As disclosed above, the MPD feature may miss much speech activity due to the very specific speech characteristic that is considered. As such, an example embodiment may employ a hold time, as disclosed above, and the detection rate can be increased without significantly increasing the false alarm rate. In this evaluation, the detection rate for longer utterances is interesting. In contrast to the earlier analysis using the digit sequence, specific elements of a speech sequence are not considered. Therefore, a much longer hold time L′2 s can be chosen which is beneficial in practical applications.
An example embodiment may combine the MPD feature with the modulation feature, and the results show that the performance can be increased. This combined feature (COMB) outperforms all the other features considered in this analysis as shown by the COMB ROC curve 1120.
Turning back to
Such a temporal context may be captured by considering modulated signal components. For example, the identifying may include computing phase differences between first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands, wherein the first frequency band may be lower in frequency relative to the second frequency band.
The identifying may include employing the phase differences computed to detect a temporal alternation of the time-separated distinctive features in the at least two different frequency bands, such as disclosed with reference to Equation (6), disclosed above. The likelihood of the presence of the speech may be higher in response to the temporal alternation being detected relative to the temporal alternation not being detected, and the pattern may be the temporal alternation.
The identifying may include applying a modulation filter to the electronic representation of the audio signal and the modulation filter may be based on a syllable rate of human speech, such as disclosed with reference to Equation (3) disclosed above.
In an event the speech detection result satisfies a criterion for indicating that speech is present, the producing may include extending, temporally, the speech detection result for the given frame by associating the speech detection result with one or more frames immediately following the given frame, such as disclosed with reference to Equation (9), disclosed above.
The speech detection result may be a first speech detection result indicating the likelihood of the presence of the speech in the given frame, such as MPDhold time(), as disclosed above with reference to Equation (9). As disclosed with reference to Equation (10), above, the producing may include combining the first speech detection result MPDhold time() with a second speech detection result MOD(), indicating the likelihood of the presence of the speech in the given frame, to produce a combined speech detection result COMB(), indicating the likelihood of the presence of the speech in the given frame. The combined speech detection result COMB() may prevent an indication that the speech is present in the given frame in an event the first speech detection result MPDhold time() indicates that the likelihood of the presence of the speech is not present at the given frame or during frames previous to the given frame. The second speech detection result MOD() may be employed to detect an end of the speech in the electronic representation of the audio signal. The combined speech detection result COMB() enables an improved robustness against false-alarms during absence of speech relative to the first speech detection result MPDhold time() and the second speech detection result MOD(), as disclosed above with reference to
The method may include producing the second speech detection result by averaging magnitudes of first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal in a second frequency band of the at least two different frequency bands.
Turning back to
The identification module 656 may be further configured to apply a modulation filter (not shown), such as disclosed with reference to Equation (3) disclosed above, to the electronic representation of the audio signal 658. The modulation filter may be based on a syllable rate of human speech.
In an event the speech detection result 662 satisfies a criterion for indicating that speech is present, the speech detection module 660 may be further configured to extend, temporally, the speech detection result 662 for the given frame by associating the speech detection result 662 with one or more frames immediately following the given frame, such as disclosed with reference to Equation (9), disclosed above.
The speech detection result 662 may be a first speech detection result indicating the likelihood of the presence of the speech in the given frame. The speech detection module 660 may be further configured to combine the first speech detection result with a second speech detection result indicating the likelihood of the presence of the speech in the given frame to produce a combined speech detection result indicating the likelihood of the presence of the speech in the given frame. As disclosed with reference to Equation (10), disclosed above, the speech detection module 660 may be further configured to employ the combined speech detection result to prevent an indication that the speech is present in the given frame in an event the first speech detection result indicates that the likelihood of the presence of the speech is not present at the given frame or during frames previous to the given frame. The speech detection module 660 may be further configured to employ the second speech detection result to detection of an end of the speech in the electronic representation of the audio signal 658.
The speech detection module 660 may be further configured to produce the second speech detection result by averaging magnitudes of first modulated signal components of the electronic representation of the audio signal in a first frequency band of the at least two different frequency bands and second modulated signal components of the electronic representation of the audio signal 658 in a second frequency band of the at least two different frequency bands.
The processor 654 may be further configured to generate an enhanced electronic representation of the audio signal based on the first, second, or combined speech detection results. The enhanced electronic representation of the audio signal may be transmitted via another audio interface (not shown) of the apparatus 650 to produce an enhanced audio signal (not shown).
According to an example embodiment, a VAD feature may expect a temporally alternating excitation structure of high and low frequencies for speech, such as disclosed with reference to
Further example embodiments disclosed herein may be configured using a computer program product; for example, controls may be programmed in software for implementing example embodiments. Further example embodiments may include a non-transitory computer-readable medium containing instructions that may be executed by a processor, and, when loaded and executed, cause the processor to complete methods described herein. It should be understood that elements of the block and flow diagrams may be implemented in software or hardware, such as via one or more arrangements of circuitry of
The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 62/335,139, filed on May 12, 2016 and U.S. Provisional Application No. 62/338,884, filed on May 19, 2016. The entire teachings of the above applications are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/018362 | 2/17/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62338884 | May 2016 | US | |
62335139 | May 2016 | US |