This application is the U.S. National Phase Application of International Application No. PCT/IN2019/050571, filed on Aug. 3, 2019, and asserts priority to Application No. IN 201821032901 filed Sep. 1, 2018, the disclosures of which are hereby incorporated by reference in their entirety.
The present disclosure relates to processing of speech signals, and more particularly to real-time pitch tracking by detection of glottal excitation epochs in speech signal.
Voiced speech is the output of time-varying vocal tract filter excited by pulsatile airflow due to quasi-periodic vibration of the vocal folds in the larynx. The excitation is characterized by an impulsive excitation around the instants of glottal closure, known as the excitation epochs and the duration between two successive epochs is termed as the pitch period. The rate of vibration of the vocal folds is termed as the fundamental frequency of voicing, the pitch frequency, or the pitch. Pitch estimation is required for many speech processing applications such as speech codecs, voice conversion systems, speaker recognition, speech recognition of tonal languages, diagnosis of voice disorders, speech training aids, and other applications involving pitch tracking.
Speech codecs are used in speech communication devices for low bit rate signal transmission, by exploiting the redundancy in the speech signal, by coding the voicing, pitch, and vocal tract filter parameters. Syllabic-level pitch contour is needed for speech recognition in tonal languages. Pitch information has been reported to be useful in significantly reducing the computation time for speaker recognition. Pitch modification is an essential component of voice conversion, for converting the properties of the speech signal of the source speaker to those of the target speaker. Detection of abnormality in the distribution of the pitch periods and measurement of jitter is useful for diagnosis of voice disorders. Speech training aids providing a feedback of the pitch to the speaker during phonation can be used for improving the speech intelligibility, particularly for the tonal languages, and for improving the prosodic features. Most of these applications require real-time pitch tracking.
A number of pitch estimation methods have been reported for different applications. These methods can be broadly grouped into window-based and event-based methods. The window-based methods segment the signal using an analysis window, treating the signal as stationary for the duration of the analysis window. The window-based methods may use time-domain, frequency-domain, or time-frequency domain analysis. The time-domain analysis uses the periodicity property of the voiced speech signal and the frequency-domain analysis uses the harmonic structure in the spectrum of the voiced speech signal. A combination of these properties is used in the time-frequency domain analysis. The window-based methods cannot track fast changes in the pitch and may have pitch doubling and pitch halving errors. The event-based methods locate points associated with a significant epoch in each cycle of the glottal excitation. These methods generally require the presence of excitation component at the fundamental frequency of the speech signal and hence are not suited for high-pass filtered speech.
In a method proposed by Atal (B. S. Atal, “Speech signal pitch detector using prediction error data,” U.S. Pat. No. 3,740,476, 1973), peaks of the glottal excitation are detected by amplitude thresholding of the linear prediction (LP) residual. Several variants of this technique have been used in speech codecs. Cox et al. (R. V. Cox and R. E. Crochiere, “Real-time pitch detection by stream processing,” U.S. Pat. No. 4,486,900, 1984) proposed real-time pitch estimation using autocorrelation over a 20-ms window and a sequential peak peaking to locate the autocorrelation peaks in the pitch range of the signal.
In a method proposed by Picone et al. (J. Picone and D. Prezas, “Parallel processing pitch detector,” U.S. Pat. No. 4,879,748, 1989), four pitch periods are estimated by applying peak-picking on the LP residual, negated LP residual, speech signal, and negated speech signal, and a voting for final estimate of the pitch period Ma et al. (C. X. Ma and L. F. Willems, “Human speech processing apparatus for detecting instants of glottal closure,” U.S. Pat. No. 6,470,308 B1, 2002) proposed detection of glottal epochs by amplitude thresholding of the low-pass filtered and rectified signal, with the low-pass filter realized as a moving average filter with a trapezoidal window of length less than the lowest pitch period and the threshold obtained as the output of another moving average filter with a larger window length.
Nucci et al. (A. Nucci and R. Keralapura, “Hierarchical real-time speaker recognition for biometric VoIP verification and targeting,” U.S. Pat. No. 8,160,877 B1, 2012) proposed pitch estimation using the largest non-DC peak in the power spectrum of amplitude envelope obtained using discrete energy separation algorithm. Sung et al. (Y. Sung, M. Wang, and X. Lei, “Mobile speech recognition with explicit tone features,” U.S. Pat. No. 8,725,498 B1, 2014) proposed three embodiments for pitch tracking using frequency-domain analysis, autocorrelation analysis, and band-pass filtering with the passband selected for the pitch range.
In a method proposed by Talkin (D. Talkin, “Simultaneous estimation of fundamental frequency, voicing state, and glottal closure instant,” U.S. Pat. No. 9,263,052 B1, 2016), initial candidate epochs are detected using the peak and the pulse shape of the normalized and polarity-corrected LP residual, initial estimate of the fundamental frequency is obtained by normalized cross-correlation applied on a linear combination of the signal and its LP residual, and the voicing probability is based on the RMS value of the signal. These initial estimates are refined by minimizing a cost function using dynamic programming.
Kacic (Z. Kacic, “Pitch period and voiced/unvoiced marking method and apparatus,” PCT International Publication No. WO 2018/026329 A1, 2018) proposed method and apparatus for obtaining the pitch period using band-pass filtering of speech signal with the center of the passband selected using a coarse pitch estimated from the short-time autocorrelation of the signal. The pitch marks are located at the signal peaks nearest to the positive zero crossings of the band-pass filtered signal and the pitch period is estimated as the interval between two pitch marks.
In an epoch detection method by Murty et al. (K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE Transactions on Audio, Speech, and Language Processing, 16 (8), pp. 1602-1613, 2008), the effect of the vocal tract response is reduced by passing the pre-emphasized signal through two marginally stable cascaded zero frequency resonators (ZFR). The positive zero-crossings of the sinusoid-like signal generated by repeated mean-subtraction of the output of the resonator represent the glottal closure instants (GCIs). In a method by Drugman et al. (T. Drugman and T. Dutoit, “Glottal closure and opening instant detection from speech signals,” Proceedings of Interspeech 2009, pp. 2891-2894), the epoch containing intervals are marked from the local-minima to the subsequent positive zero-crossings on a running mean-based speech signal and the highest peaks of the LP residual in these intervals are marked as the epochs. These techniques require the presence of the fundamental and hence cannot be used for epoch detection of high-pass filtered speech.
Patil et al. (H. A. Patil and S. Viswanath, “Effectiveness of Teager energy operator for epoch detection from speech signals,” International Journal of Speech Technology, 14 (4), pp. 321-337, 2011) and Shikhah et al. (N. Shikhah and M. Deriche, “A novel pitch estimation technique using the Teager energy function,” Proceedings of IEEE ISSPA 1999, pp. 135-138) used Teager energy operator on a low-pass filtered speech for GCI detection. This method is not suitable for epoch detection of high-pass filtered speech.
In a method proposed by Prathosh et al. (A. P. Prathosh, T. V. Ananthapadmanabha, and A. G. Ramakrishnan, “Epoch extraction based on integrated linear prediction residual using plosion index,” IEEE Transactions on Audio, Speech, and Language Processing, 21 (12), pp. 2471-2480, 2013), an integrated LP residual (ILPR) is calculated by inverse filtering the signal using LP coefficients estimated from short-time, Hamming windowed, and pre-emphasized signal, to reduce the bipolar swing of the LP residual around epochs due to the phase angle of formants. Modified short-time crest factor, termed as the dynamic plosion index, is used on the half-wave rectified ILPR to estimate instants of significant excitation. The high peak-valley swing of the dynamic plosion index, which is computed for a fixed window, marks the instant of glottal closure. Prathosh et al. (A. P. Prathosh, P. Sujith, A. G. Ramakrishnan, and P. K. Ghosh, “Cumulative impulse strength for epoch extraction,” IEEE Signal Processing Letters, 23 (4) pp. 424-428, 2016) proposed a recursive algorithm using a temporal measure derived from the ILPR to detect the glottal epochs.
In a method by Gonzalez et al. (S. Gonzalez and M. Brooke, “PEFAC—a pitch estimation algorithm robust to high levels of noise,” IEEE Transactions on Audio, Speech, and Language Processing, 22 (2), pp. 518-528, 2014), the smoothed short-time spectrum is normalized by long-time average spectrum in the log-frequency domain, for robustness against noise while retaining the harmonic structure. The harmonic structure is enhanced by applying a smooth comb filter and the most probable pitch candidate is selected for each frame. The fundamental frequency is estimated by applying a temporal continuity measure on the initially estimated pitch values.
Vikram et al. (C. M. Vikram and S. R. M. Prasanna, “Epoch extraction from telephone quality speech using single pole filter,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25 (3), pp. 624-636, 2017) proposed detection of the glottal excitation epochs in telephony speech using an infinite impulse response (IIR) filter bank approach, assuming that filtering at half the sampling frequency provides a good separation between career and amplitude modulated components of the signal. A narrowband filter with resonance at half the sampling frequency is used to enhance instants of glottal excitation. The average of the envelopes of all filters has a high peak-to-valley swing around the instants of glottal closure. The salient points are determined initially as lying between the successive positive zero-crossings of the smoothed average envelope and then located within these intervals by marking the highest peak-to-valley swing in the output of the narrow-band filter.
The available pitch estimation methods have varying computational complexities and algorithmic delays and are generally not well suited for real-time pitch tracking with high accuracy and good dynamic response, particularly for high-pass filtered speech.
A method and a system are disclosed for real-time pitch tracking by detection of glottal excitation epochs in speech signal, using Hilbert envelope to enhance saliency of the glottal excitation epochs and to reduce the ripples due to the vocal tract filter.
In an implementation of the present disclosure, a method for real-time pitch tracking is disclosed. The method comprises applying a dynamic range compression on the speech signal to obtain a dynamic range compressed signal, calculating a Hilbert envelope of the dynamic range compressed signal, and obtaining epochs and pitch periods by processing the Hilbert envelope by applying dynamic peak tracking, saliency enhancement, and amplitude-duration thresholding.
In another implementation, a system is disclosed. The system comprises a dynamic range compression module configured to perform dynamic range compression of the speech signal to obtain a dynamic range compressed signal, a Hilbert envelope calculation module configured to calculate the Hilbert envelope of the dynamic range compressed signal, and an epoch marking and pitch detection module configured to mark epochs and to output pitch periods by processing the Hilbert envelope by applying dynamic peak tracking, saliency enhancement, and amplitude-duration thresholding.
The detailed description of the invention is described with reference to the accompanying figures.
A method and system are disclosed for pitch tracking by detection of glottal excitation epochs in speech signal, wherein the method permits real-time processing and is robust against high-pass filtering. Further, the method is based on calculating the Hilbert envelope of the speech signal to enhance the excitation epochs and to suppress the ripples related to the vocal tract response. A dynamic range compression can be applied before the calculation of the Hilbert envelope, and an epoch marker may be used to detect the high-saliency points in the Hilbert envelope. The impulses corresponding to the detected epochs can then be used for pitch period estimation.
The voiced speech signal can be assumed as the convolution of the impulse response of the time-varying vocal tract and glottal filter with the quasi-periodic impulse train due to glottal vibration. The speech signal s(n) during voiced regions can be approximated by the short-time harmonic model as
where bk and θk represent the combined effect of the vocal tract and glottal filters and ω0 is the fundamental frequency. The Hilbert envelope of the speech signal s(n) is the squared magnitude of the complex analytic signal sa(n), which is given as
sa(n)=s(n)+jsh(n) (2)
where sh (n) is the Hilbert transform (see A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing, Upper Saddle River, N.J.: Prentice-Hall, 1999) of the speech signal s(n). The Hilbert transform can be obtained by a π/2-phase shifter, also known as the Hilbert transformer, with the frequency and impulse responses given as
The Hilbert envelope eh (n) may be given as
eh(n)=s2(n)+sh2(n) (5)
The Hilbert transform sh (n) for the speech signal s(n) in Equation 1, can be given as
The Hilbert envelope eh (n) can be expressed as
The Hilbert envelope eh (n) consists of an offset and sum of harmonics of ω0, with several harmonics in s(n) contributing to the fundamental and enhancing the instants of significant excitation.
The processing modules of the embodiment illustrated in
The dynamic range compression serves as a pre-processing step to the Hilbert envelope calculation in order to reduce the possibility of misdetection of the epochs during low-energy speech segments. Dynamic range compression can be implemented in several ways.
In the dynamic range compression module 210 as illustrated in
a(n)=a(n−1)+[|sin(n)|−|sin(n−L)|]/L (8)
The value L selected corresponds to a 25-ms window, i.e. L=25×10−3fs. For the input signal range of [−1, +1], the A-law compressed envelope is given as
A time-varying gain g(n) is calculated from the magnitude envelope a(n) and the compressed envelope ã(n) as
g(n)=ã(n)/a(n) (10)
The speech signal sin(n) is delayed with a delay equal to the delay introduced by the magnitude envelope estimation module and is multiplied with the time-varying gain g(n) to obtain the dynamic range compressed signal s(n) as
s(n)=g(n)sin(n−(L−1)/2) (11)
The value of A in Equation 9 is set as 40 to provide compression without excessive increase of noise during the silences and it results in the highest gain of approximately 19 dB.
The Hilbert transformer 510, used for the Hilbert envelope calculation as shown in
sht(n)=s(n)*ht(n) (12)
sd(n)=s(n−(M−1)/2) (13)
eht(n)=sht2(n)+sd2(n) (14)
In order to suppress the glottal and vocal tract filter responses without excessive smearing of the representation of the glottal excitation in the envelope, M is empirically selected to correspond to 15 ms, i.e. M=15×10−3fs.
The epoch marking and pitch detection module 230 in the block diagram of
The dynamic peak detector module 610 of
The valley d(n) tracks the time-varying offset in the Hilbert envelope, where the constants μ and v, selected to be in the range [0,1], control the rise and fall rates. A fast rise (small μ) and slow fall (large v) help in suppressing the ripples while retaining saliency of the epochs. In an exemplary embodiment, these values are selected as μ=0.1 and v=0.9954 for 90% rise in one sample and 60% fall in 100 samples.
The nonlinear smoother 620 of
Referring to the saliency detector module 630 of
In the saliency enhancer module of the saliency detector module 630 as shown in
y(n)=[−x(n)+8x(n−1)−8x(n−3)+x(n−4)]/12 (17)
It may be noted that the differentiator may be replaced by other operations to emphasize the points with high-rate of change. One such operation is a real-time version of the Teager energy operator given as
y(n)=x2(n−1)−x(n)x(n−2) (18)
In the saliency detector module (630) as shown in
Aθ(n)=Aθ(n−1)+[|y(n)|−|y(n−P)|]/P (19)
where P corresponds to a 10-ms window, i.e. P=10×10−3fs. The duration threshold Tθ(n) is calculated from the pitch periods, as half of the mean of the preceding ten pitch periods which are lying within a set range, which may be 2-15 ms. A lower limit, which may be 2 ms. is applied on the duration threshold Tθ(n).
The implementation of the glottal excitation epoch detector uses a total storage of 725 variables and coefficients: 253 for magnitude envelope calculation in Equation 8, 3 for dynamic range compression in Equation 9, 1 for compressed signal in Equations 10-11, 302 for Hilbert envelope in Equations 12-14, 47 for smoothed peak in Equations 15-16 and two-stage median mean smoothing, 5 for differentiation in Equation 17, 103 for amplitude thresholding, and 11 for duration thresholding. The technique involves an algorithmic delay of 21.4 ms, consisting of 12.5 ms for compression, 7.5 ms for Hilbert envelope, and 1.4 ms for epoch marking.
The various modules disclosed in the above description can be implemented using digital signal processors, embedded microcontrollers, FPGAs (field programmable gate arrays), or ASICs (application specific integrated circuits) or a combination of such processors. Further, one, two, or more modules can be integrated into a single processor.
The above description along with the accompanying drawings is intended to disclose and describe the preferred embodiment of the invention in sufficient detail to enable those skilled in the art to practice the invention. It should not be interpreted as limiting the scope of the invention. Various changes in form and detail may be made without departing from its spirit and scope.
Number | Date | Country | Kind |
---|---|---|---|
201821032901 | Sep 2018 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IN2019/050571 | 8/3/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/044362 | 3/5/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
3740476 | Atal | Jun 1973 | A |
4486900 | Cox et al. | Dec 1984 | A |
4879748 | Picone et al. | Nov 1989 | A |
4887299 | Cummins et al. | Dec 1989 | A |
5054085 | Meisel | Oct 1991 | A |
5381512 | Holton et al. | Jan 1995 | A |
5668925 | Rothweiler | Sep 1997 | A |
6470308 | Ma et al. | Oct 2002 | B1 |
6901353 | Huang | May 2005 | B1 |
7042986 | Lashley | May 2006 | B1 |
7376204 | Music | May 2008 | B1 |
8160877 | Nucci et al. | Apr 2012 | B1 |
8725498 | Sung et al. | May 2014 | B1 |
9263052 | Talkin | Feb 2016 | B1 |
10453479 | Wilhelms-Tricarico | Oct 2019 | B2 |
20100070283 | Kato | Mar 2010 | A1 |
20130262096 | Wilhelms-Tricarico et al. | Oct 2013 | A1 |
20150302845 | Nakano et al. | Oct 2015 | A1 |
20170032803 | Pandey | Feb 2017 | A1 |
20170347207 | De Haan | Nov 2017 | A1 |
Number | Date | Country |
---|---|---|
2018026329 | Feb 2018 | WO |
Entry |
---|
Dash et al., “High Density Noise Removal by Using Cascading Algorithms” 2015 Fifth International Conference on Advanced Computing & Communication Technologies, 2015, pp. 96-101,IEEE (Year: 2015). |
Harrison et al., “Time-Compression Overlap Add: Description and Implementation” 2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) (pp. 64-69). IEEE. (Year: 2015). |
Ananthapadmanabha et al. “Epoch Extraction from Linear Prediction Residual for Identification of Closed Glottis Interval.” In: IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No. 4, Aug. 1979. |
Murty, et al., “Epoch Extraction From Speech Signals,” IEEE Transactions on Audio, Speech, and Language Processing, 16 (8), pp. 1602-1613, 2008. |
Number | Date | Country | |
---|---|---|---|
20210201938 A1 | Jul 2021 | US |