This invention relates generally to the field of telecommunications, and more particularly to double-ended measurement of speech quality.
The capability of measuring speech quality in a telecommunications network is important to telecommunications service providers. Measurements of speech quality can be employed to assist with network maintenance and troubleshooting, and can also be used to evaluate new technologies, protocols and equipment. However, anticipating how people will perceive speech quality can be difficult. The traditional technique for measuring speech quality is a subjective listening test. In a subjective listening test a group of people manually, i.e., by listening, score the quality of speech according to, e.g., an Absolute Categorical Rating (“ACR”) scale, Bad (1), Poor (2), Fair (3), Good (4), Excellent (5). The average of the scores, known as a Mean Opinion Score (“MOS”), is then calculated and used to characterize the performance of speech codecs, transmission equipment, and networks. Other kinds of subjective tests and scoring schemes may also be used, e.g., degradation mean opinion scores (“DMOS”). Regardless of the scoring scheme, subjective listening tests are time consuming and costly.
Machine-automated, “objective” measurement is known as an alternative to subjective listening tests. Objective measurement provides a rapid and economical means to estimate user opinion, and makes it possible to perform real-time speech quality measurement on a network-wide scale. Objective measurement can be performed either intrusively or non-intrusively. Intrusive measurement, also called double-ended or input-output-based measurement, is based on measuring the distortion between the received and transmitted speech signals, often with an underlying requirement that the transmitted signal be a “clean” signal of high quality. Non-intrusive measurement, also called single-ended or output-based measurement, does not require the clean signal to estimate quality. In a working commercial network it may be difficult to provide both the clean signal and the received speech signal to the test equipment because of the distances between endpoints. Consequently, non-intrusive techniques should be more practical for implementation outside of a test facility because they do not require a clean signal.
Several non-intrusive measurement schemes are known. In C. Jin and R. Kubichek. “Vector quantization techniques for output-based objective speech quality,” in Proc. IEEE Inf. Conf. Acoustics, Speech, Signal Processing, vol. 1, May 1996, pp. 491-494, comparisons between features of the received speech signal and vector quantizer (“VQ”) codebook representations of the features of clean speech are used to estimate quality. In W. Li and R. Kubichek, “Output-based objective speech quality measurement using continuous hidden Markov models,” in Proc. 7th Inl. Strap. Signal Processing Applications, vol. I. July 2003. pp. 389-392, the VQ codebook reference is replaced with a hidden Markov model. In P. Gray, M. P. Hollier. and R. E. Massara. “Non-intrusive speech-quality assessment using vocal-tract models,” Proc. Inst. Elect. Eng., Vision, Image. Signal Process., vol. 147, no. 6, pp. 493-501, December 2000 and D. S. Kim. “ANIQUE: An auditory model for single-ended speech quality estimation,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 821-831, September 2005, vocal tract modeling and modulation-spectral features derived from the temporal envelope of speech, respectively, provide quality cues for non-intrusive quality measurement. More recently, a non-intrusive method using neurofuzzy inference was proposed in G. Chen and V. Parsa, “Nonintrusive speech quality evaluation using an adaptive neurofuzzy inference system,” IEEE Signal Process. Lett., vol. 12, no. 5, pp. 403-106, May 2005. The International Telecommunications Union ITU-T P.563 standard represents the “state-of-the-art” algorithm, ITU-T P.563, Single Ended Method for Objective Speech Quality Assessment in Narrow-Band Telephony Applications, International Telecommunication Union, Geneva, Switzerland, May 2004. However, each of these known non-intrusive measurement schemes is computationally intensive relative to the capabilities of equipment which could currently be widely deployed at low cost. Consequently, a less computationally intensive non-intrusive solution would be desirable in order to facilitate deployment outside of test facilities.
In accordance with one embodiment of the invention, a single-ended speech quality measurement method comprises the steps of: extracting perceptual features from a received speech signal; assessing the perceptual features with at least one statistical model of the features to form indicators of speech quality; and employing the indicators of speech quality to produce a speech quality score.
In accordance with another embodiment of the invention, apparatus operable to provide a single-ended speech quality measurement, comprises: a feature extraction module operable to extract perceptual features from a received speech signal; a statistical reference model and consistency calculation module operable in response to output from the feature extraction module to assess the perceptual features to form indicators of speech quality; and a scoring module operable to employ the indicators of speech quality to produce a speech quality score.
One advantage of the inventive technique is reduction of processing requirements for speech quality measurement without significant degradation in performance. Simulations with Perceptual Linear Prediction (“PLP”) coefficients have shown that the inventive technique can outperform P.563 by up to 44.74% in correlation R for SMV coded speech under noisy conditions. The inventive technique is comparable to P.563 under various other conditions. An average 40% reduction in processing time was obtained compared to P.563, with P.563 implemented using a quicker procedural computer language than the interpretive language used to run the inventive technique. Thus, the speedup that can be obtained from the inventive technique programmed with a procedural language such as C is expected to be much greater.
Referring now to the feature extraction module (102), perceptual linear prediction (“PLP”) cepstral coefficients serve as primary features and are extracted from the speech signal every 10 ms. The coefficients are obtained from an “auditory spectrum” constructed to exploit three psychoacoustic precepts: critical band spectral resolution, equal-loudness curve, and intensity loudness power law. The auditory spectrum is approximated by an all-pole auto-regressive model, the coefficients of which are transformed to PLP cepstral coefficients. The order of the auto-regressive model determines the amount of detail in the auditory spectrum preserved by the model. Higher order models tend to preserve more speaker-dependent information. Since the illustrated embodiment is directed to measuring quality variation due to the transmission system rather than the speaker, speaker independence is a desirable property. In the illustrated embodiment fifth-order PLP coefficients as described in H. Hermansky, “Perceptual linear prediction (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, pp. 1738-1752, 1990, (“Hermansky”), which is incorporated by reference, are employed as speaker-independent speech spectral parameters. Other types of features, such as RASTA-PLP, may also be employed in lieu of PLP.
Referring now to the time segmentation module (104), time segmentation is employed to separate the speech frames into different classes. Each class appears to exert different influence on the overall speech quality. Time segmentation is performed using a voice activity detector (“VAD”) and a voicing detector. The VAD identifies each 10-ms speech frame as being active or inactive. The voicing detector further labels active frames as voiced or unvoiced. In the illustrated embodiment the VAD from ITU-T Rec. G.729-Annex B, A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70, International Telecommunication Union, Geneva, Switzerland. November 1996, which is incorporated by reference, is employed.
Referring to the GMM reference model (106), where u is a K-dimensional feature vector, a Gaussian mixture density is a weighted sum of M component densities as
where αi≧0, i=1, . . . , M are the mixture weights, with
and bi(u), i=1, . . . , M, are K-variate Gaussian densities with mean vector μi and covariance matrix Σi. The parameter list λ={λ1, . . . , λM} defines a particular Gaussian mixture density, where λi={μi, Σi, αi}. GMM parameters are initialized using the k-means algorithm described in A. Gersho and R. Gray, Vector Quantization and Signal Compression. Norwell, Mass.: Kluwer, 1992, which is incorporated by reference, and estimated using the expectation-maximization (“EM”) algorithm described in A. Dempster, N. Lair, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J Royal Statistical Society, vol.˜39, pp. 1-38, 1977, which is incorporated by reference. The EM algorithm iterations produce a sequence of models with monotonically non-decreasing log-likelihood (“LL”) values. The algorithm is deemed to have converged when the difference of LL values between two consecutive iterations drops below 10−3.
Referring specifically to the reference model (106), a GMM is used to model the PLP cepstral coefficients of each class of speech frames. For instance, consider the class of clean speech signals. Three different Gaussian mixture densities pclass(u|λ) are trained. The subscript “class” represents either voiced, unvoiced, or inactive frames. In principle, by evaluating a statistical model at the PLP cepstral coefficients x of the test signal, i.e., pclass(x|λ), a measure of consistency between the coefficient vector and the statistical model is obtained. Voiced coefficient vectors are applied to pvoiced(u|λ), unvoiced vectors to punvoiced(u|λ), and inactive vectors to pinactive(u|λ).
Referring now to the consistency calculation module (108), it should be noted that a simplifying assumption is made that vectors between frames are independent. Improved performance might be obtained from more sophisticated approaches that model the statistical dependency between frames, such as Markov modeling. Nevertheless, a model with low computational complexity has benefits as already discussed above. For a given speech signal whose feature vectors have been classified as described above, the consistency between the feature vectors of a class and the statistical model of that class is calculated as
where x1, . . . , xNclass, are the feature vectors in the class, and Nclass is the number of such vectors in the statistical model class. Larger Cclass indicates greater consistency. Cclass is set to zero whenever Nclass is zero. For each class, the product of the consistency measure Cclass and the fraction of frames of that class in the speech signal is calculated. The products for all the model classes serve as quality indicators to be mapped to an objective estimate of the subjective score value.
Referring now to the mapping module (110), mapping functions which may be utilized include multivariate polynomial regression and multivariate adaptive regression splines (“MARS”), as described in J. H. Friedman, “Multivariate adaptive regression splines,” The Annals of Statistics, vol. 19, no 1, pp. 1-141, March 1991. With MARS, the mapping is constructed as a weighted sum of basis functions, each taking the form of a truncated spline
While the invention is described through the above exemplary embodiments, it will be understood by those of ordinary skill in the art that modification to and variation of the illustrated embodiments may be made without departing from the inventive concepts herein disclosed. Moreover, while the preferred embodiments are described in connection with various illustrative structures, one skilled in the art will recognize that the system may be embodied using a variety of specific structures. Accordingly, the invention should not be viewed as limited except by the scope and spirit of the appended claims.