The invention relates to measuring speech intelligibility, and more specifically, to measuring speech intelligibility using acoustic correlates of distinctive features.
Distinctive features of speech are the fundamental characteristics that make each phoneme in all the languages of the world unique, and are described in Jakobson, R., C. G. M. Fant, and M. Halle, PRELIMINARIES TO SPEECH ANALYSIS: THE DISTINCTIVE FEATURES AND THEIR CORRELATES (MIT Press, Cambridge, Mass.; 1961) (hereinafter “Jakobson et al.”), the disclosure of which is hereby incorporated by reference herein in its entirety. They function to discriminate each phoneme from all others and as such are traditionally identified by the binary extremes of each feature's range. Jakobson et al. defined twelve features that fully discriminate the world's phonemes: 1) vocalic/non-vocalic, 2) consonantal/non-consonantal, 3) compact/diffuse, 4) grave/acute, 5) flat/plain, 6) nasal/oral, 7) tense/lax, 8) continuous/interrupted, 9) strident/mellow, 10) checked/unchecked, 11) voiced/unvoiced, and 12) sharp/plain.
Distinctive features are phonological, developed primarily to express in a simple manner the rules of a language for combining phonetic segments into meaningful words, and are described in Mannell, R., Phonetics & Phonology topics: Distinctive Features, http://clas.mq.edu.au/speech/phonetics/phonology/featurcs/index.html (accessed Feb. 18, 2009) (hereinafter “Mannell”), the disclosure of which is hereby incorporated by reference herein in its entirety. However, distinctive features are manifest in spoken language through acoustic correlates. For example, “compact” denotes a clustering of formants, while “diffuse” denotes a wide range of formant frequencies of a phoneme. All twelve distinctive features may be expressed in terms of acoustic correlates, as described in Jakobson et al., which are measurable from speech waveforms. Jakobson et al. suggest measures for acoustic correlates; however, such measures are neither unique nor optimal in any sense, and many measures exist which may be used as acoustic correlates of distinctive features.
Distinctive features, through acoustic correlates, are naturally related to speech intelligibility, because a change in distinctive feature (e.g., tense to lax) results in a change in phoneme (e.g., /p/ to /b/) which produces different words when used in the same context (e.g., “pat” and “bat” are distinct English words). Highly intelligible speech contains phonemes that are easily recognized (quantified variously by listener cognitive load or noise robustness) and exhibits acoustic correlates that are highly separable. Conversely, speech of low intelligibility contains phonemes that are easily confused with others and exhibits acoustic correlates that are not highly separable. Therefore, the separability of acoustic correlates of distinctive features is a measure of the intelligibility of speech. Separation of acoustic correlates of distinctive features may be measured in several ways. Distinctive features naturally separate into binary classes, so classification methods may be used to map acoustic correlates to speech intelligibility. Binary classes, however, do not produce sufficient differentiation between the distinctive features. What is needed, then, is a method that measure speech intelligibility with higher resolution than the known binary classes.
In one aspect, the invention relates to a method for measuring speech intelligibility, the method including the steps of inputting a speech waveform, extracting at least one acoustic feature from the waveform, segmenting at least one phoneme from the at least one first acoustic feature, extracting at least one acoustic correlate measure from the at least one phoneme, determining at least one intelligibility measure, and mapping the at least one acoustic correlate measure to the at least one intelligibility measure. In an embodiment, the speech waveform is input from a talker. In another embodiment, the speech waveform is based at least in part on a stimulus sent to the talker. In another embodiment, the at least one acoustic feature is extracted utilizing a frame-based procedure. In yet another embodiment, the at least one acoustic correlate measure is extracted utilizing a segment-based procedure. In still another embodiment, the at least one intelligibility measure includes a vector.
In an embodiment of the above aspect, the vector expresses the acoustic correlate measure in a non-binary value. In another embodiment, the non-binary value has a value in a range from −1 to +1. In another embodiment, the non-binary value has a value in a range from 0% to 100%.
In another aspect, the invention relates to an article of manufacture having computer-readable program portions embedded thereon for measuring speech intelligibility, the program portions including instructions for inputting a speech waveform from a talker, instructions for extracting at least one acoustic feature from the waveform, instructions for segmenting at least one phoneme from the at least one first acoustic feature, instructions for extracting at least one acoustic correlate measure from the at least one phoneme, instructions for determining at least one intelligibility measure, and instructions for mapping the at least one acoustic correlate measure to the at least one intelligibility measure.
In another aspect, the invention relates to a system for measuring speech intelligibility, the system including a receiver for receiving a speech waveform from a talker, a first extractor for extracting at least one acoustic feature from the waveform, a first processor for segmenting at least one phoneme from the at least one first acoustic feature, a second extractor for extracting at least one acoustic correlate measure from the at least one phoneme, a second processor for determining at least one intelligibility measure, and a mapping module for mapping the at least one acoustic correlate measure to the at least one intelligibility measure. In an embodiment, the system includes a system processor including the first extractor, the first processor, the second extractor, the second processor, and the mapping module.
In another aspect, the invention relates to a method of measuring speech intelligibility, the method including the step of utilizing a non-binary value to characterize a distinctive feature of speech. In another aspect, the invention is related to a speech analysis system utilizing the above-recited method. In another aspect, the invention is related to a speech rehabilitation system utilizing the above-recited method.
In another aspect, the invention relates to a method of tuning a hearing device, the method including the steps of sending a stimulus to a hearing device associated with a user, receiving a user response, wherein the user response is based at least in part on the stimulus, measuring an intelligibility value of the user response, comparing the stimulus to the intelligibility value, determining an error associated with the comparison, and adjusting at least one parameter of the hearing device based at least in part on the error. In an embodiment, the user response includes a distinctive feature of speech. In another embodiment, the error is determined based at least in part on a non-binary value characterization of the distinctive feature of speech. In yet another embodiment, the error is determined based at least in part on a binary value characterization of the distinctive feature of speech. In still another embodiment, the adjustment is based at least in part on a prior knowledge of a relationship between the intelligibility value and a parameter of the hearing device.
There are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The HMM may be trained as phoneme models, bi-phone models, N-phone models, syllable models or word models. A Viterbi path of the speech waveform through the HMM may be used for segmentation, so the phonemic representation of each state in the HMM is required. Phonemic representation of each state may utilize hand-labeling phoneme boundaries for the HMM training data. Specific states are assigned to specific phonemes (more than one state may be used to represent each phoneme for all types of HMMs).
Because segmentation is performed using an ASR engine, the acoustic feature extraction process may be a conventional ASR front end. Human factor cepstral coefficients (HFCCs) a spectral flatness measure, a voice bar measure (e.g., energy between 200 and 400 Hz), and delta and delta-delta coefficients as acoustic features may be utilized. HFCCs and delta and delta-delta coefficients are described in Skowronski, M. D. and J. G. Harris, “Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition,” J. Acoustical Society of America, vol. 116, no. 3, pp. 1774-1780, September 2004 (hereinafter “Skowronski et al. 2004”), the disclosure of which is hereby incorporated by reference herein in its entirety. Spectral flatness measure is described in Skowronski, M. D. and J. G. Harris, “Applied principles of clear and Lombard speech for intelligibility enhancement in noisy environments,” Speech Communication, vol. 48, no. 5, pp. 549-558, May 2006 (hereinafter “Skowronski et al. 2006”), the disclosure of which is hereby incorporated by reference herein in its entirety. Acoustic features may be measured for each analysis frame (20 ms duration), with uniform overlap (10 ms) between adjacent frames. Analysis frames and overlaps having other durations and times are contemplated.
Acoustic correlates for each phoneme of the speech waveform are then measured from segmented regions (Step 108). The correlates may include HFCC calculated over a single window spanning the entire region of a phoneme (which may be much longer than 20 ms), a single voice bar measure, and/or a single spectral flatness measure, augmented with several other acoustic correlates. Various other acoustic correlates may be appended to the set of correlates listed above that provide additional information targeting specific distinctive features of phonemes. Jakobson et al. suggest several measures including, but not limited to, main-lobe width of an autocorrelation function of the acoustic waveform in the segmented region, ratio of low-frequency to high-frequency energy, ratio of energy at the beginning and end of the segment, ratio of maximum to minimum spectral density (calculated variously by direct spectral measurement or from any spectral envelope estimate such as that from linear prediction), the spectral second moment, plosive burst duration, ratio of plosive burst energy to overall phoneme energy, and formant frequency and bandwidth estimates.
The acoustic correlates for each phoneme are then mapped to the intelligibility measures by a mapper function (Step 110). The intelligibility measures may comprise a vector of values (one for each distinctive feature) that quantifies the degree to which each distinctive feature is expressed in the acoustic correlates for each phoneme, ranging from 0% to 100%. For example, a phoneme with more low-frequency energy than high-frequency energy will produce an intelligibility measure for the distinctive feature grave/acute close to 100%, while a phoneme dominated by noise-like properties will produce an intelligibility measure for strident/mellow close to 100%. Phonemes may be coarticulated, so the acoustic correlates of neighboring phonemes may be included as input to the mapper function in producing the intelligibility measure for the central phoneme of interest.
The mapper function maps the input space (acoustic correlates) to the output space (intelligibility measures). No language in the world requires all twelve distinctive features to identify each phoneme of that language, so the size of the output space various with each language. For English, the first nine distinctive features listed above are sufficient to identify each phoneme. Thus, the output space of the mapper function for English phonemes contains nine dimensions. The mapper function may be any linear or nonlinear method for combining the acoustic correlates to produce intelligibility measures. Because the output space is of limited range and the intelligibility measures may be used to discriminate phonemes, the mapper function may be implemented with a feed-forward artificial neural network (ANN). Sigmoid activation functions may be utilized in the output layer of the ANN to ensure a limited range of the output space. The particular architecture of the ANN (number and size of each network layer) may vary by application. In certain embodiments, three layers may be utilized. It is generally desirable for the input layer to be the same size as the input space and for the output layer to be the same size as the output space. At least one hidden layer may ensure that the ANN may approximate any nonlinear function. The mapper function may be trained using the same speech data used to train the HMM segmenter. The output of the ANN may be trained using binary target values for each distinctive feature.
The intelligibility measure us then estimated (Step 112), using a one or more processes. In one embodiment, the intelligibility measure is estimated from acoustic correlates using a neural network mapping function, the measured values are referred to as continuous-valued distinctive features (CVDFs). CVDFs are in the range of about −1 to about +1. In certain embodiments, CVDFs are in the range of −1 to +1 and may be converted to percentages by the equation:
CVDFs may be transformed for normality considerations by using the inverse of the neural network output activation function, producing inverse CVDFs (iCVDFs):
In another embodiment, the intelligibility measure may be estimated as a probability using likelihood models for the positive and negative groups of each distinctive feature. The distribution of acoustic correlates may be modeled using an appropriate likelihood model (e.g., mixture of Gaussians). To train a pair of models for a distinctive feature, the available speech database is divided into two groups, one for all phonemes with a positive value for the distinctive feature and one for all phonemes with a negative value for the distinctive feature. Acoustic correlates are extracted and used to train a statistical model for each group. To use the models, the acoustic correlates of a speech input are extracted, then the likelihoods from each pair of models for each distinctive feature are calculated. The likelihoods for a distinctive feature are combined using Bayes' Rule to produce a probability that the speech input exhibits the positive and negative value of the distinctive feature. Distinctive feature a priori probabilities may be included in Bayes' Rule based on feature distributions of the target language (e.g., English contains only three nasal phonemes while the rest are oral). When the intelligibility measure is estimated from acoustic correlates using a statistical model, the measured values are referred to as distinctive feature probabilities (DFPs).
The proposed intelligibility measure quantifies the distinctiveness of speech and is useful in many applications. One series of applications uses the change in the proposed intelligibility measure to quantify the change in speech from a talker due to a treatment. The talker may be undergoing speech or auditory therapy, and the intelligibility measure may be used to quantify progress. A related application is to quantify the changes in speech due to changes in the parameters of a hearing instrument then use that knowledge to fit a hearing device (i.e., hearing aids, cochlear implants) to a patient, as described below.
Hearing devices are endowed with tunable parameters so that the devices may be customized to compensate for an individual's hearing loss. The hearing device modifies the acoustic properties of sounds incident to an individual to enhance the perception of the characteristics of the sounds for the purposes of detection and recognition. One method for tuning hearing device parameters includes using a stimulus/response test paradigm to access the effects of a hearing device parameter set on the perception of speech for an individual hearing device user. Thereafter, each stimulus/response pair are compared to estimate a difference in speech properties. The method then converts the differences in speech properties of the stimulus/response pairs to a change in the device parameter set using prior knowledge of the relationship between device parameters and speech properties.
In a proposed method of testing using the system 200 of
In certain embodiments of the system and method, isolated vowel-consonant-vowel (VCV) nonsense words may be used as the speech material 206 with variation in the consonant (e.g., /aba/, /ada/, /afa/). Isolated VCV stimulus words are easy to compare with responses, producing primarily substitution errors of the consonant (e.g., /aba/ recognized as /apa/). The initial and final vowels provide context for the consonant phonemes. The fact that the words are nonsensical significantly reduces the influence of language on the responses (i.e., prevents a patient from guessing at the correct response).
The S/R comparator 218 uses distinctive features (DFs) of speech, as described in Jakobson et al., to compare the stimulus 214 and response 216 for each pair. DFs are binary subunits of phonemes that uniquely encode each phoneme in a language. For example, the English language is described by a set of nine DFs: {vocalic, consonantal, compact, grave, flat, nasal, tense, continuant, strident}. Other phonological theories, such as those presented in Chomsky, N. and Halle, M., THE SOUNDS PATTERN OF ENGLISH (Harper and Row, New York; 1968), present alternative DF sets, any of which are appropriate for S/R comparison. The disclosure of Chomsky is hereby incorporated by reference herein in its entirety. The DFs of the S/R pairs are compared to produce an error:
E
t(f)=F(Et,+(f),Et,−(f),N)
where
The errors Et,+(f) and Et,−(f) may also be tabulated from continuous-valued distinctive features (CVDFs), as described above with regard to
Other functions F(·) may be utilized, such as those that incorporate prior knowledge of the distributions of Et,+(f) and Et,−(f) for random S/R pairs. The function F(•) may also include importance weights based on the distributions of DFs in the language of the stimuli.
Hearing devices typically have many tunable parameters (some have more than 100 tunable parameters), which makes optimizing each parameter independently a challenge due to the combinatorially large number of possible parameter sets. To circumvent the difficulties of optimization in a large parameter space, a low-dimensional model of independent parameters may be imposed onto the set of hearing device parameters such that the hearing device parameters (or a subset of hearing device parameters) are derived from the low-dimensional model.
One low-dimensional model that may be employed is bump-tilt-gain (BTG) that uses five parameters: {bump gain, bump quality, bump center frequency, tilt slope, overall gain}. BTG, in one instance, describes a filter that distributes energy across frequency which affects spectral cues and, consequently, speech intelligibility. It is desirable for the hearing device 208 to include the capability of implementing BTG.
The prior knowledge 222 represents the relationship between speech properties and tunable device or device model parameters. The relationship is determined prior to a patient's tuning session, based on either expert knowledge or experiments measuring the effects of tunable parameters on speech. Prior knowledge of the relationship between DFs and BTG parameters may be presented in a master table, where each row represents a unique parameter set β and each column represents the effect of β on each DF, averaged over all utterances of the speech material in a speech database. For example, the baseline parameter set β0 (zero bump gain and zero tilt slope) has no effect on DFs, while a different parameter set with nonzero bump gain and/or tilt slope may cause speech to become more grave, more compact, and less nasal compared to β0.
To help quantify the magnitude of change in DFs in the master table, CVDFs may be used for finer resolution of distinctive features. Because CVDFs are not normally distributed, they may be transformed CVDFs to inverse CVDFs (iCVDFs):
Inverse CVDFs are more normally distributed, which facilitates averaging over all utterances of speech material in a speech database. For greater statistical power, ΔiCVDF for each utterance is measured as the difference in iCVDFs between β and β0. The master table was filled by averaging ΔiCVDFs over all utterances:
where
Prior knowledge of the relationship between DFs and BTG parameter sets may be in other forms besides a master table. The master table is used by the optimization algorithm (described below) in a non-parametric classifier (nearest neighbor), but a parametric classifier may also be used which requires the prior knowledge to be in the form of model parameters learned from utterances of speech material in a speech database.
The optimization algorithm 220 combines the measured error in speech properties with prior knowledge to produce a new parameter set for the next test. Using errors in DFs, Et(f), and prior knowledge in the form of master table entries Kβ(f), the parameter set for test t+1, βt+1, is determined as follows:
where
The errors Et(f) are scaled by step size δ(f) then combined with the current master table entry Kβt(f) as an offset. The offset entry is then compared with all master table entries, and β of the closest entry in a mean-squared sense is returned. The step size parameter δ(f) performs several functions. For example, it normalizes the variances between Et(f) and Kβ(f), controls the step size of movement in ΔiCVDF space, and weights the importance of each feature.
In the applications described above in
Another application of the intelligibility measure is to evaluate the distinctiveness of speech material used in listening tests and psychoacoustic evaluations. Performance on such tests varies due to several factors, and the proposed intelligibility measure may be used to explain part of the variation in performance due to speech material distinctiveness variation. The intelligibility measure may also be used to screen speech material for such tests to ensure uniform distinctiveness.
The testing methods and systems may be performed on a computer testing system 300 such as that depicted in
After a signal is presented, the user is given enough time to make a sound signal representing what he heard. The output signal corresponding to each input signal is recorded. The output signal 310 may be a sound repeated by the user 308 into a microphone 312. The resulting analog signal 314 is converted by an analog/digital converter 316 into a digital signal 318 delivered to the processor 320. Alternatively, the user 308 may type a textual representation of the sound heard into a keyboard 322. In the processor 320, the output signal 310 is stored and compared to the immediately preceding stimulus.
The S/R comparator (
The present invention can be realized in hardware, software, or a combination of hardware and software. The present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software can be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention also can be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
In the embodiments described above, the software may be configured to run on any computer or workstation such as a PC or PC-compatible machine, an Apple Macintosh, a Sun workstation, etc. In general, any device can be used as long as it is able to perform all of the functions and capabilities described herein. The particular type of computer or workstation is not central to the invention, nor is the configuration, location, or design of a database, which may be flat-file, relational, or object-oriented, and may include one or more physical and/or logical components.
The servers may include a network interface continuously connected to the network, and thus support numerous geographically dispersed users and applications. In a typical implementation, the network interface and the other internal components of the servers intercommunicate over a main bi-directional bus. The main sequence of instructions effectuating the functions of the invention and facilitating interaction among clients, servers and a network, can reside on a mass-storage device (such as a hard disk or optical storage unit) as well as in a main system memory during operation. Execution of these instructions and effectuation of the functions of the invention is accomplished by a central-processing unit (“CPU”).
A group of functional modules that control the operation of the CPU and effectuate the operations of the invention as described above can be located in system memory (on the server or on a separate machine, as desired). An operating system directs the execution of low-level, basic system functions such as memory allocation, file management, and operation of mass storage devices. At a higher level, a control block, implemented as a series of stored instructions, responds to client-originated access requests by retrieving the user-specific profile and applying the one or more rules as described above.
Communication may take place via any media such as standard telephone lines, LAN or WAN links (e.g., T1, T3, 56 kb, X.25), broadband connections (ISDN, Frame Relay, ATM), wireless links, and so on. Preferably, the network can carry TCP/IP protocol communications, and HTTP/HTTPS requests made by the client and the connection between the client and the server can be communicated over such TCP/IP networks. The type of network is not a limitation, however, and any suitable network may be used. Typical examples of networks that can serve as the communications network include a wireless or wired Ethernet-based intranet, a local or wide-area network (LAN or WAN), and/or the global communications network known as the Internet, which may accommodate many different communications media and protocols.
While there have been described herein what are to be considered exemplary and preferred embodiments of the present invention, other modifications of the invention will become apparent to those skilled in the art from the teachings herein. The particular methods of manufacture and geometries disclosed herein are exemplary in nature and are not to be considered limiting. It is therefore desired to be secured in the appended claims all such modifications as fall within the spirit and scope of the invention. Accordingly, what is desired to be secured by Letters Patent is the invention as defined and differentiated in the following claims, and all equivalents.
This application claims the benefit of U.S. Provisional Patent Application No. 61/164,454, filed Mar. 29, 2009, and U.S. Provisional Patent Application No. 61/262,482, filed Nov. 18, 2009, the disclosures of which are hereby incorporated by reference herein in their entireties.
Number | Date | Country | |
---|---|---|---|
61164454 | Mar 2009 | US | |
61262482 | Nov 2009 | US |