The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
A pattern recognition system, such as a speech recognition system or a handwriting recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
Many pattern recognition systems utilize models in which units are represented by a single tier of connected states. Using a training signal, probability distributions for occupying the states and for transitioning between states are determined for each of the units. In speech recognition, phonetic units are used. To decode a speech signal, the signal is divided into frames and each frame is transformed into a feature vector. The feature vectors are then compared to the distributions for the states to identify a most likely sequence of states that can be represented by the frames. The phonetic unit that corresponds to that sequence is then selected.
This Summary is provided to introduce some concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A Weighted Likelihood Ratio Hidden Markov Model is utilized for speech processing. The model emphasizes spectral peaks when comparing spectra. Probability density functions for states in the model can be developed with weights based on the comparison.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructions, which can be written on any form of a computer readable medium.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
A-to-D converter 206 converts the analog signal from microphone 204 into a series of digital values. In several embodiments, A-to-D converter 206 samples the analog signal at 16 kHz and 16 bits per sample, thereby creating 32 kilobytes of speech data per second. These digital values are provided to a frame constructor 207, which, in one embodiment, groups the values into 25 millisecond frames that start 10 milliseconds apart.
The frames of data created by frame constructor 207 are provided to feature extractor 208, which extracts a feature from each frame. Examples of feature extraction modules include modules for performing Linear Predictive Coding (LPC), LPC derived cepstrum, Perceptive Linear Prediction (PLP), auditory model feature extraction, and Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction. Note that system 200 is not limited to these feature extraction modules and that other modules may be used within the context of system 200.
The feature extraction module 208 produces a stream of feature vectors that are each associated with a frame of the speech signal. This stream of feature vectors is provided to a decoder 212, which identifies a most likely sequence of words based on the stream of feature vectors, a lexicon 214, a language model 216 (for example, based on an N-gram, context-free grammars, or hybrids thereof), and an acoustic model 218.
The most probable sequence of hypothesis words is provided to a confidence measure module 220. Confidence measure module 220 identifies which words are most likely to have been improperly identified by the speech recognizer, based in part on a secondary acoustic model (not shown). Confidence measure module 220 then provides the sequence of hypothesis words to an output module 222 along with identifiers indicating which words may have been improperly identified. Those skilled in the art will recognize that confidence measure module 220 is not necessary for the operation of system 200.
During training, a speech signal corresponding to training text 226 is input to trainer 224, along with a lexical transcription of the training text 226. Trainer 224 trains acoustic model 218 based on the training inputs. Acoustic model 218 is intended to be one example implementation of a model. Other types of pattern recognition systems can utilize the subject matter described herein, namely handwriting recognition systems. WLR
Acoustic model 218 can be developed as a Weighted Likelihood Ratio (WLR) Hidden Markov Model (HMM). The model can be applied to various different languages. The WLR emphasizes spectral peaks and reduces emphasis on valleys when comparing two given speech spectra. A WLR measure is more consistent with human perception of speech formants where natural resonances of vocal track are and tend to be more robust to noise interferences than other measures where no emphasis is placed on the spectral peaks. In terms of local (in frequency) signal-to-noise ratio (SNR), peaks of a speech spectrum are less polluted by noises. Thus, a WLR HMM can include a high weight based on peaks of spectra and a low weight based on valleys of spectra. Alternatively, a particular spectrum, e.g. only linear spectrum from testing signal, can also be used to provide an asymmetric WLR measure. In standard WLR HMM, the linear spectrum difference between testing signal and referenced one is used as the weighting function.
In
WLR can be formulated using integrands, where log St(ω)−log Sr(ω) is the difference between two log spectra: test spectrum log St(ω) and reference spectrum log Sr(ω). St(ω)−Sr(ω) is a difference between corresponding linear spectra and can be used as a weighting function. WLR distortion dwlr can be expressed as:
Parseval's theorem states that the sum (or integral) of the square of a function is equal to the sum (or integral) of the square of its transform. Thus, according to Parseval's theorem, WLR spectral distortion can be re-formulated as:
Here, rt(i) and ct(i) are autocorrelation and cepstral coefficients of the test spectrum, respectively. Similarly, rr(i) and cr(i) are autocorrelation and cepstral coefficients for the reference spectrum, respectively. Autocorrelation coefficients provide an indication of correlation between a signal and a time shifted version of the signal. It should be noted that the weighting function can satisfy equation 3 below. In other words, the 0th coefficients of rt(i) and rr(i) are constrained to unity power, or 1.
MFCC based WLR
In one example, cepstra used in equation 2 are used as MFCC, although other coefficients such as LPCC can be used. MFCCs are obtained by performing a Fourier transform on a spectrum and converting a resulting power spectrum obtained from the Fourier transform to a mel-frequency spectrum. The logarithm of the resulting spectrum is then obtained and an inverse Fourier transform is performed to obtain the coefficients. MFCC includes both static and dynamic features and in one example includes 13 coefficients for the static part. Static features can represent a particular interval of time (for example a frame) while dynamic features can represent the time changing attributes of a signal. Here, an arithmetical mean of MFCC can be used to approximate the centroids of the WLR-based measure. Given the MFCC, a corresponding weighting function can be derived with autocorrelation coefficients.
WLR-HMM
The WLR distortion discussed above can be applied to an HMM that includes a plurality of states and transitions between states. In one example, the states are represented as linguistic units such as phones or words.[cl] A probability density function (pdf) can be associated with each state. It can be shown that the WLR distortion values are nonnegative from equation 1 since the log function is monotonic and thus the difference of linear spectra has the same +/− sign as the corresponding parts of log spectra in the integrand. Thus, the integrand is semi-positive. A mixture of exponential kernels can be used to model the output pdf as shown in equation 4, and can be reffered to as WLR-HMM. In equation 4, bj represents the pdf for the jth state in the model and k represents the component/mixture index
Here, ot, is the observation vector including rt(i) and ct(i) and μjk is the mean vector and βjk is the inverse mean of the WLR distortion of the j-th state and k-th component. Also Wjk is the weighing coefficient of the k-th component for the j-th state. The pdf can also be realized as in equation 5.
The auxiliary Q-function for WLR-HMM density can be written as:
By taking the partial derivative of right side of equation 5 with regard to each parameter and let them equal to 0, the updated βjk, centroids and kernel weights can be derived and given as:
Where, Ψjk(t) is an indicator function which is 1 if ot is associated with the k-th component of the j-th state and is zero otherwise.
2-stream WLR-HMM
Dynamic cepstral features can play a more important role, especially for noisy speech recognition. As discussed above, WLR-HMM can help improve the noise robustness of static MFCC by more robust distortion measure. The static features and dynamic features can be merged. Using equation 10, the features can be integrated by two-stream when computing the likelihood scores. Weighting coefficients γ1 and γ2 are used to reflect the relative importance and normalize the different dynamic ranges of scores from these two streams.
A HMM framework based on WLR measure, called WLR-HMM, can be used as an acoustic model for a speech recognition system as discussed above. After combining with dynamic cepstral features, a multiple stream WLR-HMM can improve performance in noisy situations.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.