The present invention relates to speech recognition systems and in particular to speech recognition systems that exploit vocal tract resonances in speech.
In human speech, a great deal of information is contained in the first three or four resonant frequencies of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies (and to a less extent, bandwidths) of these resonances indicate which vowel is being spoken.
Such resonant frequencies and bandwidths are often referred to collectively as formants. During sonorant speech, which is typically voiced, formants can be found as spectral prominences in a frequency representation of the speech signal. However, during non-sonorant speech, the formants cannot be found directly as spectral prominences. Because of this, the term “formants” has sometimes been interpreted as only applying to sonorant portions of speech. To avoid confusion, some researchers use the phrase “vocal tract resonance” to refer to formants that occur during both sonorant and non-sonorant speech. In both cases, the resonance is related to only the oral tract portion of the vocal tract.
To detect formants, systems of the prior art analyzed the spectral content of a frame of the speech signal. Since a formant can be at any frequency, the prior art has attempted to limit the search space before identifying a most likely formant value. Under some systems of the prior art, the search space of possible formants is reduced by identifying peaks in the spectral content of the frame. Typically, this is done by using linear predictive coding (LPC) which attempts to find a polynomial that represents the spectral content of a frame of the speech signal. Each of the roots of this polynomial represents a possible resonant frequency in the signal and thus a possible formant. Thus, using LPC, the search space is reduced to those frequencies that form roots of the LPC polynomial.
In other formant tracking systems of the prior art, the search space is reduced by comparing the spectral content of the frame to a set of spectral templates in which formants have been identified by an expert. The closest “n” templates are then selected and used to calculate the formants for the frame. Thus, these systems reduce the search space to those formants associated with the closest templates.
One system of the prior art, developed by the same inventors as the present invention, used a consistent search space that was the same for each frame of an input signal. Each set of formants in the search space was mapped into a feature vector. Each of the feature vectors was then applied to a model to determine which set of formants was most likely.
This system works well but is computationally expensive because it typically utilizes Mel-Frequency Cepstral Coefficient frequency vectors, which require the application of a set of frequencies to a complex filter that is based on all of the formants in the set of formants that is being mapped followed by a windowing step and a discrete cosine transform step in order to map the formants into the feature vectors. This computation was too time-consuming to be performed at run time and thus all of the sets of formants had to be mapped before run time and the mapped feature vectors had to be stored in a large table. This is less than ideal because it requires a substantial amount of memory to store all of the mapped feature vectors.
In another system developed by the present inventors, a set of discrete vocal tract resonance vectors are stored in a codebook. Each of the discrete vectors is converted into a simulated feature vector that is compared to an input feature vector to determine which discrete vector best represents an input speech signal. This system is less than ideal because it does not determine continuous values for the vocal tract resonance vectors but instead selects one of the discrete vocal tract resonance codewords.
A method and apparatus tracks vocal tract resonance components in a speech signal. The components are tracked by defining a state equation that is linear with respect to a past vocal tract resonance vector and that predicts a current vocal tract resonance vector. An observation equation is also defined that is linear with respect to a current vocal tract resonance vector and that predicts at least one component of an observation vector. The state equation, the observation equation, and a sequence of observation vectors are used to identify a sequence of vocal tract resonance vectors. Under one embodiment, the observation equation is defined based on a linear approximation to a non-linear function. The parameters of the linear approximation are selected based on an estimate of a vocal tract resonance vector.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention is designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
The present invention provides methods for identifying the formant frequencies and bandwidths in a speech signal across a continuous range of formant frequencies and bandwidths, both in sonorant and non-sonorant speech. Thus, the invention is able to track vocal tract resonance frequencies and bandwidths.
To do this, the present invention models the hidden vocal tract resonance frequencies and bandwidths as a sequence of hidden states that each produces an observation. In one particular embodiment, the hidden vocal tract resonance frequencies and bandwidths are modeled using a state equation of:
xt=Φxt−1+(I−Φ)T+wt EQ. 1
and an observation equation of:
ot=C(xt)+vt EQ. 2
where xt is a hidden vocal tract resonance vector at time t consisting of xt={f1,b1,f2,b2,f3,b3,f4,b4}, xt−1 is a hidden vocal tract resonance vector at a previous time t−1, Φ is a system matrix, I is the identity matrix, T is a target vector for the vocal tract resonance frequencies and bandwidths, wt is noise in the state equation, ot is an observed vector, C(xt) is a mapping function from the hidden vocal tract resonance vector to an observation vector, and vt is the noise in the observation. Under one embodiment, Φ is a diagonal matrix with each entry having a value between 0.7 and 0.9 that has been empirically determined, and T is a vector, which, in one embodiment, has a value of:
Under one embodiment, the observed vector is a Linear Predictive Coding-Cepstra (LPC-cepstra) vector where each component of the vector represents an LPC order. As a result, the mapping function C(xt) can be determined precisely by an analytical nonlinear function. The nth component of the vector-valued function C(xt) for frame t is:
where Cn(xt) is the nth element in an Nth order LPC-Cepstrum feature vector, K is the number of vocal tract resonance (VTR) frequencies, fk(t) is the kth VTR frequency for frame t, bk(t) is the kth VTR bandwidth for frame t, and fs is the sampling frequency, which in many embodiments is 8 kHz and in other embodiments is 16 kHz. The C0 element is set equal to logG, where G is a gain.
To identify a sequence of hidden vocal tract resonance vectors from a sequence of observation vectors, the present invention uses a Kalman filter. A Kalman filter provides a recursive technique that can determine a best estimate of the continuous-valued hidden vocal tract resonance vectors in the linear dynamic system represented by Equations 1 and 2. Such Kalman filters are well known in the art.
The Kalman filter requires that the right-hand side of Equations 1 and 2 be linear with respect to the hidden vocal tract resonance vector. However, the mapping function of Equation 3 is non-linear with respect to the vocal tract resonance vector. To address this, the present invention uses piecewise linear approximations in place of the exponent and cosine terms in Equation 3. Under one embodiment, the exponent term is represented by five linear regions and the cosine term is represented by ten linear regions.
Using these linear approximations, Equation 3 is rewritten as:
where αkx is the slope and βkx is the intercept of the linear segment that approximates the exponent term and γkx is the slope and δkx is the intercept of the linear segment that approximates the cosine term. Note that all four terms are dependent on xt because the linear segments that are used to approximate the non-linear functions are selected based on the region determined by the value of xt according to Tables 1 and 2.
The form of the mapping function in Equation 4 is still not linear in xt because of the quadratic term. Under one embodiment of the present invention, the incremental portion of this term is ignored, resulting in a linear equation from xt to Cn(xt).
In this form, as long as the parameters are fixed based on the regions of the segment exemplified in Tables 1 and 2, a Kalman Filter is applied directly to obtain the sequence of continuous valued states x1:T from a sequence of observed LPC feature vectors o1:T.
In step 500 of
Under one embodiment, the formants and bandwidths are quantized according to the entries in Table 3 below, where Min(Hz) is the minimum value for the frequency or bandwidth in Hertz, Max(Hz) is the maximum value in Hertz, and “Num. Quant.” is the number of quantization states. For the frequencies and the bandwidths, the range between the minimum and maximum is divided by the number of quantization states to provide the separation between each of the quantization states. For example, for bandwidth B1 in Table 3, the range of 260 Hz is evenly divided by the 5 quantization states such that each state is separated from the other states by 65 Hz. (i.e., 40, 105, 170, 235, 300).
The number of quantization states in Table 3 could yield a total of more than 100 million different sets of VTRs. However, because of the constraint F1<F2<F3<F4 there are substantially fewer sets of VTRs in the codebook.
After the codebook has been formed, the entries in the codebook are used to train parameters that describe a residual random variable at step 502. The residual random variable is the difference between a set of observation training feature vectors and a set of simulated feature vectors. In terms of an equation:
νt=ot−S(xt[i]) EQ. 5
where νt is the residual, ot is the observed training feature vector at time t and S(xt[i]) is a simulated feature vector.
As shown in
where Sn(xt[i]) is the nth element in an nth order LPC-Cepstrum feature vector, K is the number of VTRs, fk is the kth VTR frequency, bk is the kth VTR bandwidth, and fs is the sampling frequency, which in many embodiments is 8 kHz. The S0 element is set equal to logG, where G is a gain.
To produce the observed training feature vectors ot used to train the residual model, a human speaker 612 generates an acoustic signal that is detected by a microphone 616, which also detects additive noise 614. Microphone 616 converts the acoustic signals into an analog electrical signal that is provided to an analog-to-digital (A/D) converter 618. The analog signal is sampled by A/D converter 618 at the sampling frequency fs and the resulting samples are converted into digital values. In one embodiment, A/D converter 618 samples the analog signal at 8 kHz with 16 bits per sample, thereby creating 16 kilobytes of speech data per second. In other embodiments, A/D converter 68 samples the analog signal at 16 kHz. The digital samples are provided to a frame constructor 620, which groups the samples into frames. Under one embodiment, frame constructor 620 creates a new frame every 10 milliseconds that includes 25 milliseconds worth of data.
The frames of data are provided to an LPC-Cepstrum feature extractor 622, which converts the signal to the frequency domain using a Fast Fourier Transform (FFT) 624 and then identifies a polynomial that represents the spectral content of a frame of the speech signal using an LPC coefficient system 626. The LPC coefficients are converted into LPC cepstrum coefficients using a recursion 628. The output of the recursion is a set of training feature vectors 630 representing the training speech signal.
The simulated feature vectors 610 and the training feature vectors 630 are provided to residual trainer 632 which trains the parameters for the residual νt.
Under one embodiment, νt is a single Gaussian with mean h and a precision D, where h is a vector with a separate mean for each component of the feature vector and D is a diagonal precision matrix with a separate value for each component of the feature vector.
These parameters are trained using an Expectation-Maximization (EM) algorithm under one embodiment of the present invention. During the E-step of this algorithm, a posterior probability γt(i)=p(xt[i]|o1N) is determined. Under one embodiment, this posterior is determined using a backward-forward recursion defined as:
where ρt(i) and σt(i) are recursively determined as:
Under one aspect of the invention, the transition probabilities p(xt[i]|xt−1[j]) and p(xt[i]|xt+1[j]) are determined using Equation 1 above, which is repeated here for convenience using the codebook index notation:
xt[i]=Φxt−1[i]+(I−Φ)T+wt EQ. 10
where xt[i] is the value of the VTRs at frame t, xt−1[j] is the value of the VTRs at previous frame t−1, Φ is a rate, T is a target for the VTRs associated with frame t and wt is the noise at frame t, which in one embodiment is assumed to be a zero-mean Gaussian with a precision matrix B.
Using this dynamic model, the transition probabilities can be described as Gaussian functions:
p(xt[i]|xt−1[j])=N(xt[i];Φxt−1[i]+(I−Φ)T,B) EQ. 11
p(xt[i]|xt+1|[j])=N(xt+1[i]; Φxt[i]+(I−Φ)T,B) EQ. 12
Alternatively, the posterior probability γt(i)=p(xt[i]|o1N) may be estimated by making the probability only dependent on the current observation vector and not the sequence of vectors such that the posterior probability becomes:
γt(i)≈p(xt[i]|ot) EQ. 13
which can be calculated as:
where ĥ is the mean of the residual and {circumflex over (D)} is the precision of the residual as determined from a previous iteration of the EM algorithm or as initially set if this is the first iteration.
After the E-step is performed to identify the posterior probability γt(i)=p(xt[i]|o1N), an M-step is performed to determine the mean h and each diagonal element d−1 of the variance D−1 (the inverse of the precision matrix) of the residual using:
where N is the number of frames in the training utterance, I is the number of quantization combinations for the VTRs, ot is the observed feature vector at time t and S(xt[i]) is a simulated feature vector for VTRs xt[i].
Residual trainer 632 updates the mean and variance multiple times by iterating the E-step and the M-step, each time using the mean and variance from the previous iteration. After the mean and variance reach stable values, they are stored as residual parameters 634.
Once residual parameters 634 have been constructed they can be used in step 504 of
In
The stream of feature vectors 730 is provided to a VTR tracker 732 together with residual parameters 634 and simulated feature vectors 610. VTR tracker 732 uses dynamic programming to identify a sequence of most likely VTR vectors 734. In particular, it utilizes a Viterbi decoding algorithm where each node in the trellis diagram has an optimal partial score of:
Based on the optimality principle, the optimal partial likelihood at the processing stage of t+1 can be computed using the following Viterbi recursion:
In Equation 18, the “transition” probability p(xt+1[i]=x[i]|xt[i]=x[i′]) is calculated using state Equation 10 above to produce a Gaussian distribution of:
p(xt+1[i]=x[i]|xt[i]=x[i′])=N(xt+1[i];Φxt[i′]+(I−Φ)T,B) EQ. 19
where Φxt[i]+(I−Φ)T is the mean of the distribution and B is the precision of the distribution.
The observation probability p(ot+1|xt+1[i]=x[i]) of Equation 18 is treated as a Gaussian and is computed from observation Equation 5 and the residual parameters h and D such that:
p(ot+1|xt+1[i]=x[i])=N(ot+1;S(xt+1[i])+h,D) EQ. 20
Back tracing of the optimal quantization index i′ in equation 20 provides the initial VTR sequence 734.
To reduce the number of computations that must be performed, a pruning beam search may be performed instead of a rigorous Viterbi search. In one embodiment, an extreme form of pruning is used where only one index is identified for each frame.
After initial VTR sequence 734 has been identified at step 504, the initial VTR sequence is provided to a linear parameter estimator 736, which selects the parameters for the linear approximations of Equation 4 above at step 506. Specifically, for each frame, the initial VTR vector for the frame is used to determine the values of the linear parameters αkx, βkx, γkx, and δkx for each vocal tract resonance index k and each LPC order n.
Under one embodiment, the values of linear parameters αkx and βkx are determined for an LPC order n by applying bandwidth bk of the initial VTR vector to the exponent term
and evaluating the exponent. The linear segment of
Under one embodiment, the values of linear parameters γkx and δkx are determined for an LPC order n by applying frequency fk of the initial VTR vector to the cosine term
and evaluating the cosine. The linear segment of
At step 508, the linear parameters for each frame are applied to Equation 4. Ignoring the incremental portion of the quadratic term in Equation 4, equation 4 is used in Equation 2. Equations 1 and 2 are then provided to a Kalman filter 738, which re-estimates the VTR vectors 734 for each frame. At step 510, the process determines if there are more iterations to be performed. If there are more iterations, the process returns to step 506, where the linear parameters are re-estimated from the new VTR vectors. The new linear parameters are then applied to Equation 2 through Equation 4 and Equations 1 and 2 are used in Kalman Filter 738 at step 508 to re-estimate the VTR vectors. Steps 506, 508 and 510 are iterated until a determination is made at step 510 that no further iterations are needed. At that point, the process ends at step 512 and the last estimation of VTR vectors 734 is used as the sequence of vocal tract resonance frequencies and bandwidths for the input signal.
Note that the Kalman Filter 738 provides continuous values for the vocal tract resonance vectors. Thus, the resulting sequence of vocal tract resonance frequencies and bandwidths is not limited to the discrete values found in VTR codebook 600.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.