The present invention relates generally to audio processing and in particular, to speech signal processing.
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
Voice command and continuous speech recognition are used for mobile Internet devices, for example, with in-car applications and phones that have limited keyboard functionality. It is desirable to be able to provide clean input to any speech recognition engine, but background noise in the environment impedes this objective. For example, experiments have shown that the open dictation word accuracy can degrade to approximately 20% in car noise and cafeteria environments, which may be unacceptable to the user.
Today's speech engines have some noise reduction features to reduce the impact of background noise. However, these features may not be sufficient to allow open dictation in challenging environments. Accordingly, Kalman filtering techniques may be used to improve speech signal processing.
With some embodiments presented herein, speech recognition performance may be enhanced by bifurcating audio noise filtering processing into separate speech recognition and human reception paths. That is, the audio path may be cloned to generate a “perception” (or auditory reception) channel and a separate channel that is used for preprocessing audio for the speech recognition engine.
Audio (e.g., digitized audio from a microphone) comes into the SPE (speech processing engine) and is split into two paths: a speech recognition path, entering the Kalman filter block 104, and an audio perception path (cloned audio) that is processed using standard noise suppression techniques in block 110 for reception by a user. The Kalman filter utilizes components from the speaker/voice model 106, as well as from the environmental noise model 107, to filter out noise from the audio signal and provide a filtered signal to the automatic speech recognition (ASR) engine 108.
The speaker/noise model 106 (at least an initial version) is generated before SPE execution since the SPE works off of it, although an initial version may be fairly bare, and the speech/voice model may be updated while the SPE is executing. The speaker/voice engine 106 provides particular characteristics associated with the current speaker. Such characteristics could include one or more glottal harmonics, including the user's particular fundamental glottal frequency, along with any other suitable information. For example, if previously acquired models (e.g., resulting from user training) are available, they may also be incorporated into the speaker/user model 106. As indicated, previously generated “clean” audio information (x′(n)) for the particular user may also be used.
The environmental noise model 107, like the speaker/voice model, may be based on initial default data/assumptions for assumed noise environments or for specific or previously characterized environments (e.g., an office, a car, an airplane, etc). It may be static data (e.g., assumed background noise elements) associated with an environment and/or it may comprise dynamic data obtained from real-time sensors and the like. For example, it could include sensor inputs such as car speed, background noise microphone data, and air conditioning information, to enhance the performance of the noise model estimator. In some embodiments, a noise estimation method may be employed, e.g., for a single channel, by detecting periods of speech absence using a voice activity detector algorithm. The noise model may be further enhanced using an iterative loop between the noise model and Kalman filtering.
The filter 104 may use either or both the speaker model and noise model to filter the received audio signal. Again, from the speaker model, it may use an extension to add periodic components in the form of pulses into the Kalman filtering to account for glottal harmonics generated by the speech source (e.g., human or other entity speaker using, for example, a dictation, voice controlled, or translation device). Kalman filtering has typically been used with a white noise input, but in the case of human speech, the addition of a periodic input may more closely resemble the physiology of speech generation. The speaker model information including the predetermined model information and glottal harmonic parameters may be used to load a set of predetermined or previously determined coefficients for the speaker model. Kalman filtering results in audio that does not necessarily noticeably improve the human perception, but it does typically improve the performance of the speech recognition engine. Therefore, the audio path is cloned (two paths) to maximize both human perception and the speech recognition input using the Kalman pre processing filtering.
An implemented filter 104 using Kalman techniques can be used to model the vocal tract response as an AR or ARMA system, using an independent input and a driving noise, along with a noisy observation that accounts for additive colored-noise.
In conventional Kalman applications, the driving periodic input is typically neglected and only a driving white noise is used for simplicity. This assumption implies that the filter will (under an ideal performance) produce a clean but unvoiced speech signal, which neither has physiological value nor sounds natural. However, the assumption may be adequate in cases where only filter parameters are needed.
On the other hand, we have determined that the linear Kalman filter may capture the fundamental interactive features observed in voice production, thus yielding better estimates of the clean input under noisy conditions. When combined with CP analysis and source modeling, for example, it may perform even better for speech processing applications. The error in a scheme of this nature will be associated to its parameter estimation errors and not the product of a physiological/acoustical misrepresentation. Therefore, speech enhancement schemes disclosed herein are based on the linear Kalman filter, with the structure shown in the following table under the “Linear” heading.
The state xk corresponds to the clean speech input that is produced by the glottal source uk and environmental noise wk. (x, is not an actual input to the SPE.) The measured signal yk is corrupted by the observation noise vk. As described before, previous Kalman approaches neglect the periodic input uk for simplicity, yielding white noise excited speech. However, the inclusion of such a periodic input and CP representation of the state transition matrix provides better estimates of the clean input xk and thus better speech recognition performance. In the following section, Kalman filtering, as applied herein, will be discussed in more detail.
In some embodiments, a Kalman filtering model-based approach is used for speech enhancement. It assumes that the clean speech follows a particular representation that is linearly corrupted with background noise. With standard Kalman filtering, clean speech is typically represented using an autoregressive (AR) model, which normally has a white Gaussian noise as an input. This is represented in discrete time equation 1.
where x[n] is the clean speech, αn the AR or linear prediction coding (LPC) coefficients, w[n] the white noise input, and p is the order of the AR model (normally assumed to follow the rule of thumb p=fs/1000+2, where fs is the sampling rate in kHz). This model can be rewritten to produce the desired structure needed for the Kalman filter, as described in equations (2) and (3). Thus,
xk+1=Φxk+Gwk (2)
yk=Hxk+vk (3)
where xk+1 and xk are vectors containing p samples of the future and current clean speech, Φ is the state transition matrix that contains the LPC coefficients in the last row of a controllable canonical form, wk represents the white noise input that is converted into a vector that affects the current sample via the vector gain G. The clean speech is projected via the projector vector H to obtain the current sample that is linearly added to the background noise vk to produce the corrupted observation or noisy speech yk.
Kalman filtering comprises two basic steps, a propagation step and an update step. In the propagation step the model is used to predict the current sample based on the previous estimate (hence the notation n|n−1). This is represented in equation (4). Note that only one buffer of one vector containing the previous p points is required. The update step is depicted in equations (5)-(7), where the predicted samples are first corrected considering the error between the prediction and the estimate. This error is controlled by the Kalman gain Kn, which is defined in equations (6) and (7). Note that all these parameters may be computed once within each frame, i.e., speech is considered a stationary process within each frame (normally of duration no longer than 25 ms).
{circumflex over (x)}n|n-1=Φ{circumflex over (x)}n-1|n-1 (4)
{circumflex over (x)}n|n={circumflex over (x)}n|n-1+Kn(yn−Hn{circumflex over (x)}n|n-1) (5)
Kn=Pn|n-1HnT(HnPn|n-1HnT+Rn)−1 (6)
Pn|n=I−(KnHn)Pn|n-1 (7)
The “modified Kalman filter” that is proposed in this project extends the standard filter by generalizing the two basic noise assumptions in the system, i.e., assuming that glottal pulses also drive the AR model during voiced segments and that the background noise has resonances associated with it (non-white process). The glottal pulses are represented by u[n] and are present when there is vocal fold vibration. The background noise is assumed to follow an AR model of order q (which may be estimated, e.g., empirically obtained as q=fs/2000). Therefore, the two equations that represent the new structure of the system are
Since the model for speech and noise have a similar structure, the state equation needed to the Kalman filter can be extended by creating two subsystems embedded in a larger diagonal matrix. The same system structure is used to track speech and noise as shown in equations (10) to (13), where the subscript s indicates speech and v indicates background noise. The glottal pulses are introduced only in the current sample, for which the vector B has the same structure as G.
The equations to compute Kalman propagation and update are different from the standard Kalman filter, for among other reasons, in that the glottal pulses are included and the noise covariance matrix Rn is not, since the noise is being tracked by the filter itself. These changes are represented by modifying equation (4) by (14), and equation (6) by (15). Thus,
{circumflex over (x)}n|n-1=Φ{circumflex over (x)}n-1|n-1+Buk (14)
Kn=Pn|n-1HnT(HnPn|n-1HnT)−1 (15)
With these modifications, the filter better represents speech signal and background noise conditions, thus yielding better noise removal and ASR performance.
The new Kalman filtering technique can not only be used for enhancement of speech recognition, but also to improve speech synthesis. With reference to
By reducing memory requirements at the front-end of the hardware, the use of lower power operation may be enabled to increase the number of operations per watt. The hardware implementation of the speech enhancement algorithms in the front-end 301 provides opportunity for achieving low power and will also enables the use of a threshold detector 304 to provide a wake-up signal to the back-end of the processor hardware. The back end 305 provides hardware implementation of the speech recognition algorithms e.g., (HMM and/or neural networks based), which is typically memory intensive, and high performance. Thus by dividing the hardware (e.g., SPE hardware) into a compute-intensive front-end and a high performance back-end, “voice-wake” and “always-listening” features may also be implemented for speech enhancement and recognition.
The IO section 410 comprises an audio processing section 412 and peripheral interfaces(s) 414. The Peripheral interface(s) provide interfaces (e.g., PCI, USB) for communicating and enabling various different peripheral devices 415 (keyboard, wireless interface, printer, etc.). The audio processing section 412 may receive various audio input/output (analog and/or digital) for providing/receiving audio content from a user. It may also communicate with internal modules, for example, to communicate audio between a user and a network (e.g., cell, Internet, etc.). The audio processing section 412 includes the various components (e.g., A/D/A converters, codecs, etc. for processing audio as dictated by the functions of the platform 402. In particular, the audio Px 412 includes an SPE 413, as discussed herein, for implementing speech processing. In particular, it may comprise a power efficient structure as described in
In the preceding description, numerous specific details have been set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques may have not been shown in detail in order not to obscure an understanding of the description. With this in mind, references to “one embodiment”, “an embodiment”, “example embodiment”, “various embodiments”, etc., indicate that the embodiment(s) of the invention so described may include particular features, structures, or characteristics, but not every embodiment necessarily includes the particular features, structures, or characteristics. Further, some embodiments may have some, all, or none of the features described for other embodiments.
In the preceding description and following claims, the following terms should be construed as follows: The terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” is used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” is used to indicate that two or more elements co-operate or interact with each other, but they may or may not be in direct physical or electrical contact.
The term “PMOS transistor” refers to a P-type metal oxide semiconductor field effect transistor. Likewise, “NMOS transistor” refers to an N-type metal oxide semiconductor field effect transistor. It should be appreciated that whenever the terms: “MOS transistor”, “NMOS transistor”, or “PMOS transistor” are used, unless otherwise expressly indicated or dictated by the nature of their use, they are being used in an exemplary manner. They encompass the different varieties of MOS devices including devices with different VTs, material types, insulator thicknesses, gate(s) configurations, to mention just a few. Moreover, unless specifically referred to as MOS or the like, the term transistor can include other suitable transistor types, e.g., junction-field-effect transistors, bipolar-junction transistors, metal semiconductor FETs, and various types of three dimensional transistors, MOS or otherwise, known today or not yet developed.
The invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. For example, it should be appreciated that the present invention is applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chip set components, programmable logic arrays (PLA), memory chips, network chips, and the like.
It should also be appreciated that in some of the drawings, signal conductor lines are represented with lines. Some may be thicker, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
It should be appreciated that example sizes/models/values/ranges may have been given, although the present invention is not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the FIGS, for simplicity of illustration and discussion, and so as not to obscure the invention. Further, arrangements may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the present invention is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
Number | Name | Date | Kind |
---|---|---|---|
5148488 | Chen et al. | Sep 1992 | A |
5570453 | Gerson et al. | Oct 1996 | A |
5774846 | Morii | Jun 1998 | A |
5864810 | Digalakis et al. | Jan 1999 | A |
5915234 | Itoh | Jun 1999 | A |
5970446 | Goldberg et al. | Oct 1999 | A |
6205421 | Morii | Mar 2001 | B1 |
6408269 | Wu et al. | Jun 2002 | B1 |
6427134 | Garner et al. | Jul 2002 | B1 |
6757651 | Vergin | Jun 2004 | B2 |
7072833 | Rajan | Jul 2006 | B2 |
7082393 | Lahr | Jul 2006 | B2 |
7117157 | Taylor et al. | Oct 2006 | B1 |
7457750 | Rose et al. | Nov 2008 | B2 |
8121837 | Agapi et al. | Feb 2012 | B2 |
20020026253 | Rajan | Feb 2002 | A1 |
20020055913 | Rajan | May 2002 | A1 |
20020120443 | Epstein et al. | Aug 2002 | A1 |
20030177007 | Kanazawa et al. | Sep 2003 | A1 |
20040064315 | Deisher et al. | Apr 2004 | A1 |
20040083100 | Burnett et al. | Apr 2004 | A1 |
20040193411 | Hui et al. | Sep 2004 | A1 |
20080288258 | Jiang et al. | Nov 2008 | A1 |
20090076814 | Lee | Mar 2009 | A1 |
20090163168 | Andersen et al. | Jun 2009 | A1 |
20090222263 | Collotta et al. | Sep 2009 | A1 |
20100131269 | Park et al. | May 2010 | A1 |
20100262425 | Tanabe et al. | Oct 2010 | A1 |
20110054892 | Jung et al. | Mar 2011 | A1 |
20110077939 | Jung et al. | Mar 2011 | A1 |
20110125490 | Furuta et al. | May 2011 | A1 |
20110305345 | Bouchard et al. | Dec 2011 | A1 |
20120179462 | Klein | Jul 2012 | A1 |
Number | Date | Country |
---|---|---|
2002006898 | Jan 2002 | JP |
10-2005-0106235 | Nov 2005 | KR |
309675 | Jul 1997 | TW |
425542 | Mar 2001 | TW |
2009116291 | Sep 2009 | WO |
2012003269 | Jan 2012 | WO |
2012003269 | Mar 2012 | WO |
Entry |
---|
Thomas F. Quatieri, Discrete-time Speech Signal Processing Principle and Practice, Prentice Hall Signal Processing Series, 2002; pp. 59, 64, 225-226, 150. |
International Search Report and Written Opinon received for PCT application No. PCT/US2011/042515, mailed on Feb. 9, 2012, 8 Pages. |
International Preliminary Report on Patentability received for PCT application No. PCT/US2011/042515, mailed on Jan. 17, 2013, 5 pages. |
Office Action for Taiwanese Patent Application 100123111, mailed on Sep. 26, 2013. 11 pages including 7 pages of English language translation. |
Office Action Received for Japanese Patent Application No. 2013-513424, mailed on Dec. 10, 2013, 4 pages of Office Action including 2 pages of English Translation. |
Notice of Preliminary Rejection for Korean Patent Application No. 10-2012-7031843 received Feb. 7, 2014, 7 pages including 3 pages of English translation. |
Number | Date | Country | |
---|---|---|---|
20120004909 A1 | Jan 2012 | US |