The present disclosure relates to computer-implemented speech processing and more particularly to a speech feature extraction technique that improves performance of automatic speech recognizers in the presence of noise.
This section provides background information related to the present disclosure which is not necessarily prior art.
Computer-implemented, automatic speech recognizers today are essentially complex pattern recognition systems that compare the incoming speech utterance to a set of trained speech models stored within the memory of the recognizer or accessible to the recognizer via a communications link. The speech models are typically trained under controlled conditions by supplying a corpus of speech data (e.g., utterances from human subjects reading assigned text passages).
Once trained, the models are made available to the recognizer which processes input speech by testing how well the incoming speech matches each of the trained models. Typically recognition probability scores are generated for each model. Thus for a recognizer supplied with in incoming utterance, “cat,” the trained “cat” model might return a probability score of 98%; the trained “bat” model might return a probability score of 70%; and the “aardvark” model would likely return a recognition probability score of 0%. The foregoing is merely a simplified example to demonstrate the basic recognition concept. While recognizers can work with speech models trained to recognize specific words (as in this example), they can also be trained to recognize continuous speech, where the trained models are based on more fundamental sounds such as phonemes rather than words; they can also be trained to recognize different speakers' voices, where each speaker to be recognized provides training data that are used to train models for that speaker.
Some recognizers are also capable of adapting or improving the speech models while the system is being used. In such systems, the initially provided speech models are adapted to improve recognition probability scores, based on utterances received from users as the system is being used. Anyone who has used a speech recognizer for dictation will understand that these systems learn the user's unique speech patterns over time. What is actually happening behind the scenes is that the speech models are being adapted to that user's voice.
Speech recognizers work fairly well under optimal conditions, where the incoming speech is obtained under conditions similar to those used when the training data was collected. Variation from these optimal conditions can rapidly degrade recognition performance. Microphone placement (proximity to user's mouth) and background noise are two factors that significantly affect the recognizer's performance. If a user utters words in a noisy environment, perhaps with less than optimal microphone placement (such as in a moving vehicle, or via a mobile phone in a noisy place, the recognition probability scores drop precipitously. Recognition results suffer. Some systems attempt to compensate for poor recognition by resorting to additional or more computationally intensive recognition algorithms. Recognition performance may improve, but the time required to perform the recognition will likely increase. This is one reason why mobile phone-based recognition systems will sometimes take a long time to recognize a phrase which on other occasions it was able to quickly recognize.
As discussed more fully in this disclosure, there are several techniques that can be used to improve recognizer performance under difficult conditions such as in the presence of noise or when the communication channel is degraded (through poor microphone placement or other transmission loss). The present disclosure attacks the problem by improving the way the speech signals are processed to extract features that are used to train the speech models and then used to process the incoming speech.
Discussion of Feature Extraction
When human speech is processed so that an automatic speech recognizer can analyze it, the speech is captured in analog form by a microphone and then digitized by analog to digital convertor. This converts the human speech into a time-domain sequence of digital values representing the instantaneous waveform amplitude at sample of the waveform extracted by the analog to digital convertor. In its native digitized form, the speech signal can be of any length, dictated by how long was the utterance. Pattern recognition of a time-domain sequence of digital values of indeterminate length is an intractable problem. Therefore, to make pattern recognition possible, the digitized speech signal is first broken into units of predefined length. This process is known as “windowing.” Windowing breaks the digital data stream into smaller, fixed length chunks that can be fed to the recognizer, one chunk at a time.
However, it turns out that processing chunks of raw digital speech data in the time domain remains largely unsuccessful because even for the same word uttered several times, the raw digital speech data will vary significantly from utterance to utterance. Thus comparing utterance A with utterance B on the basis of individual raw digital speech data points is not effective. Speech recognizer systems deal with this by extracting “features” from the raw digital speech data. The goal is to identify features that are effective in discriminating utterance A from utterance B, while reducing the number of comparisons that need to be performed. Many speech recognizers today are based on extracted features known as “cepstral coefficients.”
As will be more fully described below, the present disclosure seeks to improve automatic speech recognition and automatic speech recognizers by utilizing a new way of extracting features from the speech signal.
Therefore, to reiterate, unlike human audition, the performance of speech-based recognition systems degrades significantly in the presence of noise and background interference. This can be attributed to inherent mismatch between the training and deployment conditions, especially when the characteristics of all possible noise sources are not known in advance. Therefore, in literature several strategies have been proposed that can reduce the effect of this mismatch. They can be broadly categorized into three main groups: 1) speech enhancement techniques that can filter out the noise in the spectral or temporal domain; 2) robust feature extraction techniques that can generate speech features that are invariant to channel conditions; and 3) back-end adaptation techniques that can reduce the effect of training-deployment mismatch by adjusting the parameters of a statistical recognition model. Even though significant improvements in recognition performance can be expected by the application of the third approach, the overall system performance is still limited by the quality of the speech features. Therefore, this disclosure focuses on extraction of speech features that are robust to mismatch between training and testing conditions.
Traditionally, speech features used in most of the state-of-the-art speech recognition systems have relied on spectral-based techniques which include Mel-frequency cepstral coefficients (MFCCs), linear predictive coefficients (LPCs), and perceptual linear prediction (PLP). Noise-robustness is achieved by modifying these well-established techniques to compensate for channel variability. For example, cepstral mean normalization (CMN) and cepstral variance normalization adjust the mean and variance of the speech features in the cepstral domain to reduce the effect of convolutive channel distortion. Another example is the Relative spectra (RASTA) technique which suppresses the acoustic noise by high-pass (or band-pass) filtering of the log-spectral representation of speech. More recently, advanced signal processing techniques like the feature-space non-linear transformation techniques, the ETSI advanced front end (AFE), stereo-based piecewise linear compensation (SPLICE) and power-normalized cepstral coefficients (PNCC), have been used to improve the noise-robustness. The AFE approach, for example, integrates several methods to remove the effects of both additive and convolutive noises. A two-stage Mel-warped Wiener filtering, combined with an SNR-dependent waveform processing is used to reduce the effect of additive noise and a blind equalization technique is used to mitigate the channel effects.
An alternate and a promising approach towards extracting noise-robust speech features is to use data-driven statistical learning techniques that do not make strict assumptions on the spectral properties of the speech signal. Examples include kernel based techniques which operate under the premise that robustness in speech signal is encoded in high-dimensional temporal and spectral manifolds which remain intact even in the presence of ambient noise and the objective of the feature extraction procedure is to identify the parameters of the noise-invariant manifold. The procedure used in a standard kernel based technique required solving a quadratic optimization problem for each frame of speech which made the data-driven approach highly computationally intensive. Also, due to its semi-parametric nature, the methods proposed in prior systems did not incorporate any a priori information available from neurobiological and psycho-acoustical studies, which have been shown to be important for speech recognition. More recently, it has been demonstrated that cortical neurons use highly efficient and sparse encoding of visual and auditory signals. It has been shown that auditory signals can be sparsely represented by a group of basis functions which are functionally similar to gammatone functions which are equivalent to time-domain representations of human cochlea filters, also used in psycho-acoustical studies. Other neurobiological studies have proposed a hierarchical auditory processing model consisting of spectro-temporal receptive fields (STRFs) that capture information embedded in different frequency, spectral and temporal scales. The results from many of these recent neurobiological and psycho-acoustical studies are being incorporated in small-scale speech recognition systems.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
Departing from the convention cepstral coefficient techniques, the disclosed method and apparatus provides a computationally efficient, hierarchical auditory feature extraction method and apparatus that uses a transformation technique, such as a non-linear reproducing kernel Hilbert space (RKHS) transformation of gammatone basis functions.
More specifically, the method and apparatus processes the time domain speech signal, digitally represented as a vector of a first dimension, and converts that vector into a speech feature vector that has advantageous properties when compared with conventional cepstral coefficient-based feature vectors.
The method operates on the time domain speech signal, stored in memory of a processor. A set of gammatone basis functions, represented as a set of gammatone basis vectors of the first dimension are also stored in the memory of the processor. The processor applies a reproducing kernel function to transform the stored gammatone basis vectors and the stored speech signal to a higher dimensional space. Then, using the processor, a set of similarity vectors is computed in said higher dimensional space based on the stored gammatone basis vectors and the stored speech signal. The processor then applies an inverse function to transform the set of similarity vectors in said higher dimensional space to a set of similarity vectors of the first dimension, and then selects one of the set of similarity vectors of the first dimension as a processed representation of said speech signal.
The transformation from higher dimensional space to the first dimension effects a nonlinear transformation. The nonlinear transformation and use of gammatone basis functions thus generates an extracted speech feature vector that represents many of the nuances of human speech better than conventional cepstral coefficients. The higher dimensional space may be described as a Hilbert space and where the transformation is a reproducing kernel Hilbert space RKHS transformation. To reduce the computational burden on the processor, the transformation may be performed by precomputing and storing in memory a transformation matrix and using the transformation matrix to perform the step of applying an inverse function.
In addition to the foregoing steps and operations, the method and apparatus may additionally apply a regularization parameter that penalizes large similarity values to enhance robustness of the processed representation of said speech signal in the presence of noise. The method and apparatus may also perform the step of selecting one of said set of similarity vectors by applying a winner-take-all function. In addition, the method and apparatus may further use the processor to apply a compressive weighting function to the selected one of said set of similarity vectors. The compressive weighting function may be configured to enhance the resolution at low similarity scores and reduce the resolution at high similarity scores. The method and apparatus may further apply a feature pooling function to the selected one of said set of similarity vectors. The method and apparatus may further perform the step of sparsifying the selected one of the set of similarity vectors to reduce its dimensionality. The sparsifying operation may be configured to reduce dimensionality to a predetermined dimensionality corresponding to the requirements of a predetermined speech recognizer. Additionally, the processor may be programmed to decorrelate the selected one of the set of similarity vectors, as by applying a discrete cosine transform. The processor may also be programmed to compute at least one of velocity coefficients and acceleration coefficients and appending said at least one of velocity coefficients and acceleration coefficients to said selected one of said set of similarity vectors.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
a-5f (collectively
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
Example embodiments will now be described more fully with reference to the accompanying drawings.
In this disclosure, we describe a computationally efficient hierarchical auditory feature extraction model using an RKHS based statistical learning approach. The model is summarized in
The description below is organized as follows. Section I gives an overview of an exemplary automatic speech recognizer. The recognizer may be implemented using the SPARK features described herein. Section II describes the mathematical basis underlying the SPARK feature extraction algorithm. Section III presents experimental results summarizing the effect of different hyper-parameters and kernel functions when SPARK feature are evaluated for a speech recognition task using the AURORA2 corpus. Section IV discusses some further extensions of the SPARK technique. Section V concludes the disclosure with a discussion of how a SPARK feature extractor may be implemented using a suitable processor or set of processors. Before we present the SPARK algorithm we summarize some of the mathematical notations that will be used in this disclosure:
Section I. Exemplary Automatic Speech Recognition System
Referring to
The output of the feature extractor 10 is used first during training, to train the speech models 14. The output of the feature extractor 10 is subsequently used to convert incoming speech to be translated into the parameterized form used by the pattern classifier 12 during recognition. For illustration purposes the speech models 14 may be implemented as Hidden Markov Models (HMM) where the speech unit (phoneme, word, etc.) is represented by set of states (shown as circles) and transitions (shown as arrows), each having an associated probability distribution. The HMM model can be seen as a production model in which each transition corresponds to the emission of a speech frame or feature vector. To each state a corresponding probability distribution is assigned, representing the probability of producing an event. To each transition a probability distribution is also assigned, representing the probability of transitioning from that state to another state (or back to the same state).
The pattern classifier 12 computes a similarity measure between the input speech and each reference pattern represented by the trained models 14. The classifier process defines a local measure of closeness between feature vectors. The classifier also aligns two speech patterns so that they may be compared notwithstanding that they may differ in duration and rate of speaking.
The output of pattern classifier 12 is coupled to the decision processor 16 which selects the “closest” reference pattern based on decision rules that take into account the results of the similarity measurements (e.g., recognition probability scores). The pattern classifier 12 produces a recognition output 18 which may include a text-based representation of the recognized utterance, and/or an identification or verification of the speaker's identity, for example.
The feature extractor 10, pattern classifier 12 and decision processor 16 may be implemented using a programmed processor or computer 20 with associated computer-readable memory 22 which is configured to store the trained models 14. If desired the functionality represented by the feature extractor 10, pattern classifier 12 and decision processor 16 may be implemented by separate processors or computers that communicate with one another over a suitable communications link, such as the Internet. For example, the feature extractor 10 may be implemented using a processor within a mobile phone, the pattern classifier 12 and trained models 14 may be implemented using a processor located within a server coupled to the mobile phone by the telecommunications infrastructure. In such embodiment the decision processor may be implemented either on the server or on the processor within the mobile phone.
The SPARK feature extraction algorithm implemented by the preferred feature extractor 10 will now be described with reference to
Section II. Spark Feature Extraction Algorithm
In this section, we describe the mathematics underlying the SPARK feature extraction procedure. The first part of this analysis will involve deriving the mathematical form of the SPARK similarity functions based on RKHS regression techniques. For the analysis presented in this section, we will assume that a frame of speech signal is extracted using an appropriate windowing function (Hamming or Hanning).
A. SPARK Similarity Functions
As shown in
φm[n]=amnθ−1 cos(2πƒmn)e−πβERB(ƒ
where ƒm is the center frequency parameter, am is the amplitude, θ is the order of the gammatone basis, β is the parameter which controls the decay of the envelope along with a monotonic frequency dependent function ERB(·) called equivalent rectangular bandwidth (ERB) scale. One possible form of ERB(ƒm) which has been used in this disclosure, takes the form
ERB(ƒm)=0.108ƒm+24.7. (2)
Also, in this disclosure we have chosen θ=4 and β=1.019.
We will compactly represent the discrete-time gammatone function øm[n−τl,m] φl,m□P and correspondingly the similarity function will be given by s(φl,m,x). We now define a discrete-time waveform ƒ[n], n==1, . . . , P which constructed using the time-shifted basis functions according to
Our objective will be to determine the form of the similarity functions s(φl,m,x) by ensuring that the waveform ƒ[n] is close to the speech waveform x[n] according to some optimization criterion.
Before we present the optimization function, we rewrite the time-domain expressions in a matrix-vector notation as
f=φs (4)
where fεP and sεL×M is a vector given by s=[s1,1, s1,2, . . . , sL,M]T with its element given by sl,m=s(φl,m, x). Φ εP×L×M is a matrix given by Φ=[φ1,1, . . . , φL,M]T.
The optimization procedure for SPARK features involves minimizing a cost function with respect to, where is given by
The first part of the cost function acts as a regularizer which penalizes large values of sl,m, thus favoring similarity measures that are smooth (or penalizes high-frequency components of the similarity function). The second part of the cost function C is the least-square error function computed between the speech vector and the reconstructed waveform ƒ[n]. The hyper-parameter λ in C controls the tradeoff between the achieving a lower reconstruction error and obtaining smoother similarity function. Equating the derivative
leads to
φTx=(φTφ+λI)s (6)
where I denotes an identity matrix. The optimal s* can be found to be
s*=[φ
T
φ+λI]
−1]−1φTx (7)
Equation (7) shows that the optimal similarity function s* is expressed in terms of inner-products between different timeshifted gammatone basis ΦTΦ={φl,m·φu,v}; l,u=1, . . . , L; m,v=1, . . . M and between the time-shifted gammatone basis and the input speech vector ΦTx={φl,m·x}. Equation (7) shows that the similarity function admits a linear form and involves computing inner-products. We extend this framework to a more general, nonlinear form of similarity functions by converting the inner-products in (7) into kernel expansions over the gammatone and the speech vectors.
We introduce a nonlinear transformation function ψ: P→D, D>>P, which will map the vectors x and φl,m to a higher dimensional space according to x→ψ(x) and φl,m→ψ(φl,m). The high-dimensional mapping could consist of cross-correlation terms, for example, (x[1], x[2], . . . , x[P])→(x[1], x[2], {x[1]}2, {x[2]}2, {x[1]x[2]}, . . . ) which capture nonlinear attributes of the speech signal. Thus, extending (4) to the high-dimensional space, the reconstruction function fεD can be written as
f=ψ(φ)s (8)
where ψ(Φ)εD×L×M is a matrix given by ψ(Φ)=[ψ(φ1,1), . . . , ψ(φL,M]T. Then, following the regression procedure as described above, the similarity function can be expressed as inner-products in the higher dimensional space according to
Unfortunately, computing inner-products directly in the high-dimensional space is computationally intensive. The use of reproducing kernels avoids this “curse of dimensionality” by avoiding direct inner-product computation. For example, consider a nonlinear mapping of a two-dimensional vector yε2 such that
The product between two vectors y, zε2, in the high-dimensional space can be expressed as ψ(y)·ψ(z)=(1+y·z)2 which requires computing inner-products only in the low-dimensional space, hence, is more computationally tractable. In general, any symmetric positive-definite function K(·,·)(also referred to as the reproducing kernel function) can be expressed as K(z,y)=ψ(x)·ψ(y) and hence can be used in (9). In literature, many forms of reproducing kernels have been reported, which includes the Gaussian radial basis function or the polynomial spline function. In neurophysiology, kernel functions have also been used for computing similarity measures in neural responses. Equation (9) can be expressed in terms of kernels as
s*=(K+λI)−1K(φ,x) (10)
where KεL×M×L×M is a RKHS kernel matrix with elements K(φl,m,φu,v). Thus, a generic form of RKHS based similarity function can be expressed as
s(φl,m,x)=(K+λI)−1K(φl,m,x) (11)
Note that the matrix inverse in (11) involves only the gammatone basis and hence can be precomputed and stored. Thus, the computation of the SPARK similarity metric involves computing kernels and a matrix-vector multiplication which can be made computationally efficient.
B. Feature Pooling
An important consequence of projecting the speech signal onto a gammatone function space (emulating the auditory STRFs) is that the highest scores (in ∥·∥2 sense) in the similarity metric vector s will capture the salient, higher order, and the spectro-temporal aspects of the speech signal. On the other hand, the low-energy components of s will also capture similarities to noise and channel artifacts. Feature pooling serves two purposes. First, it introduces competitive masking, where only the largest similarity score is chosen. This function emulates the local competitive behavior which has been observed in auditory receptive fields. The second purpose of feature pooling is to introduce a compressive weighting function (similar to psycho-acoustical responses) which enhances the resolution at low similarity scores and reduces the resolution at high similarity scores. Mathematically, the output bm, m=1, . . . , M, resulting from feature pooling is given by
where ζ(·) is the compressive weighing function which could be a logarithmic (·) or a power function (·)1/p, p>1. Note that the pooling is performed over a set consisting of time-shifted basis obtained from the same gammatone function.
C. SPARK Feature Extraction Signal-Flow
The flow-chart describing the complete SPARK feature extraction procedure is presented in
Section III. Experiments and Performance Evaluation
A. Experimental Setup
We have evaluated the SPARK features for the task of noise-robust speech recognition using the AURORA2 dataset. The AURORA2 task involves recognizing English digits in the presence of additive noise and convolutional noise. The task consists of three types of test sets. The first test set (set A) contains 4 subsets of 1001 utterances corrupted by subway, babble, car, and exhibition hall noises, respectively, at different SNR levels. The second set (set B) contains 4 subsets of 1001 utterances corrupted by restaurant, street, airport, and train station noises at different SNR levels. The test set C contains 2 subsets of 1001 sentences, corrupted by subway and street noises and was generated after filtering the speech with an MIRS filter before adding different types of noise.
For all the experiments reported in this paper, a hidden Markov model (HMM)-based speech recognizer has been used. The HMM recognizer was implemented using the hidden Markov toolkit (HTK) package. For each digit a whole word HMM was trained with 16 states per HMM and with three diagonal Gaussian mixture components per state. Additional HMMs were trained for the “sil” and “sp” models.
Next, we summarize the effect of different algorithmic hyper-parameters on the performance of a SPARK-based recognition system.
B. Effect of the Time-Shift Resolution
As we had described in Section II and shown in
C. Effect of Different Kernel Functions
The generic form of the similarity function s(·,·) is given by (11) and is dependent on the choice of the kernel function K(·,·). In this experiment, we evaluated the effect of different types of RKHS functions on the recognition performance of the SPARK based system. The results are summarized in Table II for the following kernel functions: (a) linear K(x,y)=x·y; (b) exponential K(x,y)=exp(cx·y); (c) sigmoid K(x,y)=tan h(ax·y+c); and (d) polynomial K(x,y)=(x·y)d. The results show that the choice of the kernel function affects the recognition performance, specifically, compared to the case when the linear kernel is used. The improvements in performance demonstrates the utility of exploiting nonlinear features in speech to achieve noise-robustness. Note that the best performance is obtained for a fourth-order polynomial kernel when we fixed Λ(·)=(·)1/15.
D. Effect of Compressive Weighting Function
The compressive weighting function, as described in Section II-B, amplifies the lower values and de-amplifies larger values of the similarity metric. Table III summarizes the effect of different polynomial weighting functions on the performance of the SPARK-based speech recognition system (for K(x,y)=tan h(0.01xyT−0.01)). The results indicate an optimal order of the weighing function that yields the best recognition performance.
E. Effect of Parameter λ
Parameter λ is the regularization parameter which penalizes large values of the similarity metric and in the process makes the solution in (11) more stable. Table IV summarizes the effect of λ on the recognition performance and results show that solutions which penalizes the large values of S yields better recognition performance under noisy conditions.
F. Comparison with the Basic ETSI Front-End (MFCC)
The accuracy of the SPARK-based recognition system has been compared against the baseline speech features extracted using the ETSI STQ WI007 DSR front-end. The basic ETSI front-end generates the 39-dimensional MFCC features without any cepstral mean normalization (CMN).
G. Comparison with Gammatone Filter-Bank Based Features
The objective of the next set of experiments was to compare the SPARK features with gammatone filter-bank based features. The signal flow for the gammatone filter-bank features is shown in
H. Comparison with ETSI AFE
The last set of experiments compared the SPARK features to the state-of-the-art ETSI AFE front-end. The ETSI AFE uses noise estimation, two-pass Wiener filter-based noise suppression, and blind feature equalization techniques. To incorporate an equivalent noise-compensation to the SPARK features, we used a power bias subtraction (PBS) method. PBS method resembles in some ways to the conventional spectral subtraction (SS), but instead of estimating noise from non-speech parts which usually needs a very accurate voice activity detector (VAD), PBS simply subtracts a bias where the bias is adaptively computed based on the level of the background noise. Tables VI and VII compares the performance of ETSI AFE and SPARK+PBS (λ=0.01) recognition system under different types of noise. Even though for Set A, the performance improvement of the SPARK+PBS system over the ETSI AFE system is not statistically significant, for Set B and Set C SPARK+PBS system consistently outperforms the ETSI AFE for all types of noise except subway and exhibition noise at low SNR. In fact, SPARK shows an overall relative improvements of 4.69% with respect to the ETSI AFE which is statistically significant.
Table VIII shows a comparative performance of SPARK+PBS features against basic ETSI FE, conventional gammatone filterbank, and ETSI AFE. Even under dean recording conditions, the SPARK+PBS demonstrates improvement over the baseline ETSI AFE system but the advantage of SPARK+PBS features becomes more apparent under noisy conditions.
Section IV. Extending the Spark Technique
In this disclosure, we have presented a framework for extracting noise-robust speech features called sparse auditory reproducing kernel (SPARK) coefficients. The approach follows a computationally efficient hierarchical model where parallel similarity functions (emulating neurobiologically inspired auditory receptive fields) are computed followed by a pooling method (emulating neurobiologically inspired local competitive behavior). In this disclosure, we have derived an optimal form of the similarity functions which uses reproducing kernels to capture the nonlinear information embedded in the speech signal. Experimental results obtained for the AURORA2 speech recognition tasks demonstrate that the following:
Under clean recording conditions, the performance of both baseline MFCC and SPARK based systems are comparable with a recognition accuracy of 99.25%. The result is consistent with other state-of-the-art results reported for the AURORA2 dataset.
The SPARK features demonstrate a more robust performance in the presence of both additive and convolutive noise. We have demonstrated that SPARK can achieve average word recognition rates of 80.38%, 81.24%, and 78.52% for sets A, B, and C of the AURORA2 corpus. We have also shown that for the AURORA2 task, SPARK features combined with the PBS technique consistently out-performs the state-of-the-art ETSI AFE based features.
A possible extension to this work, to further improve noise-robustness, can be achieved by incorporating L1 metric instead of an L2 metric in the regression framework (5). We anticipate that this procedure, even though is more computationally intensive, could lead to more noise-robust speech features.
Section V. Processor Implementation of the Spark Feature Extractor
As discussed above, the SPARK feature extractor 10 applies a similarity function to compare the incoming speech to a set of time-shifted gammatone kernels. Referring to
A property of the reproducing Kernel function 26 is that it transforms the input data into a higher-dimensional space, effecting a non-linear transformation in the process. As discussed above, this non-linearity is a desirable property because it modifies the gammatone waveforms to more closely model the properties of human hearing.
The reproducing kernel function 26 is then transformed back into lower-dimensional space by multiplying it by an inverse matrix shown in dashed lines at 32. The inverse matrix comprises two components, a reproducing kernel Hilbert space (RKHS) matrix 34 and an optimization parameter 36 implemented by applying the regularization parameter discussed in Section III E. above to the identity matrix 38. Multiplying the reproducing Kernel function 26 with the inverse matrix 32 transforms the resulting matrix back to the original lower-dimensional space.
Note that while the reproducing Kernel function 26 receives both the gammatone functions 28 and the input speech 30 as inputs, the inverse matrix requires only the gammatone functions 28 (which are supplied as inputs to the RKHS kernel matrix 34). This means that the entire inverse matrix can be precomputed (before any input speech is received). The precomputed values of the inverse matrix 32 are stored in memory 22 (
With this understanding of the similarity function 24, refer now to
At this point the output represents a set of similarity value gammatone basis-speech vector products. A winner-takes-all function is then applied at 46, to select one of the set of products that represents the largest output. This is referred to in the above discussion as the MAX operation. After making the winner-takes-all selection, the resulting output is a single vector, of the same dimensionality as the input speech signal. However, whereas the input speech signal corresponded to time domain parameters, the output of the winner-takes-all function is a raw SPARK vector. The original time-domain speech signal has been transformed into non-linear, time-shifted gammatone similarity parameters.
In many applications it is helpful to further reduce the dimensionality of the speech representation. Thus the processor is programmed to apply a compressive weighting function at 48. This weighting function is discussed above in Section II. E. on Feature Pooling. After applying the compressive weighting function the SPARK speech parameters are improved to by enhancing the resolution at low similarity scores while reducing resolution at high similarity scores.
The remaining steps shown in
In a mobile device application, such as in a mobile phone application, the SPARK features may be computed using the onboard processor of the mobile device, running as a background application or a thread. Alternatively, a separate digital signal processing circuit (DSP) can be included in the mobile device to compute the SPARK features. If desired, the features may be generated using an analog embodiment whereby analog bandpass filters are used to generate the features. An application specific integrated circuit (ASIC) can be used to implement this.
The SPARK features may be computed or generated in the mobile device and then sent wirelessly to an Internet-based or cloud-based server system for further recognition processing. If desired, the SPARK features can be used for speaker identification, so that the speaker's voice can be used to authenticate himself or herself to the mobile device. In this regard, speaker identification or authentication can serve as a way for a user to activate the mobile device without the need to manually type a pass phrase or password. The ability to enter such authentication or identification information by voice is particularly advantageous with mobile devices, such as watches or other small devices worn on the user's body, that do not have large touchscreens or keypads for pass phrase or password entry.
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
Number | Date | Country | |
---|---|---|---|
61643550 | May 2012 | US |