PULSE-BASED AUTOMATIC SPEECH RECOGNITION

BACKGROUND

An automatic speech recognition (ASR) system translates recorded audio signal to text. It is a pattern classification problem at its core. However, both the nonstationarity of the signal and the large variation in the temporal dimension of the speech feature sequences prevent classical classifiers such as Bayes, nearest-neighbor, or state-of-the-art classifiers such as support vector machines (SVMs), which are limited to static patterns or fixed-dimension inputs, from being implemented in a straightforward manner.

SUMMARY

Embodiments of the present disclosure are related to speech recognition.

In one embodiment, among others, a speech recognition method comprising: converting an auditory signal into a pulse train; segmenting the pulse train into a series of frames having a predefined duration; and identifying a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a kernel adaptive autoregressive-moving-average (KAARMA) network. In one or more aspects of these embodiments, the pulse train can be transformed into a reproducing kernel Hilbert space (RKHS) using a spike kernel. The KAARMA network can generate a state that identifies the portion of the auditory signal. The portion of the auditory signal can be a phoneme, a triphone, or a word.

In one or more aspects of these embodiments, the speech recognition method can comprise identifying a word by applying the series of frames segmented from the pulse train to a KAARMA chain including the KAARMA network. The word can be identified based at least in part upon a plurality of states generated by KAARMA networks of the KAARMA chain. The portion of the series of frames can be applied to the KAARMA network in a natural left-to-right temporal sequence. The portion of the series of frames can be applied to the KAARMA network in a reversed right-to-left temporal sequence.

In one or more aspects of these embodiments, the portion of the auditory signal can be identified by applying the portion of the series of frames to the KAARMA network in a natural left-to-right temporal sequence and applying the portion of the series of frames to a second KAARMA network in a reversed right-to-left temporal sequence. The portion of the auditory signal can be identified based upon a first state generated by the KAARMA network and a second state generated by the second KAARMA network. The auditory signal can be converted into a plurality of pulse trains associated with corresponding frequency bands of the auditory signal. Individual pulse trains of the plurality of pulse trains can be segmented into a corresponding series of frames having the predefined duration. The portion of the auditory signal can be identified by applying corresponding portions of the corresponding series of frames associated with the corresponding frequency bands to the KAARMA network.

In another embodiment, a speech recognition system comprises processing circuitry including a processor and memory comprising a speech recognition application, where execution of the speech recognition application by the processor causes the processing circuitry to: convert an auditory signal into a pulse train; segment the pulse train into a series of frames having a predefined duration; and identify a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a kernel adaptive autoregressive-moving-average (KAARMA) network. In one or more aspects of these embodiments, the KAARMA network can generate a state that identifies the portion of the auditory signal

In one or more aspects of these embodiments, execution of the speech recognition application can cause the processing circuitry to identify a word by applying the series of frames segmented from the pulse train to a KAARMA chain including the KAARMA network. The word can be identified based at least in part upon a plurality of states generated by KAARMA networks of the KAARMA chain. The portion of the auditory signal can be identified by applying the portion of the series of frames to the KAARMA network in a natural left-to-right temporal sequence and applying the portion of the series of frames to a second KAARMA network in a reversed right-to-left temporal sequence. The portion of the auditory signal can be identified based upon a first state generated by the KAARMA network and a second state generated by the second KAARMA network.

In one or more aspects of these embodiments, the auditory signal can be converted into a plurality of pulse trains associated with corresponding frequency bands of the auditory signal, individual pulse trains of the plurality of pulse trains segmented into a corresponding series of frames having the predefined duration. The portion of the auditory signal can be identified by applying corresponding portions of the corresponding series of frames associated with the corresponding frequency bands to the KAARMA network.

Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims. In addition, all optional and preferred features and modifications of the described embodiments are usable in all aspects of the disclosure taught herein. Furthermore, the individual features of the dependent claims, as well as all optional and preferred features and modifications of the described embodiments are combinable and interchangeable with one another.

BRIEF DESCRIPTION OF THE DRAWINGS

Many aspects of the present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.

FIG. 1 is a schematic diagram of an example of a general state-space model for a dynamical system in accordance with various embodiments of the present disclosure.

FIG. 2 illustrates an example of a kernel adaptive autoregressive-moving-average (KAARMA) network in accordance with various embodiments of the present disclosure.

FIGS. 3A and 3B are examples of algorithms that implement portions of the KAARMA network of FIG. 2 in accordance with various embodiments of the present disclosure.

FIGS. 4A through 4D include tables illustrating performance results of the KAARMA network of FIG. 2 in accordance with various embodiments of the present disclosure.

FIG. 5 illustrates an example of a KAARMA chain of 3 grammar states in accordance with various embodiments of the present disclosure.

FIGS. 6A through 6C illustrate an example of segmentation of frequency responses of a 12-filter gammatone filterbank in accordance with various embodiments of the present disclosure.

FIGS. 7A through 7C illustrate an example of word recognition using pulse-based KAARMA chain in accordance with various embodiments of the present disclosure.

FIGS. 8A through 8C include tables illustrating and comparing performance results of the KAARMA chain and a hidden Markov model (HMM) framework in accordance with various embodiments of the present disclosure.

FIGS. 9A through 9C illustrate examples of spike train templates in accordance with various embodiments of the present disclosure.

FIG. 10 includes a table comparing performance of a liquid state machine (LSM) using various supervised learning techniques in accordance with various embodiments of the present disclosure.

FIG. 11 illustrates a misclassified spike train by the LSM of FIG. 16 in accordance with various embodiments of the present disclosure.

FIGS. 12A and 12B illustrate examples of online performance of quantized KAARMA (or QKAARMA) for spike train classification during training and testing in accordance with various embodiments of the present disclosure.

FIG. 13 illustrates an example of a minimized finite state machine in accordance with various embodiments of the present disclosure.

FIGS. 14A and 14B illustrated examples of state transition trajectories for a spike train generated by templates 0 and 1, respectively, in accordance with various embodiments of the present disclosure.

FIG. 15 is a schematic block diagram of an example of a speech processing device in accordance with various embodiments of the present disclosure.

DETAILED DESCRIPTION

Disclosed herein are various embodiments related to speech recognition. Reference will now be made in detail to the description of the embodiments as illustrated in the drawings, wherein like reference numbers indicate like parts throughout the several views.

A kernel adaptive autoregressive-moving-average (KAARMA) algorithm has the ability to model dynamical systems and as a temporal classifier using the benchmark Tomita grammars. This disclosure describes the use of KAARMA networks as an acoustic model for speech recognition. Since speech production is both nonlinear and nonstationary in nature, KAARMA-based solutions can deliver computationally efficient solutions. The performance of these grammar classifiers is shown for the fundamental application in speech processing, isolated word recognition, using both conventional feature sequences (e.g., mel-frequency cepstral coefficients or MFCCs) and biologically inspired signals (e.g., spike or pulse trains). The performance of these classifiers can be enhanced under a statistical framework.

KAARMA

The KAARMA algorithm will now be briefly described. Refer to “The kernel adaptive autoregressive-moving-average algorithm” by K. Li and J. C. Principe (IEEE Trans. on Neural Networks and Learning Systems, April 2015), which is hereby incorporated by reference in its entirety, for a more in depth discussion. Referring to FIG. 1, shown is an example of a general state-space model for a dynamical system. Let the dynamical system be defined in terms of a general continuous nonlinear state transition and observation functions, f(⋅,⋅) and h(⋅), respectively, as:

x
_i
=f(x_i−1, u_i) (1)

y
_i
=h(x_i), (2)

where

$\begin{matrix} \begin{matrix} f (x_{i - 1}, u_{i}) \overset{Δ}{=} {[f^{(1)} (x_{i - 1}, u_{i}), \dots, f^{(n_{x})} (x_{i - 1}, u_{i})]}^{T} \\ = {[x_{i}^{(1)}, \dots, x_{i}^{(n_{x})}]}^{T}, \end{matrix} & (3) \\ \begin{matrix} h (x_{i}) \overset{Δ}{=} {[h^{(1)} (x_{i}), \dots, h^{(n_{y})} (x_{i})]}^{T} \\ = {[y_{i}^{(1)}, \dots, y_{i}^{(n_{y})}]}^{T}, \end{matrix} & (4) \end{matrix}$

with input u_i∈ custom-character ⁿ^u, state x_i∈ⁿ^x, output y_i∈ⁿ^y, and the parenthesized superscript ^(k)indicating the k-th component of a vector or the k-th column of a matrix. Note that the input, state, and output vectors have independent degrees of freedom or dimensionality.

For simplicity, EQNS. (1) and (2) can be rewritten in terms of a new hidden state vector as:

$\begin{matrix} s_{i} \overset{Δ}{=} [\begin{matrix} x_{i} \\ y_{i} \end{matrix}] = [\begin{matrix} f (x_{i - 1}, u_{i}) \\ h \cdot f (x_{i - 1}, u_{i}) \end{matrix}], & (5) \\ y_{i} = s_{i}^{(n_{s} - n_{y} + 1 : n_{s})} = \underset{\underset{I}{}}{[\begin{matrix} 0 & I_{n_{y}} \end{matrix}]} [\begin{matrix} x_{i} \\ y_{i} \end{matrix}], & (6) \end{matrix}$

where I_n_yis an n_y×n_yidentity matrix, 0 is an n_y×n_xzero matrix, and ∘ is the function composition operator. This augmented state vector s_i∈ custom-character ⁿ^sis formed by concatenating the output y_iwith the original state vector x_i. With this rewriting, the measurement equation simplifies to a fixed selector matrix [0 I_n_y].

Next, define an equivalent transition function as g(s_i−1, u_i)=f(x_i−i, u_i) taking as argument a new state variable s. Using this notation, EQNS. (1) and (2) become:

x
_i
=g(s_i−1, u_i), (7)

y
_i
=h(x_i)=h∘g(s_i−1, u_i). (8)

To learn the general continuous nonlinear transition and observation functions, g(⋅,⋅) and h∘g(⋅,⋅), respectively, the theory of reproducing kernel Hilbert space (RKHS) can be applied. First, map the augmented state vector s_iand the input vector u_iinto two separate RKHSs as φ(s_i)∈ custom-character _sand ϕ(u_i)∈_u, respectively, using a spike kernel (e.g., a Schoenberg kernel). By the representer theorem, the state-space model defined by EQNS. (7) and (8) can be expressed as the following set of weights (functions in the input space) in the joint RKHS _su_s⊗_uas:

$\begin{matrix} Ω \overset{Δ}{=} Ω_{ℋ_{su}} \overset{Δ}{=} [\begin{matrix} g (\cdot, \cdot) \\ h \cdot g (\cdot, \cdot) \end{matrix}] . & (9) \end{matrix}$

where ⊗ is the tensor-product operator. Define the new features in the tensor-product RKHS as:

ψ(s_i−1, u_i) custom-character φ(s_i−1)⊗ϕ(u_i)∈_su. (10)

It follows that the tensor-product kernel is defined by:

$\begin{matrix} \begin{matrix} {〈 ψ (s, u), ψ (s^{'}, u^{'}) 〉}_{ℋ_{su}} = _{su} (s, u, s^{'}, u^{'}) \\ = (_{s} \otimes _{u}) (s, u, s^{'}, u^{'}) \\ = _{s} (s, s^{'}) \cdot _{u} (u, u^{'}) . \end{matrix} & (11) \end{matrix}$

This construction has several advantages over the simple concatenation of the input u and the state s. First, the tensor product kernel of two positive definite kernels is also a positive definite kernel. Second, since the adaptive filtering is performed in an RKHS using features, there is no constraint on the original input signals or the number of signals, as long as the appropriate reproducing kernel is used for each signal. Last but not least, this formulation imposes no restriction on the relationship between the signals in the original input space. This is important for input signals having different representations and spatio-temporal scales. For example, under this framework, a neurobiological system can be modeled taking spike trains, continuous amplitude local field potentials (LFPs), and vectorized state variables as inputs.

Finally, the kernel state-space model becomes:

s
_i=Ω^Tψ(s_i−1, u_i), (12)

y
_i
=
custom-character
s
_i. (13)

FIG. 2 shows an example of a simple KAARMA network. In general, the states (s_i) 503 are assumed hidden, and the desired state (s_i⁽ⁿ^s⁾) does not need to be available at every time step, e.g., a deferred desired output value for y_imay only be observed at the final indexed step i=f, i.e., d_f.

Kernel Adaptive Recurrent Filtering. The learning procedure presented here computes the exact error gradient in the RKHS. For simplicity, consider only the Gaussian kernel in the derivation. For the state and input vectors, the joint inner products can be computed using custom-character _a_s(s, s′) and _a_u(u, u′), respectively.

The cost function at time i can be defined as:

$\begin{matrix} ɛ_{i} = \frac{1}{2} e_{i}^{T} e_{i}, & (14) \end{matrix}$

where e_i=d_i−y_i∈ custom-character ⁿ^y^×1is the error vector, with d_ias the desired signal. The error gradient with respect to the RKHS weights Ω_iat time i is:

$\begin{matrix} \frac{\partial ɛ_{i}}{\partial Ω_{i}} = \frac{\partial e_{i}^{T} e_{i}}{2 \partial Ω_{i}} = - e_{i}^{T} \frac{\partial y_{i}}{\partial Ω_{i}}, & (15) \end{matrix}$

where the partial derivative

$\frac{\partial y_{i}}{\partial Ω_{i}}$

consists of n_sterms,

$\frac{\partial y_{i}}{\partial Ω_{i}^{(1)}}, \frac{\partial y_{i}}{\partial Ω_{i}^{(2)}}, \dots, \frac{\partial y_{i}}{\partial Ω_{i}^{(n_{s})}}$

corresponding to the state dimension. Note that the feature-space weights Ω_i, are functions with potentially infinite dimension. Fortunately, the functional derivative is well-posed in the RKHS, since Hilbert spaces are complete normed vector spaces, the Fréchet derivative can be used to compute EQN. (15).

For the k-th component of Ω_i, the gradient can be expanded using the chain rule as:

$\begin{matrix} \frac{\partial ɛ_{i}}{\partial Ω_{i}^{(k)}} = - e_{i}^{T} \frac{\partial y_{i}}{\partial Ω_{i}^{(k)}} = - e_{i}^{T} \frac{\partial y_{i}}{\partial s_{i}} \frac{\partial s_{i}}{\partial Ω_{i}^{(k)}}, where \frac{\partial y_{i}}{\partial Ω_{i}} =  . & (16) \end{matrix}$

Applying the product rule to the gradient yields:

$\begin{matrix} \begin{matrix} \frac{\partial s_{i}}{\partial Ω_{i}^{(k)}} = \frac{\partial Ω_{i}^{T} ψ (s_{i - 1}, u_{i})}{\partial Ω_{i}^{(k)}} \\ = Ω_{i}^{T} \frac{\partial ψ (s_{i - 1}, u_{i})}{\partial Ω_{i}^{(k)}} + I_{n_{s}}^{(k)} {ψ (s_{i - 1}, u_{i})}^{T}, \end{matrix} & (17) \end{matrix}$

where I_n_s^(k)∈ custom-character ⁿ^sis the k-th column of the n_s×n_sidentity matrix. The distinction between a recurrent formulation and the normal feed-forward kernel adaptive filtering (KAF) lies in the gradient term on the right-hand side of EQN. (17). In a recurrent network, past states are coupled with the current state input through feedback. Consequently, the partial derivatives of the previous states with respect to the current filter weights are nonzero.

Using the representer theorem, weights Ω_iflat time i can be written as a linear combination of prior features given by:

Ω_i=Ψ_iA_i, (18)

where Ψ_i custom-character [ψ(s₋₁, u₀), . . . , ψ(s_m−2, u_m−1)]∈ⁿ^ψ^×mis a collection of the m past tensor-product features with potentially infinite dimension n_ψ, and A_i[α_i,1, . . . , α_i,n_s]∈^m×n^sis the set of corresponding coefficients. For feedforward KAF such as the kernel least mean squares (KLMS) algorithm, the number of basis functions grows linearly with each new sample (m=i). Here, use m to denote a dictionary Ψ_iof arbitrary size, with ψ(s₋₁, u₀) initialization. Thus, the k-th component (1≤k≤n_s) of the filter weights at time i becomes:

Ω_n_s^(k)=Ψ_iA _i^(k)=Ψ_iα_i,k. (19)

Substituting the expression for weights Ω_i, in EQN. (18) into the feedback gradient on the right-hand side of EQN. (17) and applying the chain rule gives:

$\begin{matrix} \begin{matrix} Ω^{T} \frac{\partial ψ (s_{i - 1}, u_{i})}{\partial Ω^{(k)}} = A_{i}^{T} \frac{\partial Ψ_{i}^{T} ψ (s_{i - 1}, u_{i})}{\partial s_{i - 1}} \frac{\partial s_{i - 1}}{\partial Ω_{i}^{(k)}} \\ = \underset{\underset{Λ_{i}}{}}{2 a_{s} A_{i}^{T} K_{i} D_{i}^{T}} \frac{\partial s_{i - 1}}{\partial Ω_{i}^{(k)}}, \end{matrix} & (20) \end{matrix}$

where the partial derivation is evaluated using the Gaussian tensor-product kernel of EQN. (11), where K_i custom-character diag (Ψ_i^Tψ(s_i−1, u_i)) is a diagonal matrix with eigenvalues K_i^(j,j)=_a_s(s_j, s_i−1)·_a_u(u_j, u_i) and D_i[(s₋₁−s_i−1), . . . , (s_m−2−s_i−1)] is the difference matrix between state centers of the filter and the current input state s_i−1. The gradient coefficients in EQN. (20) can be collected into a matrix

$Λ_{i} \overset{Δ}{=} \frac{\partial s_{i}}{\partial s_{i - 1}} = 2 a_{s} A_{i}^{T} K_{i} D_{i}^{T},$

which is called the state-transition gradient. Substituting EQN. (20) into EQN. (17) gives the following recursion:

$\frac{\partial s_{i}}{\partial Ω_{i}^{(k)}} = Λ_{i} \frac{\partial s_{i - 1}}{\partial Ω_{i}^{(k)}} + I_{n_{s}}^{(k)} {ψ (s_{i - 1}, u_{i})}^{T} .$

The gradient

$\frac{\partial s_{i}}{\partial Ω_{i}^{(k)}}$

measures the sensitivity of s_i, the state output at time i, to a small change in the k-th weight component, taking into account the effect of such variation in the weight values over the entire state trajectory s₀, . . . , s_i−1. In this evaluation, the initial state s₀, the input sequence u₁ⁱ, and the remaining weights (Ω_i^(j), where j≠k) are fixed. The state gradient

$\frac{\partial s_{i}}{\partial Ω_{i}^{(k)}}$

is back-propagated with respect to a constant weight via feedback.

Clearly EQN. (21) is independent of any teacher signal or error that the system may incur in the future and can be computed entirely from the observed data. Therefore, the state gradients can be forward propagated. Since the initial state is user-defined and functionally independent of the filter weights, by setting

$\frac{\partial s_{i}}{\partial Ω_{i}^{(k)}} = 0,$

the ensuing recursions in EQN. (21) becomes:

$\begin{matrix} \frac{\partial s_{1}}{\partial Ω_{i}^{(k)}} = {I_{n_{s}}^{(k)} (s_{0}, u_{1})}^{T} . & (22) \end{matrix}$

By induction, the basis functions can be factored out and the recursion expressed as:

$\begin{matrix} \begin{matrix} \frac{\partial s_{i}}{\partial Ω_{i}^{(k)}} = Λ_{i} V_{i - 1}^{(k)} Ψ_{i - 1}^{' T} + I_{n_{s}}^{(k)} {ψ (s_{i - 1}, u_{i})}^{T} \\ = {[Λ_{i} V_{i - 1}^{(k)}, I_{n_{s}}^{(k)}] [Ψ_{i - 1}^{'}, ψ (s_{i - 1}, u_{i})]}^{T} \\ = V_{i}^{(k)} Ψ_{i}^{' T} \end{matrix} & (23) \end{matrix}$

where Ψ′_i custom-character [Ψ′_i−1, ψ(s_i−1, u_i)]∈ⁿ^ψ^×iare centers generated by the input sequence and forward-propagated states from a fixed filter weight Ω_i, and V_i^(k)^[Λ_iV_i−1^(k), I_n_s^(k)]∈ⁿ^×iis the updated state-transition gradient, with initializations Ψ′₁=[ψ(s₀, u₁)] and V₁^(k)=I_n_s^(k).

Combining EQN. (17) with EQN. (23) gives the error gradient:

$\begin{matrix} \frac{\partial ɛ_{i}}{\partial Ω^{(k)}} = - e_{i}^{T}  V_{i}^{(k)} Ψ_{i}^{' T} . & (24) \end{matrix}$

Updating the weights in the negative direction yields:

$\begin{matrix} \begin{matrix} Ω_{i + 1}^{(k)} = Ω_{i}^{(k)} + {{ηΨ}_{i}^{'} ( V_{i}^{(k)})}^{T} e_{i} \\ = Ψ_{i} A_{i}^{(k)} + {{ηΨ}_{i}^{'} ( V_{i}^{(k)})}^{T} e_{i} \\ = [Ψ_{i}, Ψ_{i}^{'}] [\begin{matrix} A_{i}^{(k)} \\ {η ( V_{i}^{(k)})}^{T} e_{i} \end{matrix}] \\ \overset{Δ}{=} Ψ_{i + 1} A_{i + 1}^{(k)}, (26) \end{matrix} & (25) \end{matrix}$

where η is the learning rate. Because learning is provided in a LMS fashion in the RKHS, the conventional tradeoff of speed and accuracy exists, and the same rules as for the KLMS algorithm can be utilized.

Since the centers in a KAARMA network naturally forms feature clusters during training. Its growth can be curbed by evaluating each new center from the feature update Ψ′ with the existing ones. FIGS. 3A and 3B include pseudocode illustrating a KAARMA algorithm and a quantization algorithm, respectively. The algorithm of FIG. 3B outlines the quantization procedure of quantized KAARMA (or QKAARMA), which constrains the network growth in the last five lines of the KAARMA algorithm of FIG. 3A. Each new center from the feature update Ψ′ is compared with the existing ones. If the minimum joint distance (input and state) is below a quantization threshold q, the new center Ψ′⁽ⁱ⁾ custom-character ψ(s′_i, u′_i) is simply discarded, and its corresponding coefficients A′(i), where (i) indicates the i-th row, are added to the nearest existing neighbor's (indexed by j*), thus updating the weights without growing the network structure.

KAARMA for Automatic Speech Recognition

Automatic speech recognition (ASR) research can involve connectionist networks that can focus on a frame-to-frame predictive or discriminative model, by computing the local observation probability at each time step. This can be based upon an assumption that each frame of the speech signal can be labeled or assigned a desired value. Alternatively, certain speech recognition tasks can be treated as grammatical inference problems, exactly like syntactic pattern recognition involving the Tomita grammars.

As a proof of concept, the TI-46 corpus of isolated English digits was used for experimental testing in this disclosure. The corpus of speech was designed and collected at Texas Instruments, Inc. in 1980, consisting of utterances from 16 speakers (eight males and eight females) each speaking the digits “zero” through “nine” 26 times. Of the 4160 possible utterances, 4000 were used in the subsequent experimental testing. These utterances were further partitioned randomly into a training set (of 2700 utterances with an equal number of male/female utterances and digits: 135 utterances per gender, per digit) and a testing set (of 1300 utterances with an equal number of male/female utterances and digits: 65 utterances per gender, per digit).

To help align each utterance and reduce the number of non-speech data points used in computation, each utterance can be normalized with respect to its maximum absolute amplitude, then truncated automatically into the smallest window containing all voiced regions using a simple threshold-based end-point detection algorithm. Next, each truncated utterance was analyzed on 25 ms speech frames at 100 fps. Each frame was Hamming windowed, filtered by a first-order pre-emphasis filter (α=0.95). The magnitude spectrum from the discrete Fourier transform (DFT) can be computed and scaled by the mel-scale triangular filter bank. The output energy can then be log-compressed and transformed via the DCT to cepstral coefficients.

For the ASR experiment, ten 1-vs-all KAARMA networks were trained, one for each digit. Each training utterance was represented by a sequence of 12 MFCCs and assigned a single label of ±1 based on the target class. The QKAARMA algorithm was used for training, with hidden states ∈ custom-character ³, kernel parameters a_s=a_u=2.5, learning rate η=0.05, and quantization threshold q=0.81. For the complete ASR system performance, each test utterance passes through all ten classifiers, and the digit corresponding to the KAARMA network with the largest output value was selected as the recognition result.

The classification performance of this KAARMA ASR system is summarized in the training and testing confusion matrices in the training and testing tables in FIG. 4A. The drop in performance from single 1-vs-all classifiers to the overall recognition system and from the training set to the testing set is expected, due to overfitting and the fact that errors in individual classifiers can accumulate across the entire testing set for the final 10-vs-10 word recognition system. Another contributing factor is that many digits share similar acoustic features in their temporal representation. Theoretically, recurrent networks that are trained using the gradient descent method are ill-suited for learning long-term dependencies. If two different digits converge to the same set of acoustic features after enough time has elapsed or share significant overlap during utterances, it becomes much more difficult for recurrent networks such as the KAARMA algorithm to distinguish the two. To better illustrate this phenomenon, each of the 10 digits present in the corpus is broken up into sub-word speech units of phonemes in the table of FIG. 4B.

One simple way to circumvent this problem for certain digits, without changing the experiment, is to simply reverse the order of the acoustic features. Digits that used to share the same trailing phoneme may end up in different ones (of course the converse can also happen for certain digits). In the table of FIG. 4C, the results are tabulated for individual 1-vs-all classifiers using both the natural left-to-right temporal sequence convention and the reversed ordering (right-to-left) during training and testing of the digit corpus. The true positive (TP), false positive (FP), and 1-vs-all classification accuracy are listed for each two classifiers, with the better performance in each category between the two orderings highlighted in bold font.

If the ASR system is constructed by selecting the 1-vs-all classifier with the higher accuracy between the two ordering categories for each digit, the testing set recognition rate of this modified ASR can be improved over the uniform sequence ordering. However, if both classifiers are combined by multiplying the softmax scores, the performance can be further improved, as shown in the table of FIG. 4D.

KAARMA Chain

A KAARMA chain approach can be formulated for isolated word recognition under a statistical framework. Commonly, a hidden Markov model (HMM) framework is used for speech recognition. In the HMM framework, a speech signal, specifically the sequence of acoustic feature vectors custom-character ={u₁, u₂, . . . , u_n_u}, is generated by a finite state automaton comprising L states S={s₁, s₂, . . . , s_n_s} under a probabilistic framework. An HMM is equivalent to a stochastic regular grammar with each speech unit associated with a specific Markov model M_icomprising states from S according to a predefined topology. The left-right (Bakis) model is the most commonly used topology for speech recognition. States are aligned from left to right to form a single Markov chain, indexed incrementally and with only self- or right-transitions allowed, i.e., a_i,j=0, for j<i. Furthermore, the initial state is fixed at state s₁. Left-right HMMs are able to model the temporal properties of speech.

The training and recognition criteria for HMMs are based on maximizing a posteriori probability Pr(M_i| custom-character ) that the observation U has been produced by the HMM M_i. Using Bayes' rule, the expression can be rewritten as:

$\begin{matrix} \Pr (M_{i} |) = \frac{\Pr (| M_{i}) \Pr (M_{i})}{\Pr ()}, & (27) \end{matrix}$

where Pr( custom-character |M_i) is the maximum likelihood estimate (MLE) criterion, Pr() is constant during recognition, and the a priori probability Pr(M_i) is an appropriate language model.

The statistical recognition criterion can be solved directly using the KAARMA algorithm. The maximum a posteriori (MAP) criterion can be defined as:

$\begin{matrix} M^{*} = \underset{M}{argmax} \Pr (M |), & (28) \end{matrix}$

where M is the inference model, which is equivalent to maximizing the a posterior state sequence or most probable state sequence for each model.

Define the states in a KAARMA chain as context-free grammars (e.g., phoneme, syllable, word, etc.), denoted by custom-character ={q₁, q₂, . . . , q_L} (this distinction is made to not confuse a grammar state q_iwith the KAARMA internal state variables s_i, i.e., q_i∈{s₁, s₂, . . . , s_n_s}.). Under this formulation, a single (global grammar) KAARMA network trained on the entire observation trajectory custom-character ={u₁, u₂, . . . , u_nu_u} can be viewed as:

$\begin{matrix} {\tilde{y}}_{i} = \frac{\exp (y_{i}^{(t)})}{Σ_{j} \exp (y_{j}^{(t)})} = \Pr (= q = 1 |), & (29) \end{matrix}$

where y_iis the output of a KAARMA network trained to recognize or classify the grammar q=1. The softmax function can be used to ensure that the posterior estimates are non-negative and sum to one. To improve the classification results, several KAARMA networks that specialize in different regions of a word can be trained. For each isolated word, the observation sequence can be partitioned into L equal segments. Each segment can be assigned a grammar state and a separate KAARMA network can be used to learn its classification grammar. FIG. 5 shows an example of a KAARMA chain of 3 grammar states, used to the digit “7.”

Next, the state transitions for the KAARMA chain can be fixed at a_i,j=1 for j>i. KAARMA is able to handle nonstationarities by leveraging their internal hidden states s₁. Unlike the restricted structure of a traditional left-to-right model, the hidden state s_iin each grammar state are free to form transitions that best fit the available data, i.e., an ergodic model. In the KAARMA chain formulation, the recognized word can be given by the following MAP criterion:

$\begin{matrix} M^{*} = \underset{M}{argmax} \prod_{i}^{L} \Pr (q_{i} = M | u_{(i - 1) * n_{u} / L + 1}^{i * n_{u} / L}) . & (30) \end{matrix}$

A major advantage of performing adaptive filtering in the RKHS is the freedom to choose the input representation as long as the appropriate reproducing kernel is used. By having separate formulations of the exogenous input u and the internal state s vectors, the KAARMA algorithm imposes no restriction on the relationship between the two signals in the original input space. This enables it to work directly with nonnumeric data such as spike trains. The KAARMA chain paradigm will be applied for a biologically-plausible speech recognition system. For each speech signal, biologically inspired features can be extracted to mimic the filtering performed by the human auditory system. The Schoenberg kernel can used be as a suitable reproducing kernel (or spike kernel) for spike trains. The spike kernel transforms the spike trains into a high dimensional feature space (e.g., Hilbert space).

Spike-Based Acoustic Features

First, a gammatone filterbank can be applied to each acoustic signal and its outputs converted into spike trains using leaky integrate-and-fire neurons with spike rate adaptation and a refractory current. The neuron parameters are resistance R_m=10², time constant τ_m=10⁻², spike threshold V_th=−55×10⁻³, spike delta V_spike=0.5, reversal potential for spike-rate adaptation E_k=−2×10⁻¹, and reset potential V_reset=−8×10⁻². This formulation is motivated by the mechanical to electrical transduction in the cochlea. Different regions of the basilar membrane vibrate to particular sound frequencies, in response to fluid flow in the cochlea. Sensory hair cells in the organ of Corti then convert the mechanical response to electrical signals which travel along the auditory nerve to the brain for processing.

The gammatone filterbank simulates the mechanical response of the cochlea in which the output of each filter models the frequency response of the basilar membrane at a particular location. FIG. 6A illustrates an example of frequency responses of a 12-filter gammatone filterbank with center frequencies equally spaced between 50 Hz and 8 kHz on the ERB-rate scale. It can be defined in the time domain by the impulse response:

g(t)=atⁿ⁻¹e^−2πbtcos(2πf_ct+ϕ), (31)

where f_cis the center frequency (in Hz), ϕ is the phase of the carrier (in radians), a is the amplitude, n is the filter order, b is the filter bandwidth (in Hz), and t indicates time (in s). FIG. 6B shows the gammatone filter bank outputs for an utterance of the digit “8,” superimposed with the corresponding spike trains generated using leaky integrate-and-fire (LIF) neurons. Next, the spike trains can be segmented into frames the same way as the MFCCs, with a frame duration of 25 ms and rate of 100 fps. FIG. 6C shows an example frame representation of the spike train inputs for the same digit “8” utterance. Again, the positive class is replicated 3 times and this training set is used once to train the KAARMA classifiers. For the isolated word recognition problem, the spike frames are used directly as features (with hidden states s∈ custom-character ³, kernel parameters a_s=4, a_u=6, learning rate η=0.1, quantization threshold q=0.25). To reduce over-fitting the parameters were not fully optimized over their respective ranges. The recognition performance is also shown using firing rates obtained from the spike frames (with hidden states s∈ custom-character ³, kernel parameters a_s=a_u=5, learning rate η=0.1, quantization threshold q=0.55).

FIGS. 7A-7C illustrate an overview of the spike-based ASR system using KAARMA chains for isolated word recognition. FIG. 7A shows the segmentation of frequency responses and FIGS. 7B and 7C show an example of portions of the KAARMA chain for work recognition. For a 5-state KAARMA chain, the recognition results are shown in the tables of FIGS. 8A and 8B. The spike-based performances are summarized and compared to the performance of a 5-state HMM with a mixture of 8 Gaussians in the table of FIG. 8C.

The state-of-the-art bioinspired digital performance using a liquid state machine (LSM) on the TI-46 digit corpus has been reported. For the multispeaker spoken digit task with 1590 speech samples (80% used for training and remaining 20% for testing) and training epoch of 500, the final classification rate for the 77-channel spike input LSM was 92.3%. The spike or pulse-based speech recognition of this disclosure achieves an accuracy of 95.23% for a larger speech set with 4000 samples (67% for training and 33% for testing), using a single training epoch (where only the desired class or 10% of the training data is replicated 3 times).

Furthermore, producing a constant output for a time-varying liquid state is a major challenge for a LSM, since the memory-less readout has to transform the transient and nonstationary states of the liquid filter into a stable output without any attractor states to rely on. For the KAARMA formulation using pulse-based signals, once the stable dynamics are learned, a finite state machine or deterministic finite automata (DFA) can easily be extracted from the binary time sequences, where all the information of the input is contained in its temporal evolution, e.g., the interspike intervals of individual spike trains. The limitations of LSM and the advantages of the DFA learned from the KAARMA algorithm is illustrated here for completeness.

Spike Train Classification using DFA learned from the KAARMA algorithm. Two Poisson spike trains of frequency 20 Hz and duration 0.5 s are generated as templates for two classes 0 and 1. FIG. 9A illustrates Poisson spike train templates (solid spikes for class 0 and dashed spikes for class 1). Actual spike trains used for training and testing are noisy versions of the templates, with each spike timing varying by a random amount, according to a zero mean Gaussian distribution with standard deviation or Gaussian jitter of 4 ms. An example of a jittered Class 0 spike train is shown at the bottom of FIG. 9A. The training set consisted of 500 realizations, with another 200 forming the independent test set. Due to the random displacement of spikes, it becomes impossible to recognize spike trains generated from a specific template using a single interspike interval. Furthermore, as non-numeric data, there are no spatial cues to rely on.

The liquid filter was a randomly connected recurrent neural microcircuit comprising 135 integrate and fire neurons, with 20% of the population randomly set as inhibitory. A readout neuron modeled as a perceptron or threshold gate was trained to classify the spike trains. The task of this threshold gate is to output 0 or 1 corresponding to the template used to generate the input spike train. The state of the microcircuit was sampled every 25 ms by low-pass filtering the response. Supervised learning was applied to the set of training examples in the form: (state x^M(t), target y(t)) to train a readout function f^M. The experimental setup for the LSM is described in “Learning-tool: analysing the computational power of neural microcircuits (version 1.0)” (The IGI LSM Group, Jun. 11, 2006, http://www.lsm.tugraz.at/download/learning-tool-1.1-manual.pdf), which is hereby incorporated by reference in its entirety.

The QKAARMA network is trained directly on the spiking stimuli, downsampled using 10 ms bins such that for multiple spike count in a bin, only a single 1 is recorded. Each jittered version of the target template forms a binary string of length 50. Note that the tensor-product formulation allows KAARMA networks to operate directly on spike train inputs via an appropriate spike train kernel. Here, binning is used to produce an exact binary-input finite state machine and to compare with the binned performance of the LSM. FIGS. 9B and 9C illustrate examples of test set spike trains in their original spike timings (left) and in the binned form (right) separated by template class 0 and 1, respectively. Hidden states of dimension 3 (s∈ custom-character ³) were used with the kernel parameter for both the states and the binary inputs set at a_s=a_u=1. The learning rate was fixed at η=0.1, with a quantization threshold of q=0.4 and a DFA extraction quantization threshold of q_DFA=0.11.

The performances of the LSM using various supervised learning techniques (linear classification, parallel delta (p-delta) rule, linear regression, and LevenbergMarquardt (LM) backpropagation on a two-layer NN with 5 hidden units), measured by the correlation coefficient (CC), mean absolute error (MAE), error score, and mean squared error, are summarized in the table of FIG. 10. The best performance was given by the two-layer NN trained using LM backpropagation, which was trained for 48 epochs with the best validation performance at epoch 33. FIG. 11 illustrates a misclassified spike train by the LSM approach. Shown are LSM results for test input 3 with readout function trained using linear classification, p-delta rule, linear regression, and backpropagation (desired given by dashed line and estimates given by circles).

FIG. 12A shows the online performance of the QKAARMA algorithm for the spike train classification task during training. While small perturbations in the input can cause large and unpredictable dynamical changes in the liquid states x^Mof a LSM, the KAARMA network trained with stable states easily handled the variations in interspike intervals for each class of spike trains. The KAARMA network starts to show strong discrimination after only 60 training samples, compared to the 33 epochs (where each epoch consists of 500 training samples) needed to train a LSM readout with suboptimal performance. FIG. 12B shows the testing performance on the testing set for the spike train classification task after training was completed. The QKAARMA algorithm was able to correctly label each test stimulus after a single learning pass through the training set. For this classification task, the KAARMA algorithm easily outperformed the LSM methods.

Unlike the LSM framework (which only has one attractor state: the resting state), KAARMA networks compute with stable attractors learned directly from data and can be easily converted into simple exact solutions in the form of deterministic finite automata. FIG. 13 illustrates the minimized finite state machine for accepting all template 0 spike trains and rejecting all template 1 spike trains, and vice versa, extracted from the trained KAARMA network. Shown are minimized DFA for Poisson spike trains with Gaussian jitter of 4 ms. State transitions were every 10 ms. As can be seen, the two minimized DFA are different, corresponding to distinct grammars governing the two Poisson spiking templates. Not only does this speed up computation dramatically via simple traversal of the states in the DFA with each input, as shown in FIGS. 14A and 14B, but this also opens the door for novel applications in neuroscience.

FIG. 14A shows the state transition trajectories for a spike train (shown as vertical lines at the bottom of each plot) generated by template 0, where accepted states are indicated by the dots. The final state at time step 51 gives the classification result. FIG. 14B shows the state transition trajectories for a spike train (shown as vertical lines at the bottom of each plot) generated by template 1, where accepted states are indicated by the dots. The final state at time step 51 again gives the classification result. KAARMA is able to create a state model for each class of spike train firings by extracting its grammar directly from the data. For example, analysis is possible of long-term firing rates of neurons by comparing the dynamics or grammars in stable regions associated with different behaviors.

The KAARMA algorithm's capability to deliver computationally efficient and competitive solutions to fundamental applications in automatic speech recognition has been demonstrated. Simulations show that KAARMA-based classifiers can outperform similar HMM architectures on the TI-46 digit corpus.

With reference now to FIG. 15, shown is a schematic block diagram of a speech processing device 2100 according to an embodiment of the present disclosure. The speech processing device 2100 includes at least one processor circuit, for example, having a processor 2103 and a memory 2106, both of which are coupled to a local interface 2109. To this end, the speech processing device 2100 may comprise, for example, at least one server computer or like device. The local interface 2109 may comprise, for example, a data bus with an accompanying address/control bus or other bus structure as can be appreciated.

Stored in the memory 2106 are both data and several components that are executable by the processor 2103. In particular, stored in the memory 2106 and executable by the processor 2103 are a speech recognition application 2112 based upon pulse-based detection using a KAARMA network and/or KAARMA chain as previously discussed, one or more pulse-based data sets 2115 that may be used for training and/or testing of the KAARMA network and/or KAARMA chain, and potentially other applications 2118. Also stored in the memory 2106 may be a data store 2121 including, e.g., audio, video and other speech data. In addition, an operating system may be stored in the memory 2106 and executable by the processor 2103. It is understood that there may be other applications that are stored in the memory and are executable by the processor 2103 as can be appreciated.

Where any component discussed herein is implemented in the form of software, any one of a number of programming languages may be employed such as, for example, C, C++, C#, Objective C, Java®, JavaScript®, Perl, PHP, Visual Basic®, Python®, Ruby, Delphi®, Flash®, or other programming languages. A number of software components are stored in the memory and are executable by the processor 2103. In this respect, the term “executable” means a program file that is in a form that can ultimately be run by the processor 2103. Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of the memory 2106 and run by the processor 2103, source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of the memory 2106 and executed by the processor 2103, or source code that may be interpreted by another executable program to generate instructions in a random access portion of the memory 2106 to be executed by the processor 2103, etc. An executable program may be stored in any portion or component of the memory including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components.

The memory is defined herein as including both volatile and nonvolatile memory and data storage components. Volatile components are those that do not retain data values upon loss of power. Nonvolatile components are those that retain data upon a loss of power. Thus, the memory 2106 may comprise, for example, random access memory (RAM), read-only memory (ROM), hard disk drives, solid-state drives, USB flash drives, memory cards accessed via a memory card reader, floppy disks accessed via an associated floppy disk drive, optical discs accessed via an optical disc drive, magnetic tapes accessed via an appropriate tape drive, and/or other memory components, or a combination of any two or more of these memory components. In addition, the RAM may comprise, for example, static random access memory (SRAM), dynamic random access memory (DRAM), or magnetic random access memory (M RAM) and other such devices. The ROM may comprise, for example, a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or other like memory device.

Also, the processor 2103 may represent multiple processors 2103 and the memory 2106 may represent multiple memories 2106 that operate in parallel processing circuits, respectively. In such a case, the local interface 2109 may be an appropriate network that facilitates communication between any two of the multiple processors 2103, between any processor 2103 and any of the memories 2106, or between any two of the memories 2106, etc. The processor 2103 may be of electrical or of some other available construction.

Although portions of the speech recognition application 2112, pulse-based data sets 2115, and other various systems described herein may be embodied in software or code executed by general purpose hardware, as an alternative the same may also be embodied in dedicated hardware or a combination of software/general purpose hardware and dedicated hardware. If embodied in dedicated hardware, each can be implemented as a circuit or state machine that employs any one of or a combination of a number of technologies. These technologies may include, but are not limited to, discrete logic circuits having logic gates for implementing various logic functions upon an application of one or more data signals, application specific integrated circuits having appropriate logic gates, or other components, etc. Such technologies are generally well known by those skilled in the art and, consequently, are not described in detail herein.

The speech recognition application 2112 and pulse-based data sets 2115 can comprise program instructions to implement logical function(s) and/or operations of the system. The program instructions may be embodied in the form of source code that comprises human-readable statements written in a programming language or machine code that comprises numerical instructions recognizable by a suitable execution system such as a processor in a computer system or other system. The machine code may be converted from the source code, etc. If embodied in hardware, each block may represent a circuit or a number of interconnected circuits to implement the specified logical function(s).

Also, any logic or application described herein, including the speech recognition application 2112 and pulse-based data sets 2115 that comprises software or code can be embodied in any non-transitory computer-readable medium for use by or in connection with an instruction execution system such as, for example, a processor 2103 in a computer system or other system. In this sense, the logic may comprise, for example, statements including instructions and declarations that can be fetched from the computer-readable medium and executed by the instruction execution system. In the context of the present disclosure, a “computer-readable medium” can be any medium that can contain, store, or maintain the logic or application described herein for use by or in connection with the instruction execution system.

The computer-readable medium can comprise any one of many physical media such as, for example, magnetic, optical, or semiconductor media. More specific examples of a suitable computer-readable medium would include, but are not limited to, magnetic tapes, magnetic floppy diskettes, magnetic hard drives, memory cards, solid-state drives, USB flash drives, or optical discs. Also, the computer-readable medium may be a random access memory (RAM) including, for example, static random access memory (SRAM) and dynamic random access memory (DRAM), or magnetic random access memory (MRAM). In addition, the computer-readable medium may be a read-only memory (ROM), a programmable read- only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or other type of memory device.

In this disclosure, a simple and naive segmentation-alignment technique has been provided by first detecting the end points of the voiced regions in each isolated word utterance and then partitioning the acoustic features into segments of equal length corresponding to a fixed number of grammar states. Unlike classical HMMs, which operates on quasi-stationary states, good performance was obtained due to the KAARMA algorithm's ability to model dynamics within each arbitrarily defined state regions. This has worked well for a small vocabulary of only ten words. For more complex and realistic speech recognition problems, the grammar state concept can be applied to a more fundamental speech unit: phonemes, in order to avoid learning redundant grammar states in the construction of each word model. A dedicated KAARMA network may be trained for each phoneme using phoneme-labeled training data such as the TIMIT database. Since KAARMA network provides no native phone/frame alignment capability, it could rely on a more conventional hybrid HMM framework by leveraging a Viterbi pass to compute an alignment of frames to states and the re-estimation of duration models in order to correctly classify phones.

Furthermore, pulse-based signals can be treated as binary sequences, which can encompass all the necessary dynamics in the form of a finite state machine or DFA. Computing using DFA, extracted from a KAARMA network trained from the pulse data, can be much faster than traditional methods involving analog integration or kernel functions since the state transitions are done automatically based on pulse arrival, e.g., via a lookup table.

It should be emphasized that the above-described embodiments of the present disclosure are merely possible examples of implementations set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.

It should be noted that ratios, concentrations, amounts, and other numerical data may be expressed herein in a range format. It is to be understood that such a range format is used for convenience and brevity, and thus, should be interpreted in a flexible manner to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range as if each numerical value and sub-range is explicitly recited. To illustrate, a concentration range of “about 0.1% to about 5%” should be interpreted to include not only the explicitly recited concentration of about 0.1 wt % to about 5 wt %, but also include individual concentrations (e.g., 1%, 2%, 3%, and 4%) and the sub-ranges (e.g., 0.5%, 1.1%, 2.2%, 3.3%, and 4.4%) within the indicated range. The term “about” can include traditional rounding according to significant figures of numerical values. In addition, the phrase “about ‘x’ to ‘y’” includes “about ‘x’ to about ‘y’”.

PULSE-BASED AUTOMATIC SPEECH RECOGNITION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS REFERENCE TO RELATED APPLICATIONS

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

PCT Information

Provisional Applications (1)