The invention relates to automatic speech recognition systems. More precisely, the invention relates to a method for adjusting a discrete acoustic model complexity in an automatic speech recognition system, comprising said discrete acoustic model, pronunciation dictionary and optionally a language model or a grammar.
Automatic speech recognition (ASR) systems are widely used in different technical fields. The ASR systems can enrich user-machine communication through providing convenient interface, which allows for speaking of commands, dictation of texts and filling of forms by voice. A possible application of ASR could be also in telecommunication for voice dialing or for enabling a voice activated virtual agents supporting customers calling the call centers for help. It is important for such system to achieve best possible performance and optimal operation time.
In a speech recognition system a number of knowledge sources about the speech and language are used simultaneously to find accurate transcriptions of the spoken utterances. This idea is illustrated in the
The proposed invention is concerned with finding the acoustic model based on training data.
In particular the invention considers the discrete acoustic models known from the literature. A new method of obtaining the optimal complexity (to be defined latter) discrete model is proposed. The acoustic model obtained using proposed method is optimal with respect to both, accuracy and generalization. Thus the proposed method solves in an optimal manner the accuracy/generalization tradeoff. The proposed method is a part of larger set of methods, which transform the speech database into an acoustic model, see
This transformation will be described in details in the following Sections.
Typically, the acoustic models, needed by the speech recognizer, are obtained through a multistage processing of pairs containing speech waveforms and their orthographic transcripts. In preparation for the proposed training method—including the complexity adjusting procedures, the following processing stages are necessary:
The aforementioned steps are required for the subsequent acoustic models training.
The data to be feed into the model complexity adjustment procedure can be acquired, for example, through a web interface. Such a web application allows for registration of persons recording the speech. After registration the process of recording speech is started. The person reads the prompts shown on the top of the page and, after each prompt, the speech recording is transferred to our server together with the orthographic transcription of the recorded utterance and the person is asked to record another prompt.
Thus the database contains pairs of orthographic transcriptions and speech waveforms, see
The orthographic transcriptions are transformed to phonetic transcription using a trainable grapheme to phoneme converter like, e.g., the sequitur g2p tool [3], or rule-based systems.
The waveforms are transformed into sequences of features vectors. The processing of the waveforms is organized temporally in 20-30 ms long frames. The frames advance by a step of 10 ms. Typical features are the Mel Frequency Cepstral Coefficients (MFCC) with delta and delta-delta derivatives. MFCC's are described for example in the HTK Book [2]. We denote the sequences of features vectors as Yi={fi,j}, iε{1, . . . , G}, jε{1, . . . , Oi} where G is the number of waveform/transcription pairs and Oi is the number of frames in the i-th waveform. Each features vector is a member of the Euclidean space fi,jεp. Typically, the features dimension p is equal 39 for the MFCCs.
The next step in processing features is to decorrelate them. Toward this the scatter matrix Sεp×p is computed according to:
where R is the Cholesky factor [4] of the scatter matrix.
Given the scatter matrix and the mean vector the features are decorrelated according to the prescription:
d
i,j=(fi,j−m)R−1. (3)
After the above procedure we have a set of decorrelated features, which are zero mean and with the correlation matrix normalized to the identity matrix. The features vector di,j=[ai,jT,Δi,jT,ΔΔi,jT]T consists of the basic MFCC's, ai,jε13, its delta derivatives, Δi,jε
13, and its delta-delta derivatives ΔΔi,jε
13.
According to some embodiments of the invention, the method for adjusting a discrete acoustic model complexity in an automatic speech recognition system comprises a discrete acoustic model, pronunciation dictionary, and optionally a language model or a grammar, said method comprising the steps of:
Preferably, the generalization coefficient N of the quantizer is larger than 5, more preferably larger than 10, and most preferably larger than 15.
Preferably, the quantizer in the step a1 is trained using the generalized Lloyd algorithm or the Equitz method.
Preferably, the complexity PI of the quantizer of the discrete acoustic model is defined as the number of codevectors in the trained quantizer.
Preferably, the quantizer in the step a1 is a product quantizer with number of codevectors I distributed among part quantizers.
In another preferred embodiment of the invention, the quantizer in the step a1 is a lattice quantizer.
In such case, preferably, the complexity PI of the quantizer of the discrete acoustic model is defined as the volume of the lattice quantizer cell taken with minus sign.
Preferably, the step a3 is carried out using the Viterbi training.
Preferably, the step a3 is carried out for clustered triphones or tied triphones.
The invention will be now described in details with reference to the drawings, in which:
Choosing proper model complexity is a much studied topic in machine learning. However, there is no single procedure applicable for wide class of models. Herein we restrict our attention to discrete models, a.k.a. histograms with data dependent partitions. The data dependent partition has both the cells shape and the granularity/resolution/complexity/number of the cells adjustable. The partition under consideration in this patent is derived from vector quantization [5] and it is thus the so called Voronoi partition. The application of the invention is possible if there is a need for a classification based on training data, like e.g. in speakers recognition systems, recognition of faces, graphical signs and other types of data. A short account of vector quantization follows.
Our procedures for adjusting model complexity assumes the features are quantized [5]. There are several issues related to quantization of the features. One have to choose between lattice [6] and trained quantizers [7], between one-stage and product quantizers [5] etc. Next, the quantizer resolution has to be decided upon. The quantizer resolution is given in case of lattice quantizers by volume of the cell and in case of trained quantizers by the number of codevectors in the codebook. Since the features belong to the Euclidean space of dimension p we talk here always of vector quantizers.
Vector quantizer can be viewed as a mapping from p dimensional Euclidean space p onto a discrete set Y⊂
p, Q:
p→Y where Y={y1, . . . , yI}. Set Y is called codebook. Elements of the codebook are the reproduction vectors or codevectors. Vector quantizer tills the space into I sets known as quantizer bins or cells:
defined as Ri=Q−1(yi)={xεp:Q(x)=yi}. The sets Ri have a following property:
R
i
∩R
j=Ø for j≠i (5)
It can be shown that the reproduction vector inside the partition element Ri is optimal if it is a center of weight for that partition element. Formally:
where p(x) is the source distribution. Once the source distribution is available implicitly by the training set, the ensemble averages are replaced by sample averages to compute actual placement of the reproduction vectors.
The input vectors to the quantizer are assigned reproduction vectors according to the nearest neighbor rule. It can be shown that the nearest neighbor rule is optimal, minimizing distortion induced by the quantization. Formally the nearest neighbor rule states:
R
i
={x:∥x−y
i
∥≦∥x−y
j
∥,i#j}, (7)
with any appropriate breaking of ties. Partition defined according to (7) is called the Voronoi partition.
Quantizer with bins (countable but infinite number of them), which are all the same and divide the whole space are known as lattice quantizers. The lattice quantizer, or more precisely the set of reproduction vectors, is defined as follows:
λ={y:y=uTM,uεZp}, (8)
where M is the so called generator matrix. Volume of the lattice quantizer bin is given by:
det(M) (9)
Lattice quantizers do not require training but constructing them is a difficult mathematical task [8].
A fragment from a hexagonal lattice covering the whole plane is shown in
A different class of the quantizers is trained quantizers. There is a number of algorithms for obtaining a trained quantizer. To name a few, we have, the generalized Lloyd algorithm (GLA) [5], or a method by Equitz [7], which requires less computations than the Lloyd algorithm at the price of being less accurate (this loss of accuracy is negligible in most practical applications). An often applied workaround, which is aimed at lowering complexity of training and encoding is dividing the space Ω of dimension dim(Ω)=p into subspaces Ω=Ω1×Ω2 such that dim(Ω)=dim(Ω1)+dim(Ω2). Such quantizers are referred as product quantizers to.
Current practice regarding adjusting discrete model complexity is limited to a simple advice of using a codebook with e.g. 256 entries—this value is typically found in a number of sources dealing with speech recognition with the discrete models, cf. [9], [10]. This rather restrictive setting leaves no room for accurate modeling of phonetic densities especially if there are very large training sets available. The difference between a greedy (low complexity) model and an accurate (high complexity) model is illustrated in the
Obviously the high complexity model from the
Assumption of uniform distribution is restrictive. However, it gives important initial insight into the problem, and thus is briefly presented here. Assume, that the probability density p(x) is of bounded support. Next assume, that a space partition is given for which holds: ∫R
Next, let X={x1 . . . xM} be a random sample, whose elements are quantized, that is each sample is attached a natural number in the range 1, . . . , I. It could be seen, that such obtained indices of cells are governed by a multinomial distribution with K classes and K is less or equal I. It should be pointed out, that the number of cells K, which intersect with the support is unknown and our goal is to estimate it.
Let V will be a set of indices obtained by quantization of X. We can show, that conditional probability of the sample given the hypothetical Kh is equal:
where Z is the number of distinct indices of bins included in V, Si is the number of repetitions of the bin with the index i, and M is the observation length. It could be seen that the maximum likelihood estimate for the hypothetical number of bins intersecting with the pdf under investigation, does not depend on the middle term, which includes the multinomial coefficient. Thus the estimate can be obtained by:
The likelihood (10) is equal zero for Kb less than Z. We can separate out the following three modi of this likelihood function:
It can be shown that the following conditions hold for each of the above listed modi:
The condition for 1) is:
Z=M. (12)
The condition for 2) is:
If M fulfills this condition then {circumflex over (K)}h is equal Z. One can prove the following, interesting from the theory viewpoint, property. This property establishes the link with known in the statistical literature problem of coupon collector [11]:
In the above expression H(K) is the harmonic number equal, by definition,
The expression in the numerator is the mean number of trials needed to learn all bins intersecting with the support, while the unknown number of such bins is K.
The condition for 3) is:
where
The proofs for the above three conditions follow.
The main vehicle of the proofs is the following expression valid for the harmonic numbers [12]:
where C is the Euler-Mascheroni constant. It can be seen that asymptotically, as K approaches infinity, the terms after the logarithmic term vanish to zero. This leads to the following property:
The proof for the condition 1) starts with taking logarithm of the considered expression (eq. (11)):
Suppose now that i a continuous variable, which setting follows from allowing that variable to take on non-integer values. It can be seen that the middle sum does not depend on Kh so derivative w.r.t. that variable reads:
The last expression allows us to state the condition 3) which is either:
with
The loaner equation let us conclude that the sample length needed to learn a given percent of the bins intersecting with the support is a multiple of K.
Setting Z=M we see that, indeed, this (Z=M) is the sufficient and necessary condition for optimal Kh approaching infinity, thus proving the condition 1). This is due to the following identity:
It remains to prove the condition 2). In this case the maximum of the likelihood should be attained at Kh=Z. Thus we have a following inequality:
ln [p(V|Z)]>ln [p(V|Z+1)] (23)
which implies
ln(Z+1)−M ln(Z+1)<−M ln(Z) (24)
and, after some algebra, we attain at the condition 2):
In
Next step will be derivation of the conditions analogous to the introduced in the previous Section, which are distribution free (we relax the assumption of bins of equal probabilities). To achieve the desired effect we introduce the probabilities of bins p=[p1, . . . , pK
Since we do not impose a constraint on the probabilities of bins the considered probability function is now in the form:
Where k={k1, . . . , kZ}. We integrate the above function over a unit simplex D:
Note, that integrating out the probabilities in eq. (26) is not the only available strategy. Another method would be to maximize over the joint vector of Kh and p. As can be seen this is a polynomial optimization problem which is generally NP-hard. However some approximation algorithms exist, which run in polynomial time, see e.g. [13]. Another, a more viable one, approach would be to use the pmf estimator, with a proper handling of the back-off probabilities and use these estimated probabilities during computing a likelihood estimate of the joint vector according to (26). A good candidate algorithm for this approach could be the one given in [14]. In any case, as shown later in this document, the integrating-out strategy leads to neat mathematical results. Maximization-strategy, though forms an interesting alternative to the Natural Laws of Succession from [14], might be too computationally involved.
Let assume that all pmfs are equally likely. This corresponds to the assumption that we do not know the true pmf and attach to each possible p=[p1, . . . , pK
where the equality (28) follows the fact, that the value of the integral does not depend on the choice and order of the probabilities in the monomial integrand. We present now the most important results without going into technical details. Some details of the derivations are contained in the Appendix.
The following expression for probability of Kh can be deduced starting from the equation (26):
Similarly to the previously studied case of equal probabilities of bins we can separate the following three modi:
The
In choosing quantizer complexity the idea is to balance the accuracy (complexity of the model) and the generalization ability of the model. The generalization ability is measured by the ratio of M and Z, which we call the generalization coefficient in the remaining part of the patent. The larger the ratio is the better the model will generalize, what means it will work better for the samples outside the training set. However, as illustrated in the
In the Section 0 devoted to computations of the sample needed to estimate the support (entitled “Dependencies for the non-uniform distribution”) we saw that the sample needed to learn as much as 99% of the cells in the support is on the order of 100×K. The question is if we actually need such a good generalization—coming inevitably at the price of lower accuracy. It seems that much of the cells learned with the generalization coefficient set to that number is of negligible low probability. The intuition is that we can discard such cells and increase the accuracy of the model sacrificing the generalization. The discarding of the cells does not increase significantly the Bayes risk, since that cells are of such a low probability. As derived using Monte-Carlo experiments it suffices to take the generalization coefficient equal ˜8 to ‘saturate’ the Bayes risk, what means that increasing the sample length beyond 8×K, results in no further improvement of the classifier. This is a result of learning most of the ‘typical’ cells for the classes and discarding the low probability cells, which do not add to the Bayes risk significantly.
An element of the proposed invention is a method to distribute the codevectors of the resolution constrained, cf. [16], product vector quantizer between the parts of the product. The so called product codebook is given by C=C1×C2 where the sum of the dimensions of the codevectors of C1 and C2 add up to give the dimension of the codevectors of C. The codevectors of the product codebook C are given in terms of the codevectors of C1 and C2 as:
where yiεC, yi(1)εC1, yi(2)εC2 and I=I1I2, I1=|C1|, I2=|C2|.
In light of the above definition we propose the following procedure for choosing optimally the I1 and I2 to minimize total distortion. We start the derivation with recalling known from the high rate quantization theory results [16]. The distortion, assuming the so called Gersho's conjecture, introduced by a high rate, resolution constrained quantizer is equal:
where g(x) is the density of the codevectors. The density of the codevectors is related to the number of such vectors by the following integral:
According to the high rate quantization theory the optimal reproduction vectors density reads, in terms of the source distribution:
where k is the dimension of the source vectors.
Let define the marginal pdfs as:
where x lives in the product space Ω=Ω1×Ω2 and x(1) is the projection of the x onto the first subspace Ω1, and x(2) is the projection of the x onto the second subspace Ω2. The quantizers are embedded in the corresponding subspaces, thus we can write C1⊂Ω1 and C2⊂Ω2.
Applying the high rate quantization theory results to the problem of distributing available I codevectors between the quantizers C1 and C2 we obtain the following Lagrange equation for the distortion induced by the product quantizer:
where
and λ is the Lagrange multiplier. Minimization of the Lagrange [17] equation w.r.t. I1 and I2 and λ gives the desired solution (this computation can be done with ease using any computer algebra system, thus we do not provide it here)
Given above results the quantizer resolution selection proceeds as follows. Let the complexity/resolution/number of codevectors/volume of the discrete model be denoted as Π:
given: M, the training set and the assumed generalization coefficient N
let H be the number of resulting subphonetic units, typically there are three subphonetic units per triphone set Mj=M and Zj=1, jε{1, . . . , H}.
return optimal complexity H and the optimal
quantizer
Method 1. Method for Finding an Optimal Quantizer
The parameter N in this algorithm is the generalization coefficient introduced in Section entitled “Choosing the sample length sufficient to ‘saturate’ the Bayes risk.” The above algorithm should be performed for each stream of the features vectors, that is for the basic MFCC's the delta MFCC's and the delta-delta MFCC's, separately (cf. the Section entitled “Computation and normalization of features”).
Since the generalization coefficient may vary across triphone clusters, we take as the generalization coefficient the smallest one taken over all triphone clusters. To compute the generalization coefficient one need to go through the whole segmentation/training procedure. The segmentation/training procedure can be, e.g., the Viterbi training, see [2], page 142. The algorithm results in optimal complexity quantizer given the training set. The returned optimal quantizer is a basis for forming the acoustic model in a straightforward manner, well known for those skilled in the art.
The procedure for adjusting discrete model complexity can be executed during a training phase of a speech recognition system. Necessary technical devices which allow for execution of the invented method are: any suitable computer with CPU/multiple CPUs (Central Processing Unit) with appropriate amount of RAM (Random Access Memory) and I/O (Input/Output) modules. For example it could be a desktop computer with a quad core Intel i7 processor with 6 GB of RAM, hard disk with 320 GB capacity, a keyboard, a mouse and a computer display. The procedure also can be parallelized for execution on a single server or a cluster of servers as well. It could be a server with two Xeon 6 core processors, with 24 GB RAM and 1TB hard disk. The latter configuration might be necessary if the training set grows especially large.
The procedure for adjusting discrete model complexity has been carried out for a relatively small training set comprising 100 hours of speech data from around 100 different speakers consists of the following steps:
After these steps the acoustic model is ready for use in a speech recognition system, as shown in
ASR system obtained using proposed invention is fast due to obtaining probability of a feature vector in a unit time. The operation of computing probability of a feature vector is a simple table lookup. Simultaneously the system is more robust to speakers outside the training set than while using classical approach of creating acoustic model. Such acoustic model optimized using proposed invention can be stored in memory of any device such as, for example, a mobile device, a laptop or a desktop device. The memory need not have very low access time, it could be even a slow flash memory. Given appropriately large training set collected from a large number of speakers set the system obtained using proposed invention is truly speaker independent, and does not require adaptation. This is due to the introduced generalization coefficient and the introduced procedure for adjusting complexity of the discrete model. Additionally, we observe an improvement in WER (Word Error Rate) as compared to the classical system with the number of codevectors set arbitrarily without optimization.
Proposed method of adjusting complexity can be used virtually always if a fast and accurate classifiers are needed. Examples include, but are not limited to:
Introduced dependencies allow also for estimation of the data amounts requirements needed to achieve assumed WER (Word Error Rate) in a speech recognition system or other classifiers. The method leading to such operation is following, let N be equal eight:
It can be shown using the Brion's formulae [19], that integral in eq. (28) evaluates to:
In light of the above expression we get:
and next:
To get the probability of the hypothetical number of quantizer bins intersecting with the support with given sample length a M and number of different bins indexes in the training set Z we apply a following derivation:
where C is some constant (not to be confused with the Euler-Mascheroni constant used earlier in this document). By evaluating the sum in the denominator:
we get:
Finally the probability of the hypothetical number of cells in the support is given by the following expression:
Number | Date | Country | Kind |
---|---|---|---|
P-399698 | Jun 2012 | PL | national |