This application claims the benefit of Korean Patent Application No. 10-2010-0107205, filed on Oct. 29, 2010, which is hereby incorporated by reference in its entirety into this application.
1. Technical Field
The present invention relates generally to an apparatus and method for creating an acoustic model and, more particularly, to an apparatus and method for creating an acoustic model, which can directly approximate a variation in the likelihood score and automatically find a penalty value for the complexity of an acoustic model based on the Minimum Description Length (MDL) criterion, thereby being able to freely adjust the size of an acoustic model in accordance with the specifications of a platform without deteriorating performance.
2. Description of the Related Art
The recognition performance of Automatic Speech Recognition (ASR) has been continuously increasing thanks to the advent of high-speed processors, an increase in the capacity of memory, the development of parallel processing techniques, an increase in the number of speech language resources, etc. Meanwhile, speech recognition systems are being mounted on a variety of hardware platforms ranging from a server-class computer to a small-sized portable terminal or a household electronic appliance. Accordingly, when speech recognition systems are designed, it is necessary to adjust their sizes in accordance with the computational capacities of the platforms so that they can achieve maximum recognition performance.
In order to adjust the size of a speech recognition system, a method of changing the size of an acoustic model or a language model may be chiefly considered. That is, it is necessary to reduce the size of a model while preventing recognition performance from decreasing to a level equal to or lower than a predetermined level, or to increase the size of a model so that performance can be improved.
In a Hidden Markov Model (HMM)-based speech recognition method, adjusting the size of an acoustic model means increasing or decreasing the total number of all the mean vector and covariance matrix components (hereinafter referred to as “the total number of model parameters”) of all HMM states that constitute the acoustic model. The amount of computation of acoustic likelihood scores is equal to or exceeds one-half of the amount of overall computation of speech recognition, and therefore the adjustment of the size of an acoustic model is closely related not only to the size of a storage space for storing a model but also to speech recognition speed.
Research has been conducted into methods of learning an acoustic model using a sufficient number of model parameters with respect to given acoustic model learning data and gradually reducing the number of Gaussian mixture components for each HMM state in order to adjust the number of model parameter of an acoustic model in HMM-based speech recognition. These methods are configured to construct a binary tree by repeatedly merging two Gaussian components having the most similar probability distributions and prune the binary tree to an appropriate level, thereby creating an optimum acoustic model. In this case, for the purpose of measuring the distance between two Gaussian components, a Kullback-Leibler (KL) divergence measure, a Bhattacharyya distance measure, and the sum of the mixture weights of Gaussian components have been researched. Furthermore, a weighted KL divergence measure that reflects the weights of Gaussian components in the process of calculating the KL divergence between the Gaussian components has been proposed. It was reported that among these measures, the KL divergence measure achieved relatively desirable performance.
However, the conventional KL divergence measure is limited in achieving the minimization of the amount of variation in the likelihood store, which is the intrinsic purpose of similarity measurement and probability distribution integration. Furthermore, in the conventional method, the total number of Gaussian components of an acoustic model is determined based on a penalty value for the complexity of the acoustic model, which was predetermined in accordance with the Minimum Description Length (MDL) criterion. When information about the size of an acoustic model to be used in a system is provided, a variety of values should be tried one by one so as to find an appropriate penalty value.
An object of the present invention is to provide an apparatus and method for creating an acoustic model, which can directly approximate a variation in the likelihood score and automatically find a penalty value for the complexity of an acoustic model based on the MDL criterion, thereby being able to freely adjust the size of an acoustic model in accordance with the specifications of a platform without deteriorating performance.
In order to accomplish the above object, the present invention provides an apparatus for creating an acoustic model, including a binary tree creation unit for creating a binary tree by repeatedly merging a plurality of Gaussian components for each HMM state of an acoustic model based on a distance measure reflecting a variation in likelihood score; an information creation unit for creating information about information about the largest size of the acoustic model in accordance with a platform including a speech recognizer; and a binary tree reduction unit for reducing the binary tree in accordance with the information about the largest size of the acoustic model.
The apparatus may further include a binary tree storage unit for storing the reduced binary tree.
In order to accomplish the above object, the present invention provides a method of creating an acoustic model, including measuring the distances between a plurality of Gaussian components for each HMM state of an acoustic model based on a distance measure reflecting a variation in likelihood score; creating a binary tree by repeatedly merging two Gaussian components having the shortest distance; and reducing the binary tree in accordance with information about the largest size of the acoustic model corresponding to a platform including a speech recognizer.
The method may further include storing the reduced binary tree.
The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Reference now should be made to the drawings, in which the same reference numerals are used throughout the different drawings to designate the same or similar components.
The present invention will be described in detail below with reference to the accompanying drawings. In the following description, redundant descriptions and detailed descriptions of known functions and elements that may unnecessarily make the gist of the present invention obscure will be omitted. Embodiments of the present invention are provided to fully describe the present invention to those having ordinary knowledge in the art to which the present invention pertains. Accordingly, in the drawings, the shapes and sizes of elements may be exaggerated for the sake of clearer description.
The apparatus for creating an acoustic model according to an embodiment of the present invention may be configured to adjust the size of an acoustic model including a plurality of Gaussian components for each HMM state in accordance with a platform 111 and transfer it to a speech recognizer 112 included in the platform 111.
The platform 111 includes the speech recognizer 112, and may include a variety of platforms ranging from a small-sized terminal with limited computing resources, such as memory or a Central Processing Unit (CPU), to a server-class computer with almost not limited computing resources. The apparatus for creating an acoustic model according to an embodiment of the present invention may be configured to adjust the size of an acoustic model so as to recognize speech on such a variety of platforms.
As a prerequisite for the application of the apparatus for creating an acoustic model according to the embodiment of the present invention, the process of learning an acoustic model for speech recognition will now be described. The learning of an acoustic model for speech recognition requires a speech database in which speech pronounced by a plurality of utterers is stored, transcribed sentences which correspond to respective utterance files included in the speech database, and a pronunciation dictionary in which a pronunciation for each word is represented by means of phonetic symbols. An HMM-based statistical acoustic model is learned by a commonly known method using the above-described materials. The present invention is based on the assumption that L triphone HMM models having left-right acoustic context have been acquired.
In Equation 1, wr, μr, and σr are the mixture weight, mean vector and covariance matrix of an r-th Gaussian component, respectively. Furthermore, gr(x) is the normal distribution of the r-th Gaussian component, and Gr(x) is a normal distribution reflecting the weight of the r-th Gaussian component. In the speech recognition process, with respect to feature vectors extracted from each frame of input speech, the probability value of Equation 1 is calculated for the states of all the triphone HMMs included in the acoustic model. Accordingly, in order to increase speech recognition speed, it is important to reduce the number of all HMM states included in the acoustic model without deteriorating recognition performance.
Referring back to
The binary tree creation unit 101 is a unit that creates a binary tree by repeating the process of merging a plurality of Gaussian components for each HMM state based on a distance measure reflecting a variation in the likelihood score. That is, the binary tree creation unit 101 measures the distances between the a plurality of Gaussian components for each HMM state based on the distance measure reflecting a variation in the likelihood score and then repeating the process of merging two Gaussian components having the shortest distance therebetween, thereby creating a binary tree. In this case, the binary tree creation unit 101 can obtain the distance measure reflecting a variation in the likelihood score by subtracting the approximate likelihood score after the merging of the plurality of Gaussian components from the approximate likelihood score before the merging. An algorithm for creating the binary tree using the binary tree creation unit 101 and the process of obtaining the distance measure reflecting a variation in the likelihood score will be described in detail below with reference to the drawings.
The information creation unit 102 is a unit that creates information about the largest size of the acoustic model that corresponds to the platform 111. The information about the largest size of the acoustic model may correspond to the specifications of the platform 111. That is, the acoustic model may have a size that varies depending on the specifications of a platform, such as internal memory, external memory and processing speed. Accordingly, the information creation unit 102 may receive platform-related information about the internal memory, external memory and processing speed of the platform 111, and create information about the largest size of the acoustic model corresponding to the platform 111 based on the received platform-related information.
The binary tree reduction unit 103 reduces the binary tree created by the binary tree creation unit 101 in accordance with the information about the largest size of the acoustic model created by the information creation unit 102. That is, the binary tree is reduced by receiving the information about the largest size of the acoustic model based on the limitations of the platform 111 such as internal memory, external memory and processing speed, pruning the binary tree created by the binary tree creation unit 101 and eliminating Gaussian components that does not greatly influence recognition performance. The binary tree reduction unit 103 may convert the information about the largest size of the acoustic model, created by the information creation unit 102, into the total number of Gaussian components to be included in the acoustic model, and then use it to reduce the binary tree. Furthermore, the binary tree reduction unit 103 may perform searching downwards from the root node of the binary tree, and then obtain an optimum subset of the nodes of the binary tree in accordance with the MDL criterion corresponding to the number of model parameters such as the weights, mean vectors and covariance matrices of Gaussian components. Furthermore, the binary tree reduction unit 103 may transfer the optimum subset of the nodes of the binary tree to the speech recognizer 112 so that the speech recognizer 112 of the platform 111 can perform speech recognition using the reduced acoustic model. The process of reducing a binary tree using the binary tree reduction unit 103 will be described in detail below with reference to the drawing.
The binary tree storage unit 104 may store the binary tree reduced by the binary tree reduction unit 103. The binary tree stored by the binary tree storage unit 104 may be used for speech recognition later. In addition to the binary tree, the binary tree storage unit 104 may store model parameters, such as the weights, mean vectors and covariance matrices of the Gaussian components, and the total number of Gaussian components to be included in the acoustic model.
As described above, using the above configuration, the apparatus for creating an acoustic model according to the embodiment of the present invention is configured to adjust the size of an acoustic model including a plurality of Gaussian components for each HMM state in accordance with the platform 111, and transfer it to the speech recognizer 112 included in the platform 111.
The algorithm for creating a binary tree using the binary tree creation unit 101 will now be described. First, the algorithm starts with forming R Gaussian components, included in a specific HMM state s, into respective leaf nodes. Thereafter, the distance between the Gaussian components of each pair of possible Gaussian components is measured, two Gaussian components having the shortest distance are found, and the two Gaussian components are merged into one.
In the above algorithm, methods using a Kullback-Leibler (KL) divergence measure, a weighted KL divergence measure, a Bhattacharyya distance measure, or the sum of the mixture weights of Gaussian components as a distance measure, as described above, are presented as methods of measuring the distance between two Gaussian components. Such distance measures vary the topology of the binary tree shown in
The above enumerated existing distance measures prefer that the variation between the likelihood score before the merging of two Gaussian components and the likelihood score after the merging be small. However, these distance measures do not directly utilize the variation in the likelihood score.
The apparatus for creating an acoustic model according to the embodiment of the present invention utilizes a Delta-Likelihood (DL) distance measure, that is, a new distance measure that directly reflects the variation in the likelihood score. In
In Equation 2, D is the dimension of the feature vectors, σp is he covariance matrix of the Gaussian component, and γp is calculated as
When two Gaussian components gp and gq are merged into gr, the difference between the log likelihood scores before and after the merging may be calculated by the following Equation 3:
When the value of Equation 3 is small, the distance between the two Gaussian components gp and gq can be determined to be short, and therefore the two components can be merged with each other. In Equation 3, in practice, learning data cannot always be provided in the speech recognition system, and therefore it is difficult to obtain the values of γp and γq. Accordingly, the present invention proposes a new distance measure that utilizes wp and wq corresponding to the mixture weights of the Gaussian components instead of the above values. The proposed distance measure DL is defined as the following Equation 4:
d
DL(Gp(x),Gq(x))=(wp+wq)log|σr|−wp log|σp|−wq log|σq| (4)
The number of model parameters before the merging is twice the number of model parameters after the merging. When specific data is represented using a larger number of parameters, a greater likelihood score is obtained, and therefore the proposed Equation 4 always has 0 or a positive value.
A bottom-up binary tree is constructed using the distance measure obtained as described above, as shown in
As shown in
Since in Equation 5, the probability increases in proportion to the modeling capability for given data, the value of the first term decreases as the number of model parameters increases. In the second term, k is the total number of model parameters. The value of the second term increases in proportion to the number of model parameters, and therefore it functions as a penalty for a gradual increase in the complexity of the model. The α value is a variable that adjusts the degree of penalty. A subset of finally selected all binary tree nodes varies depending on the above value. In the third term, C is a constant value, and is negligible because it does not influence the overall processing.
With regard to the penalty value adjustment variable α, in a conventional method, the total number of Gaussian components of the acoustic model is determined depending on the predetermined α value in Equation 5. In contrast, when information about the size of an acoustic model to be used in the system is provided, a variety of α values should be tried one by one so as to find an appropriate α value.
The apparatus for creating an acoustic model according to the embodiment of the present invention includes an algorithm for automating the process and automatically finding the optimum α value (see Equation 5) when the total number of finally desired Gaussian components is given. The graph of
In Equation 6, assuming that the slope represented by Δ(t) changes slowly, Δ(t)≈Δ(t+1). Accordingly, when t+1 and Δ(t) are substituted for t and Δ(t+1), respectively, in Equation 6, the following Equation 7 is obtained.
As the number of repetitions t is gradually increased from 0, gmmN(t) becomes closer to TargetGmmN. In this case, an optimum subset of nodes of the binary tree is obtained by applying α(t+1), and gmmN(t+1) at that time is calculated. Furthermore, when gmmN(t+1)=TargetGmmM, all Gaussian components may be output at that time, and the process of reducing the acoustic model may be terminated. When gmmN(t+1)≠TargetGmmM, t is increased by one, and the process restarts with the calculation of Equation 6.
Alternatively, the process of reducing the acoustic model may be terminated when the difference between gmmN(t+1) and TargetGmmM is equal to or smaller than a predetermined value, instead of when gmmN(t+1)=TargetGmmM. In this case, when the difference between gmmN(t+1) and TargetGmmM is not equal to or smaller than a predetermined value, t is increased by one, and the process restarts with the calculation of Equation 6.
Finally, when the size of an allowable acoustic model determined based on the hardware specifications of the platform on which the speech recognizer will be mounted is Q bytes and the total number of unique HMM states is N, the total number of unique Gaussian components usable in the overall acoustic model can be obtained using the following equation:
where MeanSize is the memory size of the mean vector, CovSize is the memory size of the covariance matrix, and WeightSize is the memory size of the Gaussian component weight value.
Widely known common methods are used as concrete HMM-based speech recognition methods, other than those that have been described in the above description.
The method of creating an acoustic model according to the embodiment of the present invention may be configured to adjust the size of an acoustic model including a plurality of Gaussian components for each HMM state in accordance with a platform and transfer it to the speech recognizer included in the platform.
Referring to
Thereafter, a binary tree is created by repeatedly merging two Gaussian components having the shortest distance at step S602. When the binary tree is created, IDs ranging from 1 to R are assigned to nodes corresponding to initial Gaussian components, and IDs sequentially increasing from R+1 by one are assigned to new nodes created after the merging, thereby creating the binary tree.
Once the binary tree is created at step S602, the binary tree is reduced in accordance with the information about the largest size of the acoustic model corresponding to the platform at step S603.
Once the binary tree has been reduced at step S603, the reduced binary tree may be stores at step S604.
Since the method of creating an acoustic model according to the embodiment of the present invention performs the process of creating an acoustic model similarly to the apparatus for creating an acoustic model according to the embodiment of the present invention shown in
As described above, the present invention provides the apparatus and method for creating an acoustic model, which can directly approximate a variation in the likelihood score and automatically find a penalty value for the complexity of an acoustic model based on the MDL criterion, thereby being able to freely adjust the size of an acoustic model in accordance with the specifications of a platform without deteriorating performance.
Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2010-0107205 | Oct 2010 | KR | national |