The present application is the National Phase of PCT/JP2008/053331, filed Feb. 26, 2008, which is based upon and claims priority from Japanese patent application 2007-048898 (filed on Feb. 28, 2007) the content of which is hereby incorporated in its entirety by reference in this specification.
The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program for recognizing voices, and more particularly to a voice recognition device, a voice recognition method, and a voice recognition program that increase voice recognition accuracy at a lower calculation cost by determining sound features and controlling the voice recognition related parameters.
In general, a voice recognition device is used to recognize received voices that are then converted to text for use in applications. For example, Patent Document 1 discloses a device that recognizes voices at high recognition accuracy without increase in the calculation cost while preventing a correct answer from being pruned. Non-Patent Document 1 also describes a general voice recognition technological method and a voice recognition real-time technology.
Patent Document 1:
Japanese Patent Publication Kokai JP2001-75596A (paragraphs 0063-0070, FIGS. 6-8)
Non-Patent Document 1:
Akio Ando “Real-time Speech Recognition”, Institute of Electronics, Information and Communication Engineers, pp. 28-33, 40-47, 62-67, 76-87, 126-129, 138-143, 148-165
All the disclosed contents of Patent Document 1 and non-Patent Document 1 given above are hereby incorporated by reference thereto in this application. The following gives an analysis of the technology related to the present invention.
The input signal acquisition unit 91 acquires (receives) input signals (voice signals) partitioned on a unit time basis. The feature amount calculation unit 92 calculates the feature amount from the input signals received by the input signal acquisition unit 91. The acoustic model 93 stores an acoustic model in advance. The language model 94 stores a language model in advance. The network search unit 95 searches for candidates for a word string as a result of voice recognition based on the feature amount calculated by the feature amount calculation unit 92, the acoustic model stored in the acoustic model 93, and the language model stored in the language model 94. The recognition result output unit 96 outputs candidates for a word string searched for by the network search unit.
In searching for a word string by means of the network search unit 95, the general voice recognition device shown in
[Expression 1]
{circumflex over (ω)}=ω0, . . . , ωm Equation (1)
In this case, when the search method disclosed in Non-Patent Document 1 is used (see Chapter 6 in Non-Patent Document 1), the network search unit 95 can find a word string having the highest likelihood by using equation (2) when the input signal x=x0, . . . , xT is given in the chronological order.
[Expression 2]
{circumflex over (ω)}=argmax{log P(x|ω)+λ log P(ω)} Equation (2)
In the above equation, λ is a parameter called a language weight. A larger value of the weight λ causes the device to make a search with focus on a language model, while a smaller value of the weight causes the device to make a search with focus on the acoustic model. P(x|ω) is likelihood of the word string ω for the input signal x calculated using the acoustic model, and P(ω) is probability of occurrence of the word string ω calculated by using the language model.
In equation (2), argmax means calculation for finding the highest-likelihood word string for all possible combinations of word strings. However, because performing the above operation for all word strings requires an extremely high calculation cost, the candidates (hypotheses) for the word string are pruned in the actual operation. To prune the candidates for the word string, pruning parameters, such as the number of hypotheses and the likelihood width, must be set.
For the general voice recognition device described above, there are multiple parameters, such as the language weight and the pruning parameter described above, which must be set. The calculation cost and the recognition accuracy of the voice recognition device may be changed by controlling those parameters.
The voice recognition device described in Patent Document 1 allows the user to control the voice recognition related parameters described above.
The voice data storage means 911 stores received voices. The sound analysis means 912 analyzes the sound of voice data stored in the voice data storage means 911 and outputs sound feature vectors: The acoustic model storage means 913 stores the acoustic model of phonemes. The word dictionary storage means 914 stores a word dictionary. The likelihood operation means 915 calculates likelihood of a hypothesis, which is a recognition candidate, based on sound feature vectors output by the sound analysis means 912, the acoustic model stored in the acoustic model storage means 913, and the word dictionary stored in the word dictionary storage means 914. The pruning means 916 calculates the highest likelihood from the likelihoods of the hypotheses calculated by the likelihood operation means 915 and discards a hypothesis whose likelihood is equal to or lower than the calculated highest-likelihood by a predetermined beam width. The recognition result output means 917 outputs hypotheses, selected by the pruning means 916, as recognition candidates.
In the voice recognition device having the components described above, the simplified acoustic model storage means 921 stores simplified acoustic models of phonemes, respectively. The simplified acoustic model probability operation means 922 calculates the simplified sound output probability of the HMM state each at a time of day in a predetermined period of time before and after the current time of day based on the sound feature vectors output by the sound analysis means 912 and the simplified acoustic model stored in simplified acoustic model storage means 921. The order variation calculation means 923 finds order of simplified sound output probabilities of the HMM states each corresponding to a time of day calculated by the simplified acoustic model probability operation means 922, calculates order variation width of the HMM states in a predetermined period of time before and after the current time of day, and calculates an average of the order variation width of the HMM state. And, the voice recognition device described in Patent Document 1 adjusts the voice-recognition related parameters based on the average of the order variation width calculated by the order variation calculation means 923.
The voice recognition device shown in
By configuring the device as described above, the voice recognition device described in Patent Document 1 allows the parameters to be changed so that the recognition accuracy is maximized within a predetermined calculation cost.
However, the method used by the voice recognition device described in Patent Document 1 requires the order variation calculation means 923 to conduct the time-consuming calculation of the average of the order changes in the HMM states before and after one particular time of day. This results in the problem that the processing for calculating the optimum parameters causes a delay. Another problem with the method of the voice recognition device described in Patent Document 1 is that the calculation cost is not always reduced because the likelihood calculation, which requires a high calculation cost, is performed for the simplified acoustic model and for the acoustic model separately.
As described above, the voice recognition system (voice recognition device) using the voice recognition technology described in Patent Document 1 and non-Patent Document 1 has the following problems. A first problem is that the voice recognition method, which adjusts the parameters by calculating the order variations in the HMM (Hidden Markov Model) states using simplified voice model, causes a processing delay because the method must conduct a time-consuming averaging calculation for finding the order variations. A second problem is that the voice recognition method, which adjusts parameters by calculating the order variations in the HMM states using simplified voice model, must conduct an extra likelihood calculation for the simplified model and this extra calculation may require a higher calculation cost.
In view of the foregoing, it is an object of the present invention to provide a voice recognition device, a voice recognition method, and a voice recognition program that judge the sound feature and recognize voices using appropriate parameters that increase the recognition accuracy at a low cost. It is another object of the present invention to provide a voice recognition device, a voice recognition method, and a voice recognition program that allow the appropriate parameters to be set without processing delay by considering the number of competing candidates at the same time of day. It is still another object of the present invention to provide a voice recognition device, a voice recognition method, and a voice recognition program that require a smaller amount of calculation for finding appropriate parameters.
In accordance with a first aspect of the present invention, there is provided a voice recognition device that recognizes a voice of an input voice signal, comprising a voice model storage unit (for example, implemented by voice model storage unit 7) that stores in advance a predetermined voice model having a plurality of detail levels, the plurality of detail levels being information indicating a feature property of a voice for the voice model; a detail level selection unit (for example, implemented by detail level judgment unit 9) that selects a detail level, closest to a feature property of an input voice signal, from the detail levels of the voice model stored in the voice model storage unit; and a parameter setting unit (for example, implemented by parameter setting unit 10) that sets parameters for recognizing the voice of an input voice according to a detail level selected by the detail level selection unit.
The voice recognition device described above may be configured such that the detail level selection unit finds a detail level on a unit time basis and selects the detail level closest to the feature property of the input voice signal.
The voice recognition device described above may be configured such that the detail level selection unit performs statistical analysis of the detail level, selected on a unit time basis, for a plurality of unit times and finds a detail level of one particular unit time of interest of interest.
The voice recognition device described above may further comprise a distance calculation unit (for example, implemented by distance calculation unit 8) that calculates distance information indicating a difference between the voice model stored in the voice model storage unit and the feature property of an input voice signal wherein the distance calculation unit calculates the distance information sequentially from low detail level distance information to higher detail level distance information, or sequentially from high detail level distance information to lower detail level distance information and the detail level selection unit finds a detail level corresponding to a minimum of the distance information calculated by the distance calculation unit.
The voice recognition device described above may be configured such that the voice model storage unit stores in advance a voice model having a parent-child structure.
The voice recognition device described above may further comprise sound a model storage unit (for example, implemented by acoustic model storage unit 3) that stores a predetermined acoustic model in advance; and a word string search unit (for example, implemented by network search unit 5) that searches for, and extracts, a word string as a result of voice recognition based on the parameters that are set by the parameter setting unit wherein the acoustic model storage unit stores in advance an acoustic model having predetermined relevance to the voice model stored in the voice model storage unit; and the word string search unit searches for a word string using relevance between the voice model and the acoustic model.
The voice recognition device described above may be configured such that the parameter setting unit sets at least one of a language weight parameter and a pruning parameter for performing predetermined pruning processing according to the detail level selected by the detail level selection unit.
The voice recognition device described above may further comprise an acoustic model storage unit (for example, implemented by acoustic model storage unit 13) that stores a plurality of predetermined acoustic models in advance; a language model storage unit (for example, implemented by language model storage unit 14) that stores a plurality of predetermined language models in advance; and a model selection unit (for example, implemented by model selection unit 12) that selects a set of an acoustic model and a language model from the plurality of acoustic models, stored in the acoustic model storage unit, and the plurality of language models stored in the language model storage unit, according to the detail level selected by the detail level selection unit.
The voice recognition device described above may further comprise an output change unit (for example, implemented by operation/response setting unit 15) that changes an output method or an output content of a voice recognition result of the input voice signal according to the detail level selected by the detail level selection unit.
The voice recognition device described above may further comprise a voice model update unit (for example, implemented by model learning unit 16) that updates the voice model, stored in the voice model storage unit, according to the detail level selected by the detail level selection unit.
In accordance with a second aspect of the present invention, there is provided a voice recognition method that recognizes a voice of an input voice signal, comprising: a detail level selection step that selects a detail level, closest to a feature property of an input voice signal, from a plurality of detail levels of a predetermined voice model stored in advance, the plurality of detail levels being information indicating a feature property of a voice for the voice model; and a parameter setting step that sets parameters for recognizing the voice of an input voice according to the selected detail level.
The voice recognition method described above may be configured such that the detail level selection step finds a detail level on a unit time basis and selects the detail level closest to the feature property of the input voice signal.
The voice recognition method described above may be configured such that the detail level selection step performs statistical analysis of the detail level, selected on a unit time basis, for a plurality of unit times and finds a detail level of one particular unit time of interest.
The voice recognition method described above may further comprise: a distance calculation step that calculates distance information indicating a difference between the voice model stored in advance and the feature property of an input voice signal, wherein the distance calculation step calculates the distance information sequentially from low detail level distance information to higher detail level distance information, or sequentially from high detail level distance information to lower detail level distance information and the detail level selection step finds a detail level corresponding to a minimum of the calculated distance information.
The voice recognition method described above may be configured such that the detail level closest to the feature property of the input voice signal is selected based on a voice model stored in advance and having a parent-child structure.
The voice recognition method described above may further comprise: a word string search step that searches for, and extracting, a word string as a result of voice recognition based on the parameters that are set wherein an acoustic model having predetermined relevance to the voice model is stored in advance and the word string search step searches for a word string using relevance between the voice model and the acoustic model.
The voice recognition method described above may be configured such that the parameter setting step sets at least one of a language weight parameter and a pruning parameter for performing predetermined pruning processing according to the selected detail level.
The voice recognition method described above may further comprise: a model selection step that selects a set of an acoustic model and a language model from a plurality of acoustic models stored in advance and a plurality of language models stored in advance according to the selected detail level.
The voice recognition method described above may further comprise: an output change step that changes an output method or an output content of a voice recognition result of the input voice signal according to the selected detail level.
The voice recognition method described above may further comprise: a voice model update step that updates the voice model stored in advance according to the selected detail level.
In accordance with a third aspect of the present invention, there is provided a voice recognition program that causes a computer to recognize a voice of an input voice signal, the program causing the computer to execute detail level selection processing that selects a detail level, closest to a feature property of an input voice signal, from a plurality of detail levels of a predetermined voice model stored in advance, the plurality of detail levels being information indicating a feature property of a voice for the voice model; and parameter setting processing that sets parameters for recognizing the voice of an input voice according to the selected detail level.
The voice recognition program described above may be configured such that, in the detail level selection processing, the program causes the computer to find a detail level on a unit time basis and select the detail level closest to the feature property of the input voice signal.
The voice recognition program described above may be configured such that, in the detail level selection processing, the program causes the computer to perform a statistical analysis of the detail level, selected on a unit time basis, for a plurality of unit times and find a detail level of one particular unit time of interest.
The voice recognition program described above may further cause the computer to execute distance calculation processing that calculates distance information indicating a difference between the voice model stored in advance and the feature property of an input voice signal wherein, in the distance calculation processing, the program causes the computer to calculate the distance information sequentially from low detail level distance information to higher detail level distance information, or sequentially from high detail level distance information to lower detail level distance information; and, in the detail level selection processing, the program causes the computer to find a detail level corresponding to a minimum of the calculated distance information.
The voice recognition program described above may be configured such that the program causes the computer to select the detail level closest to the feature property of the input voice signal based on a voice model stored in advance and having a parent-child structure.
The voice recognition program described above may further cause the computer, which has a storage unit (for example, acoustic model storage unit 3) that stores, in advance, an acoustic model having predetermined relevance to the voice model, to execute word string search processing that searches for, and extract, a word string as a result of voice recognition based on the parameters that are set wherein, in the word string search processing, the program causes the computer to search for a word string using relevance between the voice model and the acoustic model.
The voice recognition program described above may be configured such that, in the parameter setting processing, the program causes the computer to set at least one of a language weight parameter and a pruning parameter for performing predetermined pruning processing according to the selected detail level.
The voice recognition program described above may further cause the computer to execute model selection processing that selects a set of an acoustic model and a language model from a plurality of acoustic models stored in advance and a plurality of language models stored in advance according to the selected detail level.
The voice recognition program described above may further cause the computer to execute output change processing of changing an output manner or an output content of a voice recognition result of the input voice signal according to the selected detail level.
The voice recognition program described above may further cause the computer to execute voice model update processing that updates the voice model stored in advance according to the selected detail level. Doing so allows the voice model to be adapted to the speaker or noise environment.
In short, the voice recognition device of the present invention is generally configured as follows to solve the problems described above. That is, the voice recognition device has a voice model having multiple detail levels representing the feature property of a voice, selects a detail level closest to the feature property of an input signal, and controls the parameters related to voice recognition according to the selected detail level.
In the configuration described above, the distance to the input signal is compared between a high detail level and a low detail level of the voice model. That is, if the high detail level is closer to the input signal, the feature property of the input signal is close to the feature property of the data used when the acoustic model was developed by learning and, therefore, the voice is recognized using low calculation cost parameters considering that the feature property of the voice is reliable. Conversely, if the low detail level is closer to the input signal, the feature property of the input signal is far from the feature property of the learning data and, therefore, the voice is recognized using parameters, which ensures high accuracy, considering that the feature property of the voice is not reliable. Dynamically controlling the parameters according to the detail level as described above always ensures highly accurate voice recognition at an optimum calculation cost. This achieves the first object of the present invention.
The ability to determine the optimum parameters, based only on the detail level information corresponding to the input signal at one particular time, causes no processing delay. This achieves the second object of the present invention.
Because a voice model having multiple detail levels has a size sufficiently smaller than that of an acoustic model, the calculation cost is reduced as compared with that of the voice recognition method in which a simplified acoustic model is used to find an order change in the HMM state for parameter adjustment (see Patent Document 1). This achieves the third object of the present invention.
According to the present invention, a detail level closest to the feature property of an input voice signal is selected from the detail levels of a voice model and, based on the selected detail level, the parameters for recognizing the input voice are set. Therefore, by judging the sound property, the voice may be recognized using appropriate parameters, which ensure high recognition accuracy, at a low calculation cost. That is, based on the information on a detail level of the voice model to which the input voice signal belongs, the present invention makes it possible to determine whether or not the feature property of the input voice signal is close to that of the voice data used when the acoustic model was developed, and is reliable. Therefore, the parameters for voice recognition may be set and, based on the parameters, voices may be recognized.
According to the present invention, appropriate parameters may be set without processing delay by considering the number of competing candidates at the same time of day. That is, to find the information as to which detail level of the voice model the input voice signal belongs, the present invention requires that only one particular time of target of day be considered without need for conducting the time-consuming averaging calculation. Therefore, the voice may be recognized by setting parameters without processing delay.
According to the present invention, appropriate parameters may be determined with a small amount of operation. That is, a voice model having multiple detail levels has a size sufficiently smaller than that of an acoustic model. Therefore, the parameters may be set and the voice may be recognized with a small increase in the calculation cost.
First Exemplary Embodiment
A first exemplary embodiment of the present invention will be described below with reference to the drawings.
The input signal acquisition unit 1 is implemented in the actual device by the CPU of an information processing device operating under program control. The input signal acquisition unit 1 has the function to acquire (receive) input signals in a divided (or sampled) fashion on a unit time basis. For example, the input signal acquisition unit 1 receives voice signals from a voice input device, such as a microphone, as input signals. In addition, the input signal acquisition unit 1 extracts voice signals stored, for example, in a database in advance as input signals.
The feature amount calculation unit 2 is implemented in the actual device by the CPU of an information processing device operating under program control. The feature amount calculation unit 2 has a function to calculate the feature amount, which indicates the feature property of the input voice, based on the input signal received by the input signal acquisition unit 1.
The acoustic model storage unit 3 and the language model storage unit 4 are implemented in the actual device by a storage device such as a magnetic disk device or an optical disc device. The acoustic model storage unit 3 stores a predetermined acoustic model in advance. The language model storage unit 4 stores a predetermined language model in advance.
The network search unit 5 is implemented in the actual device by a CPU of an information processing device operating under program control. The network search unit 5 has a function to search for candidates for a word string based on the feature amount calculated by the feature amount calculation unit 2, the acoustic model stored in the acoustic model storage unit 3, and the language model stored in the language model storage unit 4. In addition, the network search unit 5 has a function to extract candidates for a word string as a result of voice recognition of input voices based on a search result of candidates for a word string.
The recognition result output unit 6 is implemented in the actual device by the CPU of an information processing device operating under program control. The recognition result output unit 6 has a function to output candidates for a word string searched for by the network search unit 5. The recognition result output unit 6 displays candidates for a word string, for example, on a display device as the voice recognition result of input voices. In addition, the recognition result output unit 6 outputs a file, for example, a file which includes candidates for a word suing, as the voice recognition result of input voices.
In the voice recognition system (voice recognition device) having the components described above, the voice model storage unit 7, distance calculation unit 8, detail level judgment unit 9, and parameter setting unit 10 have the following functions.
The voice model storage unit 7 is implemented in the actual device by a storage device such as a magnetic disk device or an optical disc device. The voice model storage unit 7 stores a voice model, which has multiple detail levels, in advance. The “detail level” refers to a measure that determines whether voice phenomena are represented coarsely or finely using a voice model.
The distance calculation unit 8 is implemented in the actual device by the CPU of an information processing device operating under program control. The distance calculation unit 8 has a function to calculate the distance of the feature amount, calculated by the feature amount calculation unit 2, from each detail level of the voice model stored in the voice model storage unit 7. More specifically, the distance calculation unit 8 calculates a value indicating the difference between the feature amount of an input voice and each detail level to calculate the distance between the feature amount of the input voice and each detail level.
The detail level judgment unit 9 is implemented in the actual device by the CPU of an information processing device operating under program control. The detail level judgment unit 9 has the function to determine the shortest distance between each of the detail levels, calculated by the distance calculation unit 8, and the feature amount, and to find (judge) a detail level that minimizes the distance from the feature amount calculated by the feature amount calculation unit 2. That is, the detail level judgment unit 9 selects a detail level, closest to the feature property of the received voice signal, from the detail levels of the voice model stored in the voice model storage unit 7.
The parameter setting unit 10 is implemented in the actual device by the CPU of an information processing device operating under program control. The parameter setting unit 10 has the function to set parameters, which will be necessary when the network search unit 5 searches for a word string, according to the value of the detail level judged by the detail level judgment unit 9.
As the feature amount, the feature amount calculation unit 2 calculates the value indicating the feature of the voice such as the cepstrum, log spectrum, spectrum, formant position, pitch, and spectrum power changes across (or over) multiple frames of the input voice. The feature amount and the method for calculating the feature amount described in this application are described, for example, in Chapter 2 of Non-Patent Document 1. The contents of Chapter 2 of Non-Patent Document 1 are hereby incorporated in this application by reference thereto.
The acoustic model storage unit 3 stores data, such as an HMM (Hidden Markov Model), as an acoustic model. The acoustic model described in this application is described, for example, in Chapter 3 of Non-Patent Document 1. The acoustic model creation method described in this application is described, for example, in Chapter 14 of Non-Patent Document 1. The contents of Chapter 3 and Chapter 14 of Non-Patent Document 1 are hereby incorporated in this application by reference thereto.
The language model storage unit 4 stores data, such as an N-gram, a word dictionary, and a context-free grammar etc., as the language model. The language model described herein and the voice recognition algorithm using the language model are described, for example, in Chapter 5 of Non-Patent Document 1. The contents of Chapter 5 of Non-Patent Document 1 are hereby incorporated in this application by reference thereto.
The network search unit 5 searches for a word string using a method such as a beam search. That is, the network search unit 5 searches the word string network, represented by the language model stored in the language model storage unit 4, for a correct word string using the acoustic model stored in the acoustic model storage unit 3 and, as a result of the sound recognition of the input sound, extracts candidates for the word string. The word string search method described herein is described, for example, in Chapter 6 of Non-Patent Document 1. The contents of Chapter 6 of Non-Patent Document 1 are hereby incorporated in this application by reference thereto.
The voice model storage unit 7 stores a voice model including multiple detail levels. For example, the voice model storage unit 7 stores data, such as an HMM or a GMM (Gaussian Mixture Model), as a voice model.
An HMM or a GMM is configured by combining multiple probability distribution functions. The Gaussian distribution is usually used for the probability distribution function, but a function other than the Gaussian distribution may also be used. The parameters of the probability distribution function are determined by learning voices using a method such as the EM algorithm. The EM algorithm described in this application is described, for example, in Chapter 4 of Non-Patent Document 1. The contents of Chapter 4 of Non-Patent Document 1 are hereby incorporated in this application by reference thereto.
As the detail level, the voice model has the number of mixtures of probability distribution functions and the average value of probability distribution functions etc.
A voice model having different detail levels is created by one of the following two creation methods: top-down creation method and bottom-up creation method. An example of the top-down creation method is as follows. First, a voice model having a smaller number of mixtures is created by using learning data and, after that, the probability distribution functions configuring the voice model is divided to increase the number of mixtures. The model having an increased number of mixtures is used again for learning. In this way, by repeating the processing of learning and division as described above until a voice model composed of a required number of mixtures is created, the voice model having different detail levels is created.
A voice model having different detail levels may be also generated by creating a voice model by varying the detail level determined by a combination of phonemes such as monophone, diphone, triphone, quinphone, and so on.
An example of the bottom-up voice model creation method is as follows. That is, a voice model having different detail levels is created by rearranging (or reconstructing) a voice model, which is created using some learning means and configured by a mixture of multiple probability distribution functions, according to the distance using a method such as the K-means method or the like. The K-means mentioned here is described, for example, in the document (“Pattern Classification” in New Technical Communication by Richard O. Duda, Peter E. Hart, and David G. Stork, Translation supervised by Morio Onoe, John Wiley & Sons, New Technology Communication, pp.528-529).
The voice model having different detail levels described above is created by a system designer in advance and stored in the voice model storage unit 7.
In this exemplary embodiment, the storage device of the information processing device, which implements the voice recognition device, stores various types of programs for executing the voice recognition processing. For example, the storage device of the information processing device, which implements the voice recognition device, stores the voice recognition program that causes a computer to execute the following two types of processing; one is the detail level selection processing for selecting a detail level, closest to the feature property of a received voice signal, from multiple detail levels wherein the multiple detail levels are included in a predetermined voice model, stored in advance, to indicate the voice feature property for the voice model, and the other is the parameter setting processing for setting the parameters for identifying a received voice according to the selected detail level.
Next, the following describes the operation.
Next, the feature amount calculation unit 2 calculates the feature amount of the input voice based on the per unit time input signal selected by the input signal acquisition unit 1 (step S2). For example, as the feature amount, the feature amount calculation unit 2 calculates the feature amount vector xt of input signal for the t-th unit time.
Next, the distance calculation unit 8 calculates the distance between each of the multiple detail levels of the voice model and the feature amount of the per-unit-time input signal (step S3). In this case, when an HMM or a GMM is used as the voice model, the distance calculation unit 8 calculates the likelihood indicated by equation (3) or the log likelihood, to calculate the distance between the feature amount and the detail level.
In the above equations, μk indicates the average of the kth probability density function. Σk indicates the variance of the kth probability density function. C indicates the constant term. n indicates the number of dimensions of the feature amount vector xt.
When a likelihood or a log likelihood is used, a larger likelihood value or a larger log likelihood value means a shorter distance between the feature amount and the detail level. In calculating the distance between the feature amount and the detail level, the distance calculation unit 8 may calculate the measure of distance, such as a Euclidian distance, instead of a likelihood or a log likelihood. Although the voice model for each detail level is represented by a mixture of multiple probability density functions, the distance between the feature amount of a per-unit-time input signal and the detail level is represented by the distance between the signal and one of multiple probability density functions that is closest.
Next, the detail level judgment unit 9 compares among the distances to the detail levels of the voice model calculated by the distance calculation unit 8 and finds the detail level whose distance to the feature amount, calculated by the feature amount calculation unit 2, is shortest (step S4). That is, the detail level judgment unit 9 checks the multiple detail levels of the voice model stored in the voice model storage unit 7, based on the distances calculated by the distance calculation unit 8, and judges the detail level whose distance to the feature amount, calculated by the feature amount calculation unit 2, is shortest.
In step S4, in addition to the method of finding the detail level on a unit time basis, it is also possible for the detail level judgment unit 9 to conduct a statistical analysis to find the detail level that minimizes the average distance over multiple unit times or over a one-time voice time. That is, the detail level judgment unit 9 may find the detail level for each unit time and select the detail level closest to the feature property of the input voice signal. It is also possible for the detail level judgment unit 9 to find the detail level that minimizes the feature amount based, on the result of the statistical analysis. That is, it is possible for the detail level judgment unit 9 to conduct the statistical analysis of the detail levels ranging over multiple unit times, one detail for each unit time, to find the detail level of one particular unit time of target.
Next, the parameter setting unit 10 sets the parameters, which will be used when the network search unit 5 searches for a word string, using the detail level judged by the detail level judgment unit 9 (step S5). In this case, the parameter setting unit 10 sets the language weight (for example, weight coefficient) parameter and the pruning parameter etc. That is, the parameter setting unit 10 sets at least one of the language weight parameter and the pruning parameter, which is the parameter for predetermined pruning processing, according to the detail level selected by the detail level judgment unit 9. Note that, in addition to the language weight parameter and the pruning parameter, the parameter setting unit 10 may control parameters used for searching for a word string.
The term “pruning” refers to the processing for discarding hypotheses, which are word string candidates, whose likelihood is equal to or lower than a predetermined likelihood width (threshold). For the pruning parameter, a likelihood width value etc., used as the threshold for the pruning processing, is set.
The parameter setting unit 10 sets the language weight as follows according to the detail level. The parameter setting unit 10 sets a large language weight (for example, a large weight coefficient value) when the detail level is low because the reliability of voice information is low as shown in
The parameter setting unit 10 sets the pruning parameter for use at word string search time as follows. When the detail level is low, the parameter setting unit 10 sets the pruning parameter so that the number of hypotheses is increased because the reliability of voice information is low. Conversely, when the detail level is high, the parameter setting unit 10 sets the pruning parameter so that the number of hypotheses is decreased because the reliability of voice information is high.
Next, the network search unit 5 searches for a word string based on the parameters set by the parameter setting unit 10 (step S6). In this case, the network search unit 5 uses the feature amount calculated by the feature amount calculation unit 2, the acoustic model stored in the acoustic model storage unit 3, and the language model stored in the language model storage unit 4 to search for a word string having the highest likelihood based on equation (2).
Finally, the recognition result output unit 6 outputs (for example, displays) the word string searched for by the network search unit 5 (step S7).
As described above, this exemplary embodiment judges the sound feature and recognizes voices using appropriate parameters that increase recognition accuracy at a low calculation cost.
For example, if the feature property of the input signal is close to the feature property of the learning data used when the voice model was developed as shown in
Dynamically controlling the voice recognition related parameters according to the detail level based on the property described above reduces the number of hypotheses when the detail level is high and, thereby, reduces the calculation amount, thus enabling the voice recognition to be carried out highly accurately at an optimum calculation cost.
Selecting a detail level whose distance to the input voice is closest means that the voice model, created in such a state in which the highest detail level state is selected, represents the input voice most accurately. This means that the information on the shortest detail level gives the information on the number of competing candidates for a word string at one particular time of target and that the parameters can be set with consideration for the number of competing candidates but without need for conducting the time-consuming averaging calculation.
In addition, a voice model having multiple detail levels has a size sufficiently smaller than that of an acoustic model, thus reducing the calculation cost as compared with that of the method in which the conventional simplified acoustic model is used.
Second Exemplary Embodiment
Next, the following describes a second exemplary embodiment of the present invention. The basic configuration of a voice recognition device in this exemplary embodiment is the same as that of the voice recognition device described in the first exemplary embodiment.
A detail level judgment unit 9 carries out calculation for the voice model having multiple detail levels sequentially, from a low detail level to a high detail level, in step S4 shown in the first exemplary embodiment to find the detail level whose distance to the feature amount, calculated by a feature amount calculation unit 2, is a minimum. The detail level judgment unit 9 may also carry out calculation sequentially, from a high detail level to a low detail level, to find the detail level whose distance to the feature amount, calculated by the feature amount calculation unit 2, is a minimum.
In this exemplary embodiment, a distance calculation unit 8 calculates the distance sequentially from a low detail level to a high detail level, or from a high detail level to a low detail level. And, the detail level judgment unit 9 finds the detail level corresponding to the distance, calculated by the distance calculation unit 8, that is the minimum.
As described above, this exemplary embodiment finds the detail level whose distance to the feature amount, calculated by a feature amount calculation unit 2, is the minimum to allow the detail level corresponding to the minimum distance to be found efficiently.
For example, when the feature property of the input signal is close to the feature property of the learning data, the distance is decreased monotonously as the detail level is increased as shown in
Third Exemplary Embodiment
Next, the following describes a third exemplary embodiment of the present invention with reference to the drawings.
The voice model storage unit 11 is implemented in the actual device by a storage device such as a magnetic disk device or an optical disc device. The voice model storage unit 11 stores a voice model having multiple detail levels having a parent-child structure.
In this exemplary embodiment, the detail levels of the voice model, which is stored in the voice model storage unit 11 and has multiple detail levels, have a parent-child structure such as a tree structure. The parent-child structure mentioned here refers to the dependence relation between the probability distribution functions (child) belonging to a high detail level and the probability distribution functions (parent) belonging to a low detail level as shown in
To create a parent-child relation, a parent distribution is divided into child distributions when the voice model is created in the top-down method. When the voice model is created in the bottom-up method, two or more child distributions are rearranged (or reconstructed) into a parent distribution. The voice model having such a parent-child structure is created in advance by the system designer and is stored in the voice model storage unit 11.
Next, the following describes the operation. In this exemplary embodiment, a detail level judgment unit 9 carries out calculation for the voice model, which has multiple detail levels having a parent-child structure, sequentially from a low detail level to a high detail level in step S4 shown in the first exemplary embodiment and finds a detail level that has the minimum distance to the feature amount calculated by a feature amount calculation unit 2. In this case, there is a parent-child structure among the distributions belonging to the detail levels and so, once a distribution whose distance is the minimum is found at some detail level, the detail level judgment unit 9 is required to carry out calculation only for the child distributions of the distribution whose distance is already found as the minimum if it is necessary to carry out calculation for the detail levels higher than the detail level whose distance is the minimum. For example, a distance calculation unit 8 and the detail level judgment unit 9 carry out the distance calculation and the minimum detail level judgment processing only for the child distributions of the distribution whose distance is found as the minimum.
This exemplary embodiment uses the configuration described above to allow the distance calculation unit 8 to carry out the distance calculation at a low calculation cost, thus further reducing the calculation cost of the voice recognition system as compared with that shown in the first exemplary embodiment.
Fourth Exemplary Embodiment
Next, the following describes a fourth exemplary embodiment of the present invention with reference to the drawings. In this exemplary embodiment, the basic configuration of a voice recognition device is the same as the configuration of the voice recognition device shown in the first exemplary embodiment.
This exemplary embodiment is different from the first exemplary embodiment in that relevance is established between a voice model having multiple detail levels stored in a voice model storage unit 7 and an acoustic model stored in an acoustic model storage unit 3.
In this exemplary embodiment, the acoustic model storage unit 3 stores, in advance, an acoustic model having predetermined relevance with a voice model stored in the voice model storage unit 7. The voice model storage unit 7 stores, in advance, a voice model having predetermined relevance with an acoustic model stored in the acoustic model storage unit 3. A network search unit 5 searches for candidates for a word string and extracts the candidates based on the relevance between the voice model and the acoustic model.
In this exemplary embodiment, to establish relevance between a voice model having multiple detail levels and an acoustic model, the multiple probability density functions constituting the voice model having multiple detail levels and the multiple probability density functions constituting the acoustic model, which are the same or similar with each other, are configured as shown in
The processing for establishing relevance between the voice model and the acoustic model (for example, a link is created) is performed in advance by the system designer, and the processed voice data and sound data are stored respectively in the voice model storage unit 7 and the acoustic model storage unit 3.
This exemplary embodiment uses the configuration described above to omit the calculation of the distance between the input signal and the acoustic model that is carried out when searching for a word string in step S6 in the first exemplary embodiment. That is, based on the relevance given to the voice model and the acoustic model in advance, the distance from the input signal to the voice model having multiple detail levels, found in step S3, is used to omit the calculation described above. As a result, the processing load of the network search unit 5 in the processing in step 6 is reduced.
Fifth Exemplary Embodiment
Next, the following describes a fifth exemplary embodiment of the present invention with reference to the drawings.
The model selection unit 12 is implemented in the actual device by the CPU of an information processing device operating under program control. The model selection unit 12 has the function to select an acoustic model and a language model according to the detail level calculated by the detail level judgment unit 9. That is, the model selection unit 12 selects a set of an acoustic model and a language model from the multiple acoustic models stored in the acoustic model storage unit 3 and the multiple language models stored in the language model storage unit 4 according to the detail level selected by the detail level judgment unit 9.
Next, the following describes the operation. In the first exemplary embodiment, the parameter setting unit 10 sets the parameters, which will be used when the network search unit 5 searches for a word string, in step S5 in
This exemplary embodiment uses the configuration described above to select a smaller-size acoustic model, or to switch the language model to a model having a smaller number of vocabularies, when the detail level judgment unit 9 judges that the detail level is low, thus increasing voice recognition accuracy. In this exemplary embodiment, the voice recognition device controls the selection of the voice model and the language model as described above according to the conditions such as the input voice.
Sixth Exemplary Embodiment
Next, the following describes a sixth exemplary embodiment of the present invention with reference to the drawings.
The operation/response setting unit 15 is implemented in the actual device by the CPU of an information processing device operating under program control. The operation/response setting unit 15 has function to change the output means or the output contents according to the detail level judged by the detail level judgment unit 9. That is, the operation/response setting unit 15 changes the output method or the output contents of the voice recognition result of the input voice signal according to the detail level selected by the detail level judgment unit 9.
In this exemplary embodiment, the operation/response setting unit 15 causes a recognition result output unit 6 to display a message, which prompts the user to produce a sound again for inputting a voice, when the detail level judgment unit 9 judges that the detail level is low. Alternatively operation/response setting unit 15 may cause the recognition result output unit 6 to display a message indicating that speaker learning is necessary. The operation/response setting unit 15 may alternatively cause the recognition result output unit 6 to display a message requesting the user to check if the voice recognition result is correct. The operation/response setting unit 15 may also control the recognition result output unit 6 not to display only the recognition result when the detail level judgment unit 9 judges that the detail level is low.
This exemplary embodiment uses the configuration described above to display the reliable recognition results only.
Seventh Exemplary Embodiment
Next, the following describes a seventh exemplary embodiment of the present invention with reference to the drawings.
The model learning unit 16 is implemented in the actual device by a CPU of an information processing device operating under program control. The model learning unit 16 has a function to cause a voice model having multiple detail levels as well as an acoustic model to learn according to the detail level calculated by a detail level judgment unit 9. That is, the model learning unit 16 updates the voice model, stored in the voice model storage unit 7, according to the detail level selected by the detail level judgment unit 9 to adapt the voice model to the speaker environment or the noise environment.
In this exemplary embodiment, if the detail level judgment unit 9 judges that the detail level is low, the acoustic model learning unit 16 controls the voice model having multiple detail levels as well as the acoustic model to be adapted to the noise environment or the speaker environment so that the detail level is increased. More specifically, if the detail level is low because the detail level of the voice model having multiple detail levels is biased on average in relation to the input signal, the model learning unit 16 corrects the bias of the voice model and controls it so that the detail level is increased. The model learning unit 16 also corrects the bias of the acoustic model in accordance with the correction of the voice model.
This exemplary embodiment uses the configuration described above to allow the voice recognition result to be output appropriately even if the noise environment or the speaker environment has changed greatly from that at the learning time.
In the voice recognition device, any of the configurations shown in the exemplary embodiments described above may be combined. For example, two or more of the configurations of the voice recognition device shown in the first exemplary embodiment to the seventh exemplary embodiment may be combined to configure a voice recognition device.
The exemplary embodiments and the examples may be changed and adjusted in the scope of the entire disclosures (including claims) of the present invention and based on the basic technological concept thereof. Within the scope of the claims of the present invention, various disclosed elements may be combined and selected in a variety of ways.
The present invention is applicable to a voice recognition device that recognizes the voice of an input voice. In particular, the present invention is applicable to a voice recognition device that implements the optimum voice recognition performance at a predetermined calculation cost.
Number | Date | Country | Kind |
---|---|---|---|
2007-048898 | Feb 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/053331 | 2/26/2008 | WO | 00 | 8/20/2009 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/108232 | 9/12/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5515475 | Gupta et al. | May 1996 | A |
5839101 | Vahatalo et al. | Nov 1998 | A |
5842163 | Weintraub | Nov 1998 | A |
5899973 | Bandara et al. | May 1999 | A |
6018708 | Dahan et al. | Jan 2000 | A |
6208964 | Sabourin | Mar 2001 | B1 |
6301555 | Hinderks | Oct 2001 | B2 |
6839667 | Reich | Jan 2005 | B2 |
7292975 | Lovance et al. | Nov 2007 | B2 |
7447635 | Konopka et al. | Nov 2008 | B1 |
8392188 | Riccardi | Mar 2013 | B1 |
20010021908 | Hinderks | Sep 2001 | A1 |
20020123891 | Epstein | Sep 2002 | A1 |
20030125945 | Doyle | Jul 2003 | A1 |
20060009973 | Nguyen et al. | Jan 2006 | A1 |
20080059184 | Soong et al. | Mar 2008 | A1 |
20080255839 | Larri et al. | Oct 2008 | A1 |
20110064233 | Van Buskirk | Mar 2011 | A1 |
Number | Date | Country |
---|---|---|
6-067698 | Mar 1994 | JP |
8-506430 | Jul 1996 | JP |
10-149192 | Jun 1998 | JP |
2000261321 | Sep 2000 | JP |
2001075596 | Mar 2001 | JP |
2004117503 | Apr 2004 | JP |
2005004018 | Jan 2005 | JP |
2005234214 | Sep 2005 | JP |
2006091864 | Apr 2006 | JP |
2005010868 | Feb 2005 | WO |
Entry |
---|
International Search Report for PCT/JP2008/053331 mailed Jun. 10, 2008. |
A. Ando. “Real-time Speech Recognition”, Institute of Electronics, Information and Communication Engineers, 2003, pp. 28-33, 40-47, 62-67, 76-87, 126-129, 138-143, 148-165. |
R. O. Duda et al., “Pattern Classification Second Edition”, John Wiley & Sons, New Technology Communications, 2003, pp. 528-529. |
Number | Date | Country | |
---|---|---|---|
20100070277 A1 | Mar 2010 | US |