This patent application claims the benefit of priority from Korean Patent Application No. 10-2012-0006898, filed on Jan. 20, 2012, the contents of which are incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates, in general, to a phonetic recognition method, system and recording medium for recognizing phones from speech signals and, more particularly, to a continuous phonetic recognition method that uses a semi-Markov model for reducing an error rate in phonetic recognition, to a system for processing the method, and to a recording medium for storing the method.
2. Description of the Related Art
Phonetic recognition technology is technology for causing devices, such as computers, to comprehend the speech of human beings, and is configured to pattern the speech (signals) of human beings and determine how the patterned speech is similar to patterns previously stored in computers or the like.
In the modern age, such technology is regarded as a very important issue when being applied to advanced devices, such as smart phones or navigation terminals. Recently, as an environment in which input devices, such as a keyboard, a touch screen, or a remote control, are used is also diversified, the case where such an input device results in inconvenience occurs.
Generally, a Hidden Markov Model (HMM) has been used to recognize phones. The HMM is obtained by statistically modeling phonetic units, such as phones or words, and the data and contents of the HMM are well known in the art.
However, the HMM most widely used in phonetic recognition at the present time predicts phonetic labels for respective observations (frames) without performing explicit phone segmentation, on the assumption that only local statistical dependencies are present between neighboring observations (frames). That is, such an HMM is problematic in that there is a high error rate in continuous phonetic recognition because long-range dependencies are not taken into consideration.
Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide a continuous phonetic recognition method that uses a semi-Markov model for speech recognition in which both continuous phonetic recognition and an error rate are taken into consideration, a system for processing the method, and a recording medium for storing the method.
In accordance with an aspect of the present invention, there is provided a phonetic recognition method of recognizing phones using a speech recognition system, including by a phonetic data recognition device, receiving speech; and by a phonetic data processing device, recognizing phones from the received speech using a semi-Markov model.
Preferably, in the recognizing the phones, a phonetic label sequence may be represented by <function 1> given by the following equation:
where ŷ denotes a phonetic label sequence, Y denotes a set of phonetic label sequences, x denotes an acoustic feature vector, y denotes a phonetic label, w denotes a parameter, and φ(x, y) denotes a segment-based joint feature map.
Preferably, the segment-based joint feature map may include:
where lj denotes a label of a j-th phone segment, nj denotes a last frame index of the j-th phone segment, J denotes a number of segments, {x}j denotes an acoustic feature vector of observation of the j-th phone segment, φtransition(lj-1, lj) denotes a transition feature indicating a relationship between a relevant phone and its subsequent phone when the relevant phone is present on a just previous label, φduration(nj-1, nj, lj) denotes a (nj-1-nj) duration feature indicating a duration of the relevant phone (for example, for the label lj), and φcontent({x}j, nj-1, nj, lj) denotes a content feature indicating acoustic feature data.
Preferably, the transition feature may be represented by a Kronecker delta function, and the duration feature may be defined as sufficient statistics of gamma distribution.
Preferably, the content feature may be represented by the following equation:
where l denotes a phone, k denotes a bin index, B(l) denotes a number of bins corresponding to a phonetic label l, bk is
(where kε{1, . . . , B(l)}), and δ(lj=l) denotes a Kronecker delta function.
Preferably, the parameter w may be estimated by a Structured Support Vector Machine (S-SVM), and the S-SVM may be solved using a stochastic subgradient descent algorithm.
In accordance with another aspect of the present invention, there is provided a speech recognition system for recognizing phones, including a phonetic data recognition device for receiving speech, configuring speech data from the speech, and outputting the speech data; and a phonetic data processing device for recognizing phones from output signals of the phonetic data recognition device using a semi-Markov model.
The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Specific structural or functional descriptions related to embodiments based on the concept of the present invention and disclosed in the present specification or application are merely illustrated to describe embodiments based on the concept of the present invention, and the embodiments based on the concept of the present invention may be implemented in various forms and should not be interpreted as being limited to the above embodiments described in the present specification or application.
The embodiments based on the concept of the present invention may be modified in various manners and may have various forms, so that specific embodiments are intended to be illustrated in the drawings and described in detail in the present specification or application. However, it should be understood that those embodiments are not intended to limit the embodiments based on the concept of the present invention to specific disclosure forms and they include all changes, equivalents or modifications included in the spirit and scope of the present invention.
The terms such as “first” and “second” may be used to describe various components, but those components should not be limited by the terms. The terms are merely used to distinguish one component from other components, and a first component may be designated as a second component and a second component may be designated as a first component in the similar manner, without departing from the scope based on the concept of the present invention.
Throughout the entire specification, it should be understood that a representation indicating that a first component is “connected” or “coupled” to a second component may include the case where the first component is connected or coupled to the second component with some other component interposed therebetween, as well as the case where the first component is “directly connected” or “directly coupled” to the second component. In contrast, it should be understood that a representation indicating that a first component is “directly connected” or “directly coupled” to a second component means that no component is interposed between the first and second components.
Other representations describing relationships among components, that is, “between” and “directly between” or “adjacent to,” and “directly adjacent to,” should be interpreted in similar manners.
The terms used in the present specification are merely used to describe specific embodiments and are not intended to limit the present invention. A singular expression includes a plural expression unless a description to the contrary is specifically pointed out in context. In the present specification, it should be understood that the terms such as “include” or “have” are merely intended to indicate that features, numbers, steps, operations, components, parts, or combinations thereof are present, and are not intended to exclude a possibility that one or more other features, numbers, steps, operations, components, parts, or combinations thereof will be present or added.
Unless differently defined, all terms used here including technical or scientific terms have the same meanings as the terms generally understood by those skilled in the art to which the present invention pertains. The terms identical to those defined in generally used dictionaries should be interpreted as having meanings identical to contextual meanings of the related art, and are not interpreted as being ideal or excessively formal meanings unless they are definitely defined in the present specification.
Further, the same characters are interpreted as having the same meaning. Even in the case of different characters, they have commonness for objects meant by subscripts. Hereinafter, the present invention will be described in detail based on preferred embodiments of the present invention with reference to the attached drawings. The same characters have the same meaning.
The phonetic data recognition device 20 recognizes phonetic data and is configured to, for example, receive speech, such as human speech, configure speech data from the speech, and output the speech data to the phonetic data processing device 30.
The phonetic data processing device 30 performs processing such that phones can be exactly recognized from the speech data received from the phonetic data recognition device 20 using a phonetic recognition model (or algorithm) according to the present invention. The phonetic recognition model according to the present invention will be described in detail bellow.
The phonetic recognition model according to the present invention captures long-range statistical dependencies in a single segment and adjacent segments having various lengths and to predict a phonetic label sequence y={s1(n1, l1), s2(n2, l2), s3(n3, l3)} by performing labeling based on the segments, where sj denotes a j-th segment, lj denotes the label of a j-th phone segment, and nj denotes the last frame index of the j-th phone segment.
For example, in
Phonetic recognition may be performed via a task for converting speech (for example, human speech) into a phonetic label sequence. The phonetic label sequence may be represented by the following Equation (1):
where ŷ denotes a phonetic label sequence, y denotes a set of phonetic label sequences, x denotes an acoustic feature vector, y denotes a phonetic label, w denotes a parameter, and φ(x, y) denotes a segment-based joint feature map. The above Equation (1) may be solved using the definition of the segment-based joint feature map and the determination of the parameter w.
The segment-based joint feature map is given by the following Equation (2):
where lj denotes the label of the j-th phone segment, nj denotes the last frame index of the j-th phone segment, and J denotes the number of segments. The above three features (transition feature, duration feature, and content feature) are defined as follows.
φtransition(lj-1, lj) denotes a transition feature indicating a relationship between a certain phone and its subsequent phone when the certain phone is present on a just previous label.
The transition feature is used to capture statistical dependencies between two neighboring phones and may be represented by a Kronecker delta function, that is, δ(lj-1=l′, lj=l).
The Kronecker delta function has a value of 1 when lj-1=l′ and lj=l are satisfied; otherwise it has a value of 0.
φduration(nj-1, nj, lj) denotes a (nj-1-nj) duration feature indicating the duration of a relevant phone (for example, for the phonetic label lj), and is represented by the following Equation (3):
The duration feature for the phone l is defined as the sufficient statistics of gamma distribution. For example, in the case of speech “have,” the duration feature (simply indicated by φd) may be represented by φd=[(φ/h/d)T, (φ/ae/d)T, . . . ]T.
φcontent({x}j, nj-1, nj, lj) denotes a content feature indicating acoustic feature data, and is represented by the following Equation (4):
where l denotes a phone, k denotes a bin index, and B(l) denotes the number of bins corresponding to the phonetic label l.
Further,
is satisfied, where kε{1, . . . , B(l)}.
For example, in the case of the speech “have,” the content feature (simply indicated by φc) may be represented by φc=[(φ(/h/,1)c)T, (φ(/h/,2)c)T, . . . , (φ(/ae/,1)c)T, (φ(/ae/,2)c)T, . . . ]T.
That is, a single segment may be divided into a large number of bins having the same length. Thereafter, the Gaussian sufficient statistics of the acoustic feature vectors in the respective bins are averaged, and then the content feature can be defined. Different parameters w may be assigned to the respective bins.
Here, the SMM inference (Equation (1)) will be schematically described below.
Let V(t,l) be the maximal score for all partial segmentations such that the last segment ends at the t-th frame with label l, and let U(t,l) be a tuple of length d and previous label l′ occupied by the best path where phone l′ transits to phone l at time t-d. We derive the recursion of the dynamic programming for efficient SMM inference as;
where R(l) is the range of admissible durations of phone l to ensure tractable inference. Once the recursion reaches the end of the sequence, we traverse U(t,l) backwards to obtain segmentation information of the sequence. An implementation of the recursion in the above equations require O(T|L|ΣlR(l)) computations of w, φ. To save computation, the maximum values in the above equations are obtained by searching through not the whole search space {1, . . . , R(l)×L but a subspace of lower resolution −{1, dl, 2dl, . . . , R(l)×L, where dl>1 is the search resolution for the phone l (longer-length phones have larger dl than shorter-length phones).
Such a parameter w may be estimated by a Structured Support Vector Machine (S-SVM).
The S-SVM optimizes the parameter w by minimizing a second-order objective function under the terms of combinations of linear margin constraints, as given by the following Equation (5):
where
C is greater than 0 and denotes a constant for controlling a trade-off between the maximization of a margin and the minimization of an error, and ξi denotes a slack variable.
In this case, F(Xi, yi; w)−F(Xi, y; w) (margin) is, for example, a difference between a correct phonetic sequence and any phonetic sequence and is configured to be maximized. Accordingly, such w as to maximize the difference is obtained.
During a procedure for maximizing the difference, a loss function Δ(yi, y) for scaling a difference between y and yi function is taken into consideration. The loss refers to a criterion for indicating how the correct label and any label are different from each other.
Here, since the S-SVM has a larger number of margin constraints, it is difficult to solve the above Equation (5). Therefore, part of the constraints are reduced using a stochastic subgradient descent algorithm that has been proposed by F. Sha and entitled “Large margin training of acoustic models for speech recognition,” in a Ph. D. thesis, Univ. Pennsylvania, 2007, and by N. Ratliff, J. A. Bagnell, and M. Zinkevich and entitled “(online) subgradient methods for structured prediction,” in AISTATS, 2007. Thereafter, as shown in
Referring to
Referring to
Referring to
The phonetic data processing device 30 analyzes segment-based phonetic label sequences from the received speech data and then performs phonetic recognition in step S120. The analysis of the phonetic label sequences may be performed based on Equations (1) to (5), as described above.
The method of the present invention can be implemented in the form of computer-readable code stored in a computer-readable recording medium. The code may enable the microprocessor of a computer.
The computer-readable recording medium includes all types of recording devices that store data readable by a computer system.
Examples of the computer-readable recording medium include Read Only Memory (ROM), Random Access Memory (RAM), Compact Disc ROM (CD-ROM), magnetic tape, a floppy disc, an optical data storage device, etc. Further, the program code for performing the phonetic recognition method according to the present invention may be transmitted in the form of a carrier wave (for example, via transmission over the Internet).
Furthermore, the computer-readable recording medium may be distributed across computer systems connected to each other over a network and may be stored and executed as computer-readable code in a distributed manner. Furthermore, the functional program, code, and code segments for implementing the present invention may be easily inferred by programmers skilled in the art to which the present invention pertains.
According to the phonetic recognition method, the system for processing the method, and the recording medium for storing the method in accordance with the present invention, there are advantages in that continuous phonetic recognition can be more easily performed and in that an error rate can be decreased.
Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various changes, modifications, and additions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. Therefore, it should be understood that those changes, modifications and additions belong to the scope of the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2012-0006898 | Jan 2012 | KR | national |