The above and other objects, features and other advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
Hereinafter, a speaker recognition method according to an exemplary embodiment of the present invention will be described with reference to
At step S100, basic data of a speaker is inputted for allocating an identification sign to a target speaker to recognize. Since the number of family members is more than two, a service robot used in a home needs to discriminate one speaker from others. In the present embodiment, it is required to allocate unique identification signs to speakers that are recognized as different family members through the speaker recognition method according to the present embodiment. It is preferable that a name or a nickname of a speaker, which is inputted through an external input device such as a keyboard and a touch screen, can be used as the identification sign.
After allocating the identification signs to a speaker to recognize, it constantly requests the speaker to constantly response to the requests using the speaker's voice at step S105. It is for learning a statistical model by collecting a plurality of speakers' voices, and for naturally performing a recognition operation using the learned model. In order to make a speaker to response to the request at the step S105, it is preferable to use predetermined contents produced using a speaker's voice. It is preferable that the predetermined contents include music contents to make a speaker to sing along by playing music, game contents to make a speaker to response using voice while playing a game, and educational contents to make a speaker to response using voice while learning.
When a speaker responses to the request of the step S105 using the voice, the voice of the speaker is inputted at step S110. In order to receive the speaker's voice, a microphone can be used as a voice input unit.
After inputting the voice of the speaker, the speaker's voice is extracted from voice data produced by inputting the speaker's voice at step S115. The voice data inputted at the step S110 includes noise around the speaker or sound related to the contents used at the step S106. Therefore, the voice data inputted at step S110 is not suitable to collect voices of a plurality of speakers, learn the statistical model thereof, and perform a recognition operation using the learned model. Thus, it is required to extract the voice of the speaker from the voice data. Herein, a noise cancellation filter such as a wiener filter can be used to remove the noise around the speaker. The sound from the contents used at the step S105 can be removed easily by removing related data from the voice data because the sound is already known waveforms.
After extracting the voice of the speaker from the voice data, a feature vector is extracted from the voice of the speaker for speaker recognition at step S120. That is, when a voice inputted through a microphone inputs to a system, a feature vector, which can properly express phonetic characteristics of a speaker, are extracted at every 1/100 second. Such vectors must excellently express the phonetic characteristics of the speaker and also be insensible to differences, pronunciation, and attitude of a speaker. Representative methods for extracting a feature vector are a linear predictive coding (PLC) extracting method that analyzes all frequency bands with equivalent weights, a mel frequency cepstral coefficients (MFCC) extracting method that extracts feature vectors using the characteristic that a human voice recognition pattern flows a mel scale similar to a log scale, a high frequency emphasis extracting method that emphasizes high frequency elements for clearly discriminating voice from noise, and a window function extracting method for minimizing distortion caused by diving voice into short periods. It is preferable to use the MFCC extracting method among them to extract the feature vector.
After extracting the feature vector from the voice data of a speaker, a speaker model is generated by parameterizing feature vector distribution of the speaker at step S125. In order to create the speaker model, a Gaussian mixture model (GMM), a hidden markove model (HMM), and a neural network are used. It is preferable to the GMM to create the speaker model in the present embodiment.
The distribution of feature vectors extracted from voice data of a speaker is performed by Gaussian mixture density. For a D-dimensional feature vector, a mixture density of a speaker can be expressed as follow Equation 1.
In Equation 1, wi denotes a mixture weight, and bi denotes a probability obtained through a Gaussian mixture model. Herein, the density is a weighted linear sum of a mean vector and M Gaussian mixture models parameterized by covariance matrix.
Then, a speaker stored in a speaker model is recognized based on analyzed information from the input voice at step S130. The speaker recognition is performed using the identification sign allocated at the step S100.
In order to recognize a speaker, a parameter of a Gaussian mixture model is estimated when voice is inputted from a speaker. A maximum likelihood estimation is used as the parameter estimation method. The likelihood of a Gaussian mixture model for a probability can be expressed as following Equation 2.
In Equation 2, the parameters of a speaker model are a weight, a mean, and i=1, 2, . . . M formed of covariance. The maximum likelihood parameter is estimated using an expectation—maximization (EM) algorithm. When one of family members speaks, a speaker is searched through finding a speaker model having a maximum posteriori probability. This method of searching a speaker can be expressed by following Equation 3.
In the present embodiment, at step S130, a previously generated speaker model is adapted using continuously inputted speaker's voice. A Bayesian adaptation is well known as a method of obtaining adapted speaker model. In order to adjust the speaker model, the adapted speaker model is obtained through changing weight, means and variances. This method is similar to a method of obtain an adapted speaker model using generalized background model. Hereinafter, three methods will be described with related Equations.
The jth Gaussian mixture model of a registered speaker is calculated by a following Equation 4.
Each weight, mean and variance parameter are calculated through statistical calculation as like Equation 5.
Based on these parameters, the adapted parameters of the jth mixture model can be obtained from the sum of adaptation coefficients. Finally, a new speaker model for a voice varying according to time and environment can be generated.
Hereinafter, a speaker recognition apparatus according to an exemplary embodiment of the present invention will be described with reference to
A contents storing unit 209 stores contents for requesting a speaker to constantly response using the speaker's voice. It is preferable that the predetermined contents include music contents to make a speaker to sing along by playing music, game contents to make a speaker to response using voice while playing a game, and educational contents to make a speaker to response using voice while learning. A contents management unit 208 manages the contents stored in the contents storing unit 209 to output to a speaker through an output unit 210.
An input unit 200 includes a voice input unit such as a microphone for receiving voice data of a speaker generated corresponding to the contents, and a general input unit such as a key board or a touch screen in order to receive an identification sign such as a name or a nick name of a voice inputted speaker.
A voice extracting module 202 is a device for extracting a voice of a speaker from a voice signal inputted through the input unit 200. It is preferable to use a noise cancellation filter 201 such as a wiener filter for canceling the noise from the voice signal inputted through the input unit 200.
After extracting the voice of a speaker by the voice extracting module 202, a feature vector extracting module 203 extracts a feature vector required for speaker recognition. That is, when the voice inputted through the input unit 200 enters into a system, feature vectors that can properly express phonetic characteristics of a speaker are extracted at every 1/100 second.
A speaker model generation module 205 creates a speaker model by parameterizing feature vector distribution of voice data of the extracted speaker, and the created speaker model is stored in a memory 207.
A speaker recognition module 206 recognizes a speaker by searching a speaker model stored in the memory 207 based on the feature vector of the voice data of the extracted speaker.
Herein, the speaker model adaptation module 204 updates the speaker model stored in the memory in order to adapt the generated speaker model using continuously inputted voice data by the contents.
As set forth above, according to exemplary embodiments of the invention, the speaker recognition method including the on-line speaker registration method naturally and adaptively performed in a home service robot is provided. Also, according to exemplary embodiments of the invention, the speaker recognition method that can adapt voice data of a registered speaker according to time and environment variations is provided.
While the present invention has been shown and described in connection with the preferred embodiments, it will be apparent to those skilled in the art that modifications and variations can be made without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2006-87004 | Sep 2006 | KR | national |