ON-LINE SPEAKER RECOGNITION METHOD AND APPARATUS THEREOF

Information

  • Patent Application
  • 20080065380
  • Publication Number
    20080065380
  • Date Filed
    March 12, 2007
    17 years ago
  • Date Published
    March 13, 2008
    16 years ago
Abstract
A speaker recognition method and apparatus are provided. In the speaker recognition method, basic data and voice data of a speaker are received using contents that constantly request the speaker to constantly response using the speaker's voice. Then, a voice of the speaker is extracted from voice data, and a feature vector for recognition is extracted from the voice of the speaker. Based on the extracted feature vector, a speaker model is created. Then, a speaker stored in a speaker model is recognized based on information analyzed from input voice.
Description

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects, features and other advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:



FIG. 1 is a flowchart illustrating a speaker recognition method according to an exemplary embodiment of the present invention; and



FIG. 2 is a block diagram illustrating a speaker recognition apparatus according to an exemplary embodiment of the present invention.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings.


Hereinafter, a speaker recognition method according to an exemplary embodiment of the present invention will be described with reference to FIG. 1 in detail.



FIG. 1 is a flowchart illustrating a speaker recognition method according to an exemplary embodiment of the present invention.


At step S100, basic data of a speaker is inputted for allocating an identification sign to a target speaker to recognize. Since the number of family members is more than two, a service robot used in a home needs to discriminate one speaker from others. In the present embodiment, it is required to allocate unique identification signs to speakers that are recognized as different family members through the speaker recognition method according to the present embodiment. It is preferable that a name or a nickname of a speaker, which is inputted through an external input device such as a keyboard and a touch screen, can be used as the identification sign.


After allocating the identification signs to a speaker to recognize, it constantly requests the speaker to constantly response to the requests using the speaker's voice at step S105. It is for learning a statistical model by collecting a plurality of speakers' voices, and for naturally performing a recognition operation using the learned model. In order to make a speaker to response to the request at the step S105, it is preferable to use predetermined contents produced using a speaker's voice. It is preferable that the predetermined contents include music contents to make a speaker to sing along by playing music, game contents to make a speaker to response using voice while playing a game, and educational contents to make a speaker to response using voice while learning.


When a speaker responses to the request of the step S105 using the voice, the voice of the speaker is inputted at step S110. In order to receive the speaker's voice, a microphone can be used as a voice input unit.


After inputting the voice of the speaker, the speaker's voice is extracted from voice data produced by inputting the speaker's voice at step S115. The voice data inputted at the step S110 includes noise around the speaker or sound related to the contents used at the step S106. Therefore, the voice data inputted at step S110 is not suitable to collect voices of a plurality of speakers, learn the statistical model thereof, and perform a recognition operation using the learned model. Thus, it is required to extract the voice of the speaker from the voice data. Herein, a noise cancellation filter such as a wiener filter can be used to remove the noise around the speaker. The sound from the contents used at the step S105 can be removed easily by removing related data from the voice data because the sound is already known waveforms.


After extracting the voice of the speaker from the voice data, a feature vector is extracted from the voice of the speaker for speaker recognition at step S120. That is, when a voice inputted through a microphone inputs to a system, a feature vector, which can properly express phonetic characteristics of a speaker, are extracted at every 1/100 second. Such vectors must excellently express the phonetic characteristics of the speaker and also be insensible to differences, pronunciation, and attitude of a speaker. Representative methods for extracting a feature vector are a linear predictive coding (PLC) extracting method that analyzes all frequency bands with equivalent weights, a mel frequency cepstral coefficients (MFCC) extracting method that extracts feature vectors using the characteristic that a human voice recognition pattern flows a mel scale similar to a log scale, a high frequency emphasis extracting method that emphasizes high frequency elements for clearly discriminating voice from noise, and a window function extracting method for minimizing distortion caused by diving voice into short periods. It is preferable to use the MFCC extracting method among them to extract the feature vector.


After extracting the feature vector from the voice data of a speaker, a speaker model is generated by parameterizing feature vector distribution of the speaker at step S125. In order to create the speaker model, a Gaussian mixture model (GMM), a hidden markove model (HMM), and a neural network are used. It is preferable to the GMM to create the speaker model in the present embodiment.


The distribution of feature vectors extracted from voice data of a speaker is performed by Gaussian mixture density. For a D-dimensional feature vector, a mixture density of a speaker can be expressed as follow Equation 1.











p


(


x




λ
s


)


=




i
=
1

M




ω
i




b
i



(

x


)













b
i



(

x


)


=


1



(

2

π

)


D
/
2






Σ



1
/
2






exp
(


-

1
2





(


x


-

μ
i


)

T




(

Σ
i

)


-
1




(


x


-

μ
i


)


)







Equation





1







In Equation 1, wi denotes a mixture weight, and bi denotes a probability obtained through a Gaussian mixture model. Herein, the density is a weighted linear sum of a mean vector and M Gaussian mixture models parameterized by covariance matrix.


Then, a speaker stored in a speaker model is recognized based on analyzed information from the input voice at step S130. The speaker recognition is performed using the identification sign allocated at the step S100.


In order to recognize a speaker, a parameter of a Gaussian mixture model is estimated when voice is inputted from a speaker. A maximum likelihood estimation is used as the parameter estimation method. The likelihood of a Gaussian mixture model for a probability can be expressed as following Equation 2.










p


(

X


λ
s


)


=




t
=
1

T



p


(



x
t





λ
s


)







Equation





2







In Equation 2, the parameters of a speaker model are a weight, a mean, and i=1, 2, . . . M formed of covariance. The maximum likelihood parameter is estimated using an expectation—maximization (EM) algorithm. When one of family members speaks, a speaker is searched through finding a speaker model having a maximum posteriori probability. This method of searching a speaker can be expressed by following Equation 3.










S
^

=


arg


max





t
=
1

T



log






p


(



x
t





λ
k


)









Equation





3







In the present embodiment, at step S130, a previously generated speaker model is adapted using continuously inputted speaker's voice. A Bayesian adaptation is well known as a method of obtaining adapted speaker model. In order to adjust the speaker model, the adapted speaker model is obtained through changing weight, means and variances. This method is similar to a method of obtain an adapted speaker model using generalized background model. Hereinafter, three methods will be described with related Equations.


The jth Gaussian mixture model of a registered speaker is calculated by a following Equation 4.










p


(

j



x
i




)


=



ω
j




b
j



(


x
t



)







t
=
1

M




ω
j




b
j



(


x
t



)









Equation





4







Each weight, mean and variance parameter are calculated through statistical calculation as like Equation 5.











n
i

=




t
=
1

T



p


(

j



x
t




)












E
i



(

x


)


=


1

n
t







t
=
1

T




p


(

j



x
t




)





x
t















E
i



(


x
2



)


=


1

n
t







t
=
1

T




p


(

i



x
t




)





x
t
2











Equation





5







Based on these parameters, the adapted parameters of the jth mixture model can be obtained from the sum of adaptation coefficients. Finally, a new speaker model for a voice varying according to time and environment can be generated.


Hereinafter, a speaker recognition apparatus according to an exemplary embodiment of the present invention will be described with reference to FIG. 2 in detail.



FIG. 2 is a block diagram illustrating a speaker recognition apparatus according to an exemplary embodiment of the present invention.


A contents storing unit 209 stores contents for requesting a speaker to constantly response using the speaker's voice. It is preferable that the predetermined contents include music contents to make a speaker to sing along by playing music, game contents to make a speaker to response using voice while playing a game, and educational contents to make a speaker to response using voice while learning. A contents management unit 208 manages the contents stored in the contents storing unit 209 to output to a speaker through an output unit 210.


An input unit 200 includes a voice input unit such as a microphone for receiving voice data of a speaker generated corresponding to the contents, and a general input unit such as a key board or a touch screen in order to receive an identification sign such as a name or a nick name of a voice inputted speaker.


A voice extracting module 202 is a device for extracting a voice of a speaker from a voice signal inputted through the input unit 200. It is preferable to use a noise cancellation filter 201 such as a wiener filter for canceling the noise from the voice signal inputted through the input unit 200.


After extracting the voice of a speaker by the voice extracting module 202, a feature vector extracting module 203 extracts a feature vector required for speaker recognition. That is, when the voice inputted through the input unit 200 enters into a system, feature vectors that can properly express phonetic characteristics of a speaker are extracted at every 1/100 second.


A speaker model generation module 205 creates a speaker model by parameterizing feature vector distribution of voice data of the extracted speaker, and the created speaker model is stored in a memory 207.


A speaker recognition module 206 recognizes a speaker by searching a speaker model stored in the memory 207 based on the feature vector of the voice data of the extracted speaker.


Herein, the speaker model adaptation module 204 updates the speaker model stored in the memory in order to adapt the generated speaker model using continuously inputted voice data by the contents.


As set forth above, according to exemplary embodiments of the invention, the speaker recognition method including the on-line speaker registration method naturally and adaptively performed in a home service robot is provided. Also, according to exemplary embodiments of the invention, the speaker recognition method that can adapt voice data of a registered speaker according to time and environment variations is provided.


While the present invention has been shown and described in connection with the preferred embodiments, it will be apparent to those skilled in the art that modifications and variations can be made without departing from the spirit and scope of the invention as defined by the appended claims.

Claims
  • 1. A speaker recognition method comprising: receiving basic data and voice data of a speaker using contents that constantly request the speaker to constantly response using the speaker's voice;extracting only a voice of the speaker from voice data;extracting a feature vector for recognition from the voice of the speaker;creating a speaker model from the extracted feature vector; andrecognizing a speaker stored in a speaker model based on information analyzed from input voice.
  • 2. The speaker recognition method according to claim 1, further comprising: receiving basic data of a speaker to be recognized before the step of receiving basic data and voice data.
  • 3. The speaker recognition method according to claim 2, wherein the basic data of the speaker is a name of the speaker.
  • 4. The speaker recognition method according to claim 1, wherein the contents are music contents, game contents, or educational contents.
  • 5. The speaker recognition method according to claim 1, wherein the step of extracting only the voice includes canceling noise from the voice data and removing sound related to the contents from the voice data.
  • 6. The speaker recognition method according to claim 1, wherein in the step of extracting the feature vector, a MFCC (mel frequency cepstral coefficients) extracting method is used.
  • 7. The speaker recognition method according to claim 1, wherein in the step of creating the speaker mode, the speaker model is created using a Gaussian mixture model.
  • 8. The speaker recognition method according to claim 1, wherein in the step of recognizing the speaker, the analyzed information from the input voice is a likelihood obtained through Equation:
  • 9. The speaker recognition method according to claim 1, further comprising adapting a previously generated speaker model using a feature vector extracted from a voice of a speaker.
  • 10. The speaker recognition method according to claim 9, wherein in the step of adapting the previously generated speaker mode, a jth Gaussian mixture mode of the previously generated speaker model is calculated using Equation:
  • 11. A computer readable recording medium for recording a program that implements a speaker recognition method, comprising: receiving basic data and voice data of a speaker using contents that constantly request the speaker to constantly response using the speaker's voice;extracting only a voice of the speaker from voice data;extracting a feature vector for recognition from the voice of the speaker;creating a speaker model from the extracted feature vector; andrecognizing a speaker stored in a speaker model based on information analyzed from input voice.
  • 12. A speaker recognition apparatus comprising: a contents storing unit for storing contents that requests a speaker to constantly response using voice;an output unit for outputting the contents externally;a contents managing unit for controlling the output unit to output the contents stored in the contents storing unit;a voice input unit for receiving voice data of a speaker generated in response to the contents;a voice extracting module for extracting only a voice of a speaker by removing sound related to the contents from the voice signal;a feature vector extraction module for extracting a feature vector from a voice of the extracted voice of the speaker;a speaker model generation module for generating a speaker model of a speaker based on the extracted feature vector;a speaker model training model for adapting a speaker model of a speaker based on the extracted feature vector;a memory for storing information related to a speaker model; anda speaker recognition module for recognizing a speaker by searching speaker model stored in the memory based on the extracted feature vector.
  • 13. The speaker recognition apparatus according to claim 12, further comprising: an input unit for receiving a name of each speaker who inputs voice through the voice input unit as an identification sign.
  • 14. The speaker recognition apparatus according to claim 12, wherein the contents stored in the contents storing unit are music contents, game contents, or educational contents.
  • 15. A home service robot comprising: a speaker recognition apparatus including: a contents storing unit for storing contents that requests a speaker to constantly response using voice;an output unit for outputting the contents externally;a contents managing unit for controlling the output unit to output the contents stored in the contents storing unit;a voice input unit for receiving voice data of a speaker generated in response to the contents;a voice extracting module for extracting only a voice of a speaker by removing sound related to the contents from the voice signal;a feature vector extraction module for extracting a feature vector from a voice of the extracted voice of the speaker;a speaker model generation module for generating a speaker model of a speaker based on the extracted feature vector;a speaker model training model for adapting a speaker model of a speaker based on the extracted feature vector;a memory for storing information related to a speaker model; anda speaker recognition module for recognizing a speaker by searching speaker model stored in the memory based on the extracted feature vector.
  • 16. The home service robot according to claim 15, wherein the speaker recognition apparatus further includes an input unit for receiving a name of each speaker who inputs voice through the voice input unit as an identification sign.
  • 17. A home service robot according to claim 15, wherein the contents stored in the contents storing unit are music contents, game contents, or educational contents.
Priority Claims (1)
Number Date Country Kind
10-2006-87004 Sep 2006 KR national