The present invention relates generally to the field of speaker recognition.
A method and apparatus for speaker recognition is provided. One embodiment of a method for determining whether a given speech signal is produced by an alleged speaker, where a plurality of statistical models (including at least one support vector machine) have been produced for the alleged speaker based on a previous speech signal received from the alleged speaker, includes receiving the given speech signal, the speech signal representing an utterance made by a speaker claiming to be the alleged speaker, scoring the given speech signal using at least two modeling systems, where at least one of the modeling systems is a support vector machine, combining scores produced by the modeling systems, with equal weights, to produce a final score, and determining, in accordance with the final score, whether the speaker is likely the alleged speaker.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
The present invention relates to a method and apparatus for speaker recognition (i.e., determining the identity of a person supplying a speech signal). Specifically, the present invention provides methods for discerning between a target (or true) speaker and one or more impostor (or background) speakers. Given a sample speech input from a speaker and a claimed identity, the present invention determines whether the claim is true or false. Embodiments of the present invention combine novel acoustic and stylistic approaches to speaker modeling by fusing scores computed by individual models into a new score, via use of a “combiner” model.
In step 106, the method 100 models the speech signal using a plurality of modeling approaches. The result is a plurality of scores, generated by the different approaches, indicating whether the speech signal likely came from the target speaker or likely came from an impostor. In one embodiment, each of the plurality of modeling approaches is a support vector machine (SVM)-based discriminative modeling approach. Each SVM is trained to classify between features for a target speaker, and features for impostors (where there are more instances—on the order of thousands—for impostors than there are instances—up to approximately eight—for true speakers). In one embodiment, the method 100 produces four individual scores (models) in step 106 (i.e., using four SVMs). In one embodiment, the SVMs use a linear kernel and differ in the types of features. Moreover, the SVMs use a cost function that makes false rejection more costly than false acceptance. In one embodiment, false rejection is five hundred times more costly than false acceptance.
In step 108, the method 100 combines the scores produced in step 106 to produce a final score. The final score indicates a “consensus” as to the likelihood that the speaker is the target speaker or an impostor. In one embodiment, the scores are combined with equal weights.
In step 110, the method 100 identifies the likely speaker, based on the final score produced in step 108. Specifically, the method 100 classifies the input speech signal as coming from either the target speaker or an impostor. The method 100 then terminates in step 112.
The method 200 represents cepstral features of an input speech signal by combining a subspace spanned by training speakers (for whom normalization statistics are available) with the subspace's complementary space, modeling both subspaces separately with SVMs, and then combining the systems. Specifically, when polynomial features (on the order of tens of thousands) are used as features with an SVM, a peculiar situation arises. Since there are more features than impostor speakers (on the order of thousands, as discussed above), the distribution of features in a high dimensional space lies in a lower dimensional subspace spanned by the background (or impostor) speakers. This lower dimensional subspace is referred to herein as the “background subspace”. A subspace orthogonal to the background subspace captures all the variation in the feature space that is not observed between background speakers. This orthogonal subspace is referred to herein as the “background-complement subspace”. It is evident that the background subspace and the background-complement subspace have different characteristics for speaker recognition.
Referring back to
In step 206, the method 200 appends the MFCCs with delta and double-delta coefficients, tripling the number of dimensions (e.g., to a 39-dimensional feature vector in the current example, where the method 200 starts with 13 MFCCS). The method 200 then proceeds to step 208 and normalizes the resultant vector, in one embodiment using cepstral mean subtraction (CMS) and feature transformation to mitigate the effects of handset variation (e.g., variation in the means by which the user speech signal is captured).
In step 210, the method 200 appends the transformed vector with second order and third order polynomial coefficients, where the second order polynomial of X=[x1 x2π is poly(X,2)=[X x12 x1x2 x22π and the third order polynomial is poly(X,3)=[p(X,2) x13 x12x2 x1x22x23]. If the method 200 originally obtained thirteen MFCCs in step 202, then the resultant vector, referred to as the “polynomial feature vector”, will have 11479 dimensions.
In step 212, the method 200 estimates the mean and standard deviations of the features of the polynomial feature vector over a given speech signal (utterance).
At this point, the method 200 branches into two individual processes that are performed in parallel. In the case where four SVMs are used to process the speech signal, the first two of the SVMs use the mean polynomial (MP) feature vectors for further processing, while the second two SVMs use the mean polynomial vector divided by the standard deviation polynomial vector (MSDP), as discussed in further detail below.
For the first two SVMs, the method 200 proceeds to step 214 and performs principal component analysis (PCA) on the polynomial features for the background (impostor) speaker utterances. The number, F, of features (e.g., F=11479 in the current example) is much larger than the number, S, of background speakers (S=on the order of thousands, as discussed above). Thus, the distribution of high-dimensional features lies in a lower dimensional speaker subspace. Only S−1 leading eigenvectors (also referred to as principal components (PCs)) have non-zero eigenvalues. The remaining F-S+1 eigenvectors have zero eigenvalues. The leading eigenvectors are normalized by the corresponding eigenvalues. All of the leading eigenvectors are selected because the total variance is distributed evenly across them.
The method 200 then proceeds to step 218 and projects features onto principal components. Specifically, the mean polynomial features are projected onto the normalized S−1 eigenvectors (F1), and onto the remaining F-S+1 un-normalized eigenvectors (F2).
Referring back to step 212, the second two SVMs modify the kernel to include a confidence estimate obtained from the standard deviation. If
k(
This kernel may be modified as:
This implies that the inner product is scaled by the standard deviation of the individual features, where the standard deviation is computed separately over each utterance. Instead of modifying the kernel, the features are modified by obtaining a new feature vector that is the mean polynomial vector divided by the standard deviation polynomial vector (MSDP).
For the second two SVMs, the method 200 proceeds to step 216 and performs principal component analysis (PCA) on the polynomial features for the background (impostor) speaker utterances. As in step 214, two sets of eigenvectors are obtained: the first set (F3) corresponds to non-zero eigenvalues, and the second set (F4) corresponds to zero eigenvalues. In the first set, the eigenvalues are not spread evenly, as they are for mean polynomial vectors. This is due to the scaling by the standard deviation terms. In one embodiment, only the first five hundred leading eigenvectors (corresponding to ninety-nine percent of the total variance) and use coefficients obtained from the first five hundred leading eigenvectors are kept in the first two SVMs. The second two SVMs use as features the coefficients obtained using the trailing eigenvectors corresponding to zero eigenvalues.
The method 200 then proceeds to step 218 as described above and projects features onto principal components. Specifically, the mean polynomial features are projected onto the normalized S−1 eigenvectors (F1), and onto the remaining F-S+1 un-normalized eigenvectors (F2).
In step 220, the method 200 combines the coefficients produced in step 218 (F1, F2, F3, and F4), which comprise complementary output, using a single (“combiner”) system. In one embodiment, the combiner is any system (e.g., SVM, neural network, etc.) that can use any linear or non-linear combination strategy. In one embodiment, the combiner SVM sums the scores from all of the SVMs (e.g., the four SVMs in the current example) with equal weights to produce the final score, which is output in step 222. The method 200 then terminates in step 224.
In one embodiment, the background and background-complement transforms are estimated as follows. The covariance matrix from the features (F) for background speakers (S) is a low-rank matrix having a rank S−1. Instead of performing PCA in feature space, PCA is performed in speaker space. This is analogous to kernel PCA. The S−1 kernel principal components are then transformed into the corresponding principal components in feature space. The principal components in feature space are divided by the eigenvalues to produce (S−1)*F background transforms.
The computation of a complement transform depends on the original transform that was used. Since PCA was performed in the previous step, the background-complement transform is implemented implicitly (PCA is a direct result of the inner product kernel). A given feature vector is projected onto the eigenvectors of the background transform. The resultant coefficients are used to reconstruct the feature vector in the original space. The difference between the original and reconstructed feature vectors is used as the feature vector in the background-complement subspace. This is an F-dimensional subspace. Those skilled in the art will appreciate that other embodiments of the present invention may not rely on PCA and complementary transforms, but may be extended to other techniques including, but not limited to, independent component analysis and local linear PCA (the complement will be computed accordingly). In other embodiments using non-linear kernels (e.g., radial basis function), the complement may be produced in a very different way.
An interesting property of the background-complement subspace is that all of the feature vectors corresponding to the background speakers get mapped to the origin. Thus, SVM training is very easy. The origin is a single impostor data point (irrespective of the number of impostors), and one or more transformed feature vectors from the target training data are the true speaker data points. This is very different from training in the background subspace, where there are S impostor data points and one or more target speaker data points.
The method 200 may be implemented independently (e.g., in an autonomous speaker recognition system) or in conjunction with other systems and methods to provide improved speaker recognition performance.
The method 300 performs modeling based on output from a word recognizer. That is, knowing what was said in a given speech signal (i.e., the hypothesized words), the method 300 aims to identify who said it by characterizing long-term aspects of the speech (e.g., pitch, duration, energy, and the like). The method 300 computes a set of prosodic features associated with each recognized syllable (syllable-based non-uniform extraction region features, or SNERFs), transforms them into fixed-length vectors, and models them using support vector machines (SVMs). Although the method 300 is described in terms of characterizing the pitch, duration, and energy of speech, those skilled in the art will appreciate that other types of prosodic features (e.g., jitter, shimmer) could also be characterized in accordance with the present invention for the purposes of performing speaker recognition.
Referring back to
In step 304, the method 300 computes syllable-level prosodic features from the hypothesized words and time marks. In one embodiment, to estimate syllable regions, the method 300 syllabifies the hypothesized words and time marks using a program that employs a set of human-created rules that operate on the best-matched dictionary pronunciation for each word. For each resulting syllable region, the method 300 obtains phone-level alignment information (e.g., from the speech recognizer) and then extracts a large number of prosodic features related to the duration, pitch, and energy values in the syllable region. After extraction and stylization of these prosodic features, the method 300 creates a number of duration, pitch, and energy features aimed at capturing basic prosodic patterns at the syllable level.
In one embodiment, for duration features, the method 300 uses six different regions in the syllable. As illustrated in
In one embodiment, for pitch features, the method 300 uses two different regions in the syllable. As illustrated in
In one embodiment, for energy features, the method 300 uses four different regions in the syllable. As illustrated in
Referring back to
In step 310, the method 300 models the sample-level vector b(X) using an SVM. In one embodiment, the score assigned by the SVM to any particular speech signal is the signed Euclidean distance from the separating hyperplane to the point in hyperspace that represents the speech signal, where a negative value indicates an impostor. The output (score) is a real-valued number.
In step 312, the method 300 normalizes the scores assigned by the SVM. In one embodiment, the scores are normalized using an impostor-centric score normalization method. Specifically, each score is normalized by a mean and a variance, which are estimated by scoring the speech signal against the set of impostor models. The method 300 then terminates in step 314.
In some embodiments, as described above, the set of syllable-level feature vectors X={x1, x2, . . . , x3} is transformed into a single sample-level vector b(X) for modeling by the SVM. Since linear kernel SVMs are trained, the whole process is equivalent to using a kernel given by K(X,Y)=b(X)tb(Y). Each component of X corresponds to either a syllable or a pause, and these components are referred to as “slots”. If a slot corresponds to a syllable, it contains the prosodic features for that syllable. If a slot corresponds to a pause, it contains the pause length. The overall idea is to make a representation of the distribution of the prosodic features and then use the parameters of that representation to form the sample-level vector b(X). In one embodiment, each prosodic feature is considered separately and models are generated for the distribution of prosodic features in unigrams, bigrams, and trigrams. This allows the change in the prosodic features over time to be modeled. In another embodiment, the prosodic features are considered in groups.
Furthermore, separate models are created for sequences including pauses in different positions of the sequence. For N=1 gram length (i.e., unigrams), each prosodic feature is modeled with a single model (S) including only non-pause slots (i.e., actual syllables). For N=2 gram length (i.e., bigrams), three different models are obtained: (S,S), (P,S) and (S,P) for each prosodic feature (where S represents a syllable and P represents a pause). For N=3 gram length (i.e., trigrams), five different models are obtained: (S,S,S), (P,S,S), (S,P,S), (S,S,P) and (P,S,P) for each prosodic feature. Each pair {prosodic feature, pattern} determines a “token”. The parameters corresponding to all tokens are concatenated to obtain the sample-level vector b(X). Three different embodiments of parameterizations of the token distributions, according to the present invention, are described in further detail with respect to
The method 700 then proceeds to step 706 and counts the number of times that each prosodic feature fell in each bin during the speech signal. Since it is not known a priori where to place thresholds for binning data, discretization is performed evenly on the rank distribution of values for a given prosodic feature, so that the resultant bins contain roughly equal amounts of data, When this is not possible (e.g., in the case of discrete features), unequal mass bins are allowed. For pauses, one set of hand-chosen threshold values (e.g., 60, 150, and 300 ms) is used to divide the pauses into four different lengths. In this approach, the undefined values are simply taken to be a separate bin. The bins for bigrams and trigrams are obtained by concatenating the bins for each feature in the sequence. This results in a grid, and the prosodic features are simply the counts corresponding to each bin in the grid. In one embodiment, the counts are normalized by the total number of syllables in the sample/speech signal. Many of the bins obtained by simple concatenation will correspond to places in the feature space where very few samples ever fall.
The method 700 then proceeds to step 708 and constructs the sample-level vector b(X). The sample level vector b(X) is composed only of the counts corresponding to bins for which the count was higher than a certain threshold in some held-out data. The method 700 then terminates in step 710.
where pt is the length of the pause at slot t and ft is the value of the prosodic feature f at slot t. The logarithm is used to reflect the fact that the influence of the length of the pause decreases as the length of the pause itself increases. In this approach, discrete features are treated in the same way as continuous features, with the only precaution being that variances that become too small are clipped to a minimum value.
Once the background GMMs for each token have been trained, the method 800 proceeds to step 806 and obtains the features for each test and train sample by MAP adaptation of the GMM weights to the sample's data. The adapted weight is simply the posterior probability of a Gaussian given the feature vector, averaged over all syllables in the speech signal.
In step 808, the adapted weights for each token are finally concatenated to form the sample-level vector b(X). The method 800 then terminates in step 810.
For the one-dimensional case (i.e., unigrams), the method 800 is closely related to the method 700, with the “hard” bins replaced by Gaussians and the counts replaced by posterior probabilities. For longer N-grams, there is a bigger difference: the “soft” bins represented by the Gaussians are obtained by looking at the joint distribution from all dimensions, while in the method 700, the bins were obtained as a concatenation of the bins for the unigrams.
A variation of the Linde Buzo Gray (LBG) algorithm (i.e., as described by Gersho et al. in “Vector Quantization and Signal Compression”, 1992, Kluwer Academic Publishers Group, Norwell, Mass.) is used to create the models. The method 900 is initialized in step 902 and proceeds to step 904, where the Lloyd algorithm is used to create two clusters (i.e., as also described by Gersho et al.).
In step 906, the cluster with the higher total distortion is then further split into two by perturbing the mean of the original cluster by a small amount. These clusters are used as a starting point for running a few iterations of the Lloyd algorithm.
In step 908, the method 900 determines whether the desired number of clusters has been reached. In one embodiment, the desired number of clusters is determined empirically (e.g., by cross validation). If the method 900 concludes that the desired number of clusters has not been reached, the method 900 returns to step 906 and proceeds as described above to split the new cluster with the higher total distortion into two new clusters. One cluster at a time is split until the desired number of clusters is reached. In one embodiment, during every step, the distortion used is weighted squares (i.e., d(x,y)=Σ(xi−yi)2/vi), where vi is the global variance of the data in the dimension i. When an undefined feature is present, the term corresponding to that dimension is simply ignored in the computation of distortion. If at any step a cluster is created that has too few samples, this cluster is destroyed, and a cluster with high total distortion is split in two.
Alternatively, if the method 900 concludes in step 908 that the desired number of clusters has been reached, the method 900 proceeds to step 910 and creates a GMM by assigning one Gaussian to each cluster with mean and variance determined by the data in the cluster and weight given by the proportion of samples in that cluster. This approach naturally deals with discrete values resulting in clusters with a single discrete value when necessary. The variances for these clusters are set to a minimum when converting the codebook to a GMM. The method 900 then terminates in step 912.
In one embodiment, the present invention may be implemented in conjunction with a word N-gram SVM-based system that outputs discriminant function values for given test vectors and speaker models. In accordance with this method, speaker-specific word N-gram models may be constructed using SVMs. The word N-gram SVM operates in a feature space given by the relative frequencies of word N-grams in the recognition output for a conversation side. Each N-gram corresponds to one prosodic feature dimension. N-gram frequencies are normalized (e.g., by rank-normalization, mean and variance normalization, Gaussianization, or the like) and modeled in an SVM with a linear kernel, with a bias (e.g., 500) against misclassification of positive examples.
In another embodiment, the present invention may be implemented in conjunction with a Gaussian mixture model (GMM)-based system that outputs the logarithm of the likelihood ratio between corresponding speaker and background models. In this case, three types of prosodic features are created: word features (containing the sequence of phone durations in the word and having varying numbers of components depending on the number of phones in their pronunciation, where each pronunciation gives rise to a different space), phone features (containing the duration of context-independent phones that are one-dimensional vectors), and state-in-phone features (containing the sequence of hidden Markov model state durations in the phones). For extraction of these features, state-level alignments from a speech recognizer are used.
For each prosodic feature type, a model is built using the background model data for each occurring word or phone. Speaker models for each word and phone are then obtained through maximum a posteriori (MAP) adaptation of means and weights of the corresponding background model. During testing, three scores are obtained (one for each prosodic feature type). Each of these scores is computed as the sum of the logarithmic likelihoods of the feature vectors in the test speech signal, given its models. This number is then divided by the number of components that were scored. The final score for each prosodic feature type is obtained from the difference between the speaker-specific model score and the background model score. This score may be further normalized, and the three resultant scores may be used in the final combination either independently or after a simple summation of the three scores.
The method 1000 is initialized at step 1002 and proceeds to step 1004, where the method 1000 obtains a noisy speech waveform (input speech signal).
In step 1006, the method 1000 estimates a clean speech waveform from the noisy speech waveform. In one embodiment, step 1006 is performed in accordance with Wiener filtering. In this case, the method 700 first uses a neural-network-based voice activity detector to mark frames of the speech waveform as speech or non-speech. The method 1000 then estimates a noise spectrum as the average spectrum from the non-speech frames. Wiener filtering is then applied to the speech waveform using the estimated noise spectrum. By applying Wiener filtering to unsegmented noisy speech waveforms, the method 1000 can take advantage of long silence segments between speech segments for noise estimation.
In step 1008, the method 1000 extracts speech segments from the estimated clean speech waveform. In one embodiment, step 1008 is performed in accordance with a speech/non-speech segmenter that takes advantage of the cleaner signal produced in step 1006. In one embodiment, the segmenting is performed by Viterbi-decoding each conversation side separately, using a speech/non-speech hidden Markov model (HMM), followed by padding at the boundaries and merging of segments separated by short pauses.
In step 1010, the method 1000 selects frames of the estimated clean speech waveform for modeling. In one embodiment (e.g., where the speech waveform is scored in accordance with Gaussian mixture modeling), only the frames with average frame energy above a certain threshold are selected. In one embodiment, this threshold is relatively high in order to eliminate frames that are likely to be degraded by noise (e.g., noisy non-speech frames). The actual energy threshold for a given waveform is computed by multiplying an energy percent (EPC) parameter (between zero and one) by the difference between maximum and minimum frame log energy values, and adding the minimum log energy. The optimal EPC (i.e., the parameter for which the test set equal error rate is lowest) is dependent on both noise type and signal-to-noise ration (SNR).
In step 1012, the method 1000 scores the selected frames in accordance with at least two systems. In one embodiment, the method 1000 uses two systems to score the frames: the first system is a Gaussian mixture model (GMM)-based system, and the second system is a maximum likelihood linear regression and support vector machine (MLLR-SVM) system. In one embodiment, the GMM-based system models speaker-specific cepstral features, where the speaker model is adapted from a universal background model (UBM). MAP adaptation is then used to derive a speaker model from the UBM. In one embodiment, the MLLR-SVM system models speaker-specific translations of the Gaussian means of phone recognition models by estimating adaptation transforms using a phone-loop speech model with three regression classes for non-speech, obstruents, and non-obstruents (the non-speech transform is not used). The coefficients from the two speech adaptation transforms are concatenated into a single feature vector and modeled using SVMs. A linear inner-product kernel SVM is trained for each target speaker using the feature vectors from the background training set as negative examples and the target speaker training data as positive examples. In one embodiment, rank normalization on each feature dimension is used.
In step 1014, the method 1000 combines the scores computed in step 1012. In the case where the scoring systems are a GMM-based system and an MLLR-SVM system, the MLLR-SVM system (which is an acoustic model that uses cepstral features, but using non-standard representations of acoustic observations) may provide complementary information to the cepstral GMM-based system. In one embodiment, the scores are combined using a neural network score combiner having two inputs, no hidden layer, and a single linear output activation unit. The method 1000 then terminates in step 1016.
Alternatively, the speaker recognition module 1105 can be represented by one or more software applications (or even a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC)), where the software is loaded from a storage medium (e.g., I/O devices 1106) and operated by the processor 1102 in the memory 1104 of the general purpose computing device 1100. Thus, in one embodiment, the speaker recognition module 1105 for facilitating recognition of a speaker as described herein with reference to the preceding Figures can be stored on a computer readable medium or carrier (e.g., RAM, magnetic or optical drive or diskette, and the like).
It should be noted that although not explicitly specified, one or more steps of the methods described herein may include a storing, displaying and/or outputting step as required for a particular application. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or outputted to another device as required for a particular application. Furthermore, steps or blocks in the accompanying Figures that recite a determining operation or involve a decision, do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step.
Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings.
This application claims the benefit of U.S. Provisional Patent Applications Ser. No. 60/803,971, filed Jun. 5, 2006; Ser. No. 60/823,245, filed Aug. 22, 2006; and Ser. No. 60/864,122, filed Nov. 2, 2006. All of these applications are herein incorporated by reference in their entireties.
This invention was made with Government support under grant numbers IRI-9619921 and IIS-0329258 awarded by the National Science Foundation. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
60803971 | Jun 2006 | US | |
60823245 | Aug 2006 | US | |
60864122 | Nov 2006 | US |